You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Timo Walther <tw...@apache.org> on 2020/03/16 17:36:31 UTC

[DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Hi everyone,

I'm happy to present the results of long discussions that we had
internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many more
have contributed to this design document.

We would like to propose new long-term table source and table sink
interfaces:

https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces

This is a requirement for FLIP-105 and finalizing FLIP-32.

The goals of this FLIP are:

- Simplify the current interface architecture:
- Merge upsert, retract, and append sinks.
- Unify batch and streaming sources.
- Unify batch and streaming sinks.

- Allow sources to produce a changelog:
- UpsertTableSources have been requested a lot by users. Now is the
time to open the internal planner capabilities via the new interfaces.
- According to FLIP-105, we would like to support changelogs for
processing formats such as Debezium.

- Don't rely on DataStream API for source and sinks:
- According to FLIP-32, the Table API and SQL should be independent
of the DataStream API which is why the `table-common` module has no
dependencies on `flink-streaming-java`.
- Source and sink implementations should only depend on the
`table-common` module after FLIP-27.
- Until FLIP-27 is ready, we still put most of the interfaces in
`table-common` and strictly separate interfaces that communicate with a
planner and actual runtime reader/writers.

- Implement efficient sources and sinks without planner dependencies:
- Make Blink's internal data structures available to connectors.
- Introduce stable interfaces for data structures that can be
marked as `@PublicEvolving`.
- Only require dependencies on `flink-table-common` in the future

It finalizes the concept of dynamic tables and consideres how all
source/sink related classes play together.

We look forward to your feedback.

Regards,
Timo

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Kurt Young <yk...@gmail.com>.

Thanks Timo for the design doc.

In general I'm +1 to this, with a minor comment. Since we introduced dozens
interfaces all at once,
I'm not sure if it's good to annotate them with @PublicEnvolving already. I
can imagine these interfaces
 would only be stable after 1 or 2 major release. Given the fact that these
interfaces will only be used by
connector developers, how about we annotate them as @Internal first? After
more try out and feedbacks
from connector developers, we can improve those interfaces quickly and mark
them @PublicEnvolving
after we are confident about them.

BTW, if I'm not mistaken, the end users will only see Row with enhanced
RowKind. This is the only one
which actually goes public IMO.

Best,
Kurt


On Tue, Mar 24, 2020 at 9:24 AM Becket Qin <be...@gmail.com> wrote:

> Hi Timo,
>
> Thanks for the proposal. I completely agree that the current Table
> connectors could be simplified quite a bit. I haven't finished reading
> everything, but here are some quick thoughts.
>
> Actually to me the biggest question is why should there be two different
> connector systems for DataStream and Table? What is the fundamental reason
> that is preventing us from merging them to one?
>
> The basic functionality of a connector is to provide capabilities to do IO
> and Serde. Conceptually, Table connectors should just be DataStream
> connectors that are dealing with Rows. It seems that quite a few of the
> special connector requirements are just a specific way to do IO / Serde.
> Taking SupportsFilterPushDown as an example, imagine we have the following
> interface:
>
> interface FilterableSource<PREDICATE> {
>     void applyFilterable(Supplier<PREDICATE> predicate);
> }
>
> And if a ParquetSource would like to support filterable, it will become:
>
> class ParquetSource implements Source, FilterableSource(FilterPredicate> {
>     ...
> }
>
> For Table, one just need to provide an predicate supplier that converts an
> Expression to the specified predicate type. This has a few benefit:
> 1. Same unified API for filterable for sources, regardless of DataStream or
> Table.
> 2. The  DataStream users now can also use the ExpressionToPredicate
> supplier if they want to.
>
> To summarize, my main point is that I am wondering if it is possible to
> have a single set of connector interface for both Table and DataStream,
> rather than having two hierarchies. I am not 100% sure if this would work,
> but if it works, this would be a huge win from both code maintenance and
> user experience perspective.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
> > Hi Timo,
> >
> > Thank you for the proposal. I think it is an important improvement that
> > will benefit many parts of the Table API. The proposal looks really good
> > to me and personally I would be comfortable with voting on the current
> > state.
> >
> > Best,
> >
> > Dawid
> >
> > On 23/03/2020 18:53, Timo Walther wrote:
> > > Hi everyone,
> > >
> > > I received some questions around how the new interfaces play together
> > > with formats and their factories.
> > >
> > > Furthermore, for MySQL or Postgres CDC logs, the format should be able
> > > to return a `ChangelogMode`.
> > >
> > > Also, I incorporated the feedback around the factory design in general.
> > >
> > > I added a new section `Factory Interfaces` to the design document.
> > > This should be helpful to understand the big picture and connecting
> > > the concepts.
> > >
> > > Please let me know what you think?
> > >
> > > Thanks,
> > > Timo
> > >
> > >
> > > On 18.03.20 13:43, Timo Walther wrote:
> > >> Hi Benchao,
> > >>
> > >> this is a very good question. I will update the FLIP about this.
> > >>
> > >> The legacy planner will not support the new interfaces. It will only
> > >> support the old interfaces. With the next release, I think the Blink
> > >> planner is stable enough to be the default one as well.
> > >>
> > >> Regards,
> > >> Timo
> > >>
> > >> On 18.03.20 08:45, Benchao Li wrote:
> > >>> Hi Timo,
> > >>>
> > >>> Thank you and others for the efforts to prepare this FLIP.
> > >>>
> > >>> The FLIP LGTM generally.
> > >>>
> > >>> +1 for moving blink data structures to table-common, it's useful to
> > >>> udf too
> > >>> in the future.
> > >>> A little question is, do we plan to support the new interfaces and
> data
> > >>> types in legacy planner?
> > >>> Or we only plan to support these new interfaces in blink planner.
> > >>>
> > >>> And using primary keys from DDL instead of derived key information
> from
> > >>> each query is also a good idea,
> > >>> we met some use cases where this does not works very well before.
> > >>>
> > >>> This FLIP also makes the dependencies of table modules more clear, I
> > >>> like
> > >>> it very much.
> > >>>
> > >>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
> > >>>
> > >>>> Hi everyone,
> > >>>>
> > >>>> I'm happy to present the results of long discussions that we had
> > >>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many more
> > >>>> have contributed to this design document.
> > >>>>
> > >>>> We would like to propose new long-term table source and table sink
> > >>>> interfaces:
> > >>>>
> > >>>>
> > >>>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > >>>>
> > >>>>
> > >>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> > >>>>
> > >>>> The goals of this FLIP are:
> > >>>>
> > >>>> - Simplify the current interface architecture:
> > >>>>       - Merge upsert, retract, and append sinks.
> > >>>>       - Unify batch and streaming sources.
> > >>>>       - Unify batch and streaming sinks.
> > >>>>
> > >>>> - Allow sources to produce a changelog:
> > >>>>       - UpsertTableSources have been requested a lot by users. Now
> > >>>> is the
> > >>>> time to open the internal planner capabilities via the new
> interfaces.
> > >>>>       - According to FLIP-105, we would like to support changelogs
> for
> > >>>> processing formats such as Debezium.
> > >>>>
> > >>>> - Don't rely on DataStream API for source and sinks:
> > >>>>       - According to FLIP-32, the Table API and SQL should be
> > >>>> independent
> > >>>> of the DataStream API which is why the `table-common` module has no
> > >>>> dependencies on `flink-streaming-java`.
> > >>>>       - Source and sink implementations should only depend on the
> > >>>> `table-common` module after FLIP-27.
> > >>>>       - Until FLIP-27 is ready, we still put most of the interfaces
> in
> > >>>> `table-common` and strictly separate interfaces that communicate
> > >>>> with a
> > >>>> planner and actual runtime reader/writers.
> > >>>>
> > >>>> - Implement efficient sources and sinks without planner
> dependencies:
> > >>>>       - Make Blink's internal data structures available to
> connectors.
> > >>>>       - Introduce stable interfaces for data structures that can be
> > >>>> marked as `@PublicEvolving`.
> > >>>>       - Only require dependencies on `flink-table-common` in the
> > >>>> future
> > >>>>
> > >>>> It finalizes the concept of dynamic tables and consideres how all
> > >>>> source/sink related classes play together.
> > >>>>
> > >>>> We look forward to your feedback.
> > >>>>
> > >>>> Regards,
> > >>>> Timo
> > >>>>
> > >>>
> > >>>
> > >
> >
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Timo Walther <tw...@apache.org>.

Hi Becket,

thanks for your feedback and the healthy discussion.

I think the connector story will still keep many of us busy in the next 
time. It would be great if concepts from SQL can positively influence 
the design of Source/Sink abstractions. Esp. we should think about some 
guidelines of how to design a connector in a semantic-free API as Dawid 
pointed out in his last email. We should not aim to develop 
SQL-specific/SQL-only runtime connectors.

@all: If there are no objections, I would like to start a voting thread 
by tomorrow. So this is the last call to give feedback for FLIP-95.

Thanks everyone,
Timo


On 26.03.20 14:56, Becket Qin wrote:
> Hi Timo and Dawid,
> 
> Thanks for the patient explanation. I just had a phone call with Kurt and
> Jark. I do see there are a few abstractions that we only see the use case
> in SQL so far. Therefore while thinking of a Source abstraction that may be
> shared with different use cases semantics is theoretically useful, doing
> that may not bring us much value at this point. So I am convinced that it
> doesn't have to be done right now and I have no further concern with the
> design in the current FLIP.
> 
> Again, really appreciate the patient discussion! I learned quite a bit from
> it.
> 
> Cheers,
> 
> Jiangjie (Becket) Qin
> 
> On Thu, Mar 26, 2020 at 8:58 PM Dawid Wysakowicz <dw...@apache.org>
> wrote:
> 
>> Hi Becket,
>>
>> Generally I don't think connector developers should bother with
>> understanding any of the SQL concepts.
>>
>> I am not sure if we understand "connector developer" the same way. Let me
>> describe how I see the process of writing a new source (that can be used in
>> both Table & DataStream API)
>>
>> 1. Connector developer writes a Source that deals with the actual reading
>> and deserializing (preferably with a pluggable format/deserializer). The
>> result of that step should be something like:
>>
>> FilesystemSource
>>
>>      .path(...)
>>
>>      .format(ParquetFormat
>>
>>                      .filterPredicate(/* parquet specific filter */)
>>
>>                      .project(/* parquet specific projection */)
>>
>>                      .map(...))
>>
>>      .watermarkAssigner(...)
>>
>>
>> This is useful for DataStream and we can and want to use this in the Table
>> API.  Those interface shouldn't accept any *Translators though. It does
>> make no sense cause internally they are not dealing e.g. with the
>> Expression. They should accept already created predicates.
>>
>> We are not designing anything at that level. This we expect from FLIP-27
>>
>> 2. Then we need to have a DynamicTableSource with different abilities that
>> can create e.g. the parquet filter or projection from expressions. I think
>> this is what you also describe in your second point. And this is what we
>> are designing in the FLIP. Bear in mind that e.g. Deserializer will be
>> created out of multiple SQL concepts: regular schema/computed
>> columns/possibly projections etc., each applied at different planning
>> stages.
>>
>> All of those interfaces serve the purpose of configuring the
>> DynamicTableSource so that it is able to instantiate the Source with proper
>> configuration. In other words it is a factory for the source that you can
>> configure with SQL concepts. In turn this Factory will call another factory
>> from point 1.
>>
>> I don't see a potential for unifying factories across different high level
>> APIs. Taking your example with Spatial Database that operates on
>> Coordinates and Area (even though those would rather be modeled as SQL
>> types and we would still operate on Rows, but just for the sake of the
>> example). In that respect there is no point in having a
>> PushDownComputedColumns interface in the factory for the spatial database.
>>
>> Best,
>>
>> Dawid
>>
>>
>> On 26/03/2020 11:47, Becket Qin wrote:
>>
>> Hi Timo,
>>
>> Regarding "connector developers just need to know how to write an
>>
>> ExpressionToParquetFilter":
>>
>>
>> This is the entire purpose of the DynamicTableSource/DynamicTableSink.
>>
>> The bridging between SQL concepts and connector specific concepts.
>> Because this is the tricky part. How to get from a SQL concept to a
>> connctor concept.
>>
>> Maybe it is just a naming issue depending on whether one is looking upward
>> from the Connectors perspective, or looking downward from the SQL
>> perspective. If we agree that the connectors should provide semantic free
>> API to the high level use cases, it seems we should follow the former path.
>> And if there are one or two APIs that the connector developers have to
>> understand in order to support Table / SQL, I think we can just address
>> them case by case, instead of wrapping the entire low level source API
>> with a set of new concepts.
>>
>> Correct me if I am wrong, can we tell the following story to a connector
>> developer and get a all the TableSource functionality work?
>>
>> To provide a TableSource from a Source, one just need to know two more
>> concepts: *Row* and *Expression*. The work to create a TableSource are
>> following:
>> 1. A connector developer can write three classes in order to build a table
>> source:
>>
>>     - Deserializer<Row> (Must-have)
>>     - PredicateTranslator<Expression, FilterPredicate> (optional, only
>>     applicable if the Source is a FilterableSource)
>>     - PredicateTranslator<Expression, ProjectionPredicate> (optional, only
>>     applicable if the Source is a ProjectableSource)
>>
>> 2. In order to let the table source be discoverable, one need to provide a
>> Factory, and that Factory provides the following as a bundle:
>>
>>     - The Source itself (Must-have)
>>     - The Deserializer<Row> (Must-have)
>>     - PredicateTranslator<Expression, FilterPredicate> (optional, only
>>     applicable when the Factory is a FilterFactory)
>>     - PredicateTranslator<Expression, ProjectionPredicate> (optional, only
>>     applicable when the Factory is a ProjectorFactory)
>>
>> 3. The Deserializer<Row> may implement one more decorative interfaces to
>> further convert the record after deserialization.
>>
>>     - withMapFunction<Row, Row>;
>>
>> Note that the above description only require the connector developer to
>> understand Expression and Row. If this works, It is much easier to explain
>> than throwing a full set of new concepts. More importantly, it is way more
>> generic. For example, If we change Row to Coordinates, and Expression to
>> Area, we easily get a Source for a Spatial Database.
>>
>>
>> One thing I want to call out is that while the old SourceFunction and
>> InputFormat are concrete implementations that does the actual IO work. The
>> Source API in FLIP-27 itself is kind of a Factory by itself already. So if
>> we can push the decorative interfaces from the TableFactory layer to the
>> Source layer, it will help unify the experience for DataStream and Table
>> Source. This will also align with our goal of letting the DataStream Source
>> provide a semantic free API that can be used by different high level API.
>>
>>
>> BTW, Jark suggested that we can probably have an offline call to accelerate
>> the discussion. I think it is a good idea. Can we do that?
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>> On Thu, Mar 26, 2020 at 5:28 PM Timo Walther <tw...@apache.org> <tw...@apache.org> wrote:
>>
>>
>> Hi Becket,
>>
>> Regarding "PushDown/NestedPushDown which is internal to optimizer":
>>
>> Those concepts cannot be entirely internal to the optimizer, at some
>> point the optimizer needs to pass them into the connector specific code.
>> This code will then convert it to e.g. Parque expressions. So there must
>> be some interface that takes SQL Expression and converts to connector
>> specific code. This interface between planner and connector is modelled
>> by the SupportsXXX interfaces. And you are right, if developers don't
>> care, they don't need to implement those optional interfaces but will
>> not get performant connectors.
>>
>> Regarding "Table connector can work with the above two mechanism":
>>
>> A table connector needs three mechanisms that are represented in the
>> current design.
>>
>> 1. a stateless discovery interface (Factory) that can convert
>> ConfigOptions to a stateful factory interface
>> (DynamicTableSource/DynamicTableSink)
>>
>> 2. a stateful factory interface (DynamicTableSource/DynamicTableSink)
>> that receives concepts from the optimizer (watermarks, filters,
>> projections) and produces runtime classes such as your
>> `ExpressionToParquetFilter`
>>
>> 3. runtime interfaces that are generated from the stateful factory; all
>> the factories that you mentioned can be used in `getScanRuntimeProvider`.
>>
>> Regarding "connector developers just need to know how to write an
>> ExpressionToParquetFilter":
>>
>> This is the entire purpose of the DynamicTableSource/DynamicTableSink.
>> The bridging between SQL concepts and connector specific concepts.
>> Because this is the tricky part. How to get from a SQL concept to a
>> connctor concept.
>>
>> Regards,
>> Timo
>>
>>
>> On 26.03.20 04:46, Becket Qin wrote:
>>
>> Hi Timo,
>>
>> Thanks for the reply. I totally agree that there must be something new
>> added to the connector in order to make it work for SQL / Table. My
>>
>> concern
>>
>> is mostly over what they should be, and how to add them. To be honest, I
>> was kind of lost when looking at the interfaces such as
>> DataStructureConverter, RuntimeConverter and their internal context.
>>
>> Also I
>>
>> believe most connector developers do not care about the concept of
>> "PushDown" / "NestedPushDown" which is internal to optimizer and not even
>> exposed to SQL writers.
>>
>> Therefore I am trying to see if we can:
>> A) Keep those additions minimum to the connector developers if they don't
>> have to know the details.
>> B) Expose as less high level concept as possible. More specifically, try
>>
>> to
>>
>> speak the connector language and expose the general mechanism instead of
>> binding them with use case semantic.
>>
>> If we can achieve the above two goals, we could avoid adding unnecessary
>> burden to the connector developers, and also make the connectors more
>> generic.
>>
>> It might worth thinking about what additional work is necessary for the
>> connector developers, here are what I am thinking of, please correct me
>>
>> if
>>
>> I miss something.
>>
>>      1. A Factory interface that allows high level use case, in this case
>>      SQL, to find a matching source using service provider mechanism.
>>      2. Allows the high level use case to specify the plugins that are
>>      supported by the underneath DataStream Source.
>>
>> If Table connector can work with the above two mechanism, maybe we can
>>
>> make
>>
>> some slight modifications to the interfaces in the current FLIP.
>>
>>      - A *SourceFactory* which extends the Factory interface in the FLIP,
>>      with one more method:
>>         - *Source getSource();*
>>      - Some decorative interfaces to the SourceFactory such as:
>>         - *FilterFactory<PREDICATE, T extends Supplier<PREDICATE>>*, with
>>
>> the
>>
>>         following method
>>            - T getFilter();
>>         - *ProjectorFactory<PREDICATE, T extends Supplier<PREDICATE>>*,
>>
>> with
>>
>>         the following method.
>>            - T getProjector();
>>         - *DeserializerFactory<INPUT, OUTPUT>*
>>
>> With this set of API, a ParquetTableSourceFactory may become:
>>
>> class ParqeutTableSourceFactory implements
>>              SourceFactory,
>>              DeserializerFactory<ParquetRecords, Row>,
>>              FilterFactory<ParquetFilter, ExressionToParquetFilter> {
>>       @Override
>>       ParquetSource getSource() { ... }
>>
>>       @Override
>>       ExressionToParquetFilter getFilterSupplier() { ... };
>> }
>>
>> The ExressionToParquetFilter will have an *applyPredicate(Expression)*
>> method.
>>
>> I know it does not look like a perfect interface from the pure SQL
>> perspective. And I am not even sure if this would meet all the
>>
>> requirements
>>
>> for SQL, but the benefit is that the connector developers just need to
>>
>> know
>>
>> how to write an ExpressionToParquetFilter in order to make it work for
>> Table, without having to understand the entire SQL concept.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>>
>> On Wed, Mar 25, 2020 at 5:57 PM Timo Walther <tw...@apache.org> <tw...@apache.org> wrote:
>>
>>
>> Hi Becket,
>>
>> Let me clarify a few things first: Historically we thought of Table
>> API/SQL as a library on top of DataStream API. Similar to Gelly or CEP.
>> We used TypeInformation in Table API to integrate nicely with DataStream
>> API. However, the last years have shown that SQL is not just a library.
>> It is an entire ecosystem that defines data types, submission behavior,
>> execution behavior, and highly optimized SerDes. SQL is a way to declare
>> data processing end-to-end such that the planner has the full control
>> over the execution.
>>
>> But I totally agree with your concerns around connectors. There is no
>> big difference between your concerns and the current design.
>>
>> 1. "native connector interface is a generic abstraction of doing IO and
>> Serde":
>>
>> This is the case in our design. We are using SourceFunction,
>> DeserializationSchema, WatermarkAssigner, etc. all pluggable interfaces
>> that the DataStream API offers for performing runtime operations.
>>
>> 2. "advanced features ... could be provided in a semantic free way":
>>
>> I agree here. But this is an orthogonal topic that each connector
>> implementer should keep in mind. If a new connector is developed, it
>> should *not* be developed only for SQL in mind but with good abstraction
>> such that also DataStream API users can use it. A connector should have
>> a builder pattern to plugin all capabilities like Parque filters etc.
>> There should be no table-specific native/runtime connectors. I think
>> this discussion is related to the discussion of FLIP-115.
>>
>> However, as I mentioned before: This FLIP only discusses the interfaces
>> for communication between planner and connector factory. As Dawid said
>> earlier, a DynamicTableSource can be more seen as a factory that calls
>> pluggable interfaces of a native connextor in the end:
>>
>> KafkaConnector.builder()
>>      .watermarkAssigner(...)
>>      .keyDeser(...)
>>      .valueDeser(...)
>>      ....
>>      .build()
>>
>> Regards,
>> Timo
>>
>>
>> On 25.03.20 09:05, Becket Qin wrote:
>>
>> Hi Kurt,
>>
>> I do not object to promote the concepts of SQL, but I don't think we
>>
>> should
>>
>> do that by introducing a new dedicate set of connector public
>>
>> interfaces
>>
>> that is only for SQL. The same argument can be applied to Gelly, CEP,
>>
>> and
>>
>> Machine Learning, claiming that they need to introduce a dedicated
>>
>> public
>>
>> set of interfaces that fits their own concept and ask the the connector
>> developers to learn and follow their design. As an analogy, if we want
>>
>> to
>>
>> promote Chinese, we don't want to force people to learn ancient Chinese
>> poem while they only need to know a few words like "hello" and
>>
>> "goodbye".
>>
>> As some design principles, here are what I think what  Flink connectors
>> should look like:
>>
>> 1. The native connector interface is a generic abstraction of doing IO
>>
>> and
>>
>> Serde, without semantic for high level use cases such as SQL, Gelly,
>>
>> CEP,
>>
>> etc.
>>
>> 2. Some advanced features that may help accelerate the IO and Serde
>>
>> could
>>
>> be provided in the native connector interfaces in a semantic free way
>>
>> so
>>
>> all the high level use cases can leverage.
>>
>> 3. Additional semantics can be built on top of the native source
>>
>> interface
>>
>> through providing different plugins. These plugins could be high level
>>
>> use
>>
>> case aware. For example, to provide a filter to the source, we can do
>>
>> the
>>
>> following
>>
>> // An interface for all the filters that take an expression.
>> interface ExpressionFilter {
>>        FilterResult applyFilterExpression();
>> }
>>
>> // An filter plugin implementation that translate the SQL Expression
>>
>> to a
>>
>> ParquetFilterPredicate.
>> Class ParquetExpressionFilter implements
>>
>> Supplier<ParquetFilterPredicate>,
>>
>> ExpressionFilter {
>>        // Called by the high level use case,
>>        FilterResult applyFilterExpression() { ... }
>>
>>        // Used by the native Source interface.
>>        ParquetFilterPredicate get() { ... }
>> }
>>
>> In this case, the connector developer just need to write the logic of
>> translating an Expression to Parquet FilterPredicate. They don't have
>>
>> to
>>
>> understand the entire set of interfaces that we want to promote. Just
>>
>> like
>>
>> they only need to know how to say "Hello" without learning ancient
>>
>> Chinese
>>
>> poem.
>>
>> Again, I am not saying this is necessarily the best approach. But so
>>
>> far
>>
>> it
>>
>> seems a reasonable design principle to tell the developers.
>>
>> Thanks,
>>
>> Jiangjie (becket) Qin
>>
>>
>>
>> On Wed, Mar 25, 2020 at 11:53 AM Kurt Young <yk...@gmail.com> <yk...@gmail.com> wrote:
>>
>>
>> Hi Becket,
>>
>> I don't think we should discuss this in pure engineering aspects. Your
>> proposal is trying
>> to let SQL connector developers understand as less SQL concepts as
>> possible. But quite
>> the opposite, we are designing those interfaces to emphasize the SQL
>> concept, to bridge
>> high level concepts into real interfaces and classes.
>>
>> We keep talking about time-varying relations and dynamic table when
>> introduce SQL concepts,
>> sources and sinks are most critical part playing with those concepts.
>>
>> It's
>>
>> essential to let
>> Flink SQL developers to learn these concepts and connect them with
>>
>> real
>>
>> codes by introducing
>> these connector interfaces and can further write *correct* connectors
>>
>> based
>>
>> on such domain
>> knowledge.
>>
>> So this FLIP is a very important chance to express these concepts and
>>
>> make
>>
>> most SQL developers
>> be align with concepts and on same page. It's mostly for different
>>
>> level of
>>
>> abstractions and for domains
>> like SQL, it's becoming more important.  It helps Flink SQL go
>>
>> smoothly
>>
>> in
>>
>> the future, and also
>> make it easier for new contributors. But I would admit this is not
>>
>> that
>>
>> obvious for others who don't work
>> with SQL frequently.
>>
>> Best,
>> Kurt
>>
>>
>> On Wed, Mar 25, 2020 at 11:07 AM Becket Qin <be...@gmail.com> <be...@gmail.com>
>>
>> wrote:
>>
>> Hi Jark,
>>
>> It is good to know that we do not expect the end users to touch those
>> interfaces.
>>
>> Then the question boils down to whether the connector developers
>>
>> should
>>
>> be
>>
>> aware of the interfaces that are only used by the SQL optimizer. It
>>
>> seems a
>>
>> win if we can avoid that.
>>
>> Two potential solutions off the top of my head are:
>> 1. An internal helper class doing the instanceOf based on DataStream
>>
>> source
>>
>> interface and create pluggables for that DataStream source.
>> 2. codegen the set of TableSource interfaces given a DataStream
>>
>> Source
>>
>> and
>>
>> its corresponding TablePluggablesFactory.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> <im...@gmail.com> wrote:
>>
>>
>> Hi Becket,
>>
>> Regarding to Flavor1 and Flavor2, I want to clarify that user will
>>
>> never
>>
>> use table source like this:
>>
>> {
>>       MyTableSource myTableSource = MyTableSourceFactory.create();
>>       myTableSource.setSchema(mySchema);
>>       myTableSource.applyFilterPredicate(expression);
>>       ...
>> }
>>
>> TableFactory and TableSource are not directly exposed to end users,
>>
>> all
>>
>> the
>>
>> methods are called by planner, not users.
>> Users always use DDL or descriptor to register a table, and planner
>>
>> will
>>
>> find the factory and create sources according to the properties.
>> All the optimization are applied automatically, e.g.
>>
>> filter/projection
>>
>> pushdown, users don't need to call `applyFilterPredicate`
>>
>> explicitly.
>>
>> On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com> <be...@gmail.com>
>>
>> wrote:
>>
>> Hi Timo and Dawid,
>>
>> Thanks for the clarification. They really help. You are right that
>>
>> we
>>
>> are
>>
>> on the same page regarding the hierarchy. I think the only
>>
>> difference
>>
>> between our view is the flavor of the interfaces. There are two
>>
>> flavors
>>
>> of
>>
>> the source interface for DataStream and Table source.
>>
>> *Flavor 1. Table Sources are some wrapper interfaces around
>>
>> DataStream
>>
>> source.*
>> Following this way, we will reach the design of the current
>>
>> proposal,
>>
>> i.e.
>>
>> each pluggable exposed in the DataStream source will have a
>>
>> corresponding
>>
>> TableSource interface counterpart, which are at the Factory level.
>>
>> Users
>>
>> will write code like this:
>>
>> {
>>        MyTableSource myTableSource = MyTableSourceFactory.create();
>>        myTableSource.setSchema(mySchema);
>>        myTableSource.applyFilterPredicate(expression);
>>        ...
>> }
>>
>> The good thing for this flavor is that from the SQL / Table's
>>
>> perspective,
>>
>> there is a dedicated set of Table oriented interface.
>>
>> The downsides are:
>> A. From the user's perspective, DataStream Source and Table Source
>>
>> are
>>
>> just
>>
>> two different sets of interfaces, regardless of how they are the
>>
>> same
>>
>> internally.
>> B. The source developers have to develop for those two sets of
>>
>> interfaces
>>
>> in order to support both DataStream and Table.
>> C. It is not explicit that DataStream can actually share the
>>
>> pluggable
>>
>> in
>>
>> Table / SQL. For example, in order to provide a filter pluggable
>>
>> with
>>
>> SQL
>>
>> expression, users will have to know the actual converter class that
>> converts the expression to the filter predicate and construct that
>> converter by themselves.
>>
>> ---------------
>>
>> *Flavor 2. A TableSource is a DataStream source with a bunch of
>>
>> pluggables.
>>
>> No Table specific interfaces at all.*
>> Following this way, we will reach another design where you have a
>> SourceFactory and a single Pluggable factory for all the table
>>
>> pluggables.
>>
>> And users will write something like:
>>
>> {
>>        Deserializer<Row> myTableDeserializer =
>> MyTablePluggableFactory.createDeserializer(schema)
>>        MySource<Row> mySource = MySourceFactory.create(properties,
>> myTableDeserializer);
>>
>>
>>
>>
>> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>>
>> }
>>
>> The good thing for this flavor is that there is just one set of
>>
>> interface
>>
>> that works for both Table and DataStream. There is no difference
>>
>> between
>>
>> creating a DataStream source and creating a Table source.
>>
>> DataStream
>>
>> can
>>
>> easily reuse the pluggables from the Table sources.
>>
>> The downside is that Table / SQL won't have a dedicated API for
>> optimization. Instead of writing:
>>
>> if (MyTableSource instanceOf FilterableTableSource) {
>>         // Some filter push down logic.
>>         MyTableSource.applyPredicate(expression)
>> }
>>
>> One have to write:
>>
>> if (MySource instanceOf FilterableSource) {
>>        // Some filter push down logic.
>>
>>
>>
>>
>> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>>
>> }
>>
>> -------------------------
>>
>> Just to be clear, I am not saying flavor 2 is necessarily better
>>
>> than
>>
>> flavor 1, but I want to make sure flavor 2 is also considered and
>> discussed.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin.
>>
>> On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
>>
>> dwysakowicz@apache.org>
>>
>> wrote:
>>
>>
>> Hi Becket,
>>
>> I really think we don't have a differing opinions. We might not
>>
>> see
>>
>> the
>>
>> changes in the same way yet. Personally I think of the
>>
>> DynamicTableSource
>>
>> as of a factory for a Source implemented for the DataStream API.
>>
>> The
>>
>> important fact about the DynamicTableSource and all feature traits
>> (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work
>>
>> with
>>
>> Table
>>
>> API concepts such as e.g. Expressions, SQL specific types etc. In
>>
>> the
>>
>> end
>>
>> what the implementation would resemble is (bear in mind I
>>
>> tremendously
>>
>> simplified the example, just to show the relation between the two
>>
>> APIs):
>>
>> SupportsFilterablePushDown {
>>
>>      applyFilters(List<ResolvedExpression> filters) {
>>
>>        this.filters = convertToDataStreamFilters(filters);
>>
>>      }
>>
>>      Source createSource() {
>>
>>            return Source.create()
>>
>>              .applyFilters(this.filters);
>>
>>       }
>>
>> }
>>
>> or exactly as you said for the computed columns:
>>
>>
>> SupportsComputedColumnsPushDown {
>>
>>
>>
>>      applyComputedColumn(ComputedColumnConverter converter) {
>>
>>        this.deserializationSchema = new DeserializationSchema<Row>
>>
>> {
>>
>>          Row deserialize(...) {
>>
>>            RowData row = format.deserialize(bytes); // original
>>
>> format,
>>
>> e.g
>>
>> json, avro, etc.
>>
>>            RowData enriched = converter(row)
>>
>>          }
>>
>>        }
>>
>>      }
>>
>>      Source createSource() {
>>
>>            return Source.create()
>>
>>              .withDeserialization(deserializationSchema);
>>
>>       }
>>
>> }
>>
>> So to sum it up again, all those interfaces are factories that
>>
>> configure
>>
>> appropriate parts of the DataStream API using Table API concepts.
>>
>> Finally
>>
>> to answer you question for particular comparisons:
>>
>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>> SupportsFilterablePushDown v.s. FilterableSource
>> SupportsProjectablePushDown v.s. ProjectableSource
>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>> ScanTableSource v.s. ChangeLogDeserializer.
>>
>> pretty much you can think of all on the left as factories for the
>>
>> right
>>
>> side, left side works with Table API classes (Expressions,
>>
>> DataTypes).
>>
>> I
>>
>> hope this clarifies it a bit.
>>
>> Best,
>>
>> Dawid
>> On 24/03/2020 15:03, Becket Qin wrote:
>>
>> Hey Kurt,
>>
>> I don't think DataStream should see some SQL specific concepts
>>
>> such
>>
>> as
>>
>> Filtering or ComputedColumn.
>>
>> Projectable and Filterable seems not necessarily SQL concepts, but
>>
>> could
>>
>> be
>>
>> applicable to DataStream source as well to reduce the network
>>
>> load.
>>
>> For
>>
>> example ORC and Parquet should probably also be readable from
>>
>> DataStream,
>>
>> right?
>>
>> ComputedColumn is not part of the Source, it is an interface
>>
>> extends
>>
>> the
>>
>> Deserializer, which is a pluggable for the Source. From the SQL's
>> perspective it has the concept of computed column, but from the
>>
>> Source
>>
>> perspective, It is essentially a Deserializer which also converts
>>
>> the
>>
>> records internally, assuming we allow some conversion to be
>>
>> embedded
>>
>> to
>>
>> the source in addition to just deserialization.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <im...@gmail.com> <
>>
>> imjark@gmail.com> wrote:
>>
>> Thanks Timo for updating the formats section. That would be very
>>
>> helpful
>>
>> for changelog supporting (FLIP-105).
>>
>> I just left 2 minor comment about some method names. In general,
>>
>> I'm
>>
>> +1
>>
>> to
>>
>> start a voting.
>>
>>
>>
>>
>> --------------------------------------------------------------------------------------------------
>>
>> Hi Becket,
>>
>> I agree we shouldn't duplicate codes, especiall the runtime
>> implementations.
>> However, the interfaces proposed by FLIP-95 are mainly used during
>> optimization (compiling), not runtime.
>> I don't think there is much to share for this. Because table/sql
>> is declarative, but DataStream is imperative.
>> For example, filter push down, DataStream FilterableSource may
>>
>> allow
>>
>> to
>>
>> accept a FilterFunction (which is a black box for the source).
>> However, table sources should pick the pushed filter expressions,
>>
>> some
>>
>> sources may only support "=", "<", ">" conditions.
>> Pushing a FilterFunction doesn't work in table ecosystem. That
>>
>> means,
>>
>> the
>>
>> connectors have to have some table-specific implementations.
>>
>>
>> Best,
>> Jark
>>
>> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <yk...@gmail.com> <
>>
>> ykt836@gmail.com> wrote:
>>
>> Hi Becket,
>>
>> I don't think DataStream should see some SQL specific concepts
>>
>> such
>>
>> as
>>
>> Filtering or ComputedColumn. It's
>> better to stay within SQL area and translate to more generic
>>
>> concept
>>
>> when
>>
>> translating to DataStream/Runtime
>> layer, such as use MapFunction to represent computed column logic.
>>
>> Best,
>> Kurt
>>
>>
>> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> <be...@gmail.com>
>>
>> <
>>
>> becket.qin@gmail.com> wrote:
>>
>> Hi Timo and Dawid,
>>
>> It's really great that we have the same goal. I am actually
>>
>> wondering
>>
>> if
>>
>> we
>>
>> can go one step further to avoid some of the interfaces in Table
>>
>> as
>>
>> well.
>>
>> For example, if we have the FilterableSource, do we still need the
>> FilterableTableSource? Should DynamicTableSource just become a
>> Source<*Row*,
>> SourceSplitT, EnumChkT>?
>>
>> Can you help me understand a bit more about the reason we need the
>> following relational representation / wrapper interfaces v.s. the
>> interfaces that we could put to the Source in FLIP-27?
>>
>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>> SupportsFilterablePushDown v.s. FilterableSource
>> SupportsProjectablePushDown v.s. ProjectableSource
>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>> ScanTableSource v.s. ChangeLogDeserializer.
>> LookUpTableSource v.s. LookUpSource
>>
>> Assuming we have all the interfaces on the right side, do we still
>>
>> need
>>
>> the
>>
>> interfaces on the left side? Note that the interfaces on the right
>>
>> can
>>
>> be
>>
>> used by both DataStream and Table. If we do this, there will only
>>
>> be
>>
>> one
>>
>> set of Source interfaces Table and DataStream, the only difference
>>
>> is
>>
>> that
>>
>> the Source for table will have some specific plugins and
>>
>> configurations.
>>
>> An
>>
>> omnipotent Source can implement all the the above interfaces and
>>
>> take a
>>
>> Deserializer that implements both ComputedColumnDeserializer and
>> ChangeLogDeserializer.
>>
>> Would the SQL planner work with that?
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>>
>>
>>
>>
>> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <
>>
>> jingsonglee0@gmail.com>
>>
>> <
>>
>> jingsonglee0@gmail.com>
>>
>> wrote:
>>
>>
>> +1. Thanks Timo for the design doc.
>>
>> We can also consider @Experimental too. But I am +1 to
>>
>> @PublicEvolving,
>>
>> we
>>
>> should be confident in the current change.
>>
>> Best,
>> Jingsong Lee
>>
>> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org> <tw...@apache.org>
>>
>> <
>>
>> twalthr@apache.org>
>>
>> wrote:
>>
>> @Becket: We totally agree that we don't need table specific
>>
>> connectors
>>
>> during runtime. As Dawid said, the interfaces proposed here are
>>
>> just
>>
>> for
>>
>> communication with the planner. Once the properties (watermarks,
>> computed column, filters, projecttion etc.) are negotiated, we can
>> configure a regular Flink connector.
>>
>> E.g. setting the watermark assigner and deserialization schema of
>>
>> a
>>
>> Kafka connector.
>>
>> For better separation of concerns, Flink connectors should not
>>
>> include
>>
>> relational interfaces and depend on flink-table. This is the
>> responsibility of table source/sink.
>>
>> @Kurt: I would like to mark them @PublicEvolving already because
>>
>> we
>>
>> need
>>
>> to deprecate the old interfaces as early as possible. We cannot
>>
>> redirect
>>
>> to @Internal interfaces. They are not marked @Public, so we can
>>
>> still
>>
>> evolve them. But a core design shift should not happen again, it
>>
>> would
>>
>> leave a bad impression if we are redesign over and over again.
>>
>> Instead
>>
>> we should be confident in the current change.
>>
>> Regards,
>> Timo
>>
>>
>> On 24.03.20 09:20, Dawid Wysakowicz wrote:
>>
>> Hi Becket,
>>
>> Answering your question, we have the same intention not to
>>
>> duplicate
>>
>> connectors between datastream and table apis. The interfaces
>>
>> proposed
>>
>> in
>>
>> the FLIP are a way to describe relational properties of a source.
>>
>> The
>>
>> intention is as you described to translate all of those expressed
>>
>> as
>>
>> expressions or other Table specific structures into a DataStream
>>
>> source.
>>
>> In other words I think what we are doing here is in line with
>>
>> what
>>
>> you
>>
>> described.
>>
>> Best,
>>
>> Dawid
>>
>> On 24/03/2020 02:23, Becket Qin wrote:
>>
>> Hi Timo,
>>
>> Thanks for the proposal. I completely agree that the current
>>
>> Table
>>
>> connectors could be simplified quite a bit. I haven't finished
>>
>> reading
>>
>> everything, but here are some quick thoughts.
>>
>> Actually to me the biggest question is why should there be two
>>
>> different
>>
>> connector systems for DataStream and Table? What is the
>>
>> fundamental
>>
>> reason
>>
>> that is preventing us from merging them to one?
>>
>> The basic functionality of a connector is to provide
>>
>> capabilities
>>
>> to
>>
>> do
>>
>> IO
>>
>> and Serde. Conceptually, Table connectors should just be
>>
>> DataStream
>>
>> connectors that are dealing with Rows. It seems that quite a few
>>
>> of
>>
>> the
>>
>> special connector requirements are just a specific way to do IO
>>
>> /
>>
>> Serde.
>>
>> Taking SupportsFilterPushDown as an example, imagine we have the
>>
>> following
>>
>> interface:
>>
>> interface FilterableSource<PREDICATE> {
>>         void applyFilterable(Supplier<PREDICATE> predicate);
>> }
>>
>> And if a ParquetSource would like to support filterable, it will
>>
>> become:
>>
>> class ParquetSource implements Source,
>>
>> FilterableSource(FilterPredicate> {
>>
>>         ...
>> }
>>
>> For Table, one just need to provide an predicate supplier that
>>
>> converts
>>
>> an
>>
>> Expression to the specified predicate type. This has a few
>>
>> benefit:
>>
>> 1. Same unified API for filterable for sources, regardless of
>>
>> DataStream or
>>
>> Table.
>> 2. The  DataStream users now can also use the
>>
>> ExpressionToPredicate
>>
>> supplier if they want to.
>>
>> To summarize, my main point is that I am wondering if it is
>>
>> possible
>>
>> to
>>
>> have a single set of connector interface for both Table and
>>
>> DataStream,
>>
>> rather than having two hierarchies. I am not 100% sure if this
>>
>> would
>>
>> work,
>>
>> but if it works, this would be a huge win from both code
>>
>> maintenance
>>
>> and
>>
>> user experience perspective.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>>
>> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
>> dwysakowicz@apache.org>
>>
>> wrote:
>>
>>
>> Hi Timo,
>>
>> Thank you for the proposal. I think it is an important
>>
>> improvement
>>
>> that
>>
>> will benefit many parts of the Table API. The proposal looks
>>
>> really
>>
>> good
>>
>> to me and personally I would be comfortable with voting on the
>>
>> current
>>
>> state.
>>
>> Best,
>>
>> Dawid
>>
>> On 23/03/2020 18:53, Timo Walther wrote:
>>
>> Hi everyone,
>>
>> I received some questions around how the new interfaces play
>>
>> together
>>
>> with formats and their factories.
>>
>> Furthermore, for MySQL or Postgres CDC logs, the format should
>>
>> be
>>
>> able
>>
>> to return a `ChangelogMode`.
>>
>> Also, I incorporated the feedback around the factory design in
>>
>> general.
>>
>> I added a new section `Factory Interfaces` to the design
>>
>> document.
>>
>> This should be helpful to understand the big picture and
>>
>> connecting
>>
>> the concepts.
>>
>> Please let me know what you think?
>>
>> Thanks,
>> Timo
>>
>>
>> On 18.03.20 13:43, Timo Walther wrote:
>>
>> Hi Benchao,
>>
>> this is a very good question. I will update the FLIP about
>>
>> this.
>>
>> The legacy planner will not support the new interfaces. It
>>
>> will
>>
>> only
>>
>> support the old interfaces. With the next release, I think
>>
>> the
>>
>> Blink
>>
>> planner is stable enough to be the default one as well.
>>
>> Regards,
>> Timo
>>
>> On 18.03.20 08:45, Benchao Li wrote:
>>
>> Hi Timo,
>>
>> Thank you and others for the efforts to prepare this FLIP.
>>
>> The FLIP LGTM generally.
>>
>> +1 for moving blink data structures to table-common, it's
>>
>> useful
>>
>> to
>>
>> udf too
>> in the future.
>> A little question is, do we plan to support the new
>>
>> interfaces
>>
>> and
>>
>> data
>>
>> types in legacy planner?
>> Or we only plan to support these new interfaces in blink
>>
>> planner.
>>
>> And using primary keys from DDL instead of derived key
>>
>> information
>>
>> from
>>
>> each query is also a good idea,
>> we met some use cases where this does not works very well
>>
>> before.
>>
>> This FLIP also makes the dependencies of table modules more
>>
>> clear, I
>>
>> like
>> it very much.
>>
>> Timo Walther <tw...@apache.org> <tw...@apache.org> <tw...@apache.org> <tw...@apache.org>
>>
>> 于2020年3月17日周二
>>
>> 上午1:36写道：
>>
>> Hi everyone,
>>
>> I'm happy to present the results of long discussions that
>>
>> we
>>
>> had
>>
>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
>>
>> many
>>
>> more
>>
>> have contributed to this design document.
>>
>> We would like to propose new long-term table source and
>>
>> table
>>
>> sink
>>
>> interfaces:
>>
>>
>>
>>
>>
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>
>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>
>> The goals of this FLIP are:
>>
>> - Simplify the current interface architecture:
>>           - Merge upsert, retract, and append sinks.
>>           - Unify batch and streaming sources.
>>           - Unify batch and streaming sinks.
>>
>> - Allow sources to produce a changelog:
>>           - UpsertTableSources have been requested a lot by
>>
>> users.
>>
>> Now
>>
>> is the
>> time to open the internal planner capabilities via the new
>>
>> interfaces.
>>
>>           - According to FLIP-105, we would like to support
>>
>> changelogs for
>>
>> processing formats such as Debezium.
>>
>> - Don't rely on DataStream API for source and sinks:
>>           - According to FLIP-32, the Table API and SQL should
>>
>> be
>>
>> independent
>> of the DataStream API which is why the `table-common`
>>
>> module
>>
>> has
>>
>> no
>>
>> dependencies on `flink-streaming-java`.
>>           - Source and sink implementations should only depend
>>
>> on
>>
>> the
>>
>> `table-common` module after FLIP-27.
>>           - Until FLIP-27 is ready, we still put most of the
>>
>> interfaces in
>>
>> `table-common` and strictly separate interfaces that
>>
>> communicate
>>
>> with a
>> planner and actual runtime reader/writers.
>>
>> - Implement efficient sources and sinks without planner
>>
>> dependencies:
>>
>>           - Make Blink's internal data structures available to
>>
>> connectors.
>>
>>           - Introduce stable interfaces for data structures
>>
>> that
>>
>> can
>>
>> be
>>
>> marked as `@PublicEvolving`.
>>           - Only require dependencies on `flink-table-common`
>>
>> in
>>
>> the
>>
>> future
>>
>> It finalizes the concept of dynamic tables and consideres
>>
>> how
>>
>> all
>>
>> source/sink related classes play together.
>>
>> We look forward to your feedback.
>>
>> Regards,
>> Timo
>>
>>
>> --
>> Best, Jingsong Lee
>>
>>
>>
>>
>>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hi Timo and Dawid,

Thanks for the patient explanation. I just had a phone call with Kurt and
Jark. I do see there are a few abstractions that we only see the use case
in SQL so far. Therefore while thinking of a Source abstraction that may be
shared with different use cases semantics is theoretically useful, doing
that may not bring us much value at this point. So I am convinced that it
doesn't have to be done right now and I have no further concern with the
design in the current FLIP.

Again, really appreciate the patient discussion! I learned quite a bit from
it.

Cheers,

Jiangjie (Becket) Qin

On Thu, Mar 26, 2020 at 8:58 PM Dawid Wysakowicz <dw...@apache.org>
wrote:

> Hi Becket,
>
> Generally I don't think connector developers should bother with
> understanding any of the SQL concepts.
>
> I am not sure if we understand "connector developer" the same way. Let me
> describe how I see the process of writing a new source (that can be used in
> both Table & DataStream API)
>
> 1. Connector developer writes a Source that deals with the actual reading
> and deserializing (preferably with a pluggable format/deserializer). The
> result of that step should be something like:
>
> FilesystemSource
>
>     .path(...)
>
>     .format(ParquetFormat
>
>                     .filterPredicate(/* parquet specific filter */)
>
>                     .project(/* parquet specific projection */)
>
>                     .map(...))
>
>     .watermarkAssigner(...)
>
>
> This is useful for DataStream and we can and want to use this in the Table
> API.  Those interface shouldn't accept any *Translators though. It does
> make no sense cause internally they are not dealing e.g. with the
> Expression. They should accept already created predicates.
>
> We are not designing anything at that level. This we expect from FLIP-27
>
> 2. Then we need to have a DynamicTableSource with different abilities that
> can create e.g. the parquet filter or projection from expressions. I think
> this is what you also describe in your second point. And this is what we
> are designing in the FLIP. Bear in mind that e.g. Deserializer will be
> created out of multiple SQL concepts: regular schema/computed
> columns/possibly projections etc., each applied at different planning
> stages.
>
> All of those interfaces serve the purpose of configuring the
> DynamicTableSource so that it is able to instantiate the Source with proper
> configuration. In other words it is a factory for the source that you can
> configure with SQL concepts. In turn this Factory will call another factory
> from point 1.
>
> I don't see a potential for unifying factories across different high level
> APIs. Taking your example with Spatial Database that operates on
> Coordinates and Area (even though those would rather be modeled as SQL
> types and we would still operate on Rows, but just for the sake of the
> example). In that respect there is no point in having a
> PushDownComputedColumns interface in the factory for the spatial database.
>
> Best,
>
> Dawid
>
>
> On 26/03/2020 11:47, Becket Qin wrote:
>
> Hi Timo,
>
> Regarding "connector developers just need to know how to write an
>
> ExpressionToParquetFilter":
>
>
> This is the entire purpose of the DynamicTableSource/DynamicTableSink.
>
> The bridging between SQL concepts and connector specific concepts.
> Because this is the tricky part. How to get from a SQL concept to a
> connctor concept.
>
> Maybe it is just a naming issue depending on whether one is looking upward
> from the Connectors perspective, or looking downward from the SQL
> perspective. If we agree that the connectors should provide semantic free
> API to the high level use cases, it seems we should follow the former path.
> And if there are one or two APIs that the connector developers have to
> understand in order to support Table / SQL, I think we can just address
> them case by case, instead of wrapping the entire low level source API
> with a set of new concepts.
>
> Correct me if I am wrong, can we tell the following story to a connector
> developer and get a all the TableSource functionality work?
>
> To provide a TableSource from a Source, one just need to know two more
> concepts: *Row* and *Expression*. The work to create a TableSource are
> following:
> 1. A connector developer can write three classes in order to build a table
> source:
>
>    - Deserializer<Row> (Must-have)
>    - PredicateTranslator<Expression, FilterPredicate> (optional, only
>    applicable if the Source is a FilterableSource)
>    - PredicateTranslator<Expression, ProjectionPredicate> (optional, only
>    applicable if the Source is a ProjectableSource)
>
> 2. In order to let the table source be discoverable, one need to provide a
> Factory, and that Factory provides the following as a bundle:
>
>    - The Source itself (Must-have)
>    - The Deserializer<Row> (Must-have)
>    - PredicateTranslator<Expression, FilterPredicate> (optional, only
>    applicable when the Factory is a FilterFactory)
>    - PredicateTranslator<Expression, ProjectionPredicate> (optional, only
>    applicable when the Factory is a ProjectorFactory)
>
> 3. The Deserializer<Row> may implement one more decorative interfaces to
> further convert the record after deserialization.
>
>    - withMapFunction<Row, Row>;
>
> Note that the above description only require the connector developer to
> understand Expression and Row. If this works, It is much easier to explain
> than throwing a full set of new concepts. More importantly, it is way more
> generic. For example, If we change Row to Coordinates, and Expression to
> Area, we easily get a Source for a Spatial Database.
>
>
> One thing I want to call out is that while the old SourceFunction and
> InputFormat are concrete implementations that does the actual IO work. The
> Source API in FLIP-27 itself is kind of a Factory by itself already. So if
> we can push the decorative interfaces from the TableFactory layer to the
> Source layer, it will help unify the experience for DataStream and Table
> Source. This will also align with our goal of letting the DataStream Source
> provide a semantic free API that can be used by different high level API.
>
>
> BTW, Jark suggested that we can probably have an offline call to accelerate
> the discussion. I think it is a good idea. Can we do that?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Thu, Mar 26, 2020 at 5:28 PM Timo Walther <tw...@apache.org> <tw...@apache.org> wrote:
>
>
> Hi Becket,
>
> Regarding "PushDown/NestedPushDown which is internal to optimizer":
>
> Those concepts cannot be entirely internal to the optimizer, at some
> point the optimizer needs to pass them into the connector specific code.
> This code will then convert it to e.g. Parque expressions. So there must
> be some interface that takes SQL Expression and converts to connector
> specific code. This interface between planner and connector is modelled
> by the SupportsXXX interfaces. And you are right, if developers don't
> care, they don't need to implement those optional interfaces but will
> not get performant connectors.
>
> Regarding "Table connector can work with the above two mechanism":
>
> A table connector needs three mechanisms that are represented in the
> current design.
>
> 1. a stateless discovery interface (Factory) that can convert
> ConfigOptions to a stateful factory interface
> (DynamicTableSource/DynamicTableSink)
>
> 2. a stateful factory interface (DynamicTableSource/DynamicTableSink)
> that receives concepts from the optimizer (watermarks, filters,
> projections) and produces runtime classes such as your
> `ExpressionToParquetFilter`
>
> 3. runtime interfaces that are generated from the stateful factory; all
> the factories that you mentioned can be used in `getScanRuntimeProvider`.
>
> Regarding "connector developers just need to know how to write an
> ExpressionToParquetFilter":
>
> This is the entire purpose of the DynamicTableSource/DynamicTableSink.
> The bridging between SQL concepts and connector specific concepts.
> Because this is the tricky part. How to get from a SQL concept to a
> connctor concept.
>
> Regards,
> Timo
>
>
> On 26.03.20 04:46, Becket Qin wrote:
>
> Hi Timo,
>
> Thanks for the reply. I totally agree that there must be something new
> added to the connector in order to make it work for SQL / Table. My
>
> concern
>
> is mostly over what they should be, and how to add them. To be honest, I
> was kind of lost when looking at the interfaces such as
> DataStructureConverter, RuntimeConverter and their internal context.
>
> Also I
>
> believe most connector developers do not care about the concept of
> "PushDown" / "NestedPushDown" which is internal to optimizer and not even
> exposed to SQL writers.
>
> Therefore I am trying to see if we can:
> A) Keep those additions minimum to the connector developers if they don't
> have to know the details.
> B) Expose as less high level concept as possible. More specifically, try
>
> to
>
> speak the connector language and expose the general mechanism instead of
> binding them with use case semantic.
>
> If we can achieve the above two goals, we could avoid adding unnecessary
> burden to the connector developers, and also make the connectors more
> generic.
>
> It might worth thinking about what additional work is necessary for the
> connector developers, here are what I am thinking of, please correct me
>
> if
>
> I miss something.
>
>     1. A Factory interface that allows high level use case, in this case
>     SQL, to find a matching source using service provider mechanism.
>     2. Allows the high level use case to specify the plugins that are
>     supported by the underneath DataStream Source.
>
> If Table connector can work with the above two mechanism, maybe we can
>
> make
>
> some slight modifications to the interfaces in the current FLIP.
>
>     - A *SourceFactory* which extends the Factory interface in the FLIP,
>     with one more method:
>        - *Source getSource();*
>     - Some decorative interfaces to the SourceFactory such as:
>        - *FilterFactory<PREDICATE, T extends Supplier<PREDICATE>>*, with
>
> the
>
>        following method
>           - T getFilter();
>        - *ProjectorFactory<PREDICATE, T extends Supplier<PREDICATE>>*,
>
> with
>
>        the following method.
>           - T getProjector();
>        - *DeserializerFactory<INPUT, OUTPUT>*
>
> With this set of API, a ParquetTableSourceFactory may become:
>
> class ParqeutTableSourceFactory implements
>             SourceFactory,
>             DeserializerFactory<ParquetRecords, Row>,
>             FilterFactory<ParquetFilter, ExressionToParquetFilter> {
>      @Override
>      ParquetSource getSource() { ... }
>
>      @Override
>      ExressionToParquetFilter getFilterSupplier() { ... };
> }
>
> The ExressionToParquetFilter will have an *applyPredicate(Expression)*
> method.
>
> I know it does not look like a perfect interface from the pure SQL
> perspective. And I am not even sure if this would meet all the
>
> requirements
>
> for SQL, but the benefit is that the connector developers just need to
>
> know
>
> how to write an ExpressionToParquetFilter in order to make it work for
> Table, without having to understand the entire SQL concept.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Wed, Mar 25, 2020 at 5:57 PM Timo Walther <tw...@apache.org> <tw...@apache.org> wrote:
>
>
> Hi Becket,
>
> Let me clarify a few things first: Historically we thought of Table
> API/SQL as a library on top of DataStream API. Similar to Gelly or CEP.
> We used TypeInformation in Table API to integrate nicely with DataStream
> API. However, the last years have shown that SQL is not just a library.
> It is an entire ecosystem that defines data types, submission behavior,
> execution behavior, and highly optimized SerDes. SQL is a way to declare
> data processing end-to-end such that the planner has the full control
> over the execution.
>
> But I totally agree with your concerns around connectors. There is no
> big difference between your concerns and the current design.
>
> 1. "native connector interface is a generic abstraction of doing IO and
> Serde":
>
> This is the case in our design. We are using SourceFunction,
> DeserializationSchema, WatermarkAssigner, etc. all pluggable interfaces
> that the DataStream API offers for performing runtime operations.
>
> 2. "advanced features ... could be provided in a semantic free way":
>
> I agree here. But this is an orthogonal topic that each connector
> implementer should keep in mind. If a new connector is developed, it
> should *not* be developed only for SQL in mind but with good abstraction
> such that also DataStream API users can use it. A connector should have
> a builder pattern to plugin all capabilities like Parque filters etc.
> There should be no table-specific native/runtime connectors. I think
> this discussion is related to the discussion of FLIP-115.
>
> However, as I mentioned before: This FLIP only discusses the interfaces
> for communication between planner and connector factory. As Dawid said
> earlier, a DynamicTableSource can be more seen as a factory that calls
> pluggable interfaces of a native connextor in the end:
>
> KafkaConnector.builder()
>     .watermarkAssigner(...)
>     .keyDeser(...)
>     .valueDeser(...)
>     ....
>     .build()
>
> Regards,
> Timo
>
>
> On 25.03.20 09:05, Becket Qin wrote:
>
> Hi Kurt,
>
> I do not object to promote the concepts of SQL, but I don't think we
>
> should
>
> do that by introducing a new dedicate set of connector public
>
> interfaces
>
> that is only for SQL. The same argument can be applied to Gelly, CEP,
>
> and
>
> Machine Learning, claiming that they need to introduce a dedicated
>
> public
>
> set of interfaces that fits their own concept and ask the the connector
> developers to learn and follow their design. As an analogy, if we want
>
> to
>
> promote Chinese, we don't want to force people to learn ancient Chinese
> poem while they only need to know a few words like "hello" and
>
> "goodbye".
>
> As some design principles, here are what I think what  Flink connectors
> should look like:
>
> 1. The native connector interface is a generic abstraction of doing IO
>
> and
>
> Serde, without semantic for high level use cases such as SQL, Gelly,
>
> CEP,
>
> etc.
>
> 2. Some advanced features that may help accelerate the IO and Serde
>
> could
>
> be provided in the native connector interfaces in a semantic free way
>
> so
>
> all the high level use cases can leverage.
>
> 3. Additional semantics can be built on top of the native source
>
> interface
>
> through providing different plugins. These plugins could be high level
>
> use
>
> case aware. For example, to provide a filter to the source, we can do
>
> the
>
> following
>
> // An interface for all the filters that take an expression.
> interface ExpressionFilter {
>       FilterResult applyFilterExpression();
> }
>
> // An filter plugin implementation that translate the SQL Expression
>
> to a
>
> ParquetFilterPredicate.
> Class ParquetExpressionFilter implements
>
> Supplier<ParquetFilterPredicate>,
>
> ExpressionFilter {
>       // Called by the high level use case,
>       FilterResult applyFilterExpression() { ... }
>
>       // Used by the native Source interface.
>       ParquetFilterPredicate get() { ... }
> }
>
> In this case, the connector developer just need to write the logic of
> translating an Expression to Parquet FilterPredicate. They don't have
>
> to
>
> understand the entire set of interfaces that we want to promote. Just
>
> like
>
> they only need to know how to say "Hello" without learning ancient
>
> Chinese
>
> poem.
>
> Again, I am not saying this is necessarily the best approach. But so
>
> far
>
> it
>
> seems a reasonable design principle to tell the developers.
>
> Thanks,
>
> Jiangjie (becket) Qin
>
>
>
> On Wed, Mar 25, 2020 at 11:53 AM Kurt Young <yk...@gmail.com> <yk...@gmail.com> wrote:
>
>
> Hi Becket,
>
> I don't think we should discuss this in pure engineering aspects. Your
> proposal is trying
> to let SQL connector developers understand as less SQL concepts as
> possible. But quite
> the opposite, we are designing those interfaces to emphasize the SQL
> concept, to bridge
> high level concepts into real interfaces and classes.
>
> We keep talking about time-varying relations and dynamic table when
> introduce SQL concepts,
> sources and sinks are most critical part playing with those concepts.
>
> It's
>
> essential to let
> Flink SQL developers to learn these concepts and connect them with
>
> real
>
> codes by introducing
> these connector interfaces and can further write *correct* connectors
>
> based
>
> on such domain
> knowledge.
>
> So this FLIP is a very important chance to express these concepts and
>
> make
>
> most SQL developers
> be align with concepts and on same page. It's mostly for different
>
> level of
>
> abstractions and for domains
> like SQL, it's becoming more important.  It helps Flink SQL go
>
> smoothly
>
> in
>
> the future, and also
> make it easier for new contributors. But I would admit this is not
>
> that
>
> obvious for others who don't work
> with SQL frequently.
>
> Best,
> Kurt
>
>
> On Wed, Mar 25, 2020 at 11:07 AM Becket Qin <be...@gmail.com> <be...@gmail.com>
>
> wrote:
>
> Hi Jark,
>
> It is good to know that we do not expect the end users to touch those
> interfaces.
>
> Then the question boils down to whether the connector developers
>
> should
>
> be
>
> aware of the interfaces that are only used by the SQL optimizer. It
>
> seems a
>
> win if we can avoid that.
>
> Two potential solutions off the top of my head are:
> 1. An internal helper class doing the instanceOf based on DataStream
>
> source
>
> interface and create pluggables for that DataStream source.
> 2. codegen the set of TableSource interfaces given a DataStream
>
> Source
>
> and
>
> its corresponding TablePluggablesFactory.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> <im...@gmail.com> wrote:
>
>
> Hi Becket,
>
> Regarding to Flavor1 and Flavor2, I want to clarify that user will
>
> never
>
> use table source like this:
>
> {
>      MyTableSource myTableSource = MyTableSourceFactory.create();
>      myTableSource.setSchema(mySchema);
>      myTableSource.applyFilterPredicate(expression);
>      ...
> }
>
> TableFactory and TableSource are not directly exposed to end users,
>
> all
>
> the
>
> methods are called by planner, not users.
> Users always use DDL or descriptor to register a table, and planner
>
> will
>
> find the factory and create sources according to the properties.
> All the optimization are applied automatically, e.g.
>
> filter/projection
>
> pushdown, users don't need to call `applyFilterPredicate`
>
> explicitly.
>
> On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com> <be...@gmail.com>
>
> wrote:
>
> Hi Timo and Dawid,
>
> Thanks for the clarification. They really help. You are right that
>
> we
>
> are
>
> on the same page regarding the hierarchy. I think the only
>
> difference
>
> between our view is the flavor of the interfaces. There are two
>
> flavors
>
> of
>
> the source interface for DataStream and Table source.
>
> *Flavor 1. Table Sources are some wrapper interfaces around
>
> DataStream
>
> source.*
> Following this way, we will reach the design of the current
>
> proposal,
>
> i.e.
>
> each pluggable exposed in the DataStream source will have a
>
> corresponding
>
> TableSource interface counterpart, which are at the Factory level.
>
> Users
>
> will write code like this:
>
> {
>       MyTableSource myTableSource = MyTableSourceFactory.create();
>       myTableSource.setSchema(mySchema);
>       myTableSource.applyFilterPredicate(expression);
>       ...
> }
>
> The good thing for this flavor is that from the SQL / Table's
>
> perspective,
>
> there is a dedicated set of Table oriented interface.
>
> The downsides are:
> A. From the user's perspective, DataStream Source and Table Source
>
> are
>
> just
>
> two different sets of interfaces, regardless of how they are the
>
> same
>
> internally.
> B. The source developers have to develop for those two sets of
>
> interfaces
>
> in order to support both DataStream and Table.
> C. It is not explicit that DataStream can actually share the
>
> pluggable
>
> in
>
> Table / SQL. For example, in order to provide a filter pluggable
>
> with
>
> SQL
>
> expression, users will have to know the actual converter class that
> converts the expression to the filter predicate and construct that
> converter by themselves.
>
> ---------------
>
> *Flavor 2. A TableSource is a DataStream source with a bunch of
>
> pluggables.
>
> No Table specific interfaces at all.*
> Following this way, we will reach another design where you have a
> SourceFactory and a single Pluggable factory for all the table
>
> pluggables.
>
> And users will write something like:
>
> {
>       Deserializer<Row> myTableDeserializer =
> MyTablePluggableFactory.createDeserializer(schema)
>       MySource<Row> mySource = MySourceFactory.create(properties,
> myTableDeserializer);
>
>
>
>
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>
> }
>
> The good thing for this flavor is that there is just one set of
>
> interface
>
> that works for both Table and DataStream. There is no difference
>
> between
>
> creating a DataStream source and creating a Table source.
>
> DataStream
>
> can
>
> easily reuse the pluggables from the Table sources.
>
> The downside is that Table / SQL won't have a dedicated API for
> optimization. Instead of writing:
>
> if (MyTableSource instanceOf FilterableTableSource) {
>        // Some filter push down logic.
>        MyTableSource.applyPredicate(expression)
> }
>
> One have to write:
>
> if (MySource instanceOf FilterableSource) {
>       // Some filter push down logic.
>
>
>
>
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>
> }
>
> -------------------------
>
> Just to be clear, I am not saying flavor 2 is necessarily better
>
> than
>
> flavor 1, but I want to make sure flavor 2 is also considered and
> discussed.
>
> Thanks,
>
> Jiangjie (Becket) Qin.
>
> On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
>
> dwysakowicz@apache.org>
>
> wrote:
>
>
> Hi Becket,
>
> I really think we don't have a differing opinions. We might not
>
> see
>
> the
>
> changes in the same way yet. Personally I think of the
>
> DynamicTableSource
>
> as of a factory for a Source implemented for the DataStream API.
>
> The
>
> important fact about the DynamicTableSource and all feature traits
> (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work
>
> with
>
> Table
>
> API concepts such as e.g. Expressions, SQL specific types etc. In
>
> the
>
> end
>
> what the implementation would resemble is (bear in mind I
>
> tremendously
>
> simplified the example, just to show the relation between the two
>
> APIs):
>
> SupportsFilterablePushDown {
>
>     applyFilters(List<ResolvedExpression> filters) {
>
>       this.filters = convertToDataStreamFilters(filters);
>
>     }
>
>     Source createSource() {
>
>           return Source.create()
>
>             .applyFilters(this.filters);
>
>      }
>
> }
>
> or exactly as you said for the computed columns:
>
>
> SupportsComputedColumnsPushDown {
>
>
>
>     applyComputedColumn(ComputedColumnConverter converter) {
>
>       this.deserializationSchema = new DeserializationSchema<Row>
>
> {
>
>         Row deserialize(...) {
>
>           RowData row = format.deserialize(bytes); // original
>
> format,
>
> e.g
>
> json, avro, etc.
>
>           RowData enriched = converter(row)
>
>         }
>
>       }
>
>     }
>
>     Source createSource() {
>
>           return Source.create()
>
>             .withDeserialization(deserializationSchema);
>
>      }
>
> }
>
> So to sum it up again, all those interfaces are factories that
>
> configure
>
> appropriate parts of the DataStream API using Table API concepts.
>
> Finally
>
> to answer you question for particular comparisons:
>
> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> SupportsFilterablePushDown v.s. FilterableSource
> SupportsProjectablePushDown v.s. ProjectableSource
> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> ScanTableSource v.s. ChangeLogDeserializer.
>
> pretty much you can think of all on the left as factories for the
>
> right
>
> side, left side works with Table API classes (Expressions,
>
> DataTypes).
>
> I
>
> hope this clarifies it a bit.
>
> Best,
>
> Dawid
> On 24/03/2020 15:03, Becket Qin wrote:
>
> Hey Kurt,
>
> I don't think DataStream should see some SQL specific concepts
>
> such
>
> as
>
> Filtering or ComputedColumn.
>
> Projectable and Filterable seems not necessarily SQL concepts, but
>
> could
>
> be
>
> applicable to DataStream source as well to reduce the network
>
> load.
>
> For
>
> example ORC and Parquet should probably also be readable from
>
> DataStream,
>
> right?
>
> ComputedColumn is not part of the Source, it is an interface
>
> extends
>
> the
>
> Deserializer, which is a pluggable for the Source. From the SQL's
> perspective it has the concept of computed column, but from the
>
> Source
>
> perspective, It is essentially a Deserializer which also converts
>
> the
>
> records internally, assuming we allow some conversion to be
>
> embedded
>
> to
>
> the source in addition to just deserialization.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <im...@gmail.com> <
>
> imjark@gmail.com> wrote:
>
> Thanks Timo for updating the formats section. That would be very
>
> helpful
>
> for changelog supporting (FLIP-105).
>
> I just left 2 minor comment about some method names. In general,
>
> I'm
>
> +1
>
> to
>
> start a voting.
>
>
>
>
> --------------------------------------------------------------------------------------------------
>
> Hi Becket,
>
> I agree we shouldn't duplicate codes, especiall the runtime
> implementations.
> However, the interfaces proposed by FLIP-95 are mainly used during
> optimization (compiling), not runtime.
> I don't think there is much to share for this. Because table/sql
> is declarative, but DataStream is imperative.
> For example, filter push down, DataStream FilterableSource may
>
> allow
>
> to
>
> accept a FilterFunction (which is a black box for the source).
> However, table sources should pick the pushed filter expressions,
>
> some
>
> sources may only support "=", "<", ">" conditions.
> Pushing a FilterFunction doesn't work in table ecosystem. That
>
> means,
>
> the
>
> connectors have to have some table-specific implementations.
>
>
> Best,
> Jark
>
> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <yk...@gmail.com> <
>
> ykt836@gmail.com> wrote:
>
> Hi Becket,
>
> I don't think DataStream should see some SQL specific concepts
>
> such
>
> as
>
> Filtering or ComputedColumn. It's
> better to stay within SQL area and translate to more generic
>
> concept
>
> when
>
> translating to DataStream/Runtime
> layer, such as use MapFunction to represent computed column logic.
>
> Best,
> Kurt
>
>
> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> <be...@gmail.com>
>
> <
>
> becket.qin@gmail.com> wrote:
>
> Hi Timo and Dawid,
>
> It's really great that we have the same goal. I am actually
>
> wondering
>
> if
>
> we
>
> can go one step further to avoid some of the interfaces in Table
>
> as
>
> well.
>
> For example, if we have the FilterableSource, do we still need the
> FilterableTableSource? Should DynamicTableSource just become a
> Source<*Row*,
> SourceSplitT, EnumChkT>?
>
> Can you help me understand a bit more about the reason we need the
> following relational representation / wrapper interfaces v.s. the
> interfaces that we could put to the Source in FLIP-27?
>
> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> SupportsFilterablePushDown v.s. FilterableSource
> SupportsProjectablePushDown v.s. ProjectableSource
> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> ScanTableSource v.s. ChangeLogDeserializer.
> LookUpTableSource v.s. LookUpSource
>
> Assuming we have all the interfaces on the right side, do we still
>
> need
>
> the
>
> interfaces on the left side? Note that the interfaces on the right
>
> can
>
> be
>
> used by both DataStream and Table. If we do this, there will only
>
> be
>
> one
>
> set of Source interfaces Table and DataStream, the only difference
>
> is
>
> that
>
> the Source for table will have some specific plugins and
>
> configurations.
>
> An
>
> omnipotent Source can implement all the the above interfaces and
>
> take a
>
> Deserializer that implements both ComputedColumnDeserializer and
> ChangeLogDeserializer.
>
> Would the SQL planner work with that?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
>
>
>
> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <
>
> jingsonglee0@gmail.com>
>
> <
>
> jingsonglee0@gmail.com>
>
> wrote:
>
>
> +1. Thanks Timo for the design doc.
>
> We can also consider @Experimental too. But I am +1 to
>
> @PublicEvolving,
>
> we
>
> should be confident in the current change.
>
> Best,
> Jingsong Lee
>
> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org> <tw...@apache.org>
>
> <
>
> twalthr@apache.org>
>
> wrote:
>
> @Becket: We totally agree that we don't need table specific
>
> connectors
>
> during runtime. As Dawid said, the interfaces proposed here are
>
> just
>
> for
>
> communication with the planner. Once the properties (watermarks,
> computed column, filters, projecttion etc.) are negotiated, we can
> configure a regular Flink connector.
>
> E.g. setting the watermark assigner and deserialization schema of
>
> a
>
> Kafka connector.
>
> For better separation of concerns, Flink connectors should not
>
> include
>
> relational interfaces and depend on flink-table. This is the
> responsibility of table source/sink.
>
> @Kurt: I would like to mark them @PublicEvolving already because
>
> we
>
> need
>
> to deprecate the old interfaces as early as possible. We cannot
>
> redirect
>
> to @Internal interfaces. They are not marked @Public, so we can
>
> still
>
> evolve them. But a core design shift should not happen again, it
>
> would
>
> leave a bad impression if we are redesign over and over again.
>
> Instead
>
> we should be confident in the current change.
>
> Regards,
> Timo
>
>
> On 24.03.20 09:20, Dawid Wysakowicz wrote:
>
> Hi Becket,
>
> Answering your question, we have the same intention not to
>
> duplicate
>
> connectors between datastream and table apis. The interfaces
>
> proposed
>
> in
>
> the FLIP are a way to describe relational properties of a source.
>
> The
>
> intention is as you described to translate all of those expressed
>
> as
>
> expressions or other Table specific structures into a DataStream
>
> source.
>
> In other words I think what we are doing here is in line with
>
> what
>
> you
>
> described.
>
> Best,
>
> Dawid
>
> On 24/03/2020 02:23, Becket Qin wrote:
>
> Hi Timo,
>
> Thanks for the proposal. I completely agree that the current
>
> Table
>
> connectors could be simplified quite a bit. I haven't finished
>
> reading
>
> everything, but here are some quick thoughts.
>
> Actually to me the biggest question is why should there be two
>
> different
>
> connector systems for DataStream and Table? What is the
>
> fundamental
>
> reason
>
> that is preventing us from merging them to one?
>
> The basic functionality of a connector is to provide
>
> capabilities
>
> to
>
> do
>
> IO
>
> and Serde. Conceptually, Table connectors should just be
>
> DataStream
>
> connectors that are dealing with Rows. It seems that quite a few
>
> of
>
> the
>
> special connector requirements are just a specific way to do IO
>
> /
>
> Serde.
>
> Taking SupportsFilterPushDown as an example, imagine we have the
>
> following
>
> interface:
>
> interface FilterableSource<PREDICATE> {
>        void applyFilterable(Supplier<PREDICATE> predicate);
> }
>
> And if a ParquetSource would like to support filterable, it will
>
> become:
>
> class ParquetSource implements Source,
>
> FilterableSource(FilterPredicate> {
>
>        ...
> }
>
> For Table, one just need to provide an predicate supplier that
>
> converts
>
> an
>
> Expression to the specified predicate type. This has a few
>
> benefit:
>
> 1. Same unified API for filterable for sources, regardless of
>
> DataStream or
>
> Table.
> 2. The  DataStream users now can also use the
>
> ExpressionToPredicate
>
> supplier if they want to.
>
> To summarize, my main point is that I am wondering if it is
>
> possible
>
> to
>
> have a single set of connector interface for both Table and
>
> DataStream,
>
> rather than having two hierarchies. I am not 100% sure if this
>
> would
>
> work,
>
> but if it works, this would be a huge win from both code
>
> maintenance
>
> and
>
> user experience perspective.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> dwysakowicz@apache.org>
>
> wrote:
>
>
> Hi Timo,
>
> Thank you for the proposal. I think it is an important
>
> improvement
>
> that
>
> will benefit many parts of the Table API. The proposal looks
>
> really
>
> good
>
> to me and personally I would be comfortable with voting on the
>
> current
>
> state.
>
> Best,
>
> Dawid
>
> On 23/03/2020 18:53, Timo Walther wrote:
>
> Hi everyone,
>
> I received some questions around how the new interfaces play
>
> together
>
> with formats and their factories.
>
> Furthermore, for MySQL or Postgres CDC logs, the format should
>
> be
>
> able
>
> to return a `ChangelogMode`.
>
> Also, I incorporated the feedback around the factory design in
>
> general.
>
> I added a new section `Factory Interfaces` to the design
>
> document.
>
> This should be helpful to understand the big picture and
>
> connecting
>
> the concepts.
>
> Please let me know what you think?
>
> Thanks,
> Timo
>
>
> On 18.03.20 13:43, Timo Walther wrote:
>
> Hi Benchao,
>
> this is a very good question. I will update the FLIP about
>
> this.
>
> The legacy planner will not support the new interfaces. It
>
> will
>
> only
>
> support the old interfaces. With the next release, I think
>
> the
>
> Blink
>
> planner is stable enough to be the default one as well.
>
> Regards,
> Timo
>
> On 18.03.20 08:45, Benchao Li wrote:
>
> Hi Timo,
>
> Thank you and others for the efforts to prepare this FLIP.
>
> The FLIP LGTM generally.
>
> +1 for moving blink data structures to table-common, it's
>
> useful
>
> to
>
> udf too
> in the future.
> A little question is, do we plan to support the new
>
> interfaces
>
> and
>
> data
>
> types in legacy planner?
> Or we only plan to support these new interfaces in blink
>
> planner.
>
> And using primary keys from DDL instead of derived key
>
> information
>
> from
>
> each query is also a good idea,
> we met some use cases where this does not works very well
>
> before.
>
> This FLIP also makes the dependencies of table modules more
>
> clear, I
>
> like
> it very much.
>
> Timo Walther <tw...@apache.org> <tw...@apache.org> <tw...@apache.org> <tw...@apache.org>
>
> 于2020年3月17日周二
>
> 上午1:36写道：
>
> Hi everyone,
>
> I'm happy to present the results of long discussions that
>
> we
>
> had
>
> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
>
> many
>
> more
>
> have contributed to this design document.
>
> We would like to propose new long-term table source and
>
> table
>
> sink
>
> interfaces:
>
>
>
>
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>
> This is a requirement for FLIP-105 and finalizing FLIP-32.
>
> The goals of this FLIP are:
>
> - Simplify the current interface architecture:
>          - Merge upsert, retract, and append sinks.
>          - Unify batch and streaming sources.
>          - Unify batch and streaming sinks.
>
> - Allow sources to produce a changelog:
>          - UpsertTableSources have been requested a lot by
>
> users.
>
> Now
>
> is the
> time to open the internal planner capabilities via the new
>
> interfaces.
>
>          - According to FLIP-105, we would like to support
>
> changelogs for
>
> processing formats such as Debezium.
>
> - Don't rely on DataStream API for source and sinks:
>          - According to FLIP-32, the Table API and SQL should
>
> be
>
> independent
> of the DataStream API which is why the `table-common`
>
> module
>
> has
>
> no
>
> dependencies on `flink-streaming-java`.
>          - Source and sink implementations should only depend
>
> on
>
> the
>
> `table-common` module after FLIP-27.
>          - Until FLIP-27 is ready, we still put most of the
>
> interfaces in
>
> `table-common` and strictly separate interfaces that
>
> communicate
>
> with a
> planner and actual runtime reader/writers.
>
> - Implement efficient sources and sinks without planner
>
> dependencies:
>
>          - Make Blink's internal data structures available to
>
> connectors.
>
>          - Introduce stable interfaces for data structures
>
> that
>
> can
>
> be
>
> marked as `@PublicEvolving`.
>          - Only require dependencies on `flink-table-common`
>
> in
>
> the
>
> future
>
> It finalizes the concept of dynamic tables and consideres
>
> how
>
> all
>
> source/sink related classes play together.
>
> We look forward to your feedback.
>
> Regards,
> Timo
>
>
> --
> Best, Jingsong Lee
>
>
>
>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Dawid Wysakowicz <dw...@apache.org>.

Hi Becket,

Generally I don't think connector developers should bother with
understanding any of the SQL concepts.

I am not sure if we understand "connector developer" the same way. Let
me describe how I see the process of writing a new source (that can be
used in both Table & DataStream API)

1. Connector developer writes a Source that deals with the actual
reading and deserializing (preferably with a pluggable
format/deserializer). The result of that step should be something like:

|FilesystemSource|

|    .path(...)|

|    .format(ParquetFormat|

|                    .filterPredicate(/* parquet specific filter */)|

|                    .project(/* parquet specific projection */)|

|                    .map(...))|

|    .watermarkAssigner(...) |

This is useful for DataStream and we can and want to use this in the
Table API.  Those interface shouldn't accept any *Translators though. It
does make no sense cause internally they are not dealing e.g. with the
Expression. They should accept already created predicates.

We are not designing anything at that level. This we expect from FLIP-27

2. Then we need to have a DynamicTableSource with different abilities
that can create e.g. the parquet filter or projection from expressions.
I think this is what you also describe in your second point. And this is
what we are designing in the FLIP. Bear in mind that e.g. Deserializer
will be created out of multiple SQL concepts: regular schema/computed
columns/possibly projections etc., each applied at different planning
stages.

All of those interfaces serve the purpose of configuring the
DynamicTableSource so that it is able to instantiate the Source with
proper configuration. In other words it is a factory for the source that
you can configure with SQL concepts. In turn this Factory will call
another factory from point 1.

I don't see a potential for unifying factories across different high
level APIs. Taking your example with Spatial Database that operates on
Coordinates and Area (even though those would rather be modeled as SQL
types and we would still operate on Rows, but just for the sake of the
example). In that respect there is no point in having a
PushDownComputedColumns interface in the factory for the spatial database.

Best,

Dawid


On 26/03/2020 11:47, Becket Qin wrote:
> Hi Timo,
>
> Regarding "connector developers just need to know how to write an
>> ExpressionToParquetFilter":
>>
>
> This is the entire purpose of the DynamicTableSource/DynamicTableSink.
>> The bridging between SQL concepts and connector specific concepts.
>> Because this is the tricky part. How to get from a SQL concept to a
>> connctor concept.
>
> Maybe it is just a naming issue depending on whether one is looking upward
> from the Connectors perspective, or looking downward from the SQL
> perspective. If we agree that the connectors should provide semantic free
> API to the high level use cases, it seems we should follow the former path.
> And if there are one or two APIs that the connector developers have to
> understand in order to support Table / SQL, I think we can just address
> them case by case, instead of wrapping the entire low level source API
> with a set of new concepts.
>
> Correct me if I am wrong, can we tell the following story to a connector
> developer and get a all the TableSource functionality work?
>
> To provide a TableSource from a Source, one just need to know two more
> concepts: *Row* and *Expression*. The work to create a TableSource are
> following:
> 1. A connector developer can write three classes in order to build a table
> source:
>
>    - Deserializer<Row> (Must-have)
>    - PredicateTranslator<Expression, FilterPredicate> (optional, only
>    applicable if the Source is a FilterableSource)
>    - PredicateTranslator<Expression, ProjectionPredicate> (optional, only
>    applicable if the Source is a ProjectableSource)
>
> 2. In order to let the table source be discoverable, one need to provide a
> Factory, and that Factory provides the following as a bundle:
>
>    - The Source itself (Must-have)
>    - The Deserializer<Row> (Must-have)
>    - PredicateTranslator<Expression, FilterPredicate> (optional, only
>    applicable when the Factory is a FilterFactory)
>    - PredicateTranslator<Expression, ProjectionPredicate> (optional, only
>    applicable when the Factory is a ProjectorFactory)
>
> 3. The Deserializer<Row> may implement one more decorative interfaces to
> further convert the record after deserialization.
>
>    - withMapFunction<Row, Row>;
>
> Note that the above description only require the connector developer to
> understand Expression and Row. If this works, It is much easier to explain
> than throwing a full set of new concepts. More importantly, it is way more
> generic. For example, If we change Row to Coordinates, and Expression to
> Area, we easily get a Source for a Spatial Database.
>
>
> One thing I want to call out is that while the old SourceFunction and
> InputFormat are concrete implementations that does the actual IO work. The
> Source API in FLIP-27 itself is kind of a Factory by itself already. So if
> we can push the decorative interfaces from the TableFactory layer to the
> Source layer, it will help unify the experience for DataStream and Table
> Source. This will also align with our goal of letting the DataStream Source
> provide a semantic free API that can be used by different high level API.
>
>
> BTW, Jark suggested that we can probably have an offline call to accelerate
> the discussion. I think it is a good idea. Can we do that?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Thu, Mar 26, 2020 at 5:28 PM Timo Walther <tw...@apache.org> wrote:
>
>> Hi Becket,
>>
>> Regarding "PushDown/NestedPushDown which is internal to optimizer":
>>
>> Those concepts cannot be entirely internal to the optimizer, at some
>> point the optimizer needs to pass them into the connector specific code.
>> This code will then convert it to e.g. Parque expressions. So there must
>> be some interface that takes SQL Expression and converts to connector
>> specific code. This interface between planner and connector is modelled
>> by the SupportsXXX interfaces. And you are right, if developers don't
>> care, they don't need to implement those optional interfaces but will
>> not get performant connectors.
>>
>> Regarding "Table connector can work with the above two mechanism":
>>
>> A table connector needs three mechanisms that are represented in the
>> current design.
>>
>> 1. a stateless discovery interface (Factory) that can convert
>> ConfigOptions to a stateful factory interface
>> (DynamicTableSource/DynamicTableSink)
>>
>> 2. a stateful factory interface (DynamicTableSource/DynamicTableSink)
>> that receives concepts from the optimizer (watermarks, filters,
>> projections) and produces runtime classes such as your
>> `ExpressionToParquetFilter`
>>
>> 3. runtime interfaces that are generated from the stateful factory; all
>> the factories that you mentioned can be used in `getScanRuntimeProvider`.
>>
>> Regarding "connector developers just need to know how to write an
>> ExpressionToParquetFilter":
>>
>> This is the entire purpose of the DynamicTableSource/DynamicTableSink.
>> The bridging between SQL concepts and connector specific concepts.
>> Because this is the tricky part. How to get from a SQL concept to a
>> connctor concept.
>>
>> Regards,
>> Timo
>>
>>
>> On 26.03.20 04:46, Becket Qin wrote:
>>> Hi Timo,
>>>
>>> Thanks for the reply. I totally agree that there must be something new
>>> added to the connector in order to make it work for SQL / Table. My
>> concern
>>> is mostly over what they should be, and how to add them. To be honest, I
>>> was kind of lost when looking at the interfaces such as
>>> DataStructureConverter, RuntimeConverter and their internal context.
>> Also I
>>> believe most connector developers do not care about the concept of
>>> "PushDown" / "NestedPushDown" which is internal to optimizer and not even
>>> exposed to SQL writers.
>>>
>>> Therefore I am trying to see if we can:
>>> A) Keep those additions minimum to the connector developers if they don't
>>> have to know the details.
>>> B) Expose as less high level concept as possible. More specifically, try
>> to
>>> speak the connector language and expose the general mechanism instead of
>>> binding them with use case semantic.
>>>
>>> If we can achieve the above two goals, we could avoid adding unnecessary
>>> burden to the connector developers, and also make the connectors more
>>> generic.
>>>
>>> It might worth thinking about what additional work is necessary for the
>>> connector developers, here are what I am thinking of, please correct me
>> if
>>> I miss something.
>>>
>>>     1. A Factory interface that allows high level use case, in this case
>>>     SQL, to find a matching source using service provider mechanism.
>>>     2. Allows the high level use case to specify the plugins that are
>>>     supported by the underneath DataStream Source.
>>>
>>> If Table connector can work with the above two mechanism, maybe we can
>> make
>>> some slight modifications to the interfaces in the current FLIP.
>>>
>>>     - A *SourceFactory* which extends the Factory interface in the FLIP,
>>>     with one more method:
>>>        - *Source getSource();*
>>>     - Some decorative interfaces to the SourceFactory such as:
>>>        - *FilterFactory<PREDICATE, T extends Supplier<PREDICATE>>*, with
>> the
>>>        following method
>>>           - T getFilter();
>>>        - *ProjectorFactory<PREDICATE, T extends Supplier<PREDICATE>>*,
>> with
>>>        the following method.
>>>           - T getProjector();
>>>        - *DeserializerFactory<INPUT, OUTPUT>*
>>>
>>> With this set of API, a ParquetTableSourceFactory may become:
>>>
>>> class ParqeutTableSourceFactory implements
>>>             SourceFactory,
>>>             DeserializerFactory<ParquetRecords, Row>,
>>>             FilterFactory<ParquetFilter, ExressionToParquetFilter> {
>>>      @Override
>>>      ParquetSource getSource() { ... }
>>>
>>>      @Override
>>>      ExressionToParquetFilter getFilterSupplier() { ... };
>>> }
>>>
>>> The ExressionToParquetFilter will have an *applyPredicate(Expression)*
>>> method.
>>>
>>> I know it does not look like a perfect interface from the pure SQL
>>> perspective. And I am not even sure if this would meet all the
>> requirements
>>> for SQL, but the benefit is that the connector developers just need to
>> know
>>> how to write an ExpressionToParquetFilter in order to make it work for
>>> Table, without having to understand the entire SQL concept.
>>>
>>> Thanks,
>>>
>>> Jiangjie (Becket) Qin
>>>
>>>
>>>
>>> On Wed, Mar 25, 2020 at 5:57 PM Timo Walther <tw...@apache.org> wrote:
>>>
>>>> Hi Becket,
>>>>
>>>> Let me clarify a few things first: Historically we thought of Table
>>>> API/SQL as a library on top of DataStream API. Similar to Gelly or CEP.
>>>> We used TypeInformation in Table API to integrate nicely with DataStream
>>>> API. However, the last years have shown that SQL is not just a library.
>>>> It is an entire ecosystem that defines data types, submission behavior,
>>>> execution behavior, and highly optimized SerDes. SQL is a way to declare
>>>> data processing end-to-end such that the planner has the full control
>>>> over the execution.
>>>>
>>>> But I totally agree with your concerns around connectors. There is no
>>>> big difference between your concerns and the current design.
>>>>
>>>> 1. "native connector interface is a generic abstraction of doing IO and
>>>> Serde":
>>>>
>>>> This is the case in our design. We are using SourceFunction,
>>>> DeserializationSchema, WatermarkAssigner, etc. all pluggable interfaces
>>>> that the DataStream API offers for performing runtime operations.
>>>>
>>>> 2. "advanced features ... could be provided in a semantic free way":
>>>>
>>>> I agree here. But this is an orthogonal topic that each connector
>>>> implementer should keep in mind. If a new connector is developed, it
>>>> should *not* be developed only for SQL in mind but with good abstraction
>>>> such that also DataStream API users can use it. A connector should have
>>>> a builder pattern to plugin all capabilities like Parque filters etc.
>>>> There should be no table-specific native/runtime connectors. I think
>>>> this discussion is related to the discussion of FLIP-115.
>>>>
>>>> However, as I mentioned before: This FLIP only discusses the interfaces
>>>> for communication between planner and connector factory. As Dawid said
>>>> earlier, a DynamicTableSource can be more seen as a factory that calls
>>>> pluggable interfaces of a native connextor in the end:
>>>>
>>>> KafkaConnector.builder()
>>>>     .watermarkAssigner(...)
>>>>     .keyDeser(...)
>>>>     .valueDeser(...)
>>>>     ....
>>>>     .build()
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>>
>>>> On 25.03.20 09:05, Becket Qin wrote:
>>>>> Hi Kurt,
>>>>>
>>>>> I do not object to promote the concepts of SQL, but I don't think we
>>>> should
>>>>> do that by introducing a new dedicate set of connector public
>> interfaces
>>>>> that is only for SQL. The same argument can be applied to Gelly, CEP,
>> and
>>>>> Machine Learning, claiming that they need to introduce a dedicated
>> public
>>>>> set of interfaces that fits their own concept and ask the the connector
>>>>> developers to learn and follow their design. As an analogy, if we want
>> to
>>>>> promote Chinese, we don't want to force people to learn ancient Chinese
>>>>> poem while they only need to know a few words like "hello" and
>> "goodbye".
>>>>> As some design principles, here are what I think what  Flink connectors
>>>>> should look like:
>>>>>
>>>>> 1. The native connector interface is a generic abstraction of doing IO
>>>> and
>>>>> Serde, without semantic for high level use cases such as SQL, Gelly,
>> CEP,
>>>>> etc.
>>>>>
>>>>> 2. Some advanced features that may help accelerate the IO and Serde
>> could
>>>>> be provided in the native connector interfaces in a semantic free way
>> so
>>>>> all the high level use cases can leverage.
>>>>>
>>>>> 3. Additional semantics can be built on top of the native source
>>>> interface
>>>>> through providing different plugins. These plugins could be high level
>>>> use
>>>>> case aware. For example, to provide a filter to the source, we can do
>> the
>>>>> following
>>>>>
>>>>> // An interface for all the filters that take an expression.
>>>>> interface ExpressionFilter {
>>>>>       FilterResult applyFilterExpression();
>>>>> }
>>>>>
>>>>> // An filter plugin implementation that translate the SQL Expression
>> to a
>>>>> ParquetFilterPredicate.
>>>>> Class ParquetExpressionFilter implements
>>>> Supplier<ParquetFilterPredicate>,
>>>>> ExpressionFilter {
>>>>>       // Called by the high level use case,
>>>>>       FilterResult applyFilterExpression() { ... }
>>>>>
>>>>>       // Used by the native Source interface.
>>>>>       ParquetFilterPredicate get() { ... }
>>>>> }
>>>>>
>>>>> In this case, the connector developer just need to write the logic of
>>>>> translating an Expression to Parquet FilterPredicate. They don't have
>> to
>>>>> understand the entire set of interfaces that we want to promote. Just
>>>> like
>>>>> they only need to know how to say "Hello" without learning ancient
>>>> Chinese
>>>>> poem.
>>>>>
>>>>> Again, I am not saying this is necessarily the best approach. But so
>> far
>>>> it
>>>>> seems a reasonable design principle to tell the developers.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jiangjie (becket) Qin
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 25, 2020 at 11:53 AM Kurt Young <yk...@gmail.com> wrote:
>>>>>
>>>>>> Hi Becket,
>>>>>>
>>>>>> I don't think we should discuss this in pure engineering aspects. Your
>>>>>> proposal is trying
>>>>>> to let SQL connector developers understand as less SQL concepts as
>>>>>> possible. But quite
>>>>>> the opposite, we are designing those interfaces to emphasize the SQL
>>>>>> concept, to bridge
>>>>>> high level concepts into real interfaces and classes.
>>>>>>
>>>>>> We keep talking about time-varying relations and dynamic table when
>>>>>> introduce SQL concepts,
>>>>>> sources and sinks are most critical part playing with those concepts.
>>>> It's
>>>>>> essential to let
>>>>>> Flink SQL developers to learn these concepts and connect them with
>> real
>>>>>> codes by introducing
>>>>>> these connector interfaces and can further write *correct* connectors
>>>> based
>>>>>> on such domain
>>>>>> knowledge.
>>>>>>
>>>>>> So this FLIP is a very important chance to express these concepts and
>>>> make
>>>>>> most SQL developers
>>>>>> be align with concepts and on same page. It's mostly for different
>>>> level of
>>>>>> abstractions and for domains
>>>>>> like SQL, it's becoming more important.  It helps Flink SQL go
>> smoothly
>>>> in
>>>>>> the future, and also
>>>>>> make it easier for new contributors. But I would admit this is not
>> that
>>>>>> obvious for others who don't work
>>>>>> with SQL frequently.
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 25, 2020 at 11:07 AM Becket Qin <be...@gmail.com>
>>>> wrote:
>>>>>>> Hi Jark,
>>>>>>>
>>>>>>> It is good to know that we do not expect the end users to touch those
>>>>>>> interfaces.
>>>>>>>
>>>>>>> Then the question boils down to whether the connector developers
>> should
>>>>>> be
>>>>>>> aware of the interfaces that are only used by the SQL optimizer. It
>>>>>> seems a
>>>>>>> win if we can avoid that.
>>>>>>>
>>>>>>> Two potential solutions off the top of my head are:
>>>>>>> 1. An internal helper class doing the instanceOf based on DataStream
>>>>>> source
>>>>>>> interface and create pluggables for that DataStream source.
>>>>>>> 2. codegen the set of TableSource interfaces given a DataStream
>> Source
>>>>>> and
>>>>>>> its corresponding TablePluggablesFactory.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jiangjie (Becket) Qin
>>>>>>>
>>>>>>> On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Becket,
>>>>>>>>
>>>>>>>> Regarding to Flavor1 and Flavor2, I want to clarify that user will
>>>>>> never
>>>>>>>> use table source like this:
>>>>>>>>
>>>>>>>> {
>>>>>>>>      MyTableSource myTableSource = MyTableSourceFactory.create();
>>>>>>>>      myTableSource.setSchema(mySchema);
>>>>>>>>      myTableSource.applyFilterPredicate(expression);
>>>>>>>>      ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> TableFactory and TableSource are not directly exposed to end users,
>>>> all
>>>>>>> the
>>>>>>>> methods are called by planner, not users.
>>>>>>>> Users always use DDL or descriptor to register a table, and planner
>>>>>> will
>>>>>>>> find the factory and create sources according to the properties.
>>>>>>>> All the optimization are applied automatically, e.g.
>> filter/projection
>>>>>>>> pushdown, users don't need to call `applyFilterPredicate`
>> explicitly.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com>
>>>> wrote:
>>>>>>>>> Hi Timo and Dawid,
>>>>>>>>>
>>>>>>>>> Thanks for the clarification. They really help. You are right that
>> we
>>>>>>> are
>>>>>>>>> on the same page regarding the hierarchy. I think the only
>> difference
>>>>>>>>> between our view is the flavor of the interfaces. There are two
>>>>>> flavors
>>>>>>>> of
>>>>>>>>> the source interface for DataStream and Table source.
>>>>>>>>>
>>>>>>>>> *Flavor 1. Table Sources are some wrapper interfaces around
>>>>>> DataStream
>>>>>>>>> source.*
>>>>>>>>> Following this way, we will reach the design of the current
>> proposal,
>>>>>>>> i.e.
>>>>>>>>> each pluggable exposed in the DataStream source will have a
>>>>>>> corresponding
>>>>>>>>> TableSource interface counterpart, which are at the Factory level.
>>>>>>> Users
>>>>>>>>> will write code like this:
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>>       MyTableSource myTableSource = MyTableSourceFactory.create();
>>>>>>>>>       myTableSource.setSchema(mySchema);
>>>>>>>>>       myTableSource.applyFilterPredicate(expression);
>>>>>>>>>       ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> The good thing for this flavor is that from the SQL / Table's
>>>>>>>> perspective,
>>>>>>>>> there is a dedicated set of Table oriented interface.
>>>>>>>>>
>>>>>>>>> The downsides are:
>>>>>>>>> A. From the user's perspective, DataStream Source and Table Source
>>>>>> are
>>>>>>>> just
>>>>>>>>> two different sets of interfaces, regardless of how they are the
>> same
>>>>>>>>> internally.
>>>>>>>>> B. The source developers have to develop for those two sets of
>>>>>>> interfaces
>>>>>>>>> in order to support both DataStream and Table.
>>>>>>>>> C. It is not explicit that DataStream can actually share the
>>>>>> pluggable
>>>>>>> in
>>>>>>>>> Table / SQL. For example, in order to provide a filter pluggable
>> with
>>>>>>> SQL
>>>>>>>>> expression, users will have to know the actual converter class that
>>>>>>>>> converts the expression to the filter predicate and construct that
>>>>>>>>> converter by themselves.
>>>>>>>>>
>>>>>>>>> ---------------
>>>>>>>>>
>>>>>>>>> *Flavor 2. A TableSource is a DataStream source with a bunch of
>>>>>>>> pluggables.
>>>>>>>>> No Table specific interfaces at all.*
>>>>>>>>> Following this way, we will reach another design where you have a
>>>>>>>>> SourceFactory and a single Pluggable factory for all the table
>>>>>>>> pluggables.
>>>>>>>>> And users will write something like:
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>>       Deserializer<Row> myTableDeserializer =
>>>>>>>>> MyTablePluggableFactory.createDeserializer(schema)
>>>>>>>>>       MySource<Row> mySource = MySourceFactory.create(properties,
>>>>>>>>> myTableDeserializer);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> The good thing for this flavor is that there is just one set of
>>>>>>> interface
>>>>>>>>> that works for both Table and DataStream. There is no difference
>>>>>>> between
>>>>>>>>> creating a DataStream source and creating a Table source.
>> DataStream
>>>>>>> can
>>>>>>>>> easily reuse the pluggables from the Table sources.
>>>>>>>>>
>>>>>>>>> The downside is that Table / SQL won't have a dedicated API for
>>>>>>>>> optimization. Instead of writing:
>>>>>>>>>
>>>>>>>>> if (MyTableSource instanceOf FilterableTableSource) {
>>>>>>>>>        // Some filter push down logic.
>>>>>>>>>        MyTableSource.applyPredicate(expression)
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> One have to write:
>>>>>>>>>
>>>>>>>>> if (MySource instanceOf FilterableSource) {
>>>>>>>>>       // Some filter push down logic.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> -------------------------
>>>>>>>>>
>>>>>>>>> Just to be clear, I am not saying flavor 2 is necessarily better
>> than
>>>>>>>>> flavor 1, but I want to make sure flavor 2 is also considered and
>>>>>>>>> discussed.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Jiangjie (Becket) Qin.
>>>>>>>>>
>>>>>>>>> On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
>>>>>>>> dwysakowicz@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Becket,
>>>>>>>>>>
>>>>>>>>>> I really think we don't have a differing opinions. We might not
>> see
>>>>>>> the
>>>>>>>>>> changes in the same way yet. Personally I think of the
>>>>>>>> DynamicTableSource
>>>>>>>>>> as of a factory for a Source implemented for the DataStream API.
>>>>>> The
>>>>>>>>>> important fact about the DynamicTableSource and all feature traits
>>>>>>>>>> (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work
>>>>>> with
>>>>>>>>> Table
>>>>>>>>>> API concepts such as e.g. Expressions, SQL specific types etc. In
>>>>>> the
>>>>>>>> end
>>>>>>>>>> what the implementation would resemble is (bear in mind I
>>>>>>> tremendously
>>>>>>>>>> simplified the example, just to show the relation between the two
>>>>>>>> APIs):
>>>>>>>>>> SupportsFilterablePushDown {
>>>>>>>>>>
>>>>>>>>>>     applyFilters(List<ResolvedExpression> filters) {
>>>>>>>>>>
>>>>>>>>>>       this.filters = convertToDataStreamFilters(filters);
>>>>>>>>>>
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     Source createSource() {
>>>>>>>>>>
>>>>>>>>>>           return Source.create()
>>>>>>>>>>
>>>>>>>>>>             .applyFilters(this.filters);
>>>>>>>>>>
>>>>>>>>>>      }
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> or exactly as you said for the computed columns:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> SupportsComputedColumnsPushDown {
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     applyComputedColumn(ComputedColumnConverter converter) {
>>>>>>>>>>
>>>>>>>>>>       this.deserializationSchema = new DeserializationSchema<Row>
>> {
>>>>>>>>>>         Row deserialize(...) {
>>>>>>>>>>
>>>>>>>>>>           RowData row = format.deserialize(bytes); // original
>>>>>> format,
>>>>>>>> e.g
>>>>>>>>>> json, avro, etc.
>>>>>>>>>>
>>>>>>>>>>           RowData enriched = converter(row)
>>>>>>>>>>
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     Source createSource() {
>>>>>>>>>>
>>>>>>>>>>           return Source.create()
>>>>>>>>>>
>>>>>>>>>>             .withDeserialization(deserializationSchema);
>>>>>>>>>>
>>>>>>>>>>      }
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> So to sum it up again, all those interfaces are factories that
>>>>>>>> configure
>>>>>>>>>> appropriate parts of the DataStream API using Table API concepts.
>>>>>>>> Finally
>>>>>>>>>> to answer you question for particular comparisons:
>>>>>>>>>>
>>>>>>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>>>>>>>>>> SupportsFilterablePushDown v.s. FilterableSource
>>>>>>>>>> SupportsProjectablePushDown v.s. ProjectableSource
>>>>>>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>>>>>>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>>>>>>>>>> ScanTableSource v.s. ChangeLogDeserializer.
>>>>>>>>>>
>>>>>>>>>> pretty much you can think of all on the left as factories for the
>>>>>>> right
>>>>>>>>>> side, left side works with Table API classes (Expressions,
>>>>>>> DataTypes).
>>>>>>>> I
>>>>>>>>>> hope this clarifies it a bit.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Dawid
>>>>>>>>>> On 24/03/2020 15:03, Becket Qin wrote:
>>>>>>>>>>
>>>>>>>>>> Hey Kurt,
>>>>>>>>>>
>>>>>>>>>> I don't think DataStream should see some SQL specific concepts
>> such
>>>>>>> as
>>>>>>>>>> Filtering or ComputedColumn.
>>>>>>>>>>
>>>>>>>>>> Projectable and Filterable seems not necessarily SQL concepts, but
>>>>>>>> could
>>>>>>>>> be
>>>>>>>>>> applicable to DataStream source as well to reduce the network
>> load.
>>>>>>> For
>>>>>>>>>> example ORC and Parquet should probably also be readable from
>>>>>>>> DataStream,
>>>>>>>>>> right?
>>>>>>>>>>
>>>>>>>>>> ComputedColumn is not part of the Source, it is an interface
>>>>>> extends
>>>>>>>> the
>>>>>>>>>> Deserializer, which is a pluggable for the Source. From the SQL's
>>>>>>>>>> perspective it has the concept of computed column, but from the
>>>>>>> Source
>>>>>>>>>> perspective, It is essentially a Deserializer which also converts
>>>>>> the
>>>>>>>>>> records internally, assuming we allow some conversion to be
>>>>>> embedded
>>>>>>> to
>>>>>>>>>> the source in addition to just deserialization.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <
>>>>>>>>> imjark@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks Timo for updating the formats section. That would be very
>>>>>>>> helpful
>>>>>>>>>> for changelog supporting (FLIP-105).
>>>>>>>>>>
>>>>>>>>>> I just left 2 minor comment about some method names. In general,
>>>>>> I'm
>>>>>>> +1
>>>>>>>>> to
>>>>>>>>>> start a voting.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>> --------------------------------------------------------------------------------------------------
>>>>>>>>>> Hi Becket,
>>>>>>>>>>
>>>>>>>>>> I agree we shouldn't duplicate codes, especiall the runtime
>>>>>>>>>> implementations.
>>>>>>>>>> However, the interfaces proposed by FLIP-95 are mainly used during
>>>>>>>>>> optimization (compiling), not runtime.
>>>>>>>>>> I don't think there is much to share for this. Because table/sql
>>>>>>>>>> is declarative, but DataStream is imperative.
>>>>>>>>>> For example, filter push down, DataStream FilterableSource may
>>>>>> allow
>>>>>>> to
>>>>>>>>>> accept a FilterFunction (which is a black box for the source).
>>>>>>>>>> However, table sources should pick the pushed filter expressions,
>>>>>>> some
>>>>>>>>>> sources may only support "=", "<", ">" conditions.
>>>>>>>>>> Pushing a FilterFunction doesn't work in table ecosystem. That
>>>>>> means,
>>>>>>>> the
>>>>>>>>>> connectors have to have some table-specific implementations.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>>
>>>>>>>>>> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <
>>>>>>>>> ykt836@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Becket,
>>>>>>>>>>
>>>>>>>>>> I don't think DataStream should see some SQL specific concepts
>> such
>>>>>>> as
>>>>>>>>>> Filtering or ComputedColumn. It's
>>>>>>>>>> better to stay within SQL area and translate to more generic
>>>>>> concept
>>>>>>>> when
>>>>>>>>>> translating to DataStream/Runtime
>>>>>>>>>> layer, such as use MapFunction to represent computed column logic.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kurt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com>
>>>>>> <
>>>>>>>>> becket.qin@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Timo and Dawid,
>>>>>>>>>>
>>>>>>>>>> It's really great that we have the same goal. I am actually
>>>>>> wondering
>>>>>>>>>> if
>>>>>>>>>>
>>>>>>>>>> we
>>>>>>>>>>
>>>>>>>>>> can go one step further to avoid some of the interfaces in Table
>> as
>>>>>>>>>> well.
>>>>>>>>>>
>>>>>>>>>> For example, if we have the FilterableSource, do we still need the
>>>>>>>>>> FilterableTableSource? Should DynamicTableSource just become a
>>>>>>>>>> Source<*Row*,
>>>>>>>>>> SourceSplitT, EnumChkT>?
>>>>>>>>>>
>>>>>>>>>> Can you help me understand a bit more about the reason we need the
>>>>>>>>>> following relational representation / wrapper interfaces v.s. the
>>>>>>>>>> interfaces that we could put to the Source in FLIP-27?
>>>>>>>>>>
>>>>>>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>>>>>>>>>> SupportsFilterablePushDown v.s. FilterableSource
>>>>>>>>>> SupportsProjectablePushDown v.s. ProjectableSource
>>>>>>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>>>>>>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>>>>>>>>>> ScanTableSource v.s. ChangeLogDeserializer.
>>>>>>>>>> LookUpTableSource v.s. LookUpSource
>>>>>>>>>>
>>>>>>>>>> Assuming we have all the interfaces on the right side, do we still
>>>>>>> need
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> interfaces on the left side? Note that the interfaces on the right
>>>>>>> can
>>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>> used by both DataStream and Table. If we do this, there will only
>>>>>> be
>>>>>>>>>> one
>>>>>>>>>>
>>>>>>>>>> set of Source interfaces Table and DataStream, the only difference
>>>>>> is
>>>>>>>>>> that
>>>>>>>>>>
>>>>>>>>>> the Source for table will have some specific plugins and
>>>>>>>>>>
>>>>>>>>>> configurations.
>>>>>>>>>>
>>>>>>>>>> An
>>>>>>>>>>
>>>>>>>>>> omnipotent Source can implement all the the above interfaces and
>>>>>>> take a
>>>>>>>>>> Deserializer that implements both ComputedColumnDeserializer and
>>>>>>>>>> ChangeLogDeserializer.
>>>>>>>>>>
>>>>>>>>>> Would the SQL planner work with that?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <
>>>>>> jingsonglee0@gmail.com>
>>>>>>> <
>>>>>>>>> jingsonglee0@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +1. Thanks Timo for the design doc.
>>>>>>>>>>
>>>>>>>>>> We can also consider @Experimental too. But I am +1 to
>>>>>>>>>>
>>>>>>>>>> @PublicEvolving,
>>>>>>>>>>
>>>>>>>>>> we
>>>>>>>>>>
>>>>>>>>>> should be confident in the current change.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jingsong Lee
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
>>>>>> <
>>>>>>>>> twalthr@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> @Becket: We totally agree that we don't need table specific
>>>>>>>>>>
>>>>>>>>>> connectors
>>>>>>>>>>
>>>>>>>>>> during runtime. As Dawid said, the interfaces proposed here are
>>>>>>>>>>
>>>>>>>>>> just
>>>>>>>>>>
>>>>>>>>>> for
>>>>>>>>>>
>>>>>>>>>> communication with the planner. Once the properties (watermarks,
>>>>>>>>>> computed column, filters, projecttion etc.) are negotiated, we can
>>>>>>>>>> configure a regular Flink connector.
>>>>>>>>>>
>>>>>>>>>> E.g. setting the watermark assigner and deserialization schema of
>> a
>>>>>>>>>> Kafka connector.
>>>>>>>>>>
>>>>>>>>>> For better separation of concerns, Flink connectors should not
>>>>>>>>>>
>>>>>>>>>> include
>>>>>>>>>>
>>>>>>>>>> relational interfaces and depend on flink-table. This is the
>>>>>>>>>> responsibility of table source/sink.
>>>>>>>>>>
>>>>>>>>>> @Kurt: I would like to mark them @PublicEvolving already because
>> we
>>>>>>>>>> need
>>>>>>>>>>
>>>>>>>>>> to deprecate the old interfaces as early as possible. We cannot
>>>>>>>>>>
>>>>>>>>>> redirect
>>>>>>>>>>
>>>>>>>>>> to @Internal interfaces. They are not marked @Public, so we can
>>>>>>>>>>
>>>>>>>>>> still
>>>>>>>>>>
>>>>>>>>>> evolve them. But a core design shift should not happen again, it
>>>>>>>>>>
>>>>>>>>>> would
>>>>>>>>>>
>>>>>>>>>> leave a bad impression if we are redesign over and over again.
>>>>>>>>>>
>>>>>>>>>> Instead
>>>>>>>>>>
>>>>>>>>>> we should be confident in the current change.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Timo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 24.03.20 09:20, Dawid Wysakowicz wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Becket,
>>>>>>>>>>
>>>>>>>>>> Answering your question, we have the same intention not to
>>>>>>>>>>
>>>>>>>>>> duplicate
>>>>>>>>>>
>>>>>>>>>> connectors between datastream and table apis. The interfaces
>>>>>>>>>>
>>>>>>>>>> proposed
>>>>>>>>>>
>>>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> the FLIP are a way to describe relational properties of a source.
>>>>>>>>>>
>>>>>>>>>> The
>>>>>>>>>>
>>>>>>>>>> intention is as you described to translate all of those expressed
>>>>>>>>>>
>>>>>>>>>> as
>>>>>>>>>>
>>>>>>>>>> expressions or other Table specific structures into a DataStream
>>>>>>>>>>
>>>>>>>>>> source.
>>>>>>>>>>
>>>>>>>>>> In other words I think what we are doing here is in line with
>>>>>>>>>>
>>>>>>>>>> what
>>>>>>>>>>
>>>>>>>>>> you
>>>>>>>>>>
>>>>>>>>>> described.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Dawid
>>>>>>>>>>
>>>>>>>>>> On 24/03/2020 02:23, Becket Qin wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Timo,
>>>>>>>>>>
>>>>>>>>>> Thanks for the proposal. I completely agree that the current
>>>>>>>>>>
>>>>>>>>>> Table
>>>>>>>>>>
>>>>>>>>>> connectors could be simplified quite a bit. I haven't finished
>>>>>>>>>>
>>>>>>>>>> reading
>>>>>>>>>>
>>>>>>>>>> everything, but here are some quick thoughts.
>>>>>>>>>>
>>>>>>>>>> Actually to me the biggest question is why should there be two
>>>>>>>>>>
>>>>>>>>>> different
>>>>>>>>>>
>>>>>>>>>> connector systems for DataStream and Table? What is the
>>>>>>>>>>
>>>>>>>>>> fundamental
>>>>>>>>>>
>>>>>>>>>> reason
>>>>>>>>>>
>>>>>>>>>> that is preventing us from merging them to one?
>>>>>>>>>>
>>>>>>>>>> The basic functionality of a connector is to provide
>>>>>>>>>>
>>>>>>>>>> capabilities
>>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> do
>>>>>>>>>>
>>>>>>>>>> IO
>>>>>>>>>>
>>>>>>>>>> and Serde. Conceptually, Table connectors should just be
>>>>>>>>>>
>>>>>>>>>> DataStream
>>>>>>>>>>
>>>>>>>>>> connectors that are dealing with Rows. It seems that quite a few
>>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> special connector requirements are just a specific way to do IO
>>>>>>>>>>
>>>>>>>>>> /
>>>>>>>>>>
>>>>>>>>>> Serde.
>>>>>>>>>>
>>>>>>>>>> Taking SupportsFilterPushDown as an example, imagine we have the
>>>>>>>>>>
>>>>>>>>>> following
>>>>>>>>>>
>>>>>>>>>> interface:
>>>>>>>>>>
>>>>>>>>>> interface FilterableSource<PREDICATE> {
>>>>>>>>>>        void applyFilterable(Supplier<PREDICATE> predicate);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> And if a ParquetSource would like to support filterable, it will
>>>>>>>>>>
>>>>>>>>>> become:
>>>>>>>>>>
>>>>>>>>>> class ParquetSource implements Source,
>>>>>>>>>>
>>>>>>>>>> FilterableSource(FilterPredicate> {
>>>>>>>>>>
>>>>>>>>>>        ...
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> For Table, one just need to provide an predicate supplier that
>>>>>>>>>>
>>>>>>>>>> converts
>>>>>>>>>>
>>>>>>>>>> an
>>>>>>>>>>
>>>>>>>>>> Expression to the specified predicate type. This has a few
>>>>>>>>>>
>>>>>>>>>> benefit:
>>>>>>>>>>
>>>>>>>>>> 1. Same unified API for filterable for sources, regardless of
>>>>>>>>>>
>>>>>>>>>> DataStream or
>>>>>>>>>>
>>>>>>>>>> Table.
>>>>>>>>>> 2. The  DataStream users now can also use the
>>>>>>>>>>
>>>>>>>>>> ExpressionToPredicate
>>>>>>>>>>
>>>>>>>>>> supplier if they want to.
>>>>>>>>>>
>>>>>>>>>> To summarize, my main point is that I am wondering if it is
>>>>>>>>>>
>>>>>>>>>> possible
>>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> have a single set of connector interface for both Table and
>>>>>>>>>>
>>>>>>>>>> DataStream,
>>>>>>>>>>
>>>>>>>>>> rather than having two hierarchies. I am not 100% sure if this
>>>>>>>>>>
>>>>>>>>>> would
>>>>>>>>>>
>>>>>>>>>> work,
>>>>>>>>>>
>>>>>>>>>> but if it works, this would be a huge win from both code
>>>>>>>>>>
>>>>>>>>>> maintenance
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>> user experience perspective.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
>>>>>>>>>>
>>>>>>>>>> dwysakowicz@apache.org>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Timo,
>>>>>>>>>>
>>>>>>>>>> Thank you for the proposal. I think it is an important
>>>>>>>>>>
>>>>>>>>>> improvement
>>>>>>>>>>
>>>>>>>>>> that
>>>>>>>>>>
>>>>>>>>>> will benefit many parts of the Table API. The proposal looks
>>>>>>>>>>
>>>>>>>>>> really
>>>>>>>>>>
>>>>>>>>>> good
>>>>>>>>>>
>>>>>>>>>> to me and personally I would be comfortable with voting on the
>>>>>>>>>>
>>>>>>>>>> current
>>>>>>>>>>
>>>>>>>>>> state.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Dawid
>>>>>>>>>>
>>>>>>>>>> On 23/03/2020 18:53, Timo Walther wrote:
>>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I received some questions around how the new interfaces play
>>>>>>>>>>
>>>>>>>>>> together
>>>>>>>>>>
>>>>>>>>>> with formats and their factories.
>>>>>>>>>>
>>>>>>>>>> Furthermore, for MySQL or Postgres CDC logs, the format should
>>>>>>>>>>
>>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>> able
>>>>>>>>>>
>>>>>>>>>> to return a `ChangelogMode`.
>>>>>>>>>>
>>>>>>>>>> Also, I incorporated the feedback around the factory design in
>>>>>>>>>>
>>>>>>>>>> general.
>>>>>>>>>>
>>>>>>>>>> I added a new section `Factory Interfaces` to the design
>>>>>>>>>>
>>>>>>>>>> document.
>>>>>>>>>>
>>>>>>>>>> This should be helpful to understand the big picture and
>>>>>>>>>>
>>>>>>>>>> connecting
>>>>>>>>>>
>>>>>>>>>> the concepts.
>>>>>>>>>>
>>>>>>>>>> Please let me know what you think?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Timo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 18.03.20 13:43, Timo Walther wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Benchao,
>>>>>>>>>>
>>>>>>>>>> this is a very good question. I will update the FLIP about
>>>>>>>>>>
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>> The legacy planner will not support the new interfaces. It
>>>>>>>>>>
>>>>>>>>>> will
>>>>>>>>>>
>>>>>>>>>> only
>>>>>>>>>>
>>>>>>>>>> support the old interfaces. With the next release, I think
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> Blink
>>>>>>>>>>
>>>>>>>>>> planner is stable enough to be the default one as well.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Timo
>>>>>>>>>>
>>>>>>>>>> On 18.03.20 08:45, Benchao Li wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Timo,
>>>>>>>>>>
>>>>>>>>>> Thank you and others for the efforts to prepare this FLIP.
>>>>>>>>>>
>>>>>>>>>> The FLIP LGTM generally.
>>>>>>>>>>
>>>>>>>>>> +1 for moving blink data structures to table-common, it's
>>>>>>>>>>
>>>>>>>>>> useful
>>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> udf too
>>>>>>>>>> in the future.
>>>>>>>>>> A little question is, do we plan to support the new
>>>>>>>>>>
>>>>>>>>>> interfaces
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>> data
>>>>>>>>>>
>>>>>>>>>> types in legacy planner?
>>>>>>>>>> Or we only plan to support these new interfaces in blink
>>>>>>>>>>
>>>>>>>>>> planner.
>>>>>>>>>>
>>>>>>>>>> And using primary keys from DDL instead of derived key
>>>>>>>>>>
>>>>>>>>>> information
>>>>>>>>>>
>>>>>>>>>> from
>>>>>>>>>>
>>>>>>>>>> each query is also a good idea,
>>>>>>>>>> we met some use cases where this does not works very well
>>>>>>>>>>
>>>>>>>>>> before.
>>>>>>>>>>
>>>>>>>>>> This FLIP also makes the dependencies of table modules more
>>>>>>>>>>
>>>>>>>>>> clear, I
>>>>>>>>>>
>>>>>>>>>> like
>>>>>>>>>> it very much.
>>>>>>>>>>
>>>>>>>>>> Timo Walther <tw...@apache.org> <tw...@apache.org>
>>>>>> 于2020年3月17日周二
>>>>>>>>> 上午1:36写道：
>>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I'm happy to present the results of long discussions that
>>>>>>>>>>
>>>>>>>>>> we
>>>>>>>>>>
>>>>>>>>>> had
>>>>>>>>>>
>>>>>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
>>>>>>>>>>
>>>>>>>>>> many
>>>>>>>>>>
>>>>>>>>>> more
>>>>>>>>>>
>>>>>>>>>> have contributed to this design document.
>>>>>>>>>>
>>>>>>>>>> We would like to propose new long-term table source and
>>>>>>>>>>
>>>>>>>>>> table
>>>>>>>>>>
>>>>>>>>>> sink
>>>>>>>>>>
>>>>>>>>>> interfaces:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>>>>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>>>>>>>>>
>>>>>>>>>> The goals of this FLIP are:
>>>>>>>>>>
>>>>>>>>>> - Simplify the current interface architecture:
>>>>>>>>>>          - Merge upsert, retract, and append sinks.
>>>>>>>>>>          - Unify batch and streaming sources.
>>>>>>>>>>          - Unify batch and streaming sinks.
>>>>>>>>>>
>>>>>>>>>> - Allow sources to produce a changelog:
>>>>>>>>>>          - UpsertTableSources have been requested a lot by
>>>>>>>>>>
>>>>>>>>>> users.
>>>>>>>>>>
>>>>>>>>>> Now
>>>>>>>>>>
>>>>>>>>>> is the
>>>>>>>>>> time to open the internal planner capabilities via the new
>>>>>>>>>>
>>>>>>>>>> interfaces.
>>>>>>>>>>
>>>>>>>>>>          - According to FLIP-105, we would like to support
>>>>>>>>>>
>>>>>>>>>> changelogs for
>>>>>>>>>>
>>>>>>>>>> processing formats such as Debezium.
>>>>>>>>>>
>>>>>>>>>> - Don't rely on DataStream API for source and sinks:
>>>>>>>>>>          - According to FLIP-32, the Table API and SQL should
>>>>>>>>>>
>>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>> independent
>>>>>>>>>> of the DataStream API which is why the `table-common`
>>>>>>>>>>
>>>>>>>>>> module
>>>>>>>>>>
>>>>>>>>>> has
>>>>>>>>>>
>>>>>>>>>> no
>>>>>>>>>>
>>>>>>>>>> dependencies on `flink-streaming-java`.
>>>>>>>>>>          - Source and sink implementations should only depend
>>>>>>>>>>
>>>>>>>>>> on
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> `table-common` module after FLIP-27.
>>>>>>>>>>          - Until FLIP-27 is ready, we still put most of the
>>>>>>>>>>
>>>>>>>>>> interfaces in
>>>>>>>>>>
>>>>>>>>>> `table-common` and strictly separate interfaces that
>>>>>>>>>>
>>>>>>>>>> communicate
>>>>>>>>>>
>>>>>>>>>> with a
>>>>>>>>>> planner and actual runtime reader/writers.
>>>>>>>>>>
>>>>>>>>>> - Implement efficient sources and sinks without planner
>>>>>>>>>>
>>>>>>>>>> dependencies:
>>>>>>>>>>
>>>>>>>>>>          - Make Blink's internal data structures available to
>>>>>>>>>>
>>>>>>>>>> connectors.
>>>>>>>>>>
>>>>>>>>>>          - Introduce stable interfaces for data structures
>>>>>>>>>>
>>>>>>>>>> that
>>>>>>>>>>
>>>>>>>>>> can
>>>>>>>>>>
>>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>> marked as `@PublicEvolving`.
>>>>>>>>>>          - Only require dependencies on `flink-table-common`
>>>>>>>>>>
>>>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> future
>>>>>>>>>>
>>>>>>>>>> It finalizes the concept of dynamic tables and consideres
>>>>>>>>>>
>>>>>>>>>> how
>>>>>>>>>>
>>>>>>>>>> all
>>>>>>>>>>
>>>>>>>>>> source/sink related classes play together.
>>>>>>>>>>
>>>>>>>>>> We look forward to your feedback.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Timo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>
>>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hi Timo,

Regarding "connector developers just need to know how to write an
> ExpressionToParquetFilter":
>


This is the entire purpose of the DynamicTableSource/DynamicTableSink.
> The bridging between SQL concepts and connector specific concepts.
> Because this is the tricky part. How to get from a SQL concept to a
> connctor concept.


Maybe it is just a naming issue depending on whether one is looking upward
from the Connectors perspective, or looking downward from the SQL
perspective. If we agree that the connectors should provide semantic free
API to the high level use cases, it seems we should follow the former path.
And if there are one or two APIs that the connector developers have to
understand in order to support Table / SQL, I think we can just address
them case by case, instead of wrapping the entire low level source API
with a set of new concepts.

Correct me if I am wrong, can we tell the following story to a connector
developer and get a all the TableSource functionality work?

To provide a TableSource from a Source, one just need to know two more
concepts: *Row* and *Expression*. The work to create a TableSource are
following:
1. A connector developer can write three classes in order to build a table
source:

   - Deserializer<Row> (Must-have)
   - PredicateTranslator<Expression, FilterPredicate> (optional, only
   applicable if the Source is a FilterableSource)
   - PredicateTranslator<Expression, ProjectionPredicate> (optional, only
   applicable if the Source is a ProjectableSource)

2. In order to let the table source be discoverable, one need to provide a
Factory, and that Factory provides the following as a bundle:

   - The Source itself (Must-have)
   - The Deserializer<Row> (Must-have)
   - PredicateTranslator<Expression, FilterPredicate> (optional, only
   applicable when the Factory is a FilterFactory)
   - PredicateTranslator<Expression, ProjectionPredicate> (optional, only
   applicable when the Factory is a ProjectorFactory)

3. The Deserializer<Row> may implement one more decorative interfaces to
further convert the record after deserialization.

   - withMapFunction<Row, Row>;

Note that the above description only require the connector developer to
understand Expression and Row. If this works, It is much easier to explain
than throwing a full set of new concepts. More importantly, it is way more
generic. For example, If we change Row to Coordinates, and Expression to
Area, we easily get a Source for a Spatial Database.


One thing I want to call out is that while the old SourceFunction and
InputFormat are concrete implementations that does the actual IO work. The
Source API in FLIP-27 itself is kind of a Factory by itself already. So if
we can push the decorative interfaces from the TableFactory layer to the
Source layer, it will help unify the experience for DataStream and Table
Source. This will also align with our goal of letting the DataStream Source
provide a semantic free API that can be used by different high level API.


BTW, Jark suggested that we can probably have an offline call to accelerate
the discussion. I think it is a good idea. Can we do that?

Thanks,

Jiangjie (Becket) Qin


On Thu, Mar 26, 2020 at 5:28 PM Timo Walther <tw...@apache.org> wrote:

> Hi Becket,
>
> Regarding "PushDown/NestedPushDown which is internal to optimizer":
>
> Those concepts cannot be entirely internal to the optimizer, at some
> point the optimizer needs to pass them into the connector specific code.
> This code will then convert it to e.g. Parque expressions. So there must
> be some interface that takes SQL Expression and converts to connector
> specific code. This interface between planner and connector is modelled
> by the SupportsXXX interfaces. And you are right, if developers don't
> care, they don't need to implement those optional interfaces but will
> not get performant connectors.
>
> Regarding "Table connector can work with the above two mechanism":
>
> A table connector needs three mechanisms that are represented in the
> current design.
>
> 1. a stateless discovery interface (Factory) that can convert
> ConfigOptions to a stateful factory interface
> (DynamicTableSource/DynamicTableSink)
>
> 2. a stateful factory interface (DynamicTableSource/DynamicTableSink)
> that receives concepts from the optimizer (watermarks, filters,
> projections) and produces runtime classes such as your
> `ExpressionToParquetFilter`
>
> 3. runtime interfaces that are generated from the stateful factory; all
> the factories that you mentioned can be used in `getScanRuntimeProvider`.
>
> Regarding "connector developers just need to know how to write an
> ExpressionToParquetFilter":
>
> This is the entire purpose of the DynamicTableSource/DynamicTableSink.
> The bridging between SQL concepts and connector specific concepts.
> Because this is the tricky part. How to get from a SQL concept to a
> connctor concept.
>
> Regards,
> Timo
>
>
> On 26.03.20 04:46, Becket Qin wrote:
> > Hi Timo,
> >
> > Thanks for the reply. I totally agree that there must be something new
> > added to the connector in order to make it work for SQL / Table. My
> concern
> > is mostly over what they should be, and how to add them. To be honest, I
> > was kind of lost when looking at the interfaces such as
> > DataStructureConverter, RuntimeConverter and their internal context.
> Also I
> > believe most connector developers do not care about the concept of
> > "PushDown" / "NestedPushDown" which is internal to optimizer and not even
> > exposed to SQL writers.
> >
> > Therefore I am trying to see if we can:
> > A) Keep those additions minimum to the connector developers if they don't
> > have to know the details.
> > B) Expose as less high level concept as possible. More specifically, try
> to
> > speak the connector language and expose the general mechanism instead of
> > binding them with use case semantic.
> >
> > If we can achieve the above two goals, we could avoid adding unnecessary
> > burden to the connector developers, and also make the connectors more
> > generic.
> >
> > It might worth thinking about what additional work is necessary for the
> > connector developers, here are what I am thinking of, please correct me
> if
> > I miss something.
> >
> >     1. A Factory interface that allows high level use case, in this case
> >     SQL, to find a matching source using service provider mechanism.
> >     2. Allows the high level use case to specify the plugins that are
> >     supported by the underneath DataStream Source.
> >
> > If Table connector can work with the above two mechanism, maybe we can
> make
> > some slight modifications to the interfaces in the current FLIP.
> >
> >     - A *SourceFactory* which extends the Factory interface in the FLIP,
> >     with one more method:
> >        - *Source getSource();*
> >     - Some decorative interfaces to the SourceFactory such as:
> >        - *FilterFactory<PREDICATE, T extends Supplier<PREDICATE>>*, with
> the
> >        following method
> >           - T getFilter();
> >        - *ProjectorFactory<PREDICATE, T extends Supplier<PREDICATE>>*,
> with
> >        the following method.
> >           - T getProjector();
> >        - *DeserializerFactory<INPUT, OUTPUT>*
> >
> > With this set of API, a ParquetTableSourceFactory may become:
> >
> > class ParqeutTableSourceFactory implements
> >             SourceFactory,
> >             DeserializerFactory<ParquetRecords, Row>,
> >             FilterFactory<ParquetFilter, ExressionToParquetFilter> {
> >      @Override
> >      ParquetSource getSource() { ... }
> >
> >      @Override
> >      ExressionToParquetFilter getFilterSupplier() { ... };
> > }
> >
> > The ExressionToParquetFilter will have an *applyPredicate(Expression)*
> > method.
> >
> > I know it does not look like a perfect interface from the pure SQL
> > perspective. And I am not even sure if this would meet all the
> requirements
> > for SQL, but the benefit is that the connector developers just need to
> know
> > how to write an ExpressionToParquetFilter in order to make it work for
> > Table, without having to understand the entire SQL concept.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Wed, Mar 25, 2020 at 5:57 PM Timo Walther <tw...@apache.org> wrote:
> >
> >> Hi Becket,
> >>
> >> Let me clarify a few things first: Historically we thought of Table
> >> API/SQL as a library on top of DataStream API. Similar to Gelly or CEP.
> >> We used TypeInformation in Table API to integrate nicely with DataStream
> >> API. However, the last years have shown that SQL is not just a library.
> >> It is an entire ecosystem that defines data types, submission behavior,
> >> execution behavior, and highly optimized SerDes. SQL is a way to declare
> >> data processing end-to-end such that the planner has the full control
> >> over the execution.
> >>
> >> But I totally agree with your concerns around connectors. There is no
> >> big difference between your concerns and the current design.
> >>
> >> 1. "native connector interface is a generic abstraction of doing IO and
> >> Serde":
> >>
> >> This is the case in our design. We are using SourceFunction,
> >> DeserializationSchema, WatermarkAssigner, etc. all pluggable interfaces
> >> that the DataStream API offers for performing runtime operations.
> >>
> >> 2. "advanced features ... could be provided in a semantic free way":
> >>
> >> I agree here. But this is an orthogonal topic that each connector
> >> implementer should keep in mind. If a new connector is developed, it
> >> should *not* be developed only for SQL in mind but with good abstraction
> >> such that also DataStream API users can use it. A connector should have
> >> a builder pattern to plugin all capabilities like Parque filters etc.
> >> There should be no table-specific native/runtime connectors. I think
> >> this discussion is related to the discussion of FLIP-115.
> >>
> >> However, as I mentioned before: This FLIP only discusses the interfaces
> >> for communication between planner and connector factory. As Dawid said
> >> earlier, a DynamicTableSource can be more seen as a factory that calls
> >> pluggable interfaces of a native connextor in the end:
> >>
> >> KafkaConnector.builder()
> >>     .watermarkAssigner(...)
> >>     .keyDeser(...)
> >>     .valueDeser(...)
> >>     ....
> >>     .build()
> >>
> >> Regards,
> >> Timo
> >>
> >>
> >> On 25.03.20 09:05, Becket Qin wrote:
> >>> Hi Kurt,
> >>>
> >>> I do not object to promote the concepts of SQL, but I don't think we
> >> should
> >>> do that by introducing a new dedicate set of connector public
> interfaces
> >>> that is only for SQL. The same argument can be applied to Gelly, CEP,
> and
> >>> Machine Learning, claiming that they need to introduce a dedicated
> public
> >>> set of interfaces that fits their own concept and ask the the connector
> >>> developers to learn and follow their design. As an analogy, if we want
> to
> >>> promote Chinese, we don't want to force people to learn ancient Chinese
> >>> poem while they only need to know a few words like "hello" and
> "goodbye".
> >>>
> >>> As some design principles, here are what I think what  Flink connectors
> >>> should look like:
> >>>
> >>> 1. The native connector interface is a generic abstraction of doing IO
> >> and
> >>> Serde, without semantic for high level use cases such as SQL, Gelly,
> CEP,
> >>> etc.
> >>>
> >>> 2. Some advanced features that may help accelerate the IO and Serde
> could
> >>> be provided in the native connector interfaces in a semantic free way
> so
> >>> all the high level use cases can leverage.
> >>>
> >>> 3. Additional semantics can be built on top of the native source
> >> interface
> >>> through providing different plugins. These plugins could be high level
> >> use
> >>> case aware. For example, to provide a filter to the source, we can do
> the
> >>> following
> >>>
> >>> // An interface for all the filters that take an expression.
> >>> interface ExpressionFilter {
> >>>       FilterResult applyFilterExpression();
> >>> }
> >>>
> >>> // An filter plugin implementation that translate the SQL Expression
> to a
> >>> ParquetFilterPredicate.
> >>> Class ParquetExpressionFilter implements
> >> Supplier<ParquetFilterPredicate>,
> >>> ExpressionFilter {
> >>>       // Called by the high level use case,
> >>>       FilterResult applyFilterExpression() { ... }
> >>>
> >>>       // Used by the native Source interface.
> >>>       ParquetFilterPredicate get() { ... }
> >>> }
> >>>
> >>> In this case, the connector developer just need to write the logic of
> >>> translating an Expression to Parquet FilterPredicate. They don't have
> to
> >>> understand the entire set of interfaces that we want to promote. Just
> >> like
> >>> they only need to know how to say "Hello" without learning ancient
> >> Chinese
> >>> poem.
> >>>
> >>> Again, I am not saying this is necessarily the best approach. But so
> far
> >> it
> >>> seems a reasonable design principle to tell the developers.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (becket) Qin
> >>>
> >>>
> >>>
> >>> On Wed, Mar 25, 2020 at 11:53 AM Kurt Young <yk...@gmail.com> wrote:
> >>>
> >>>> Hi Becket,
> >>>>
> >>>> I don't think we should discuss this in pure engineering aspects. Your
> >>>> proposal is trying
> >>>> to let SQL connector developers understand as less SQL concepts as
> >>>> possible. But quite
> >>>> the opposite, we are designing those interfaces to emphasize the SQL
> >>>> concept, to bridge
> >>>> high level concepts into real interfaces and classes.
> >>>>
> >>>> We keep talking about time-varying relations and dynamic table when
> >>>> introduce SQL concepts,
> >>>> sources and sinks are most critical part playing with those concepts.
> >> It's
> >>>> essential to let
> >>>> Flink SQL developers to learn these concepts and connect them with
> real
> >>>> codes by introducing
> >>>> these connector interfaces and can further write *correct* connectors
> >> based
> >>>> on such domain
> >>>> knowledge.
> >>>>
> >>>> So this FLIP is a very important chance to express these concepts and
> >> make
> >>>> most SQL developers
> >>>> be align with concepts and on same page. It's mostly for different
> >> level of
> >>>> abstractions and for domains
> >>>> like SQL, it's becoming more important.  It helps Flink SQL go
> smoothly
> >> in
> >>>> the future, and also
> >>>> make it easier for new contributors. But I would admit this is not
> that
> >>>> obvious for others who don't work
> >>>> with SQL frequently.
> >>>>
> >>>> Best,
> >>>> Kurt
> >>>>
> >>>>
> >>>> On Wed, Mar 25, 2020 at 11:07 AM Becket Qin <be...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Hi Jark,
> >>>>>
> >>>>> It is good to know that we do not expect the end users to touch those
> >>>>> interfaces.
> >>>>>
> >>>>> Then the question boils down to whether the connector developers
> should
> >>>> be
> >>>>> aware of the interfaces that are only used by the SQL optimizer. It
> >>>> seems a
> >>>>> win if we can avoid that.
> >>>>>
> >>>>> Two potential solutions off the top of my head are:
> >>>>> 1. An internal helper class doing the instanceOf based on DataStream
> >>>> source
> >>>>> interface and create pluggables for that DataStream source.
> >>>>> 2. codegen the set of TableSource interfaces given a DataStream
> Source
> >>>> and
> >>>>> its corresponding TablePluggablesFactory.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jiangjie (Becket) Qin
> >>>>>
> >>>>> On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> wrote:
> >>>>>
> >>>>>> Hi Becket,
> >>>>>>
> >>>>>> Regarding to Flavor1 and Flavor2, I want to clarify that user will
> >>>> never
> >>>>>> use table source like this:
> >>>>>>
> >>>>>> {
> >>>>>>      MyTableSource myTableSource = MyTableSourceFactory.create();
> >>>>>>      myTableSource.setSchema(mySchema);
> >>>>>>      myTableSource.applyFilterPredicate(expression);
> >>>>>>      ...
> >>>>>> }
> >>>>>>
> >>>>>> TableFactory and TableSource are not directly exposed to end users,
> >> all
> >>>>> the
> >>>>>> methods are called by planner, not users.
> >>>>>> Users always use DDL or descriptor to register a table, and planner
> >>>> will
> >>>>>> find the factory and create sources according to the properties.
> >>>>>> All the optimization are applied automatically, e.g.
> filter/projection
> >>>>>> pushdown, users don't need to call `applyFilterPredicate`
> explicitly.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com>
> >> wrote:
> >>>>>>
> >>>>>>> Hi Timo and Dawid,
> >>>>>>>
> >>>>>>> Thanks for the clarification. They really help. You are right that
> we
> >>>>> are
> >>>>>>> on the same page regarding the hierarchy. I think the only
> difference
> >>>>>>> between our view is the flavor of the interfaces. There are two
> >>>> flavors
> >>>>>> of
> >>>>>>> the source interface for DataStream and Table source.
> >>>>>>>
> >>>>>>> *Flavor 1. Table Sources are some wrapper interfaces around
> >>>> DataStream
> >>>>>>> source.*
> >>>>>>> Following this way, we will reach the design of the current
> proposal,
> >>>>>> i.e.
> >>>>>>> each pluggable exposed in the DataStream source will have a
> >>>>> corresponding
> >>>>>>> TableSource interface counterpart, which are at the Factory level.
> >>>>> Users
> >>>>>>> will write code like this:
> >>>>>>>
> >>>>>>> {
> >>>>>>>       MyTableSource myTableSource = MyTableSourceFactory.create();
> >>>>>>>       myTableSource.setSchema(mySchema);
> >>>>>>>       myTableSource.applyFilterPredicate(expression);
> >>>>>>>       ...
> >>>>>>> }
> >>>>>>>
> >>>>>>> The good thing for this flavor is that from the SQL / Table's
> >>>>>> perspective,
> >>>>>>> there is a dedicated set of Table oriented interface.
> >>>>>>>
> >>>>>>> The downsides are:
> >>>>>>> A. From the user's perspective, DataStream Source and Table Source
> >>>> are
> >>>>>> just
> >>>>>>> two different sets of interfaces, regardless of how they are the
> same
> >>>>>>> internally.
> >>>>>>> B. The source developers have to develop for those two sets of
> >>>>> interfaces
> >>>>>>> in order to support both DataStream and Table.
> >>>>>>> C. It is not explicit that DataStream can actually share the
> >>>> pluggable
> >>>>> in
> >>>>>>> Table / SQL. For example, in order to provide a filter pluggable
> with
> >>>>> SQL
> >>>>>>> expression, users will have to know the actual converter class that
> >>>>>>> converts the expression to the filter predicate and construct that
> >>>>>>> converter by themselves.
> >>>>>>>
> >>>>>>> ---------------
> >>>>>>>
> >>>>>>> *Flavor 2. A TableSource is a DataStream source with a bunch of
> >>>>>> pluggables.
> >>>>>>> No Table specific interfaces at all.*
> >>>>>>> Following this way, we will reach another design where you have a
> >>>>>>> SourceFactory and a single Pluggable factory for all the table
> >>>>>> pluggables.
> >>>>>>> And users will write something like:
> >>>>>>>
> >>>>>>> {
> >>>>>>>       Deserializer<Row> myTableDeserializer =
> >>>>>>> MyTablePluggableFactory.createDeserializer(schema)
> >>>>>>>       MySource<Row> mySource = MySourceFactory.create(properties,
> >>>>>>> myTableDeserializer);
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> >>>>>>> }
> >>>>>>>
> >>>>>>> The good thing for this flavor is that there is just one set of
> >>>>> interface
> >>>>>>> that works for both Table and DataStream. There is no difference
> >>>>> between
> >>>>>>> creating a DataStream source and creating a Table source.
> DataStream
> >>>>> can
> >>>>>>> easily reuse the pluggables from the Table sources.
> >>>>>>>
> >>>>>>> The downside is that Table / SQL won't have a dedicated API for
> >>>>>>> optimization. Instead of writing:
> >>>>>>>
> >>>>>>> if (MyTableSource instanceOf FilterableTableSource) {
> >>>>>>>        // Some filter push down logic.
> >>>>>>>        MyTableSource.applyPredicate(expression)
> >>>>>>> }
> >>>>>>>
> >>>>>>> One have to write:
> >>>>>>>
> >>>>>>> if (MySource instanceOf FilterableSource) {
> >>>>>>>       // Some filter push down logic.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> >>>>>>> }
> >>>>>>>
> >>>>>>> -------------------------
> >>>>>>>
> >>>>>>> Just to be clear, I am not saying flavor 2 is necessarily better
> than
> >>>>>>> flavor 1, but I want to make sure flavor 2 is also considered and
> >>>>>>> discussed.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jiangjie (Becket) Qin.
> >>>>>>>
> >>>>>>> On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
> >>>>>> dwysakowicz@apache.org>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> I really think we don't have a differing opinions. We might not
> see
> >>>>> the
> >>>>>>>> changes in the same way yet. Personally I think of the
> >>>>>> DynamicTableSource
> >>>>>>>> as of a factory for a Source implemented for the DataStream API.
> >>>> The
> >>>>>>>> important fact about the DynamicTableSource and all feature traits
> >>>>>>>> (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work
> >>>> with
> >>>>>>> Table
> >>>>>>>> API concepts such as e.g. Expressions, SQL specific types etc. In
> >>>> the
> >>>>>> end
> >>>>>>>> what the implementation would resemble is (bear in mind I
> >>>>> tremendously
> >>>>>>>> simplified the example, just to show the relation between the two
> >>>>>> APIs):
> >>>>>>>>
> >>>>>>>> SupportsFilterablePushDown {
> >>>>>>>>
> >>>>>>>>     applyFilters(List<ResolvedExpression> filters) {
> >>>>>>>>
> >>>>>>>>       this.filters = convertToDataStreamFilters(filters);
> >>>>>>>>
> >>>>>>>>     }
> >>>>>>>>
> >>>>>>>>     Source createSource() {
> >>>>>>>>
> >>>>>>>>           return Source.create()
> >>>>>>>>
> >>>>>>>>             .applyFilters(this.filters);
> >>>>>>>>
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> or exactly as you said for the computed columns:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> SupportsComputedColumnsPushDown {
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>     applyComputedColumn(ComputedColumnConverter converter) {
> >>>>>>>>
> >>>>>>>>       this.deserializationSchema = new DeserializationSchema<Row>
> {
> >>>>>>>>
> >>>>>>>>         Row deserialize(...) {
> >>>>>>>>
> >>>>>>>>           RowData row = format.deserialize(bytes); // original
> >>>> format,
> >>>>>> e.g
> >>>>>>>> json, avro, etc.
> >>>>>>>>
> >>>>>>>>           RowData enriched = converter(row)
> >>>>>>>>
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>>       }
> >>>>>>>>
> >>>>>>>>     }
> >>>>>>>>
> >>>>>>>>     Source createSource() {
> >>>>>>>>
> >>>>>>>>           return Source.create()
> >>>>>>>>
> >>>>>>>>             .withDeserialization(deserializationSchema);
> >>>>>>>>
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> So to sum it up again, all those interfaces are factories that
> >>>>>> configure
> >>>>>>>> appropriate parts of the DataStream API using Table API concepts.
> >>>>>> Finally
> >>>>>>>> to answer you question for particular comparisons:
> >>>>>>>>
> >>>>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> >>>>>>>> SupportsFilterablePushDown v.s. FilterableSource
> >>>>>>>> SupportsProjectablePushDown v.s. ProjectableSource
> >>>>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> >>>>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> >>>>>>>> ScanTableSource v.s. ChangeLogDeserializer.
> >>>>>>>>
> >>>>>>>> pretty much you can think of all on the left as factories for the
> >>>>> right
> >>>>>>>> side, left side works with Table API classes (Expressions,
> >>>>> DataTypes).
> >>>>>> I
> >>>>>>>> hope this clarifies it a bit.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Dawid
> >>>>>>>> On 24/03/2020 15:03, Becket Qin wrote:
> >>>>>>>>
> >>>>>>>> Hey Kurt,
> >>>>>>>>
> >>>>>>>> I don't think DataStream should see some SQL specific concepts
> such
> >>>>> as
> >>>>>>>>
> >>>>>>>> Filtering or ComputedColumn.
> >>>>>>>>
> >>>>>>>> Projectable and Filterable seems not necessarily SQL concepts, but
> >>>>>> could
> >>>>>>> be
> >>>>>>>> applicable to DataStream source as well to reduce the network
> load.
> >>>>> For
> >>>>>>>> example ORC and Parquet should probably also be readable from
> >>>>>> DataStream,
> >>>>>>>> right?
> >>>>>>>>
> >>>>>>>> ComputedColumn is not part of the Source, it is an interface
> >>>> extends
> >>>>>> the
> >>>>>>>> Deserializer, which is a pluggable for the Source. From the SQL's
> >>>>>>>> perspective it has the concept of computed column, but from the
> >>>>> Source
> >>>>>>>> perspective, It is essentially a Deserializer which also converts
> >>>> the
> >>>>>>>> records internally, assuming we allow some conversion to be
> >>>> embedded
> >>>>> to
> >>>>>>>> the source in addition to just deserialization.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <
> >>>>>>> imjark@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks Timo for updating the formats section. That would be very
> >>>>>> helpful
> >>>>>>>> for changelog supporting (FLIP-105).
> >>>>>>>>
> >>>>>>>> I just left 2 minor comment about some method names. In general,
> >>>> I'm
> >>>>> +1
> >>>>>>> to
> >>>>>>>> start a voting.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> --------------------------------------------------------------------------------------------------
> >>>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> I agree we shouldn't duplicate codes, especiall the runtime
> >>>>>>>> implementations.
> >>>>>>>> However, the interfaces proposed by FLIP-95 are mainly used during
> >>>>>>>> optimization (compiling), not runtime.
> >>>>>>>> I don't think there is much to share for this. Because table/sql
> >>>>>>>> is declarative, but DataStream is imperative.
> >>>>>>>> For example, filter push down, DataStream FilterableSource may
> >>>> allow
> >>>>> to
> >>>>>>>> accept a FilterFunction (which is a black box for the source).
> >>>>>>>> However, table sources should pick the pushed filter expressions,
> >>>>> some
> >>>>>>>> sources may only support "=", "<", ">" conditions.
> >>>>>>>> Pushing a FilterFunction doesn't work in table ecosystem. That
> >>>> means,
> >>>>>> the
> >>>>>>>> connectors have to have some table-specific implementations.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <
> >>>>>>> ykt836@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> I don't think DataStream should see some SQL specific concepts
> such
> >>>>> as
> >>>>>>>> Filtering or ComputedColumn. It's
> >>>>>>>> better to stay within SQL area and translate to more generic
> >>>> concept
> >>>>>> when
> >>>>>>>> translating to DataStream/Runtime
> >>>>>>>> layer, such as use MapFunction to represent computed column logic.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Kurt
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com>
> >>>> <
> >>>>>>> becket.qin@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Timo and Dawid,
> >>>>>>>>
> >>>>>>>> It's really great that we have the same goal. I am actually
> >>>> wondering
> >>>>>>>>
> >>>>>>>> if
> >>>>>>>>
> >>>>>>>> we
> >>>>>>>>
> >>>>>>>> can go one step further to avoid some of the interfaces in Table
> as
> >>>>>>>>
> >>>>>>>> well.
> >>>>>>>>
> >>>>>>>> For example, if we have the FilterableSource, do we still need the
> >>>>>>>> FilterableTableSource? Should DynamicTableSource just become a
> >>>>>>>> Source<*Row*,
> >>>>>>>> SourceSplitT, EnumChkT>?
> >>>>>>>>
> >>>>>>>> Can you help me understand a bit more about the reason we need the
> >>>>>>>> following relational representation / wrapper interfaces v.s. the
> >>>>>>>> interfaces that we could put to the Source in FLIP-27?
> >>>>>>>>
> >>>>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> >>>>>>>> SupportsFilterablePushDown v.s. FilterableSource
> >>>>>>>> SupportsProjectablePushDown v.s. ProjectableSource
> >>>>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> >>>>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> >>>>>>>> ScanTableSource v.s. ChangeLogDeserializer.
> >>>>>>>> LookUpTableSource v.s. LookUpSource
> >>>>>>>>
> >>>>>>>> Assuming we have all the interfaces on the right side, do we still
> >>>>> need
> >>>>>>>>
> >>>>>>>> the
> >>>>>>>>
> >>>>>>>> interfaces on the left side? Note that the interfaces on the right
> >>>>> can
> >>>>>>>>
> >>>>>>>> be
> >>>>>>>>
> >>>>>>>> used by both DataStream and Table. If we do this, there will only
> >>>> be
> >>>>>>>>
> >>>>>>>> one
> >>>>>>>>
> >>>>>>>> set of Source interfaces Table and DataStream, the only difference
> >>>> is
> >>>>>>>>
> >>>>>>>> that
> >>>>>>>>
> >>>>>>>> the Source for table will have some specific plugins and
> >>>>>>>>
> >>>>>>>> configurations.
> >>>>>>>>
> >>>>>>>> An
> >>>>>>>>
> >>>>>>>> omnipotent Source can implement all the the above interfaces and
> >>>>> take a
> >>>>>>>> Deserializer that implements both ComputedColumnDeserializer and
> >>>>>>>> ChangeLogDeserializer.
> >>>>>>>>
> >>>>>>>> Would the SQL planner work with that?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <
> >>>> jingsonglee0@gmail.com>
> >>>>> <
> >>>>>>> jingsonglee0@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> +1. Thanks Timo for the design doc.
> >>>>>>>>
> >>>>>>>> We can also consider @Experimental too. But I am +1 to
> >>>>>>>>
> >>>>>>>> @PublicEvolving,
> >>>>>>>>
> >>>>>>>> we
> >>>>>>>>
> >>>>>>>> should be confident in the current change.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jingsong Lee
> >>>>>>>>
> >>>>>>>> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
> >>>> <
> >>>>>>> twalthr@apache.org>
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> @Becket: We totally agree that we don't need table specific
> >>>>>>>>
> >>>>>>>> connectors
> >>>>>>>>
> >>>>>>>> during runtime. As Dawid said, the interfaces proposed here are
> >>>>>>>>
> >>>>>>>> just
> >>>>>>>>
> >>>>>>>> for
> >>>>>>>>
> >>>>>>>> communication with the planner. Once the properties (watermarks,
> >>>>>>>> computed column, filters, projecttion etc.) are negotiated, we can
> >>>>>>>> configure a regular Flink connector.
> >>>>>>>>
> >>>>>>>> E.g. setting the watermark assigner and deserialization schema of
> a
> >>>>>>>> Kafka connector.
> >>>>>>>>
> >>>>>>>> For better separation of concerns, Flink connectors should not
> >>>>>>>>
> >>>>>>>> include
> >>>>>>>>
> >>>>>>>> relational interfaces and depend on flink-table. This is the
> >>>>>>>> responsibility of table source/sink.
> >>>>>>>>
> >>>>>>>> @Kurt: I would like to mark them @PublicEvolving already because
> we
> >>>>>>>>
> >>>>>>>> need
> >>>>>>>>
> >>>>>>>> to deprecate the old interfaces as early as possible. We cannot
> >>>>>>>>
> >>>>>>>> redirect
> >>>>>>>>
> >>>>>>>> to @Internal interfaces. They are not marked @Public, so we can
> >>>>>>>>
> >>>>>>>> still
> >>>>>>>>
> >>>>>>>> evolve them. But a core design shift should not happen again, it
> >>>>>>>>
> >>>>>>>> would
> >>>>>>>>
> >>>>>>>> leave a bad impression if we are redesign over and over again.
> >>>>>>>>
> >>>>>>>> Instead
> >>>>>>>>
> >>>>>>>> we should be confident in the current change.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 24.03.20 09:20, Dawid Wysakowicz wrote:
> >>>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> Answering your question, we have the same intention not to
> >>>>>>>>
> >>>>>>>> duplicate
> >>>>>>>>
> >>>>>>>> connectors between datastream and table apis. The interfaces
> >>>>>>>>
> >>>>>>>> proposed
> >>>>>>>>
> >>>>>>>> in
> >>>>>>>>
> >>>>>>>> the FLIP are a way to describe relational properties of a source.
> >>>>>>>>
> >>>>>>>> The
> >>>>>>>>
> >>>>>>>> intention is as you described to translate all of those expressed
> >>>>>>>>
> >>>>>>>> as
> >>>>>>>>
> >>>>>>>> expressions or other Table specific structures into a DataStream
> >>>>>>>>
> >>>>>>>> source.
> >>>>>>>>
> >>>>>>>> In other words I think what we are doing here is in line with
> >>>>>>>>
> >>>>>>>> what
> >>>>>>>>
> >>>>>>>> you
> >>>>>>>>
> >>>>>>>> described.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Dawid
> >>>>>>>>
> >>>>>>>> On 24/03/2020 02:23, Becket Qin wrote:
> >>>>>>>>
> >>>>>>>> Hi Timo,
> >>>>>>>>
> >>>>>>>> Thanks for the proposal. I completely agree that the current
> >>>>>>>>
> >>>>>>>> Table
> >>>>>>>>
> >>>>>>>> connectors could be simplified quite a bit. I haven't finished
> >>>>>>>>
> >>>>>>>> reading
> >>>>>>>>
> >>>>>>>> everything, but here are some quick thoughts.
> >>>>>>>>
> >>>>>>>> Actually to me the biggest question is why should there be two
> >>>>>>>>
> >>>>>>>> different
> >>>>>>>>
> >>>>>>>> connector systems for DataStream and Table? What is the
> >>>>>>>>
> >>>>>>>> fundamental
> >>>>>>>>
> >>>>>>>> reason
> >>>>>>>>
> >>>>>>>> that is preventing us from merging them to one?
> >>>>>>>>
> >>>>>>>> The basic functionality of a connector is to provide
> >>>>>>>>
> >>>>>>>> capabilities
> >>>>>>>>
> >>>>>>>> to
> >>>>>>>>
> >>>>>>>> do
> >>>>>>>>
> >>>>>>>> IO
> >>>>>>>>
> >>>>>>>> and Serde. Conceptually, Table connectors should just be
> >>>>>>>>
> >>>>>>>> DataStream
> >>>>>>>>
> >>>>>>>> connectors that are dealing with Rows. It seems that quite a few
> >>>>>>>>
> >>>>>>>> of
> >>>>>>>>
> >>>>>>>> the
> >>>>>>>>
> >>>>>>>> special connector requirements are just a specific way to do IO
> >>>>>>>>
> >>>>>>>> /
> >>>>>>>>
> >>>>>>>> Serde.
> >>>>>>>>
> >>>>>>>> Taking SupportsFilterPushDown as an example, imagine we have the
> >>>>>>>>
> >>>>>>>> following
> >>>>>>>>
> >>>>>>>> interface:
> >>>>>>>>
> >>>>>>>> interface FilterableSource<PREDICATE> {
> >>>>>>>>        void applyFilterable(Supplier<PREDICATE> predicate);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> And if a ParquetSource would like to support filterable, it will
> >>>>>>>>
> >>>>>>>> become:
> >>>>>>>>
> >>>>>>>> class ParquetSource implements Source,
> >>>>>>>>
> >>>>>>>> FilterableSource(FilterPredicate> {
> >>>>>>>>
> >>>>>>>>        ...
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> For Table, one just need to provide an predicate supplier that
> >>>>>>>>
> >>>>>>>> converts
> >>>>>>>>
> >>>>>>>> an
> >>>>>>>>
> >>>>>>>> Expression to the specified predicate type. This has a few
> >>>>>>>>
> >>>>>>>> benefit:
> >>>>>>>>
> >>>>>>>> 1. Same unified API for filterable for sources, regardless of
> >>>>>>>>
> >>>>>>>> DataStream or
> >>>>>>>>
> >>>>>>>> Table.
> >>>>>>>> 2. The  DataStream users now can also use the
> >>>>>>>>
> >>>>>>>> ExpressionToPredicate
> >>>>>>>>
> >>>>>>>> supplier if they want to.
> >>>>>>>>
> >>>>>>>> To summarize, my main point is that I am wondering if it is
> >>>>>>>>
> >>>>>>>> possible
> >>>>>>>>
> >>>>>>>> to
> >>>>>>>>
> >>>>>>>> have a single set of connector interface for both Table and
> >>>>>>>>
> >>>>>>>> DataStream,
> >>>>>>>>
> >>>>>>>> rather than having two hierarchies. I am not 100% sure if this
> >>>>>>>>
> >>>>>>>> would
> >>>>>>>>
> >>>>>>>> work,
> >>>>>>>>
> >>>>>>>> but if it works, this would be a huge win from both code
> >>>>>>>>
> >>>>>>>> maintenance
> >>>>>>>>
> >>>>>>>> and
> >>>>>>>>
> >>>>>>>> user experience perspective.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> >>>>>>>>
> >>>>>>>> dwysakowicz@apache.org>
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Timo,
> >>>>>>>>
> >>>>>>>> Thank you for the proposal. I think it is an important
> >>>>>>>>
> >>>>>>>> improvement
> >>>>>>>>
> >>>>>>>> that
> >>>>>>>>
> >>>>>>>> will benefit many parts of the Table API. The proposal looks
> >>>>>>>>
> >>>>>>>> really
> >>>>>>>>
> >>>>>>>> good
> >>>>>>>>
> >>>>>>>> to me and personally I would be comfortable with voting on the
> >>>>>>>>
> >>>>>>>> current
> >>>>>>>>
> >>>>>>>> state.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Dawid
> >>>>>>>>
> >>>>>>>> On 23/03/2020 18:53, Timo Walther wrote:
> >>>>>>>>
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> I received some questions around how the new interfaces play
> >>>>>>>>
> >>>>>>>> together
> >>>>>>>>
> >>>>>>>> with formats and their factories.
> >>>>>>>>
> >>>>>>>> Furthermore, for MySQL or Postgres CDC logs, the format should
> >>>>>>>>
> >>>>>>>> be
> >>>>>>>>
> >>>>>>>> able
> >>>>>>>>
> >>>>>>>> to return a `ChangelogMode`.
> >>>>>>>>
> >>>>>>>> Also, I incorporated the feedback around the factory design in
> >>>>>>>>
> >>>>>>>> general.
> >>>>>>>>
> >>>>>>>> I added a new section `Factory Interfaces` to the design
> >>>>>>>>
> >>>>>>>> document.
> >>>>>>>>
> >>>>>>>> This should be helpful to understand the big picture and
> >>>>>>>>
> >>>>>>>> connecting
> >>>>>>>>
> >>>>>>>> the concepts.
> >>>>>>>>
> >>>>>>>> Please let me know what you think?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 18.03.20 13:43, Timo Walther wrote:
> >>>>>>>>
> >>>>>>>> Hi Benchao,
> >>>>>>>>
> >>>>>>>> this is a very good question. I will update the FLIP about
> >>>>>>>>
> >>>>>>>> this.
> >>>>>>>>
> >>>>>>>> The legacy planner will not support the new interfaces. It
> >>>>>>>>
> >>>>>>>> will
> >>>>>>>>
> >>>>>>>> only
> >>>>>>>>
> >>>>>>>> support the old interfaces. With the next release, I think
> >>>>>>>>
> >>>>>>>> the
> >>>>>>>>
> >>>>>>>> Blink
> >>>>>>>>
> >>>>>>>> planner is stable enough to be the default one as well.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>> On 18.03.20 08:45, Benchao Li wrote:
> >>>>>>>>
> >>>>>>>> Hi Timo,
> >>>>>>>>
> >>>>>>>> Thank you and others for the efforts to prepare this FLIP.
> >>>>>>>>
> >>>>>>>> The FLIP LGTM generally.
> >>>>>>>>
> >>>>>>>> +1 for moving blink data structures to table-common, it's
> >>>>>>>>
> >>>>>>>> useful
> >>>>>>>>
> >>>>>>>> to
> >>>>>>>>
> >>>>>>>> udf too
> >>>>>>>> in the future.
> >>>>>>>> A little question is, do we plan to support the new
> >>>>>>>>
> >>>>>>>> interfaces
> >>>>>>>>
> >>>>>>>> and
> >>>>>>>>
> >>>>>>>> data
> >>>>>>>>
> >>>>>>>> types in legacy planner?
> >>>>>>>> Or we only plan to support these new interfaces in blink
> >>>>>>>>
> >>>>>>>> planner.
> >>>>>>>>
> >>>>>>>> And using primary keys from DDL instead of derived key
> >>>>>>>>
> >>>>>>>> information
> >>>>>>>>
> >>>>>>>> from
> >>>>>>>>
> >>>>>>>> each query is also a good idea,
> >>>>>>>> we met some use cases where this does not works very well
> >>>>>>>>
> >>>>>>>> before.
> >>>>>>>>
> >>>>>>>> This FLIP also makes the dependencies of table modules more
> >>>>>>>>
> >>>>>>>> clear, I
> >>>>>>>>
> >>>>>>>> like
> >>>>>>>> it very much.
> >>>>>>>>
> >>>>>>>> Timo Walther <tw...@apache.org> <tw...@apache.org>
> >>>> 于2020年3月17日周二
> >>>>>>> 上午1:36写道：
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> I'm happy to present the results of long discussions that
> >>>>>>>>
> >>>>>>>> we
> >>>>>>>>
> >>>>>>>> had
> >>>>>>>>
> >>>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> >>>>>>>>
> >>>>>>>> many
> >>>>>>>>
> >>>>>>>> more
> >>>>>>>>
> >>>>>>>> have contributed to this design document.
> >>>>>>>>
> >>>>>>>> We would like to propose new long-term table source and
> >>>>>>>>
> >>>>>>>> table
> >>>>>>>>
> >>>>>>>> sink
> >>>>>>>>
> >>>>>>>> interfaces:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> >>>>>>>>
> >>>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> >>>>>>>>
> >>>>>>>> The goals of this FLIP are:
> >>>>>>>>
> >>>>>>>> - Simplify the current interface architecture:
> >>>>>>>>          - Merge upsert, retract, and append sinks.
> >>>>>>>>          - Unify batch and streaming sources.
> >>>>>>>>          - Unify batch and streaming sinks.
> >>>>>>>>
> >>>>>>>> - Allow sources to produce a changelog:
> >>>>>>>>          - UpsertTableSources have been requested a lot by
> >>>>>>>>
> >>>>>>>> users.
> >>>>>>>>
> >>>>>>>> Now
> >>>>>>>>
> >>>>>>>> is the
> >>>>>>>> time to open the internal planner capabilities via the new
> >>>>>>>>
> >>>>>>>> interfaces.
> >>>>>>>>
> >>>>>>>>          - According to FLIP-105, we would like to support
> >>>>>>>>
> >>>>>>>> changelogs for
> >>>>>>>>
> >>>>>>>> processing formats such as Debezium.
> >>>>>>>>
> >>>>>>>> - Don't rely on DataStream API for source and sinks:
> >>>>>>>>          - According to FLIP-32, the Table API and SQL should
> >>>>>>>>
> >>>>>>>> be
> >>>>>>>>
> >>>>>>>> independent
> >>>>>>>> of the DataStream API which is why the `table-common`
> >>>>>>>>
> >>>>>>>> module
> >>>>>>>>
> >>>>>>>> has
> >>>>>>>>
> >>>>>>>> no
> >>>>>>>>
> >>>>>>>> dependencies on `flink-streaming-java`.
> >>>>>>>>          - Source and sink implementations should only depend
> >>>>>>>>
> >>>>>>>> on
> >>>>>>>>
> >>>>>>>> the
> >>>>>>>>
> >>>>>>>> `table-common` module after FLIP-27.
> >>>>>>>>          - Until FLIP-27 is ready, we still put most of the
> >>>>>>>>
> >>>>>>>> interfaces in
> >>>>>>>>
> >>>>>>>> `table-common` and strictly separate interfaces that
> >>>>>>>>
> >>>>>>>> communicate
> >>>>>>>>
> >>>>>>>> with a
> >>>>>>>> planner and actual runtime reader/writers.
> >>>>>>>>
> >>>>>>>> - Implement efficient sources and sinks without planner
> >>>>>>>>
> >>>>>>>> dependencies:
> >>>>>>>>
> >>>>>>>>          - Make Blink's internal data structures available to
> >>>>>>>>
> >>>>>>>> connectors.
> >>>>>>>>
> >>>>>>>>          - Introduce stable interfaces for data structures
> >>>>>>>>
> >>>>>>>> that
> >>>>>>>>
> >>>>>>>> can
> >>>>>>>>
> >>>>>>>> be
> >>>>>>>>
> >>>>>>>> marked as `@PublicEvolving`.
> >>>>>>>>          - Only require dependencies on `flink-table-common`
> >>>>>>>>
> >>>>>>>> in
> >>>>>>>>
> >>>>>>>> the
> >>>>>>>>
> >>>>>>>> future
> >>>>>>>>
> >>>>>>>> It finalizes the concept of dynamic tables and consideres
> >>>>>>>>
> >>>>>>>> how
> >>>>>>>>
> >>>>>>>> all
> >>>>>>>>
> >>>>>>>> source/sink related classes play together.
> >>>>>>>>
> >>>>>>>> We look forward to your feedback.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best, Jingsong Lee
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Timo Walther <tw...@apache.org>.

Hi Becket,

Regarding "PushDown/NestedPushDown which is internal to optimizer":

Those concepts cannot be entirely internal to the optimizer, at some 
point the optimizer needs to pass them into the connector specific code. 
This code will then convert it to e.g. Parque expressions. So there must 
be some interface that takes SQL Expression and converts to connector 
specific code. This interface between planner and connector is modelled 
by the SupportsXXX interfaces. And you are right, if developers don't 
care, they don't need to implement those optional interfaces but will 
not get performant connectors.

Regarding "Table connector can work with the above two mechanism":

A table connector needs three mechanisms that are represented in the 
current design.

1. a stateless discovery interface (Factory) that can convert 
ConfigOptions to a stateful factory interface 
(DynamicTableSource/DynamicTableSink)

2. a stateful factory interface (DynamicTableSource/DynamicTableSink) 
that receives concepts from the optimizer (watermarks, filters, 
projections) and produces runtime classes such as your 
`ExpressionToParquetFilter`

3. runtime interfaces that are generated from the stateful factory; all 
the factories that you mentioned can be used in `getScanRuntimeProvider`.

Regarding "connector developers just need to know how to write an 
ExpressionToParquetFilter":

This is the entire purpose of the DynamicTableSource/DynamicTableSink. 
The bridging between SQL concepts and connector specific concepts. 
Because this is the tricky part. How to get from a SQL concept to a 
connctor concept.

Regards,
Timo


On 26.03.20 04:46, Becket Qin wrote:
> Hi Timo,
> 
> Thanks for the reply. I totally agree that there must be something new
> added to the connector in order to make it work for SQL / Table. My concern
> is mostly over what they should be, and how to add them. To be honest, I
> was kind of lost when looking at the interfaces such as
> DataStructureConverter, RuntimeConverter and their internal context. Also I
> believe most connector developers do not care about the concept of
> "PushDown" / "NestedPushDown" which is internal to optimizer and not even
> exposed to SQL writers.
> 
> Therefore I am trying to see if we can:
> A) Keep those additions minimum to the connector developers if they don't
> have to know the details.
> B) Expose as less high level concept as possible. More specifically, try to
> speak the connector language and expose the general mechanism instead of
> binding them with use case semantic.
> 
> If we can achieve the above two goals, we could avoid adding unnecessary
> burden to the connector developers, and also make the connectors more
> generic.
> 
> It might worth thinking about what additional work is necessary for the
> connector developers, here are what I am thinking of, please correct me if
> I miss something.
> 
>     1. A Factory interface that allows high level use case, in this case
>     SQL, to find a matching source using service provider mechanism.
>     2. Allows the high level use case to specify the plugins that are
>     supported by the underneath DataStream Source.
> 
> If Table connector can work with the above two mechanism, maybe we can make
> some slight modifications to the interfaces in the current FLIP.
> 
>     - A *SourceFactory* which extends the Factory interface in the FLIP,
>     with one more method:
>        - *Source getSource();*
>     - Some decorative interfaces to the SourceFactory such as:
>        - *FilterFactory<PREDICATE, T extends Supplier<PREDICATE>>*, with the
>        following method
>           - T getFilter();
>        - *ProjectorFactory<PREDICATE, T extends Supplier<PREDICATE>>*, with
>        the following method.
>           - T getProjector();
>        - *DeserializerFactory<INPUT, OUTPUT>*
> 
> With this set of API, a ParquetTableSourceFactory may become:
> 
> class ParqeutTableSourceFactory implements
>             SourceFactory,
>             DeserializerFactory<ParquetRecords, Row>,
>             FilterFactory<ParquetFilter, ExressionToParquetFilter> {
>      @Override
>      ParquetSource getSource() { ... }
> 
>      @Override
>      ExressionToParquetFilter getFilterSupplier() { ... };
> }
> 
> The ExressionToParquetFilter will have an *applyPredicate(Expression)*
> method.
> 
> I know it does not look like a perfect interface from the pure SQL
> perspective. And I am not even sure if this would meet all the requirements
> for SQL, but the benefit is that the connector developers just need to know
> how to write an ExpressionToParquetFilter in order to make it work for
> Table, without having to understand the entire SQL concept.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> 
> On Wed, Mar 25, 2020 at 5:57 PM Timo Walther <tw...@apache.org> wrote:
> 
>> Hi Becket,
>>
>> Let me clarify a few things first: Historically we thought of Table
>> API/SQL as a library on top of DataStream API. Similar to Gelly or CEP.
>> We used TypeInformation in Table API to integrate nicely with DataStream
>> API. However, the last years have shown that SQL is not just a library.
>> It is an entire ecosystem that defines data types, submission behavior,
>> execution behavior, and highly optimized SerDes. SQL is a way to declare
>> data processing end-to-end such that the planner has the full control
>> over the execution.
>>
>> But I totally agree with your concerns around connectors. There is no
>> big difference between your concerns and the current design.
>>
>> 1. "native connector interface is a generic abstraction of doing IO and
>> Serde":
>>
>> This is the case in our design. We are using SourceFunction,
>> DeserializationSchema, WatermarkAssigner, etc. all pluggable interfaces
>> that the DataStream API offers for performing runtime operations.
>>
>> 2. "advanced features ... could be provided in a semantic free way":
>>
>> I agree here. But this is an orthogonal topic that each connector
>> implementer should keep in mind. If a new connector is developed, it
>> should *not* be developed only for SQL in mind but with good abstraction
>> such that also DataStream API users can use it. A connector should have
>> a builder pattern to plugin all capabilities like Parque filters etc.
>> There should be no table-specific native/runtime connectors. I think
>> this discussion is related to the discussion of FLIP-115.
>>
>> However, as I mentioned before: This FLIP only discusses the interfaces
>> for communication between planner and connector factory. As Dawid said
>> earlier, a DynamicTableSource can be more seen as a factory that calls
>> pluggable interfaces of a native connextor in the end:
>>
>> KafkaConnector.builder()
>>     .watermarkAssigner(...)
>>     .keyDeser(...)
>>     .valueDeser(...)
>>     ....
>>     .build()
>>
>> Regards,
>> Timo
>>
>>
>> On 25.03.20 09:05, Becket Qin wrote:
>>> Hi Kurt,
>>>
>>> I do not object to promote the concepts of SQL, but I don't think we
>> should
>>> do that by introducing a new dedicate set of connector public interfaces
>>> that is only for SQL. The same argument can be applied to Gelly, CEP, and
>>> Machine Learning, claiming that they need to introduce a dedicated public
>>> set of interfaces that fits their own concept and ask the the connector
>>> developers to learn and follow their design. As an analogy, if we want to
>>> promote Chinese, we don't want to force people to learn ancient Chinese
>>> poem while they only need to know a few words like "hello" and "goodbye".
>>>
>>> As some design principles, here are what I think what  Flink connectors
>>> should look like:
>>>
>>> 1. The native connector interface is a generic abstraction of doing IO
>> and
>>> Serde, without semantic for high level use cases such as SQL, Gelly, CEP,
>>> etc.
>>>
>>> 2. Some advanced features that may help accelerate the IO and Serde could
>>> be provided in the native connector interfaces in a semantic free way so
>>> all the high level use cases can leverage.
>>>
>>> 3. Additional semantics can be built on top of the native source
>> interface
>>> through providing different plugins. These plugins could be high level
>> use
>>> case aware. For example, to provide a filter to the source, we can do the
>>> following
>>>
>>> // An interface for all the filters that take an expression.
>>> interface ExpressionFilter {
>>>       FilterResult applyFilterExpression();
>>> }
>>>
>>> // An filter plugin implementation that translate the SQL Expression to a
>>> ParquetFilterPredicate.
>>> Class ParquetExpressionFilter implements
>> Supplier<ParquetFilterPredicate>,
>>> ExpressionFilter {
>>>       // Called by the high level use case,
>>>       FilterResult applyFilterExpression() { ... }
>>>
>>>       // Used by the native Source interface.
>>>       ParquetFilterPredicate get() { ... }
>>> }
>>>
>>> In this case, the connector developer just need to write the logic of
>>> translating an Expression to Parquet FilterPredicate. They don't have to
>>> understand the entire set of interfaces that we want to promote. Just
>> like
>>> they only need to know how to say "Hello" without learning ancient
>> Chinese
>>> poem.
>>>
>>> Again, I am not saying this is necessarily the best approach. But so far
>> it
>>> seems a reasonable design principle to tell the developers.
>>>
>>> Thanks,
>>>
>>> Jiangjie (becket) Qin
>>>
>>>
>>>
>>> On Wed, Mar 25, 2020 at 11:53 AM Kurt Young <yk...@gmail.com> wrote:
>>>
>>>> Hi Becket,
>>>>
>>>> I don't think we should discuss this in pure engineering aspects. Your
>>>> proposal is trying
>>>> to let SQL connector developers understand as less SQL concepts as
>>>> possible. But quite
>>>> the opposite, we are designing those interfaces to emphasize the SQL
>>>> concept, to bridge
>>>> high level concepts into real interfaces and classes.
>>>>
>>>> We keep talking about time-varying relations and dynamic table when
>>>> introduce SQL concepts,
>>>> sources and sinks are most critical part playing with those concepts.
>> It's
>>>> essential to let
>>>> Flink SQL developers to learn these concepts and connect them with real
>>>> codes by introducing
>>>> these connector interfaces and can further write *correct* connectors
>> based
>>>> on such domain
>>>> knowledge.
>>>>
>>>> So this FLIP is a very important chance to express these concepts and
>> make
>>>> most SQL developers
>>>> be align with concepts and on same page. It's mostly for different
>> level of
>>>> abstractions and for domains
>>>> like SQL, it's becoming more important.  It helps Flink SQL go smoothly
>> in
>>>> the future, and also
>>>> make it easier for new contributors. But I would admit this is not that
>>>> obvious for others who don't work
>>>> with SQL frequently.
>>>>
>>>> Best,
>>>> Kurt
>>>>
>>>>
>>>> On Wed, Mar 25, 2020 at 11:07 AM Becket Qin <be...@gmail.com>
>> wrote:
>>>>
>>>>> Hi Jark,
>>>>>
>>>>> It is good to know that we do not expect the end users to touch those
>>>>> interfaces.
>>>>>
>>>>> Then the question boils down to whether the connector developers should
>>>> be
>>>>> aware of the interfaces that are only used by the SQL optimizer. It
>>>> seems a
>>>>> win if we can avoid that.
>>>>>
>>>>> Two potential solutions off the top of my head are:
>>>>> 1. An internal helper class doing the instanceOf based on DataStream
>>>> source
>>>>> interface and create pluggables for that DataStream source.
>>>>> 2. codegen the set of TableSource interfaces given a DataStream Source
>>>> and
>>>>> its corresponding TablePluggablesFactory.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jiangjie (Becket) Qin
>>>>>
>>>>> On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> wrote:
>>>>>
>>>>>> Hi Becket,
>>>>>>
>>>>>> Regarding to Flavor1 and Flavor2, I want to clarify that user will
>>>> never
>>>>>> use table source like this:
>>>>>>
>>>>>> {
>>>>>>      MyTableSource myTableSource = MyTableSourceFactory.create();
>>>>>>      myTableSource.setSchema(mySchema);
>>>>>>      myTableSource.applyFilterPredicate(expression);
>>>>>>      ...
>>>>>> }
>>>>>>
>>>>>> TableFactory and TableSource are not directly exposed to end users,
>> all
>>>>> the
>>>>>> methods are called by planner, not users.
>>>>>> Users always use DDL or descriptor to register a table, and planner
>>>> will
>>>>>> find the factory and create sources according to the properties.
>>>>>> All the optimization are applied automatically, e.g. filter/projection
>>>>>> pushdown, users don't need to call `applyFilterPredicate` explicitly.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com>
>> wrote:
>>>>>>
>>>>>>> Hi Timo and Dawid,
>>>>>>>
>>>>>>> Thanks for the clarification. They really help. You are right that we
>>>>> are
>>>>>>> on the same page regarding the hierarchy. I think the only difference
>>>>>>> between our view is the flavor of the interfaces. There are two
>>>> flavors
>>>>>> of
>>>>>>> the source interface for DataStream and Table source.
>>>>>>>
>>>>>>> *Flavor 1. Table Sources are some wrapper interfaces around
>>>> DataStream
>>>>>>> source.*
>>>>>>> Following this way, we will reach the design of the current proposal,
>>>>>> i.e.
>>>>>>> each pluggable exposed in the DataStream source will have a
>>>>> corresponding
>>>>>>> TableSource interface counterpart, which are at the Factory level.
>>>>> Users
>>>>>>> will write code like this:
>>>>>>>
>>>>>>> {
>>>>>>>       MyTableSource myTableSource = MyTableSourceFactory.create();
>>>>>>>       myTableSource.setSchema(mySchema);
>>>>>>>       myTableSource.applyFilterPredicate(expression);
>>>>>>>       ...
>>>>>>> }
>>>>>>>
>>>>>>> The good thing for this flavor is that from the SQL / Table's
>>>>>> perspective,
>>>>>>> there is a dedicated set of Table oriented interface.
>>>>>>>
>>>>>>> The downsides are:
>>>>>>> A. From the user's perspective, DataStream Source and Table Source
>>>> are
>>>>>> just
>>>>>>> two different sets of interfaces, regardless of how they are the same
>>>>>>> internally.
>>>>>>> B. The source developers have to develop for those two sets of
>>>>> interfaces
>>>>>>> in order to support both DataStream and Table.
>>>>>>> C. It is not explicit that DataStream can actually share the
>>>> pluggable
>>>>> in
>>>>>>> Table / SQL. For example, in order to provide a filter pluggable with
>>>>> SQL
>>>>>>> expression, users will have to know the actual converter class that
>>>>>>> converts the expression to the filter predicate and construct that
>>>>>>> converter by themselves.
>>>>>>>
>>>>>>> ---------------
>>>>>>>
>>>>>>> *Flavor 2. A TableSource is a DataStream source with a bunch of
>>>>>> pluggables.
>>>>>>> No Table specific interfaces at all.*
>>>>>>> Following this way, we will reach another design where you have a
>>>>>>> SourceFactory and a single Pluggable factory for all the table
>>>>>> pluggables.
>>>>>>> And users will write something like:
>>>>>>>
>>>>>>> {
>>>>>>>       Deserializer<Row> myTableDeserializer =
>>>>>>> MyTablePluggableFactory.createDeserializer(schema)
>>>>>>>       MySource<Row> mySource = MySourceFactory.create(properties,
>>>>>>> myTableDeserializer);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>>>>>>> }
>>>>>>>
>>>>>>> The good thing for this flavor is that there is just one set of
>>>>> interface
>>>>>>> that works for both Table and DataStream. There is no difference
>>>>> between
>>>>>>> creating a DataStream source and creating a Table source. DataStream
>>>>> can
>>>>>>> easily reuse the pluggables from the Table sources.
>>>>>>>
>>>>>>> The downside is that Table / SQL won't have a dedicated API for
>>>>>>> optimization. Instead of writing:
>>>>>>>
>>>>>>> if (MyTableSource instanceOf FilterableTableSource) {
>>>>>>>        // Some filter push down logic.
>>>>>>>        MyTableSource.applyPredicate(expression)
>>>>>>> }
>>>>>>>
>>>>>>> One have to write:
>>>>>>>
>>>>>>> if (MySource instanceOf FilterableSource) {
>>>>>>>       // Some filter push down logic.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>>>>>>> }
>>>>>>>
>>>>>>> -------------------------
>>>>>>>
>>>>>>> Just to be clear, I am not saying flavor 2 is necessarily better than
>>>>>>> flavor 1, but I want to make sure flavor 2 is also considered and
>>>>>>> discussed.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jiangjie (Becket) Qin.
>>>>>>>
>>>>>>> On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
>>>>>> dwysakowicz@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Becket,
>>>>>>>>
>>>>>>>> I really think we don't have a differing opinions. We might not see
>>>>> the
>>>>>>>> changes in the same way yet. Personally I think of the
>>>>>> DynamicTableSource
>>>>>>>> as of a factory for a Source implemented for the DataStream API.
>>>> The
>>>>>>>> important fact about the DynamicTableSource and all feature traits
>>>>>>>> (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work
>>>> with
>>>>>>> Table
>>>>>>>> API concepts such as e.g. Expressions, SQL specific types etc. In
>>>> the
>>>>>> end
>>>>>>>> what the implementation would resemble is (bear in mind I
>>>>> tremendously
>>>>>>>> simplified the example, just to show the relation between the two
>>>>>> APIs):
>>>>>>>>
>>>>>>>> SupportsFilterablePushDown {
>>>>>>>>
>>>>>>>>     applyFilters(List<ResolvedExpression> filters) {
>>>>>>>>
>>>>>>>>       this.filters = convertToDataStreamFilters(filters);
>>>>>>>>
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     Source createSource() {
>>>>>>>>
>>>>>>>>           return Source.create()
>>>>>>>>
>>>>>>>>             .applyFilters(this.filters);
>>>>>>>>
>>>>>>>>      }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> or exactly as you said for the computed columns:
>>>>>>>>
>>>>>>>>
>>>>>>>> SupportsComputedColumnsPushDown {
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>     applyComputedColumn(ComputedColumnConverter converter) {
>>>>>>>>
>>>>>>>>       this.deserializationSchema = new DeserializationSchema<Row> {
>>>>>>>>
>>>>>>>>         Row deserialize(...) {
>>>>>>>>
>>>>>>>>           RowData row = format.deserialize(bytes); // original
>>>> format,
>>>>>> e.g
>>>>>>>> json, avro, etc.
>>>>>>>>
>>>>>>>>           RowData enriched = converter(row)
>>>>>>>>
>>>>>>>>         }
>>>>>>>>
>>>>>>>>       }
>>>>>>>>
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     Source createSource() {
>>>>>>>>
>>>>>>>>           return Source.create()
>>>>>>>>
>>>>>>>>             .withDeserialization(deserializationSchema);
>>>>>>>>
>>>>>>>>      }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> So to sum it up again, all those interfaces are factories that
>>>>>> configure
>>>>>>>> appropriate parts of the DataStream API using Table API concepts.
>>>>>> Finally
>>>>>>>> to answer you question for particular comparisons:
>>>>>>>>
>>>>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>>>>>>>> SupportsFilterablePushDown v.s. FilterableSource
>>>>>>>> SupportsProjectablePushDown v.s. ProjectableSource
>>>>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>>>>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>>>>>>>> ScanTableSource v.s. ChangeLogDeserializer.
>>>>>>>>
>>>>>>>> pretty much you can think of all on the left as factories for the
>>>>> right
>>>>>>>> side, left side works with Table API classes (Expressions,
>>>>> DataTypes).
>>>>>> I
>>>>>>>> hope this clarifies it a bit.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Dawid
>>>>>>>> On 24/03/2020 15:03, Becket Qin wrote:
>>>>>>>>
>>>>>>>> Hey Kurt,
>>>>>>>>
>>>>>>>> I don't think DataStream should see some SQL specific concepts such
>>>>> as
>>>>>>>>
>>>>>>>> Filtering or ComputedColumn.
>>>>>>>>
>>>>>>>> Projectable and Filterable seems not necessarily SQL concepts, but
>>>>>> could
>>>>>>> be
>>>>>>>> applicable to DataStream source as well to reduce the network load.
>>>>> For
>>>>>>>> example ORC and Parquet should probably also be readable from
>>>>>> DataStream,
>>>>>>>> right?
>>>>>>>>
>>>>>>>> ComputedColumn is not part of the Source, it is an interface
>>>> extends
>>>>>> the
>>>>>>>> Deserializer, which is a pluggable for the Source. From the SQL's
>>>>>>>> perspective it has the concept of computed column, but from the
>>>>> Source
>>>>>>>> perspective, It is essentially a Deserializer which also converts
>>>> the
>>>>>>>> records internally, assuming we allow some conversion to be
>>>> embedded
>>>>> to
>>>>>>>> the source in addition to just deserialization.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>
>>>>>>>> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <
>>>>>>> imjark@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks Timo for updating the formats section. That would be very
>>>>>> helpful
>>>>>>>> for changelog supporting (FLIP-105).
>>>>>>>>
>>>>>>>> I just left 2 minor comment about some method names. In general,
>>>> I'm
>>>>> +1
>>>>>>> to
>>>>>>>> start a voting.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>> --------------------------------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> Hi Becket,
>>>>>>>>
>>>>>>>> I agree we shouldn't duplicate codes, especiall the runtime
>>>>>>>> implementations.
>>>>>>>> However, the interfaces proposed by FLIP-95 are mainly used during
>>>>>>>> optimization (compiling), not runtime.
>>>>>>>> I don't think there is much to share for this. Because table/sql
>>>>>>>> is declarative, but DataStream is imperative.
>>>>>>>> For example, filter push down, DataStream FilterableSource may
>>>> allow
>>>>> to
>>>>>>>> accept a FilterFunction (which is a black box for the source).
>>>>>>>> However, table sources should pick the pushed filter expressions,
>>>>> some
>>>>>>>> sources may only support "=", "<", ">" conditions.
>>>>>>>> Pushing a FilterFunction doesn't work in table ecosystem. That
>>>> means,
>>>>>> the
>>>>>>>> connectors have to have some table-specific implementations.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>>
>>>>>>>> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <
>>>>>>> ykt836@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Becket,
>>>>>>>>
>>>>>>>> I don't think DataStream should see some SQL specific concepts such
>>>>> as
>>>>>>>> Filtering or ComputedColumn. It's
>>>>>>>> better to stay within SQL area and translate to more generic
>>>> concept
>>>>>> when
>>>>>>>> translating to DataStream/Runtime
>>>>>>>> layer, such as use MapFunction to represent computed column logic.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kurt
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com>
>>>> <
>>>>>>> becket.qin@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Timo and Dawid,
>>>>>>>>
>>>>>>>> It's really great that we have the same goal. I am actually
>>>> wondering
>>>>>>>>
>>>>>>>> if
>>>>>>>>
>>>>>>>> we
>>>>>>>>
>>>>>>>> can go one step further to avoid some of the interfaces in Table as
>>>>>>>>
>>>>>>>> well.
>>>>>>>>
>>>>>>>> For example, if we have the FilterableSource, do we still need the
>>>>>>>> FilterableTableSource? Should DynamicTableSource just become a
>>>>>>>> Source<*Row*,
>>>>>>>> SourceSplitT, EnumChkT>?
>>>>>>>>
>>>>>>>> Can you help me understand a bit more about the reason we need the
>>>>>>>> following relational representation / wrapper interfaces v.s. the
>>>>>>>> interfaces that we could put to the Source in FLIP-27?
>>>>>>>>
>>>>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>>>>>>>> SupportsFilterablePushDown v.s. FilterableSource
>>>>>>>> SupportsProjectablePushDown v.s. ProjectableSource
>>>>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>>>>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>>>>>>>> ScanTableSource v.s. ChangeLogDeserializer.
>>>>>>>> LookUpTableSource v.s. LookUpSource
>>>>>>>>
>>>>>>>> Assuming we have all the interfaces on the right side, do we still
>>>>> need
>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>> interfaces on the left side? Note that the interfaces on the right
>>>>> can
>>>>>>>>
>>>>>>>> be
>>>>>>>>
>>>>>>>> used by both DataStream and Table. If we do this, there will only
>>>> be
>>>>>>>>
>>>>>>>> one
>>>>>>>>
>>>>>>>> set of Source interfaces Table and DataStream, the only difference
>>>> is
>>>>>>>>
>>>>>>>> that
>>>>>>>>
>>>>>>>> the Source for table will have some specific plugins and
>>>>>>>>
>>>>>>>> configurations.
>>>>>>>>
>>>>>>>> An
>>>>>>>>
>>>>>>>> omnipotent Source can implement all the the above interfaces and
>>>>> take a
>>>>>>>> Deserializer that implements both ComputedColumnDeserializer and
>>>>>>>> ChangeLogDeserializer.
>>>>>>>>
>>>>>>>> Would the SQL planner work with that?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <
>>>> jingsonglee0@gmail.com>
>>>>> <
>>>>>>> jingsonglee0@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> +1. Thanks Timo for the design doc.
>>>>>>>>
>>>>>>>> We can also consider @Experimental too. But I am +1 to
>>>>>>>>
>>>>>>>> @PublicEvolving,
>>>>>>>>
>>>>>>>> we
>>>>>>>>
>>>>>>>> should be confident in the current change.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jingsong Lee
>>>>>>>>
>>>>>>>> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
>>>> <
>>>>>>> twalthr@apache.org>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> @Becket: We totally agree that we don't need table specific
>>>>>>>>
>>>>>>>> connectors
>>>>>>>>
>>>>>>>> during runtime. As Dawid said, the interfaces proposed here are
>>>>>>>>
>>>>>>>> just
>>>>>>>>
>>>>>>>> for
>>>>>>>>
>>>>>>>> communication with the planner. Once the properties (watermarks,
>>>>>>>> computed column, filters, projecttion etc.) are negotiated, we can
>>>>>>>> configure a regular Flink connector.
>>>>>>>>
>>>>>>>> E.g. setting the watermark assigner and deserialization schema of a
>>>>>>>> Kafka connector.
>>>>>>>>
>>>>>>>> For better separation of concerns, Flink connectors should not
>>>>>>>>
>>>>>>>> include
>>>>>>>>
>>>>>>>> relational interfaces and depend on flink-table. This is the
>>>>>>>> responsibility of table source/sink.
>>>>>>>>
>>>>>>>> @Kurt: I would like to mark them @PublicEvolving already because we
>>>>>>>>
>>>>>>>> need
>>>>>>>>
>>>>>>>> to deprecate the old interfaces as early as possible. We cannot
>>>>>>>>
>>>>>>>> redirect
>>>>>>>>
>>>>>>>> to @Internal interfaces. They are not marked @Public, so we can
>>>>>>>>
>>>>>>>> still
>>>>>>>>
>>>>>>>> evolve them. But a core design shift should not happen again, it
>>>>>>>>
>>>>>>>> would
>>>>>>>>
>>>>>>>> leave a bad impression if we are redesign over and over again.
>>>>>>>>
>>>>>>>> Instead
>>>>>>>>
>>>>>>>> we should be confident in the current change.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Timo
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24.03.20 09:20, Dawid Wysakowicz wrote:
>>>>>>>>
>>>>>>>> Hi Becket,
>>>>>>>>
>>>>>>>> Answering your question, we have the same intention not to
>>>>>>>>
>>>>>>>> duplicate
>>>>>>>>
>>>>>>>> connectors between datastream and table apis. The interfaces
>>>>>>>>
>>>>>>>> proposed
>>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>> the FLIP are a way to describe relational properties of a source.
>>>>>>>>
>>>>>>>> The
>>>>>>>>
>>>>>>>> intention is as you described to translate all of those expressed
>>>>>>>>
>>>>>>>> as
>>>>>>>>
>>>>>>>> expressions or other Table specific structures into a DataStream
>>>>>>>>
>>>>>>>> source.
>>>>>>>>
>>>>>>>> In other words I think what we are doing here is in line with
>>>>>>>>
>>>>>>>> what
>>>>>>>>
>>>>>>>> you
>>>>>>>>
>>>>>>>> described.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Dawid
>>>>>>>>
>>>>>>>> On 24/03/2020 02:23, Becket Qin wrote:
>>>>>>>>
>>>>>>>> Hi Timo,
>>>>>>>>
>>>>>>>> Thanks for the proposal. I completely agree that the current
>>>>>>>>
>>>>>>>> Table
>>>>>>>>
>>>>>>>> connectors could be simplified quite a bit. I haven't finished
>>>>>>>>
>>>>>>>> reading
>>>>>>>>
>>>>>>>> everything, but here are some quick thoughts.
>>>>>>>>
>>>>>>>> Actually to me the biggest question is why should there be two
>>>>>>>>
>>>>>>>> different
>>>>>>>>
>>>>>>>> connector systems for DataStream and Table? What is the
>>>>>>>>
>>>>>>>> fundamental
>>>>>>>>
>>>>>>>> reason
>>>>>>>>
>>>>>>>> that is preventing us from merging them to one?
>>>>>>>>
>>>>>>>> The basic functionality of a connector is to provide
>>>>>>>>
>>>>>>>> capabilities
>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>> do
>>>>>>>>
>>>>>>>> IO
>>>>>>>>
>>>>>>>> and Serde. Conceptually, Table connectors should just be
>>>>>>>>
>>>>>>>> DataStream
>>>>>>>>
>>>>>>>> connectors that are dealing with Rows. It seems that quite a few
>>>>>>>>
>>>>>>>> of
>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>> special connector requirements are just a specific way to do IO
>>>>>>>>
>>>>>>>> /
>>>>>>>>
>>>>>>>> Serde.
>>>>>>>>
>>>>>>>> Taking SupportsFilterPushDown as an example, imagine we have the
>>>>>>>>
>>>>>>>> following
>>>>>>>>
>>>>>>>> interface:
>>>>>>>>
>>>>>>>> interface FilterableSource<PREDICATE> {
>>>>>>>>        void applyFilterable(Supplier<PREDICATE> predicate);
>>>>>>>> }
>>>>>>>>
>>>>>>>> And if a ParquetSource would like to support filterable, it will
>>>>>>>>
>>>>>>>> become:
>>>>>>>>
>>>>>>>> class ParquetSource implements Source,
>>>>>>>>
>>>>>>>> FilterableSource(FilterPredicate> {
>>>>>>>>
>>>>>>>>        ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> For Table, one just need to provide an predicate supplier that
>>>>>>>>
>>>>>>>> converts
>>>>>>>>
>>>>>>>> an
>>>>>>>>
>>>>>>>> Expression to the specified predicate type. This has a few
>>>>>>>>
>>>>>>>> benefit:
>>>>>>>>
>>>>>>>> 1. Same unified API for filterable for sources, regardless of
>>>>>>>>
>>>>>>>> DataStream or
>>>>>>>>
>>>>>>>> Table.
>>>>>>>> 2. The  DataStream users now can also use the
>>>>>>>>
>>>>>>>> ExpressionToPredicate
>>>>>>>>
>>>>>>>> supplier if they want to.
>>>>>>>>
>>>>>>>> To summarize, my main point is that I am wondering if it is
>>>>>>>>
>>>>>>>> possible
>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>> have a single set of connector interface for both Table and
>>>>>>>>
>>>>>>>> DataStream,
>>>>>>>>
>>>>>>>> rather than having two hierarchies. I am not 100% sure if this
>>>>>>>>
>>>>>>>> would
>>>>>>>>
>>>>>>>> work,
>>>>>>>>
>>>>>>>> but if it works, this would be a huge win from both code
>>>>>>>>
>>>>>>>> maintenance
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> user experience perspective.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
>>>>>>>>
>>>>>>>> dwysakowicz@apache.org>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Timo,
>>>>>>>>
>>>>>>>> Thank you for the proposal. I think it is an important
>>>>>>>>
>>>>>>>> improvement
>>>>>>>>
>>>>>>>> that
>>>>>>>>
>>>>>>>> will benefit many parts of the Table API. The proposal looks
>>>>>>>>
>>>>>>>> really
>>>>>>>>
>>>>>>>> good
>>>>>>>>
>>>>>>>> to me and personally I would be comfortable with voting on the
>>>>>>>>
>>>>>>>> current
>>>>>>>>
>>>>>>>> state.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Dawid
>>>>>>>>
>>>>>>>> On 23/03/2020 18:53, Timo Walther wrote:
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I received some questions around how the new interfaces play
>>>>>>>>
>>>>>>>> together
>>>>>>>>
>>>>>>>> with formats and their factories.
>>>>>>>>
>>>>>>>> Furthermore, for MySQL or Postgres CDC logs, the format should
>>>>>>>>
>>>>>>>> be
>>>>>>>>
>>>>>>>> able
>>>>>>>>
>>>>>>>> to return a `ChangelogMode`.
>>>>>>>>
>>>>>>>> Also, I incorporated the feedback around the factory design in
>>>>>>>>
>>>>>>>> general.
>>>>>>>>
>>>>>>>> I added a new section `Factory Interfaces` to the design
>>>>>>>>
>>>>>>>> document.
>>>>>>>>
>>>>>>>> This should be helpful to understand the big picture and
>>>>>>>>
>>>>>>>> connecting
>>>>>>>>
>>>>>>>> the concepts.
>>>>>>>>
>>>>>>>> Please let me know what you think?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Timo
>>>>>>>>
>>>>>>>>
>>>>>>>> On 18.03.20 13:43, Timo Walther wrote:
>>>>>>>>
>>>>>>>> Hi Benchao,
>>>>>>>>
>>>>>>>> this is a very good question. I will update the FLIP about
>>>>>>>>
>>>>>>>> this.
>>>>>>>>
>>>>>>>> The legacy planner will not support the new interfaces. It
>>>>>>>>
>>>>>>>> will
>>>>>>>>
>>>>>>>> only
>>>>>>>>
>>>>>>>> support the old interfaces. With the next release, I think
>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>> Blink
>>>>>>>>
>>>>>>>> planner is stable enough to be the default one as well.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Timo
>>>>>>>>
>>>>>>>> On 18.03.20 08:45, Benchao Li wrote:
>>>>>>>>
>>>>>>>> Hi Timo,
>>>>>>>>
>>>>>>>> Thank you and others for the efforts to prepare this FLIP.
>>>>>>>>
>>>>>>>> The FLIP LGTM generally.
>>>>>>>>
>>>>>>>> +1 for moving blink data structures to table-common, it's
>>>>>>>>
>>>>>>>> useful
>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>> udf too
>>>>>>>> in the future.
>>>>>>>> A little question is, do we plan to support the new
>>>>>>>>
>>>>>>>> interfaces
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> data
>>>>>>>>
>>>>>>>> types in legacy planner?
>>>>>>>> Or we only plan to support these new interfaces in blink
>>>>>>>>
>>>>>>>> planner.
>>>>>>>>
>>>>>>>> And using primary keys from DDL instead of derived key
>>>>>>>>
>>>>>>>> information
>>>>>>>>
>>>>>>>> from
>>>>>>>>
>>>>>>>> each query is also a good idea,
>>>>>>>> we met some use cases where this does not works very well
>>>>>>>>
>>>>>>>> before.
>>>>>>>>
>>>>>>>> This FLIP also makes the dependencies of table modules more
>>>>>>>>
>>>>>>>> clear, I
>>>>>>>>
>>>>>>>> like
>>>>>>>> it very much.
>>>>>>>>
>>>>>>>> Timo Walther <tw...@apache.org> <tw...@apache.org>
>>>> 于2020年3月17日周二
>>>>>>> 上午1:36写道：
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I'm happy to present the results of long discussions that
>>>>>>>>
>>>>>>>> we
>>>>>>>>
>>>>>>>> had
>>>>>>>>
>>>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
>>>>>>>>
>>>>>>>> many
>>>>>>>>
>>>>>>>> more
>>>>>>>>
>>>>>>>> have contributed to this design document.
>>>>>>>>
>>>>>>>> We would like to propose new long-term table source and
>>>>>>>>
>>>>>>>> table
>>>>>>>>
>>>>>>>> sink
>>>>>>>>
>>>>>>>> interfaces:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>>>>>>>
>>>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>>>>>>>
>>>>>>>> The goals of this FLIP are:
>>>>>>>>
>>>>>>>> - Simplify the current interface architecture:
>>>>>>>>          - Merge upsert, retract, and append sinks.
>>>>>>>>          - Unify batch and streaming sources.
>>>>>>>>          - Unify batch and streaming sinks.
>>>>>>>>
>>>>>>>> - Allow sources to produce a changelog:
>>>>>>>>          - UpsertTableSources have been requested a lot by
>>>>>>>>
>>>>>>>> users.
>>>>>>>>
>>>>>>>> Now
>>>>>>>>
>>>>>>>> is the
>>>>>>>> time to open the internal planner capabilities via the new
>>>>>>>>
>>>>>>>> interfaces.
>>>>>>>>
>>>>>>>>          - According to FLIP-105, we would like to support
>>>>>>>>
>>>>>>>> changelogs for
>>>>>>>>
>>>>>>>> processing formats such as Debezium.
>>>>>>>>
>>>>>>>> - Don't rely on DataStream API for source and sinks:
>>>>>>>>          - According to FLIP-32, the Table API and SQL should
>>>>>>>>
>>>>>>>> be
>>>>>>>>
>>>>>>>> independent
>>>>>>>> of the DataStream API which is why the `table-common`
>>>>>>>>
>>>>>>>> module
>>>>>>>>
>>>>>>>> has
>>>>>>>>
>>>>>>>> no
>>>>>>>>
>>>>>>>> dependencies on `flink-streaming-java`.
>>>>>>>>          - Source and sink implementations should only depend
>>>>>>>>
>>>>>>>> on
>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>> `table-common` module after FLIP-27.
>>>>>>>>          - Until FLIP-27 is ready, we still put most of the
>>>>>>>>
>>>>>>>> interfaces in
>>>>>>>>
>>>>>>>> `table-common` and strictly separate interfaces that
>>>>>>>>
>>>>>>>> communicate
>>>>>>>>
>>>>>>>> with a
>>>>>>>> planner and actual runtime reader/writers.
>>>>>>>>
>>>>>>>> - Implement efficient sources and sinks without planner
>>>>>>>>
>>>>>>>> dependencies:
>>>>>>>>
>>>>>>>>          - Make Blink's internal data structures available to
>>>>>>>>
>>>>>>>> connectors.
>>>>>>>>
>>>>>>>>          - Introduce stable interfaces for data structures
>>>>>>>>
>>>>>>>> that
>>>>>>>>
>>>>>>>> can
>>>>>>>>
>>>>>>>> be
>>>>>>>>
>>>>>>>> marked as `@PublicEvolving`.
>>>>>>>>          - Only require dependencies on `flink-table-common`
>>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>> future
>>>>>>>>
>>>>>>>> It finalizes the concept of dynamic tables and consideres
>>>>>>>>
>>>>>>>> how
>>>>>>>>
>>>>>>>> all
>>>>>>>>
>>>>>>>> source/sink related classes play together.
>>>>>>>>
>>>>>>>> We look forward to your feedback.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Timo
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best, Jingsong Lee
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hi Timo,

Thanks for the reply. I totally agree that there must be something new
added to the connector in order to make it work for SQL / Table. My concern
is mostly over what they should be, and how to add them. To be honest, I
was kind of lost when looking at the interfaces such as
DataStructureConverter, RuntimeConverter and their internal context. Also I
believe most connector developers do not care about the concept of
"PushDown" / "NestedPushDown" which is internal to optimizer and not even
exposed to SQL writers.

Therefore I am trying to see if we can:
A) Keep those additions minimum to the connector developers if they don't
have to know the details.
B) Expose as less high level concept as possible. More specifically, try to
speak the connector language and expose the general mechanism instead of
binding them with use case semantic.

If we can achieve the above two goals, we could avoid adding unnecessary
burden to the connector developers, and also make the connectors more
generic.

It might worth thinking about what additional work is necessary for the
connector developers, here are what I am thinking of, please correct me if
I miss something.

   1. A Factory interface that allows high level use case, in this case
   SQL, to find a matching source using service provider mechanism.
   2. Allows the high level use case to specify the plugins that are
   supported by the underneath DataStream Source.

If Table connector can work with the above two mechanism, maybe we can make
some slight modifications to the interfaces in the current FLIP.

   - A *SourceFactory* which extends the Factory interface in the FLIP,
   with one more method:
      - *Source getSource();*
   - Some decorative interfaces to the SourceFactory such as:
      - *FilterFactory<PREDICATE, T extends Supplier<PREDICATE>>*, with the
      following method
         - T getFilter();
      - *ProjectorFactory<PREDICATE, T extends Supplier<PREDICATE>>*, with
      the following method.
         - T getProjector();
      - *DeserializerFactory<INPUT, OUTPUT>*

With this set of API, a ParquetTableSourceFactory may become:

class ParqeutTableSourceFactory implements
           SourceFactory,
           DeserializerFactory<ParquetRecords, Row>,
           FilterFactory<ParquetFilter, ExressionToParquetFilter> {
    @Override
    ParquetSource getSource() { ... }

    @Override
    ExressionToParquetFilter getFilterSupplier() { ... };
}

The ExressionToParquetFilter will have an *applyPredicate(Expression)*
method.

I know it does not look like a perfect interface from the pure SQL
perspective. And I am not even sure if this would meet all the requirements
for SQL, but the benefit is that the connector developers just need to know
how to write an ExpressionToParquetFilter in order to make it work for
Table, without having to understand the entire SQL concept.

Thanks,

Jiangjie (Becket) Qin



On Wed, Mar 25, 2020 at 5:57 PM Timo Walther <tw...@apache.org> wrote:

> Hi Becket,
>
> Let me clarify a few things first: Historically we thought of Table
> API/SQL as a library on top of DataStream API. Similar to Gelly or CEP.
> We used TypeInformation in Table API to integrate nicely with DataStream
> API. However, the last years have shown that SQL is not just a library.
> It is an entire ecosystem that defines data types, submission behavior,
> execution behavior, and highly optimized SerDes. SQL is a way to declare
> data processing end-to-end such that the planner has the full control
> over the execution.
>
> But I totally agree with your concerns around connectors. There is no
> big difference between your concerns and the current design.
>
> 1. "native connector interface is a generic abstraction of doing IO and
> Serde":
>
> This is the case in our design. We are using SourceFunction,
> DeserializationSchema, WatermarkAssigner, etc. all pluggable interfaces
> that the DataStream API offers for performing runtime operations.
>
> 2. "advanced features ... could be provided in a semantic free way":
>
> I agree here. But this is an orthogonal topic that each connector
> implementer should keep in mind. If a new connector is developed, it
> should *not* be developed only for SQL in mind but with good abstraction
> such that also DataStream API users can use it. A connector should have
> a builder pattern to plugin all capabilities like Parque filters etc.
> There should be no table-specific native/runtime connectors. I think
> this discussion is related to the discussion of FLIP-115.
>
> However, as I mentioned before: This FLIP only discusses the interfaces
> for communication between planner and connector factory. As Dawid said
> earlier, a DynamicTableSource can be more seen as a factory that calls
> pluggable interfaces of a native connextor in the end:
>
> KafkaConnector.builder()
>    .watermarkAssigner(...)
>    .keyDeser(...)
>    .valueDeser(...)
>    ....
>    .build()
>
> Regards,
> Timo
>
>
> On 25.03.20 09:05, Becket Qin wrote:
> > Hi Kurt,
> >
> > I do not object to promote the concepts of SQL, but I don't think we
> should
> > do that by introducing a new dedicate set of connector public interfaces
> > that is only for SQL. The same argument can be applied to Gelly, CEP, and
> > Machine Learning, claiming that they need to introduce a dedicated public
> > set of interfaces that fits their own concept and ask the the connector
> > developers to learn and follow their design. As an analogy, if we want to
> > promote Chinese, we don't want to force people to learn ancient Chinese
> > poem while they only need to know a few words like "hello" and "goodbye".
> >
> > As some design principles, here are what I think what  Flink connectors
> > should look like:
> >
> > 1. The native connector interface is a generic abstraction of doing IO
> and
> > Serde, without semantic for high level use cases such as SQL, Gelly, CEP,
> > etc.
> >
> > 2. Some advanced features that may help accelerate the IO and Serde could
> > be provided in the native connector interfaces in a semantic free way so
> > all the high level use cases can leverage.
> >
> > 3. Additional semantics can be built on top of the native source
> interface
> > through providing different plugins. These plugins could be high level
> use
> > case aware. For example, to provide a filter to the source, we can do the
> > following
> >
> > // An interface for all the filters that take an expression.
> > interface ExpressionFilter {
> >      FilterResult applyFilterExpression();
> > }
> >
> > // An filter plugin implementation that translate the SQL Expression to a
> > ParquetFilterPredicate.
> > Class ParquetExpressionFilter implements
> Supplier<ParquetFilterPredicate>,
> > ExpressionFilter {
> >      // Called by the high level use case,
> >      FilterResult applyFilterExpression() { ... }
> >
> >      // Used by the native Source interface.
> >      ParquetFilterPredicate get() { ... }
> > }
> >
> > In this case, the connector developer just need to write the logic of
> > translating an Expression to Parquet FilterPredicate. They don't have to
> > understand the entire set of interfaces that we want to promote. Just
> like
> > they only need to know how to say "Hello" without learning ancient
> Chinese
> > poem.
> >
> > Again, I am not saying this is necessarily the best approach. But so far
> it
> > seems a reasonable design principle to tell the developers.
> >
> > Thanks,
> >
> > Jiangjie (becket) Qin
> >
> >
> >
> > On Wed, Mar 25, 2020 at 11:53 AM Kurt Young <yk...@gmail.com> wrote:
> >
> >> Hi Becket,
> >>
> >> I don't think we should discuss this in pure engineering aspects. Your
> >> proposal is trying
> >> to let SQL connector developers understand as less SQL concepts as
> >> possible. But quite
> >> the opposite, we are designing those interfaces to emphasize the SQL
> >> concept, to bridge
> >> high level concepts into real interfaces and classes.
> >>
> >> We keep talking about time-varying relations and dynamic table when
> >> introduce SQL concepts,
> >> sources and sinks are most critical part playing with those concepts.
> It's
> >> essential to let
> >> Flink SQL developers to learn these concepts and connect them with real
> >> codes by introducing
> >> these connector interfaces and can further write *correct* connectors
> based
> >> on such domain
> >> knowledge.
> >>
> >> So this FLIP is a very important chance to express these concepts and
> make
> >> most SQL developers
> >> be align with concepts and on same page. It's mostly for different
> level of
> >> abstractions and for domains
> >> like SQL, it's becoming more important.  It helps Flink SQL go smoothly
> in
> >> the future, and also
> >> make it easier for new contributors. But I would admit this is not that
> >> obvious for others who don't work
> >> with SQL frequently.
> >>
> >> Best,
> >> Kurt
> >>
> >>
> >> On Wed, Mar 25, 2020 at 11:07 AM Becket Qin <be...@gmail.com>
> wrote:
> >>
> >>> Hi Jark,
> >>>
> >>> It is good to know that we do not expect the end users to touch those
> >>> interfaces.
> >>>
> >>> Then the question boils down to whether the connector developers should
> >> be
> >>> aware of the interfaces that are only used by the SQL optimizer. It
> >> seems a
> >>> win if we can avoid that.
> >>>
> >>> Two potential solutions off the top of my head are:
> >>> 1. An internal helper class doing the instanceOf based on DataStream
> >> source
> >>> interface and create pluggables for that DataStream source.
> >>> 2. codegen the set of TableSource interfaces given a DataStream Source
> >> and
> >>> its corresponding TablePluggablesFactory.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> wrote:
> >>>
> >>>> Hi Becket,
> >>>>
> >>>> Regarding to Flavor1 and Flavor2, I want to clarify that user will
> >> never
> >>>> use table source like this:
> >>>>
> >>>> {
> >>>>     MyTableSource myTableSource = MyTableSourceFactory.create();
> >>>>     myTableSource.setSchema(mySchema);
> >>>>     myTableSource.applyFilterPredicate(expression);
> >>>>     ...
> >>>> }
> >>>>
> >>>> TableFactory and TableSource are not directly exposed to end users,
> all
> >>> the
> >>>> methods are called by planner, not users.
> >>>> Users always use DDL or descriptor to register a table, and planner
> >> will
> >>>> find the factory and create sources according to the properties.
> >>>> All the optimization are applied automatically, e.g. filter/projection
> >>>> pushdown, users don't need to call `applyFilterPredicate` explicitly.
> >>>>
> >>>>
> >>>>
> >>>> On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi Timo and Dawid,
> >>>>>
> >>>>> Thanks for the clarification. They really help. You are right that we
> >>> are
> >>>>> on the same page regarding the hierarchy. I think the only difference
> >>>>> between our view is the flavor of the interfaces. There are two
> >> flavors
> >>>> of
> >>>>> the source interface for DataStream and Table source.
> >>>>>
> >>>>> *Flavor 1. Table Sources are some wrapper interfaces around
> >> DataStream
> >>>>> source.*
> >>>>> Following this way, we will reach the design of the current proposal,
> >>>> i.e.
> >>>>> each pluggable exposed in the DataStream source will have a
> >>> corresponding
> >>>>> TableSource interface counterpart, which are at the Factory level.
> >>> Users
> >>>>> will write code like this:
> >>>>>
> >>>>> {
> >>>>>      MyTableSource myTableSource = MyTableSourceFactory.create();
> >>>>>      myTableSource.setSchema(mySchema);
> >>>>>      myTableSource.applyFilterPredicate(expression);
> >>>>>      ...
> >>>>> }
> >>>>>
> >>>>> The good thing for this flavor is that from the SQL / Table's
> >>>> perspective,
> >>>>> there is a dedicated set of Table oriented interface.
> >>>>>
> >>>>> The downsides are:
> >>>>> A. From the user's perspective, DataStream Source and Table Source
> >> are
> >>>> just
> >>>>> two different sets of interfaces, regardless of how they are the same
> >>>>> internally.
> >>>>> B. The source developers have to develop for those two sets of
> >>> interfaces
> >>>>> in order to support both DataStream and Table.
> >>>>> C. It is not explicit that DataStream can actually share the
> >> pluggable
> >>> in
> >>>>> Table / SQL. For example, in order to provide a filter pluggable with
> >>> SQL
> >>>>> expression, users will have to know the actual converter class that
> >>>>> converts the expression to the filter predicate and construct that
> >>>>> converter by themselves.
> >>>>>
> >>>>> ---------------
> >>>>>
> >>>>> *Flavor 2. A TableSource is a DataStream source with a bunch of
> >>>> pluggables.
> >>>>> No Table specific interfaces at all.*
> >>>>> Following this way, we will reach another design where you have a
> >>>>> SourceFactory and a single Pluggable factory for all the table
> >>>> pluggables.
> >>>>> And users will write something like:
> >>>>>
> >>>>> {
> >>>>>      Deserializer<Row> myTableDeserializer =
> >>>>> MyTablePluggableFactory.createDeserializer(schema)
> >>>>>      MySource<Row> mySource = MySourceFactory.create(properties,
> >>>>> myTableDeserializer);
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> >>>>> }
> >>>>>
> >>>>> The good thing for this flavor is that there is just one set of
> >>> interface
> >>>>> that works for both Table and DataStream. There is no difference
> >>> between
> >>>>> creating a DataStream source and creating a Table source. DataStream
> >>> can
> >>>>> easily reuse the pluggables from the Table sources.
> >>>>>
> >>>>> The downside is that Table / SQL won't have a dedicated API for
> >>>>> optimization. Instead of writing:
> >>>>>
> >>>>> if (MyTableSource instanceOf FilterableTableSource) {
> >>>>>       // Some filter push down logic.
> >>>>>       MyTableSource.applyPredicate(expression)
> >>>>> }
> >>>>>
> >>>>> One have to write:
> >>>>>
> >>>>> if (MySource instanceOf FilterableSource) {
> >>>>>      // Some filter push down logic.
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> >>>>> }
> >>>>>
> >>>>> -------------------------
> >>>>>
> >>>>> Just to be clear, I am not saying flavor 2 is necessarily better than
> >>>>> flavor 1, but I want to make sure flavor 2 is also considered and
> >>>>> discussed.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jiangjie (Becket) Qin.
> >>>>>
> >>>>> On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
> >>>> dwysakowicz@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Becket,
> >>>>>>
> >>>>>> I really think we don't have a differing opinions. We might not see
> >>> the
> >>>>>> changes in the same way yet. Personally I think of the
> >>>> DynamicTableSource
> >>>>>> as of a factory for a Source implemented for the DataStream API.
> >> The
> >>>>>> important fact about the DynamicTableSource and all feature traits
> >>>>>> (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work
> >> with
> >>>>> Table
> >>>>>> API concepts such as e.g. Expressions, SQL specific types etc. In
> >> the
> >>>> end
> >>>>>> what the implementation would resemble is (bear in mind I
> >>> tremendously
> >>>>>> simplified the example, just to show the relation between the two
> >>>> APIs):
> >>>>>>
> >>>>>> SupportsFilterablePushDown {
> >>>>>>
> >>>>>>    applyFilters(List<ResolvedExpression> filters) {
> >>>>>>
> >>>>>>      this.filters = convertToDataStreamFilters(filters);
> >>>>>>
> >>>>>>    }
> >>>>>>
> >>>>>>    Source createSource() {
> >>>>>>
> >>>>>>          return Source.create()
> >>>>>>
> >>>>>>            .applyFilters(this.filters);
> >>>>>>
> >>>>>>     }
> >>>>>>
> >>>>>> }
> >>>>>>
> >>>>>> or exactly as you said for the computed columns:
> >>>>>>
> >>>>>>
> >>>>>> SupportsComputedColumnsPushDown {
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>    applyComputedColumn(ComputedColumnConverter converter) {
> >>>>>>
> >>>>>>      this.deserializationSchema = new DeserializationSchema<Row> {
> >>>>>>
> >>>>>>        Row deserialize(...) {
> >>>>>>
> >>>>>>          RowData row = format.deserialize(bytes); // original
> >> format,
> >>>> e.g
> >>>>>> json, avro, etc.
> >>>>>>
> >>>>>>          RowData enriched = converter(row)
> >>>>>>
> >>>>>>        }
> >>>>>>
> >>>>>>      }
> >>>>>>
> >>>>>>    }
> >>>>>>
> >>>>>>    Source createSource() {
> >>>>>>
> >>>>>>          return Source.create()
> >>>>>>
> >>>>>>            .withDeserialization(deserializationSchema);
> >>>>>>
> >>>>>>     }
> >>>>>>
> >>>>>> }
> >>>>>>
> >>>>>> So to sum it up again, all those interfaces are factories that
> >>>> configure
> >>>>>> appropriate parts of the DataStream API using Table API concepts.
> >>>> Finally
> >>>>>> to answer you question for particular comparisons:
> >>>>>>
> >>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> >>>>>> SupportsFilterablePushDown v.s. FilterableSource
> >>>>>> SupportsProjectablePushDown v.s. ProjectableSource
> >>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> >>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> >>>>>> ScanTableSource v.s. ChangeLogDeserializer.
> >>>>>>
> >>>>>> pretty much you can think of all on the left as factories for the
> >>> right
> >>>>>> side, left side works with Table API classes (Expressions,
> >>> DataTypes).
> >>>> I
> >>>>>> hope this clarifies it a bit.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Dawid
> >>>>>> On 24/03/2020 15:03, Becket Qin wrote:
> >>>>>>
> >>>>>> Hey Kurt,
> >>>>>>
> >>>>>> I don't think DataStream should see some SQL specific concepts such
> >>> as
> >>>>>>
> >>>>>> Filtering or ComputedColumn.
> >>>>>>
> >>>>>> Projectable and Filterable seems not necessarily SQL concepts, but
> >>>> could
> >>>>> be
> >>>>>> applicable to DataStream source as well to reduce the network load.
> >>> For
> >>>>>> example ORC and Parquet should probably also be readable from
> >>>> DataStream,
> >>>>>> right?
> >>>>>>
> >>>>>> ComputedColumn is not part of the Source, it is an interface
> >> extends
> >>>> the
> >>>>>> Deserializer, which is a pluggable for the Source. From the SQL's
> >>>>>> perspective it has the concept of computed column, but from the
> >>> Source
> >>>>>> perspective, It is essentially a Deserializer which also converts
> >> the
> >>>>>> records internally, assuming we allow some conversion to be
> >> embedded
> >>> to
> >>>>>> the source in addition to just deserialization.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <
> >>>>> imjark@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Thanks Timo for updating the formats section. That would be very
> >>>> helpful
> >>>>>> for changelog supporting (FLIP-105).
> >>>>>>
> >>>>>> I just left 2 minor comment about some method names. In general,
> >> I'm
> >>> +1
> >>>>> to
> >>>>>> start a voting.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> --------------------------------------------------------------------------------------------------
> >>>>>>
> >>>>>> Hi Becket,
> >>>>>>
> >>>>>> I agree we shouldn't duplicate codes, especiall the runtime
> >>>>>> implementations.
> >>>>>> However, the interfaces proposed by FLIP-95 are mainly used during
> >>>>>> optimization (compiling), not runtime.
> >>>>>> I don't think there is much to share for this. Because table/sql
> >>>>>> is declarative, but DataStream is imperative.
> >>>>>> For example, filter push down, DataStream FilterableSource may
> >> allow
> >>> to
> >>>>>> accept a FilterFunction (which is a black box for the source).
> >>>>>> However, table sources should pick the pushed filter expressions,
> >>> some
> >>>>>> sources may only support "=", "<", ">" conditions.
> >>>>>> Pushing a FilterFunction doesn't work in table ecosystem. That
> >> means,
> >>>> the
> >>>>>> connectors have to have some table-specific implementations.
> >>>>>>
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <
> >>>>> ykt836@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi Becket,
> >>>>>>
> >>>>>> I don't think DataStream should see some SQL specific concepts such
> >>> as
> >>>>>> Filtering or ComputedColumn. It's
> >>>>>> better to stay within SQL area and translate to more generic
> >> concept
> >>>> when
> >>>>>> translating to DataStream/Runtime
> >>>>>> layer, such as use MapFunction to represent computed column logic.
> >>>>>>
> >>>>>> Best,
> >>>>>> Kurt
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com>
> >> <
> >>>>> becket.qin@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi Timo and Dawid,
> >>>>>>
> >>>>>> It's really great that we have the same goal. I am actually
> >> wondering
> >>>>>>
> >>>>>> if
> >>>>>>
> >>>>>> we
> >>>>>>
> >>>>>> can go one step further to avoid some of the interfaces in Table as
> >>>>>>
> >>>>>> well.
> >>>>>>
> >>>>>> For example, if we have the FilterableSource, do we still need the
> >>>>>> FilterableTableSource? Should DynamicTableSource just become a
> >>>>>> Source<*Row*,
> >>>>>> SourceSplitT, EnumChkT>?
> >>>>>>
> >>>>>> Can you help me understand a bit more about the reason we need the
> >>>>>> following relational representation / wrapper interfaces v.s. the
> >>>>>> interfaces that we could put to the Source in FLIP-27?
> >>>>>>
> >>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> >>>>>> SupportsFilterablePushDown v.s. FilterableSource
> >>>>>> SupportsProjectablePushDown v.s. ProjectableSource
> >>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> >>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> >>>>>> ScanTableSource v.s. ChangeLogDeserializer.
> >>>>>> LookUpTableSource v.s. LookUpSource
> >>>>>>
> >>>>>> Assuming we have all the interfaces on the right side, do we still
> >>> need
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> interfaces on the left side? Note that the interfaces on the right
> >>> can
> >>>>>>
> >>>>>> be
> >>>>>>
> >>>>>> used by both DataStream and Table. If we do this, there will only
> >> be
> >>>>>>
> >>>>>> one
> >>>>>>
> >>>>>> set of Source interfaces Table and DataStream, the only difference
> >> is
> >>>>>>
> >>>>>> that
> >>>>>>
> >>>>>> the Source for table will have some specific plugins and
> >>>>>>
> >>>>>> configurations.
> >>>>>>
> >>>>>> An
> >>>>>>
> >>>>>> omnipotent Source can implement all the the above interfaces and
> >>> take a
> >>>>>> Deserializer that implements both ComputedColumnDeserializer and
> >>>>>> ChangeLogDeserializer.
> >>>>>>
> >>>>>> Would the SQL planner work with that?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <
> >> jingsonglee0@gmail.com>
> >>> <
> >>>>> jingsonglee0@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> +1. Thanks Timo for the design doc.
> >>>>>>
> >>>>>> We can also consider @Experimental too. But I am +1 to
> >>>>>>
> >>>>>> @PublicEvolving,
> >>>>>>
> >>>>>> we
> >>>>>>
> >>>>>> should be confident in the current change.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jingsong Lee
> >>>>>>
> >>>>>> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
> >> <
> >>>>> twalthr@apache.org>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> @Becket: We totally agree that we don't need table specific
> >>>>>>
> >>>>>> connectors
> >>>>>>
> >>>>>> during runtime. As Dawid said, the interfaces proposed here are
> >>>>>>
> >>>>>> just
> >>>>>>
> >>>>>> for
> >>>>>>
> >>>>>> communication with the planner. Once the properties (watermarks,
> >>>>>> computed column, filters, projecttion etc.) are negotiated, we can
> >>>>>> configure a regular Flink connector.
> >>>>>>
> >>>>>> E.g. setting the watermark assigner and deserialization schema of a
> >>>>>> Kafka connector.
> >>>>>>
> >>>>>> For better separation of concerns, Flink connectors should not
> >>>>>>
> >>>>>> include
> >>>>>>
> >>>>>> relational interfaces and depend on flink-table. This is the
> >>>>>> responsibility of table source/sink.
> >>>>>>
> >>>>>> @Kurt: I would like to mark them @PublicEvolving already because we
> >>>>>>
> >>>>>> need
> >>>>>>
> >>>>>> to deprecate the old interfaces as early as possible. We cannot
> >>>>>>
> >>>>>> redirect
> >>>>>>
> >>>>>> to @Internal interfaces. They are not marked @Public, so we can
> >>>>>>
> >>>>>> still
> >>>>>>
> >>>>>> evolve them. But a core design shift should not happen again, it
> >>>>>>
> >>>>>> would
> >>>>>>
> >>>>>> leave a bad impression if we are redesign over and over again.
> >>>>>>
> >>>>>> Instead
> >>>>>>
> >>>>>> we should be confident in the current change.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Timo
> >>>>>>
> >>>>>>
> >>>>>> On 24.03.20 09:20, Dawid Wysakowicz wrote:
> >>>>>>
> >>>>>> Hi Becket,
> >>>>>>
> >>>>>> Answering your question, we have the same intention not to
> >>>>>>
> >>>>>> duplicate
> >>>>>>
> >>>>>> connectors between datastream and table apis. The interfaces
> >>>>>>
> >>>>>> proposed
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> the FLIP are a way to describe relational properties of a source.
> >>>>>>
> >>>>>> The
> >>>>>>
> >>>>>> intention is as you described to translate all of those expressed
> >>>>>>
> >>>>>> as
> >>>>>>
> >>>>>> expressions or other Table specific structures into a DataStream
> >>>>>>
> >>>>>> source.
> >>>>>>
> >>>>>> In other words I think what we are doing here is in line with
> >>>>>>
> >>>>>> what
> >>>>>>
> >>>>>> you
> >>>>>>
> >>>>>> described.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Dawid
> >>>>>>
> >>>>>> On 24/03/2020 02:23, Becket Qin wrote:
> >>>>>>
> >>>>>> Hi Timo,
> >>>>>>
> >>>>>> Thanks for the proposal. I completely agree that the current
> >>>>>>
> >>>>>> Table
> >>>>>>
> >>>>>> connectors could be simplified quite a bit. I haven't finished
> >>>>>>
> >>>>>> reading
> >>>>>>
> >>>>>> everything, but here are some quick thoughts.
> >>>>>>
> >>>>>> Actually to me the biggest question is why should there be two
> >>>>>>
> >>>>>> different
> >>>>>>
> >>>>>> connector systems for DataStream and Table? What is the
> >>>>>>
> >>>>>> fundamental
> >>>>>>
> >>>>>> reason
> >>>>>>
> >>>>>> that is preventing us from merging them to one?
> >>>>>>
> >>>>>> The basic functionality of a connector is to provide
> >>>>>>
> >>>>>> capabilities
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> do
> >>>>>>
> >>>>>> IO
> >>>>>>
> >>>>>> and Serde. Conceptually, Table connectors should just be
> >>>>>>
> >>>>>> DataStream
> >>>>>>
> >>>>>> connectors that are dealing with Rows. It seems that quite a few
> >>>>>>
> >>>>>> of
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> special connector requirements are just a specific way to do IO
> >>>>>>
> >>>>>> /
> >>>>>>
> >>>>>> Serde.
> >>>>>>
> >>>>>> Taking SupportsFilterPushDown as an example, imagine we have the
> >>>>>>
> >>>>>> following
> >>>>>>
> >>>>>> interface:
> >>>>>>
> >>>>>> interface FilterableSource<PREDICATE> {
> >>>>>>       void applyFilterable(Supplier<PREDICATE> predicate);
> >>>>>> }
> >>>>>>
> >>>>>> And if a ParquetSource would like to support filterable, it will
> >>>>>>
> >>>>>> become:
> >>>>>>
> >>>>>> class ParquetSource implements Source,
> >>>>>>
> >>>>>> FilterableSource(FilterPredicate> {
> >>>>>>
> >>>>>>       ...
> >>>>>> }
> >>>>>>
> >>>>>> For Table, one just need to provide an predicate supplier that
> >>>>>>
> >>>>>> converts
> >>>>>>
> >>>>>> an
> >>>>>>
> >>>>>> Expression to the specified predicate type. This has a few
> >>>>>>
> >>>>>> benefit:
> >>>>>>
> >>>>>> 1. Same unified API for filterable for sources, regardless of
> >>>>>>
> >>>>>> DataStream or
> >>>>>>
> >>>>>> Table.
> >>>>>> 2. The  DataStream users now can also use the
> >>>>>>
> >>>>>> ExpressionToPredicate
> >>>>>>
> >>>>>> supplier if they want to.
> >>>>>>
> >>>>>> To summarize, my main point is that I am wondering if it is
> >>>>>>
> >>>>>> possible
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> have a single set of connector interface for both Table and
> >>>>>>
> >>>>>> DataStream,
> >>>>>>
> >>>>>> rather than having two hierarchies. I am not 100% sure if this
> >>>>>>
> >>>>>> would
> >>>>>>
> >>>>>> work,
> >>>>>>
> >>>>>> but if it works, this would be a huge win from both code
> >>>>>>
> >>>>>> maintenance
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>> user experience perspective.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> >>>>>>
> >>>>>> dwysakowicz@apache.org>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi Timo,
> >>>>>>
> >>>>>> Thank you for the proposal. I think it is an important
> >>>>>>
> >>>>>> improvement
> >>>>>>
> >>>>>> that
> >>>>>>
> >>>>>> will benefit many parts of the Table API. The proposal looks
> >>>>>>
> >>>>>> really
> >>>>>>
> >>>>>> good
> >>>>>>
> >>>>>> to me and personally I would be comfortable with voting on the
> >>>>>>
> >>>>>> current
> >>>>>>
> >>>>>> state.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Dawid
> >>>>>>
> >>>>>> On 23/03/2020 18:53, Timo Walther wrote:
> >>>>>>
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> I received some questions around how the new interfaces play
> >>>>>>
> >>>>>> together
> >>>>>>
> >>>>>> with formats and their factories.
> >>>>>>
> >>>>>> Furthermore, for MySQL or Postgres CDC logs, the format should
> >>>>>>
> >>>>>> be
> >>>>>>
> >>>>>> able
> >>>>>>
> >>>>>> to return a `ChangelogMode`.
> >>>>>>
> >>>>>> Also, I incorporated the feedback around the factory design in
> >>>>>>
> >>>>>> general.
> >>>>>>
> >>>>>> I added a new section `Factory Interfaces` to the design
> >>>>>>
> >>>>>> document.
> >>>>>>
> >>>>>> This should be helpful to understand the big picture and
> >>>>>>
> >>>>>> connecting
> >>>>>>
> >>>>>> the concepts.
> >>>>>>
> >>>>>> Please let me know what you think?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Timo
> >>>>>>
> >>>>>>
> >>>>>> On 18.03.20 13:43, Timo Walther wrote:
> >>>>>>
> >>>>>> Hi Benchao,
> >>>>>>
> >>>>>> this is a very good question. I will update the FLIP about
> >>>>>>
> >>>>>> this.
> >>>>>>
> >>>>>> The legacy planner will not support the new interfaces. It
> >>>>>>
> >>>>>> will
> >>>>>>
> >>>>>> only
> >>>>>>
> >>>>>> support the old interfaces. With the next release, I think
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> Blink
> >>>>>>
> >>>>>> planner is stable enough to be the default one as well.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Timo
> >>>>>>
> >>>>>> On 18.03.20 08:45, Benchao Li wrote:
> >>>>>>
> >>>>>> Hi Timo,
> >>>>>>
> >>>>>> Thank you and others for the efforts to prepare this FLIP.
> >>>>>>
> >>>>>> The FLIP LGTM generally.
> >>>>>>
> >>>>>> +1 for moving blink data structures to table-common, it's
> >>>>>>
> >>>>>> useful
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> udf too
> >>>>>> in the future.
> >>>>>> A little question is, do we plan to support the new
> >>>>>>
> >>>>>> interfaces
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>> data
> >>>>>>
> >>>>>> types in legacy planner?
> >>>>>> Or we only plan to support these new interfaces in blink
> >>>>>>
> >>>>>> planner.
> >>>>>>
> >>>>>> And using primary keys from DDL instead of derived key
> >>>>>>
> >>>>>> information
> >>>>>>
> >>>>>> from
> >>>>>>
> >>>>>> each query is also a good idea,
> >>>>>> we met some use cases where this does not works very well
> >>>>>>
> >>>>>> before.
> >>>>>>
> >>>>>> This FLIP also makes the dependencies of table modules more
> >>>>>>
> >>>>>> clear, I
> >>>>>>
> >>>>>> like
> >>>>>> it very much.
> >>>>>>
> >>>>>> Timo Walther <tw...@apache.org> <tw...@apache.org>
> >> 于2020年3月17日周二
> >>>>> 上午1:36写道：
> >>>>>>
> >>>>>>
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> I'm happy to present the results of long discussions that
> >>>>>>
> >>>>>> we
> >>>>>>
> >>>>>> had
> >>>>>>
> >>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> >>>>>>
> >>>>>> many
> >>>>>>
> >>>>>> more
> >>>>>>
> >>>>>> have contributed to this design document.
> >>>>>>
> >>>>>> We would like to propose new long-term table source and
> >>>>>>
> >>>>>> table
> >>>>>>
> >>>>>> sink
> >>>>>>
> >>>>>> interfaces:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> >>>>>>
> >>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> >>>>>>
> >>>>>> The goals of this FLIP are:
> >>>>>>
> >>>>>> - Simplify the current interface architecture:
> >>>>>>         - Merge upsert, retract, and append sinks.
> >>>>>>         - Unify batch and streaming sources.
> >>>>>>         - Unify batch and streaming sinks.
> >>>>>>
> >>>>>> - Allow sources to produce a changelog:
> >>>>>>         - UpsertTableSources have been requested a lot by
> >>>>>>
> >>>>>> users.
> >>>>>>
> >>>>>> Now
> >>>>>>
> >>>>>> is the
> >>>>>> time to open the internal planner capabilities via the new
> >>>>>>
> >>>>>> interfaces.
> >>>>>>
> >>>>>>         - According to FLIP-105, we would like to support
> >>>>>>
> >>>>>> changelogs for
> >>>>>>
> >>>>>> processing formats such as Debezium.
> >>>>>>
> >>>>>> - Don't rely on DataStream API for source and sinks:
> >>>>>>         - According to FLIP-32, the Table API and SQL should
> >>>>>>
> >>>>>> be
> >>>>>>
> >>>>>> independent
> >>>>>> of the DataStream API which is why the `table-common`
> >>>>>>
> >>>>>> module
> >>>>>>
> >>>>>> has
> >>>>>>
> >>>>>> no
> >>>>>>
> >>>>>> dependencies on `flink-streaming-java`.
> >>>>>>         - Source and sink implementations should only depend
> >>>>>>
> >>>>>> on
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> `table-common` module after FLIP-27.
> >>>>>>         - Until FLIP-27 is ready, we still put most of the
> >>>>>>
> >>>>>> interfaces in
> >>>>>>
> >>>>>> `table-common` and strictly separate interfaces that
> >>>>>>
> >>>>>> communicate
> >>>>>>
> >>>>>> with a
> >>>>>> planner and actual runtime reader/writers.
> >>>>>>
> >>>>>> - Implement efficient sources and sinks without planner
> >>>>>>
> >>>>>> dependencies:
> >>>>>>
> >>>>>>         - Make Blink's internal data structures available to
> >>>>>>
> >>>>>> connectors.
> >>>>>>
> >>>>>>         - Introduce stable interfaces for data structures
> >>>>>>
> >>>>>> that
> >>>>>>
> >>>>>> can
> >>>>>>
> >>>>>> be
> >>>>>>
> >>>>>> marked as `@PublicEvolving`.
> >>>>>>         - Only require dependencies on `flink-table-common`
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> future
> >>>>>>
> >>>>>> It finalizes the concept of dynamic tables and consideres
> >>>>>>
> >>>>>> how
> >>>>>>
> >>>>>> all
> >>>>>>
> >>>>>> source/sink related classes play together.
> >>>>>>
> >>>>>> We look forward to your feedback.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Timo
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best, Jingsong Lee
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Timo Walther <tw...@apache.org>.

Hi Becket,

Let me clarify a few things first: Historically we thought of Table 
API/SQL as a library on top of DataStream API. Similar to Gelly or CEP. 
We used TypeInformation in Table API to integrate nicely with DataStream 
API. However, the last years have shown that SQL is not just a library. 
It is an entire ecosystem that defines data types, submission behavior, 
execution behavior, and highly optimized SerDes. SQL is a way to declare 
data processing end-to-end such that the planner has the full control 
over the execution.

But I totally agree with your concerns around connectors. There is no 
big difference between your concerns and the current design.

1. "native connector interface is a generic abstraction of doing IO and 
Serde":

This is the case in our design. We are using SourceFunction, 
DeserializationSchema, WatermarkAssigner, etc. all pluggable interfaces 
that the DataStream API offers for performing runtime operations.

2. "advanced features ... could be provided in a semantic free way":

I agree here. But this is an orthogonal topic that each connector 
implementer should keep in mind. If a new connector is developed, it 
should *not* be developed only for SQL in mind but with good abstraction 
such that also DataStream API users can use it. A connector should have 
a builder pattern to plugin all capabilities like Parque filters etc. 
There should be no table-specific native/runtime connectors. I think 
this discussion is related to the discussion of FLIP-115.

However, as I mentioned before: This FLIP only discusses the interfaces 
for communication between planner and connector factory. As Dawid said 
earlier, a DynamicTableSource can be more seen as a factory that calls 
pluggable interfaces of a native connextor in the end:

KafkaConnector.builder()
   .watermarkAssigner(...)
   .keyDeser(...)
   .valueDeser(...)
   ....
   .build()

Regards,
Timo


On 25.03.20 09:05, Becket Qin wrote:
> Hi Kurt,
> 
> I do not object to promote the concepts of SQL, but I don't think we should
> do that by introducing a new dedicate set of connector public interfaces
> that is only for SQL. The same argument can be applied to Gelly, CEP, and
> Machine Learning, claiming that they need to introduce a dedicated public
> set of interfaces that fits their own concept and ask the the connector
> developers to learn and follow their design. As an analogy, if we want to
> promote Chinese, we don't want to force people to learn ancient Chinese
> poem while they only need to know a few words like "hello" and "goodbye".
> 
> As some design principles, here are what I think what  Flink connectors
> should look like:
> 
> 1. The native connector interface is a generic abstraction of doing IO and
> Serde, without semantic for high level use cases such as SQL, Gelly, CEP,
> etc.
> 
> 2. Some advanced features that may help accelerate the IO and Serde could
> be provided in the native connector interfaces in a semantic free way so
> all the high level use cases can leverage.
> 
> 3. Additional semantics can be built on top of the native source interface
> through providing different plugins. These plugins could be high level use
> case aware. For example, to provide a filter to the source, we can do the
> following
> 
> // An interface for all the filters that take an expression.
> interface ExpressionFilter {
>      FilterResult applyFilterExpression();
> }
> 
> // An filter plugin implementation that translate the SQL Expression to a
> ParquetFilterPredicate.
> Class ParquetExpressionFilter implements Supplier<ParquetFilterPredicate>,
> ExpressionFilter {
>      // Called by the high level use case,
>      FilterResult applyFilterExpression() { ... }
> 
>      // Used by the native Source interface.
>      ParquetFilterPredicate get() { ... }
> }
> 
> In this case, the connector developer just need to write the logic of
> translating an Expression to Parquet FilterPredicate. They don't have to
> understand the entire set of interfaces that we want to promote. Just like
> they only need to know how to say "Hello" without learning ancient Chinese
> poem.
> 
> Again, I am not saying this is necessarily the best approach. But so far it
> seems a reasonable design principle to tell the developers.
> 
> Thanks,
> 
> Jiangjie (becket) Qin
> 
> 
> 
> On Wed, Mar 25, 2020 at 11:53 AM Kurt Young <yk...@gmail.com> wrote:
> 
>> Hi Becket,
>>
>> I don't think we should discuss this in pure engineering aspects. Your
>> proposal is trying
>> to let SQL connector developers understand as less SQL concepts as
>> possible. But quite
>> the opposite, we are designing those interfaces to emphasize the SQL
>> concept, to bridge
>> high level concepts into real interfaces and classes.
>>
>> We keep talking about time-varying relations and dynamic table when
>> introduce SQL concepts,
>> sources and sinks are most critical part playing with those concepts. It's
>> essential to let
>> Flink SQL developers to learn these concepts and connect them with real
>> codes by introducing
>> these connector interfaces and can further write *correct* connectors based
>> on such domain
>> knowledge.
>>
>> So this FLIP is a very important chance to express these concepts and make
>> most SQL developers
>> be align with concepts and on same page. It's mostly for different level of
>> abstractions and for domains
>> like SQL, it's becoming more important.  It helps Flink SQL go smoothly in
>> the future, and also
>> make it easier for new contributors. But I would admit this is not that
>> obvious for others who don't work
>> with SQL frequently.
>>
>> Best,
>> Kurt
>>
>>
>> On Wed, Mar 25, 2020 at 11:07 AM Becket Qin <be...@gmail.com> wrote:
>>
>>> Hi Jark,
>>>
>>> It is good to know that we do not expect the end users to touch those
>>> interfaces.
>>>
>>> Then the question boils down to whether the connector developers should
>> be
>>> aware of the interfaces that are only used by the SQL optimizer. It
>> seems a
>>> win if we can avoid that.
>>>
>>> Two potential solutions off the top of my head are:
>>> 1. An internal helper class doing the instanceOf based on DataStream
>> source
>>> interface and create pluggables for that DataStream source.
>>> 2. codegen the set of TableSource interfaces given a DataStream Source
>> and
>>> its corresponding TablePluggablesFactory.
>>>
>>> Thanks,
>>>
>>> Jiangjie (Becket) Qin
>>>
>>> On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> wrote:
>>>
>>>> Hi Becket,
>>>>
>>>> Regarding to Flavor1 and Flavor2, I want to clarify that user will
>> never
>>>> use table source like this:
>>>>
>>>> {
>>>>     MyTableSource myTableSource = MyTableSourceFactory.create();
>>>>     myTableSource.setSchema(mySchema);
>>>>     myTableSource.applyFilterPredicate(expression);
>>>>     ...
>>>> }
>>>>
>>>> TableFactory and TableSource are not directly exposed to end users, all
>>> the
>>>> methods are called by planner, not users.
>>>> Users always use DDL or descriptor to register a table, and planner
>> will
>>>> find the factory and create sources according to the properties.
>>>> All the optimization are applied automatically, e.g. filter/projection
>>>> pushdown, users don't need to call `applyFilterPredicate` explicitly.
>>>>
>>>>
>>>>
>>>> On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com> wrote:
>>>>
>>>>> Hi Timo and Dawid,
>>>>>
>>>>> Thanks for the clarification. They really help. You are right that we
>>> are
>>>>> on the same page regarding the hierarchy. I think the only difference
>>>>> between our view is the flavor of the interfaces. There are two
>> flavors
>>>> of
>>>>> the source interface for DataStream and Table source.
>>>>>
>>>>> *Flavor 1. Table Sources are some wrapper interfaces around
>> DataStream
>>>>> source.*
>>>>> Following this way, we will reach the design of the current proposal,
>>>> i.e.
>>>>> each pluggable exposed in the DataStream source will have a
>>> corresponding
>>>>> TableSource interface counterpart, which are at the Factory level.
>>> Users
>>>>> will write code like this:
>>>>>
>>>>> {
>>>>>      MyTableSource myTableSource = MyTableSourceFactory.create();
>>>>>      myTableSource.setSchema(mySchema);
>>>>>      myTableSource.applyFilterPredicate(expression);
>>>>>      ...
>>>>> }
>>>>>
>>>>> The good thing for this flavor is that from the SQL / Table's
>>>> perspective,
>>>>> there is a dedicated set of Table oriented interface.
>>>>>
>>>>> The downsides are:
>>>>> A. From the user's perspective, DataStream Source and Table Source
>> are
>>>> just
>>>>> two different sets of interfaces, regardless of how they are the same
>>>>> internally.
>>>>> B. The source developers have to develop for those two sets of
>>> interfaces
>>>>> in order to support both DataStream and Table.
>>>>> C. It is not explicit that DataStream can actually share the
>> pluggable
>>> in
>>>>> Table / SQL. For example, in order to provide a filter pluggable with
>>> SQL
>>>>> expression, users will have to know the actual converter class that
>>>>> converts the expression to the filter predicate and construct that
>>>>> converter by themselves.
>>>>>
>>>>> ---------------
>>>>>
>>>>> *Flavor 2. A TableSource is a DataStream source with a bunch of
>>>> pluggables.
>>>>> No Table specific interfaces at all.*
>>>>> Following this way, we will reach another design where you have a
>>>>> SourceFactory and a single Pluggable factory for all the table
>>>> pluggables.
>>>>> And users will write something like:
>>>>>
>>>>> {
>>>>>      Deserializer<Row> myTableDeserializer =
>>>>> MyTablePluggableFactory.createDeserializer(schema)
>>>>>      MySource<Row> mySource = MySourceFactory.create(properties,
>>>>> myTableDeserializer);
>>>>>
>>>>>
>>>>>
>>>>
>>>
>> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>>>>> }
>>>>>
>>>>> The good thing for this flavor is that there is just one set of
>>> interface
>>>>> that works for both Table and DataStream. There is no difference
>>> between
>>>>> creating a DataStream source and creating a Table source. DataStream
>>> can
>>>>> easily reuse the pluggables from the Table sources.
>>>>>
>>>>> The downside is that Table / SQL won't have a dedicated API for
>>>>> optimization. Instead of writing:
>>>>>
>>>>> if (MyTableSource instanceOf FilterableTableSource) {
>>>>>       // Some filter push down logic.
>>>>>       MyTableSource.applyPredicate(expression)
>>>>> }
>>>>>
>>>>> One have to write:
>>>>>
>>>>> if (MySource instanceOf FilterableSource) {
>>>>>      // Some filter push down logic.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
>>>>> }
>>>>>
>>>>> -------------------------
>>>>>
>>>>> Just to be clear, I am not saying flavor 2 is necessarily better than
>>>>> flavor 1, but I want to make sure flavor 2 is also considered and
>>>>> discussed.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jiangjie (Becket) Qin.
>>>>>
>>>>> On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
>>>> dwysakowicz@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Becket,
>>>>>>
>>>>>> I really think we don't have a differing opinions. We might not see
>>> the
>>>>>> changes in the same way yet. Personally I think of the
>>>> DynamicTableSource
>>>>>> as of a factory for a Source implemented for the DataStream API.
>> The
>>>>>> important fact about the DynamicTableSource and all feature traits
>>>>>> (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work
>> with
>>>>> Table
>>>>>> API concepts such as e.g. Expressions, SQL specific types etc. In
>> the
>>>> end
>>>>>> what the implementation would resemble is (bear in mind I
>>> tremendously
>>>>>> simplified the example, just to show the relation between the two
>>>> APIs):
>>>>>>
>>>>>> SupportsFilterablePushDown {
>>>>>>
>>>>>>    applyFilters(List<ResolvedExpression> filters) {
>>>>>>
>>>>>>      this.filters = convertToDataStreamFilters(filters);
>>>>>>
>>>>>>    }
>>>>>>
>>>>>>    Source createSource() {
>>>>>>
>>>>>>          return Source.create()
>>>>>>
>>>>>>            .applyFilters(this.filters);
>>>>>>
>>>>>>     }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> or exactly as you said for the computed columns:
>>>>>>
>>>>>>
>>>>>> SupportsComputedColumnsPushDown {
>>>>>>
>>>>>>
>>>>>>
>>>>>>    applyComputedColumn(ComputedColumnConverter converter) {
>>>>>>
>>>>>>      this.deserializationSchema = new DeserializationSchema<Row> {
>>>>>>
>>>>>>        Row deserialize(...) {
>>>>>>
>>>>>>          RowData row = format.deserialize(bytes); // original
>> format,
>>>> e.g
>>>>>> json, avro, etc.
>>>>>>
>>>>>>          RowData enriched = converter(row)
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>      }
>>>>>>
>>>>>>    }
>>>>>>
>>>>>>    Source createSource() {
>>>>>>
>>>>>>          return Source.create()
>>>>>>
>>>>>>            .withDeserialization(deserializationSchema);
>>>>>>
>>>>>>     }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> So to sum it up again, all those interfaces are factories that
>>>> configure
>>>>>> appropriate parts of the DataStream API using Table API concepts.
>>>> Finally
>>>>>> to answer you question for particular comparisons:
>>>>>>
>>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>>>>>> SupportsFilterablePushDown v.s. FilterableSource
>>>>>> SupportsProjectablePushDown v.s. ProjectableSource
>>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>>>>>> ScanTableSource v.s. ChangeLogDeserializer.
>>>>>>
>>>>>> pretty much you can think of all on the left as factories for the
>>> right
>>>>>> side, left side works with Table API classes (Expressions,
>>> DataTypes).
>>>> I
>>>>>> hope this clarifies it a bit.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Dawid
>>>>>> On 24/03/2020 15:03, Becket Qin wrote:
>>>>>>
>>>>>> Hey Kurt,
>>>>>>
>>>>>> I don't think DataStream should see some SQL specific concepts such
>>> as
>>>>>>
>>>>>> Filtering or ComputedColumn.
>>>>>>
>>>>>> Projectable and Filterable seems not necessarily SQL concepts, but
>>>> could
>>>>> be
>>>>>> applicable to DataStream source as well to reduce the network load.
>>> For
>>>>>> example ORC and Parquet should probably also be readable from
>>>> DataStream,
>>>>>> right?
>>>>>>
>>>>>> ComputedColumn is not part of the Source, it is an interface
>> extends
>>>> the
>>>>>> Deserializer, which is a pluggable for the Source. From the SQL's
>>>>>> perspective it has the concept of computed column, but from the
>>> Source
>>>>>> perspective, It is essentially a Deserializer which also converts
>> the
>>>>>> records internally, assuming we allow some conversion to be
>> embedded
>>> to
>>>>>> the source in addition to just deserialization.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jiangjie (Becket) Qin
>>>>>>
>>>>>> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <
>>>>> imjark@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> Thanks Timo for updating the formats section. That would be very
>>>> helpful
>>>>>> for changelog supporting (FLIP-105).
>>>>>>
>>>>>> I just left 2 minor comment about some method names. In general,
>> I'm
>>> +1
>>>>> to
>>>>>> start a voting.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> --------------------------------------------------------------------------------------------------
>>>>>>
>>>>>> Hi Becket,
>>>>>>
>>>>>> I agree we shouldn't duplicate codes, especiall the runtime
>>>>>> implementations.
>>>>>> However, the interfaces proposed by FLIP-95 are mainly used during
>>>>>> optimization (compiling), not runtime.
>>>>>> I don't think there is much to share for this. Because table/sql
>>>>>> is declarative, but DataStream is imperative.
>>>>>> For example, filter push down, DataStream FilterableSource may
>> allow
>>> to
>>>>>> accept a FilterFunction (which is a black box for the source).
>>>>>> However, table sources should pick the pushed filter expressions,
>>> some
>>>>>> sources may only support "=", "<", ">" conditions.
>>>>>> Pushing a FilterFunction doesn't work in table ecosystem. That
>> means,
>>>> the
>>>>>> connectors have to have some table-specific implementations.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Jark
>>>>>>
>>>>>> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <
>>>>> ykt836@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Becket,
>>>>>>
>>>>>> I don't think DataStream should see some SQL specific concepts such
>>> as
>>>>>> Filtering or ComputedColumn. It's
>>>>>> better to stay within SQL area and translate to more generic
>> concept
>>>> when
>>>>>> translating to DataStream/Runtime
>>>>>> layer, such as use MapFunction to represent computed column logic.
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com>
>> <
>>>>> becket.qin@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Timo and Dawid,
>>>>>>
>>>>>> It's really great that we have the same goal. I am actually
>> wondering
>>>>>>
>>>>>> if
>>>>>>
>>>>>> we
>>>>>>
>>>>>> can go one step further to avoid some of the interfaces in Table as
>>>>>>
>>>>>> well.
>>>>>>
>>>>>> For example, if we have the FilterableSource, do we still need the
>>>>>> FilterableTableSource? Should DynamicTableSource just become a
>>>>>> Source<*Row*,
>>>>>> SourceSplitT, EnumChkT>?
>>>>>>
>>>>>> Can you help me understand a bit more about the reason we need the
>>>>>> following relational representation / wrapper interfaces v.s. the
>>>>>> interfaces that we could put to the Source in FLIP-27?
>>>>>>
>>>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>>>>>> SupportsFilterablePushDown v.s. FilterableSource
>>>>>> SupportsProjectablePushDown v.s. ProjectableSource
>>>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>>>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>>>>>> ScanTableSource v.s. ChangeLogDeserializer.
>>>>>> LookUpTableSource v.s. LookUpSource
>>>>>>
>>>>>> Assuming we have all the interfaces on the right side, do we still
>>> need
>>>>>>
>>>>>> the
>>>>>>
>>>>>> interfaces on the left side? Note that the interfaces on the right
>>> can
>>>>>>
>>>>>> be
>>>>>>
>>>>>> used by both DataStream and Table. If we do this, there will only
>> be
>>>>>>
>>>>>> one
>>>>>>
>>>>>> set of Source interfaces Table and DataStream, the only difference
>> is
>>>>>>
>>>>>> that
>>>>>>
>>>>>> the Source for table will have some specific plugins and
>>>>>>
>>>>>> configurations.
>>>>>>
>>>>>> An
>>>>>>
>>>>>> omnipotent Source can implement all the the above interfaces and
>>> take a
>>>>>> Deserializer that implements both ComputedColumnDeserializer and
>>>>>> ChangeLogDeserializer.
>>>>>>
>>>>>> Would the SQL planner work with that?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jiangjie (Becket) Qin
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <
>> jingsonglee0@gmail.com>
>>> <
>>>>> jingsonglee0@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> +1. Thanks Timo for the design doc.
>>>>>>
>>>>>> We can also consider @Experimental too. But I am +1 to
>>>>>>
>>>>>> @PublicEvolving,
>>>>>>
>>>>>> we
>>>>>>
>>>>>> should be confident in the current change.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
>> <
>>>>> twalthr@apache.org>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> @Becket: We totally agree that we don't need table specific
>>>>>>
>>>>>> connectors
>>>>>>
>>>>>> during runtime. As Dawid said, the interfaces proposed here are
>>>>>>
>>>>>> just
>>>>>>
>>>>>> for
>>>>>>
>>>>>> communication with the planner. Once the properties (watermarks,
>>>>>> computed column, filters, projecttion etc.) are negotiated, we can
>>>>>> configure a regular Flink connector.
>>>>>>
>>>>>> E.g. setting the watermark assigner and deserialization schema of a
>>>>>> Kafka connector.
>>>>>>
>>>>>> For better separation of concerns, Flink connectors should not
>>>>>>
>>>>>> include
>>>>>>
>>>>>> relational interfaces and depend on flink-table. This is the
>>>>>> responsibility of table source/sink.
>>>>>>
>>>>>> @Kurt: I would like to mark them @PublicEvolving already because we
>>>>>>
>>>>>> need
>>>>>>
>>>>>> to deprecate the old interfaces as early as possible. We cannot
>>>>>>
>>>>>> redirect
>>>>>>
>>>>>> to @Internal interfaces. They are not marked @Public, so we can
>>>>>>
>>>>>> still
>>>>>>
>>>>>> evolve them. But a core design shift should not happen again, it
>>>>>>
>>>>>> would
>>>>>>
>>>>>> leave a bad impression if we are redesign over and over again.
>>>>>>
>>>>>> Instead
>>>>>>
>>>>>> we should be confident in the current change.
>>>>>>
>>>>>> Regards,
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> On 24.03.20 09:20, Dawid Wysakowicz wrote:
>>>>>>
>>>>>> Hi Becket,
>>>>>>
>>>>>> Answering your question, we have the same intention not to
>>>>>>
>>>>>> duplicate
>>>>>>
>>>>>> connectors between datastream and table apis. The interfaces
>>>>>>
>>>>>> proposed
>>>>>>
>>>>>> in
>>>>>>
>>>>>> the FLIP are a way to describe relational properties of a source.
>>>>>>
>>>>>> The
>>>>>>
>>>>>> intention is as you described to translate all of those expressed
>>>>>>
>>>>>> as
>>>>>>
>>>>>> expressions or other Table specific structures into a DataStream
>>>>>>
>>>>>> source.
>>>>>>
>>>>>> In other words I think what we are doing here is in line with
>>>>>>
>>>>>> what
>>>>>>
>>>>>> you
>>>>>>
>>>>>> described.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Dawid
>>>>>>
>>>>>> On 24/03/2020 02:23, Becket Qin wrote:
>>>>>>
>>>>>> Hi Timo,
>>>>>>
>>>>>> Thanks for the proposal. I completely agree that the current
>>>>>>
>>>>>> Table
>>>>>>
>>>>>> connectors could be simplified quite a bit. I haven't finished
>>>>>>
>>>>>> reading
>>>>>>
>>>>>> everything, but here are some quick thoughts.
>>>>>>
>>>>>> Actually to me the biggest question is why should there be two
>>>>>>
>>>>>> different
>>>>>>
>>>>>> connector systems for DataStream and Table? What is the
>>>>>>
>>>>>> fundamental
>>>>>>
>>>>>> reason
>>>>>>
>>>>>> that is preventing us from merging them to one?
>>>>>>
>>>>>> The basic functionality of a connector is to provide
>>>>>>
>>>>>> capabilities
>>>>>>
>>>>>> to
>>>>>>
>>>>>> do
>>>>>>
>>>>>> IO
>>>>>>
>>>>>> and Serde. Conceptually, Table connectors should just be
>>>>>>
>>>>>> DataStream
>>>>>>
>>>>>> connectors that are dealing with Rows. It seems that quite a few
>>>>>>
>>>>>> of
>>>>>>
>>>>>> the
>>>>>>
>>>>>> special connector requirements are just a specific way to do IO
>>>>>>
>>>>>> /
>>>>>>
>>>>>> Serde.
>>>>>>
>>>>>> Taking SupportsFilterPushDown as an example, imagine we have the
>>>>>>
>>>>>> following
>>>>>>
>>>>>> interface:
>>>>>>
>>>>>> interface FilterableSource<PREDICATE> {
>>>>>>       void applyFilterable(Supplier<PREDICATE> predicate);
>>>>>> }
>>>>>>
>>>>>> And if a ParquetSource would like to support filterable, it will
>>>>>>
>>>>>> become:
>>>>>>
>>>>>> class ParquetSource implements Source,
>>>>>>
>>>>>> FilterableSource(FilterPredicate> {
>>>>>>
>>>>>>       ...
>>>>>> }
>>>>>>
>>>>>> For Table, one just need to provide an predicate supplier that
>>>>>>
>>>>>> converts
>>>>>>
>>>>>> an
>>>>>>
>>>>>> Expression to the specified predicate type. This has a few
>>>>>>
>>>>>> benefit:
>>>>>>
>>>>>> 1. Same unified API for filterable for sources, regardless of
>>>>>>
>>>>>> DataStream or
>>>>>>
>>>>>> Table.
>>>>>> 2. The  DataStream users now can also use the
>>>>>>
>>>>>> ExpressionToPredicate
>>>>>>
>>>>>> supplier if they want to.
>>>>>>
>>>>>> To summarize, my main point is that I am wondering if it is
>>>>>>
>>>>>> possible
>>>>>>
>>>>>> to
>>>>>>
>>>>>> have a single set of connector interface for both Table and
>>>>>>
>>>>>> DataStream,
>>>>>>
>>>>>> rather than having two hierarchies. I am not 100% sure if this
>>>>>>
>>>>>> would
>>>>>>
>>>>>> work,
>>>>>>
>>>>>> but if it works, this would be a huge win from both code
>>>>>>
>>>>>> maintenance
>>>>>>
>>>>>> and
>>>>>>
>>>>>> user experience perspective.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jiangjie (Becket) Qin
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
>>>>>>
>>>>>> dwysakowicz@apache.org>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Timo,
>>>>>>
>>>>>> Thank you for the proposal. I think it is an important
>>>>>>
>>>>>> improvement
>>>>>>
>>>>>> that
>>>>>>
>>>>>> will benefit many parts of the Table API. The proposal looks
>>>>>>
>>>>>> really
>>>>>>
>>>>>> good
>>>>>>
>>>>>> to me and personally I would be comfortable with voting on the
>>>>>>
>>>>>> current
>>>>>>
>>>>>> state.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Dawid
>>>>>>
>>>>>> On 23/03/2020 18:53, Timo Walther wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I received some questions around how the new interfaces play
>>>>>>
>>>>>> together
>>>>>>
>>>>>> with formats and their factories.
>>>>>>
>>>>>> Furthermore, for MySQL or Postgres CDC logs, the format should
>>>>>>
>>>>>> be
>>>>>>
>>>>>> able
>>>>>>
>>>>>> to return a `ChangelogMode`.
>>>>>>
>>>>>> Also, I incorporated the feedback around the factory design in
>>>>>>
>>>>>> general.
>>>>>>
>>>>>> I added a new section `Factory Interfaces` to the design
>>>>>>
>>>>>> document.
>>>>>>
>>>>>> This should be helpful to understand the big picture and
>>>>>>
>>>>>> connecting
>>>>>>
>>>>>> the concepts.
>>>>>>
>>>>>> Please let me know what you think?
>>>>>>
>>>>>> Thanks,
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> On 18.03.20 13:43, Timo Walther wrote:
>>>>>>
>>>>>> Hi Benchao,
>>>>>>
>>>>>> this is a very good question. I will update the FLIP about
>>>>>>
>>>>>> this.
>>>>>>
>>>>>> The legacy planner will not support the new interfaces. It
>>>>>>
>>>>>> will
>>>>>>
>>>>>> only
>>>>>>
>>>>>> support the old interfaces. With the next release, I think
>>>>>>
>>>>>> the
>>>>>>
>>>>>> Blink
>>>>>>
>>>>>> planner is stable enough to be the default one as well.
>>>>>>
>>>>>> Regards,
>>>>>> Timo
>>>>>>
>>>>>> On 18.03.20 08:45, Benchao Li wrote:
>>>>>>
>>>>>> Hi Timo,
>>>>>>
>>>>>> Thank you and others for the efforts to prepare this FLIP.
>>>>>>
>>>>>> The FLIP LGTM generally.
>>>>>>
>>>>>> +1 for moving blink data structures to table-common, it's
>>>>>>
>>>>>> useful
>>>>>>
>>>>>> to
>>>>>>
>>>>>> udf too
>>>>>> in the future.
>>>>>> A little question is, do we plan to support the new
>>>>>>
>>>>>> interfaces
>>>>>>
>>>>>> and
>>>>>>
>>>>>> data
>>>>>>
>>>>>> types in legacy planner?
>>>>>> Or we only plan to support these new interfaces in blink
>>>>>>
>>>>>> planner.
>>>>>>
>>>>>> And using primary keys from DDL instead of derived key
>>>>>>
>>>>>> information
>>>>>>
>>>>>> from
>>>>>>
>>>>>> each query is also a good idea,
>>>>>> we met some use cases where this does not works very well
>>>>>>
>>>>>> before.
>>>>>>
>>>>>> This FLIP also makes the dependencies of table modules more
>>>>>>
>>>>>> clear, I
>>>>>>
>>>>>> like
>>>>>> it very much.
>>>>>>
>>>>>> Timo Walther <tw...@apache.org> <tw...@apache.org>
>> 于2020年3月17日周二
>>>>> 上午1:36写道：
>>>>>>
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I'm happy to present the results of long discussions that
>>>>>>
>>>>>> we
>>>>>>
>>>>>> had
>>>>>>
>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
>>>>>>
>>>>>> many
>>>>>>
>>>>>> more
>>>>>>
>>>>>> have contributed to this design document.
>>>>>>
>>>>>> We would like to propose new long-term table source and
>>>>>>
>>>>>> table
>>>>>>
>>>>>> sink
>>>>>>
>>>>>> interfaces:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>>>>>
>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>>>>>
>>>>>> The goals of this FLIP are:
>>>>>>
>>>>>> - Simplify the current interface architecture:
>>>>>>         - Merge upsert, retract, and append sinks.
>>>>>>         - Unify batch and streaming sources.
>>>>>>         - Unify batch and streaming sinks.
>>>>>>
>>>>>> - Allow sources to produce a changelog:
>>>>>>         - UpsertTableSources have been requested a lot by
>>>>>>
>>>>>> users.
>>>>>>
>>>>>> Now
>>>>>>
>>>>>> is the
>>>>>> time to open the internal planner capabilities via the new
>>>>>>
>>>>>> interfaces.
>>>>>>
>>>>>>         - According to FLIP-105, we would like to support
>>>>>>
>>>>>> changelogs for
>>>>>>
>>>>>> processing formats such as Debezium.
>>>>>>
>>>>>> - Don't rely on DataStream API for source and sinks:
>>>>>>         - According to FLIP-32, the Table API and SQL should
>>>>>>
>>>>>> be
>>>>>>
>>>>>> independent
>>>>>> of the DataStream API which is why the `table-common`
>>>>>>
>>>>>> module
>>>>>>
>>>>>> has
>>>>>>
>>>>>> no
>>>>>>
>>>>>> dependencies on `flink-streaming-java`.
>>>>>>         - Source and sink implementations should only depend
>>>>>>
>>>>>> on
>>>>>>
>>>>>> the
>>>>>>
>>>>>> `table-common` module after FLIP-27.
>>>>>>         - Until FLIP-27 is ready, we still put most of the
>>>>>>
>>>>>> interfaces in
>>>>>>
>>>>>> `table-common` and strictly separate interfaces that
>>>>>>
>>>>>> communicate
>>>>>>
>>>>>> with a
>>>>>> planner and actual runtime reader/writers.
>>>>>>
>>>>>> - Implement efficient sources and sinks without planner
>>>>>>
>>>>>> dependencies:
>>>>>>
>>>>>>         - Make Blink's internal data structures available to
>>>>>>
>>>>>> connectors.
>>>>>>
>>>>>>         - Introduce stable interfaces for data structures
>>>>>>
>>>>>> that
>>>>>>
>>>>>> can
>>>>>>
>>>>>> be
>>>>>>
>>>>>> marked as `@PublicEvolving`.
>>>>>>         - Only require dependencies on `flink-table-common`
>>>>>>
>>>>>> in
>>>>>>
>>>>>> the
>>>>>>
>>>>>> future
>>>>>>
>>>>>> It finalizes the concept of dynamic tables and consideres
>>>>>>
>>>>>> how
>>>>>>
>>>>>> all
>>>>>>
>>>>>> source/sink related classes play together.
>>>>>>
>>>>>> We look forward to your feedback.
>>>>>>
>>>>>> Regards,
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hi Kurt,

I do not object to promote the concepts of SQL, but I don't think we should
do that by introducing a new dedicate set of connector public interfaces
that is only for SQL. The same argument can be applied to Gelly, CEP, and
Machine Learning, claiming that they need to introduce a dedicated public
set of interfaces that fits their own concept and ask the the connector
developers to learn and follow their design. As an analogy, if we want to
promote Chinese, we don't want to force people to learn ancient Chinese
poem while they only need to know a few words like "hello" and "goodbye".

As some design principles, here are what I think what  Flink connectors
should look like:

1. The native connector interface is a generic abstraction of doing IO and
Serde, without semantic for high level use cases such as SQL, Gelly, CEP,
etc.

2. Some advanced features that may help accelerate the IO and Serde could
be provided in the native connector interfaces in a semantic free way so
all the high level use cases can leverage.

3. Additional semantics can be built on top of the native source interface
through providing different plugins. These plugins could be high level use
case aware. For example, to provide a filter to the source, we can do the
following

// An interface for all the filters that take an expression.
interface ExpressionFilter {
    FilterResult applyFilterExpression();
}

// An filter plugin implementation that translate the SQL Expression to a
ParquetFilterPredicate.
Class ParquetExpressionFilter implements Supplier<ParquetFilterPredicate>,
ExpressionFilter {
    // Called by the high level use case,
    FilterResult applyFilterExpression() { ... }

    // Used by the native Source interface.
    ParquetFilterPredicate get() { ... }
}

In this case, the connector developer just need to write the logic of
translating an Expression to Parquet FilterPredicate. They don't have to
understand the entire set of interfaces that we want to promote. Just like
they only need to know how to say "Hello" without learning ancient Chinese
poem.

Again, I am not saying this is necessarily the best approach. But so far it
seems a reasonable design principle to tell the developers.

Thanks,

Jiangjie (becket) Qin



On Wed, Mar 25, 2020 at 11:53 AM Kurt Young <yk...@gmail.com> wrote:

> Hi Becket,
>
> I don't think we should discuss this in pure engineering aspects. Your
> proposal is trying
> to let SQL connector developers understand as less SQL concepts as
> possible. But quite
> the opposite, we are designing those interfaces to emphasize the SQL
> concept, to bridge
> high level concepts into real interfaces and classes.
>
> We keep talking about time-varying relations and dynamic table when
> introduce SQL concepts,
> sources and sinks are most critical part playing with those concepts. It's
> essential to let
> Flink SQL developers to learn these concepts and connect them with real
> codes by introducing
> these connector interfaces and can further write *correct* connectors based
> on such domain
> knowledge.
>
> So this FLIP is a very important chance to express these concepts and make
> most SQL developers
> be align with concepts and on same page. It's mostly for different level of
> abstractions and for domains
> like SQL, it's becoming more important.  It helps Flink SQL go smoothly in
> the future, and also
> make it easier for new contributors. But I would admit this is not that
> obvious for others who don't work
> with SQL frequently.
>
> Best,
> Kurt
>
>
> On Wed, Mar 25, 2020 at 11:07 AM Becket Qin <be...@gmail.com> wrote:
>
> > Hi Jark,
> >
> > It is good to know that we do not expect the end users to touch those
> > interfaces.
> >
> > Then the question boils down to whether the connector developers should
> be
> > aware of the interfaces that are only used by the SQL optimizer. It
> seems a
> > win if we can avoid that.
> >
> > Two potential solutions off the top of my head are:
> > 1. An internal helper class doing the instanceOf based on DataStream
> source
> > interface and create pluggables for that DataStream source.
> > 2. codegen the set of TableSource interfaces given a DataStream Source
> and
> > its corresponding TablePluggablesFactory.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> wrote:
> >
> > > Hi Becket,
> > >
> > > Regarding to Flavor1 and Flavor2, I want to clarify that user will
> never
> > > use table source like this:
> > >
> > > {
> > >    MyTableSource myTableSource = MyTableSourceFactory.create();
> > >    myTableSource.setSchema(mySchema);
> > >    myTableSource.applyFilterPredicate(expression);
> > >    ...
> > > }
> > >
> > > TableFactory and TableSource are not directly exposed to end users, all
> > the
> > > methods are called by planner, not users.
> > > Users always use DDL or descriptor to register a table, and planner
> will
> > > find the factory and create sources according to the properties.
> > > All the optimization are applied automatically, e.g. filter/projection
> > > pushdown, users don't need to call `applyFilterPredicate` explicitly.
> > >
> > >
> > >
> > > On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com> wrote:
> > >
> > > > Hi Timo and Dawid,
> > > >
> > > > Thanks for the clarification. They really help. You are right that we
> > are
> > > > on the same page regarding the hierarchy. I think the only difference
> > > > between our view is the flavor of the interfaces. There are two
> flavors
> > > of
> > > > the source interface for DataStream and Table source.
> > > >
> > > > *Flavor 1. Table Sources are some wrapper interfaces around
> DataStream
> > > > source.*
> > > > Following this way, we will reach the design of the current proposal,
> > > i.e.
> > > > each pluggable exposed in the DataStream source will have a
> > corresponding
> > > > TableSource interface counterpart, which are at the Factory level.
> > Users
> > > > will write code like this:
> > > >
> > > > {
> > > >     MyTableSource myTableSource = MyTableSourceFactory.create();
> > > >     myTableSource.setSchema(mySchema);
> > > >     myTableSource.applyFilterPredicate(expression);
> > > >     ...
> > > > }
> > > >
> > > > The good thing for this flavor is that from the SQL / Table's
> > > perspective,
> > > > there is a dedicated set of Table oriented interface.
> > > >
> > > > The downsides are:
> > > > A. From the user's perspective, DataStream Source and Table Source
> are
> > > just
> > > > two different sets of interfaces, regardless of how they are the same
> > > > internally.
> > > > B. The source developers have to develop for those two sets of
> > interfaces
> > > > in order to support both DataStream and Table.
> > > > C. It is not explicit that DataStream can actually share the
> pluggable
> > in
> > > > Table / SQL. For example, in order to provide a filter pluggable with
> > SQL
> > > > expression, users will have to know the actual converter class that
> > > > converts the expression to the filter predicate and construct that
> > > > converter by themselves.
> > > >
> > > > ---------------
> > > >
> > > > *Flavor 2. A TableSource is a DataStream source with a bunch of
> > > pluggables.
> > > > No Table specific interfaces at all.*
> > > > Following this way, we will reach another design where you have a
> > > > SourceFactory and a single Pluggable factory for all the table
> > > pluggables.
> > > > And users will write something like:
> > > >
> > > > {
> > > >     Deserializer<Row> myTableDeserializer =
> > > > MyTablePluggableFactory.createDeserializer(schema)
> > > >     MySource<Row> mySource = MySourceFactory.create(properties,
> > > > myTableDeserializer);
> > > >
> > > >
> > > >
> > >
> >
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> > > > }
> > > >
> > > > The good thing for this flavor is that there is just one set of
> > interface
> > > > that works for both Table and DataStream. There is no difference
> > between
> > > > creating a DataStream source and creating a Table source. DataStream
> > can
> > > > easily reuse the pluggables from the Table sources.
> > > >
> > > > The downside is that Table / SQL won't have a dedicated API for
> > > > optimization. Instead of writing:
> > > >
> > > > if (MyTableSource instanceOf FilterableTableSource) {
> > > >      // Some filter push down logic.
> > > >      MyTableSource.applyPredicate(expression)
> > > > }
> > > >
> > > > One have to write:
> > > >
> > > > if (MySource instanceOf FilterableSource) {
> > > >     // Some filter push down logic.
> > > >
> > > >
> > > >
> > >
> >
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> > > > }
> > > >
> > > > -------------------------
> > > >
> > > > Just to be clear, I am not saying flavor 2 is necessarily better than
> > > > flavor 1, but I want to make sure flavor 2 is also considered and
> > > > discussed.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin.
> > > >
> > > > On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
> > > dwysakowicz@apache.org>
> > > > wrote:
> > > >
> > > > > Hi Becket,
> > > > >
> > > > > I really think we don't have a differing opinions. We might not see
> > the
> > > > > changes in the same way yet. Personally I think of the
> > > DynamicTableSource
> > > > > as of a factory for a Source implemented for the DataStream API.
> The
> > > > > important fact about the DynamicTableSource and all feature traits
> > > > > (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work
> with
> > > > Table
> > > > > API concepts such as e.g. Expressions, SQL specific types etc. In
> the
> > > end
> > > > > what the implementation would resemble is (bear in mind I
> > tremendously
> > > > > simplified the example, just to show the relation between the two
> > > APIs):
> > > > >
> > > > > SupportsFilterablePushDown {
> > > > >
> > > > >   applyFilters(List<ResolvedExpression> filters) {
> > > > >
> > > > >     this.filters = convertToDataStreamFilters(filters);
> > > > >
> > > > >   }
> > > > >
> > > > >   Source createSource() {
> > > > >
> > > > >         return Source.create()
> > > > >
> > > > >           .applyFilters(this.filters);
> > > > >
> > > > >    }
> > > > >
> > > > > }
> > > > >
> > > > > or exactly as you said for the computed columns:
> > > > >
> > > > >
> > > > > SupportsComputedColumnsPushDown {
> > > > >
> > > > >
> > > > >
> > > > >   applyComputedColumn(ComputedColumnConverter converter) {
> > > > >
> > > > >     this.deserializationSchema = new DeserializationSchema<Row> {
> > > > >
> > > > >       Row deserialize(...) {
> > > > >
> > > > >         RowData row = format.deserialize(bytes); // original
> format,
> > > e.g
> > > > > json, avro, etc.
> > > > >
> > > > >         RowData enriched = converter(row)
> > > > >
> > > > >       }
> > > > >
> > > > >     }
> > > > >
> > > > >   }
> > > > >
> > > > >   Source createSource() {
> > > > >
> > > > >         return Source.create()
> > > > >
> > > > >           .withDeserialization(deserializationSchema);
> > > > >
> > > > >    }
> > > > >
> > > > > }
> > > > >
> > > > > So to sum it up again, all those interfaces are factories that
> > > configure
> > > > > appropriate parts of the DataStream API using Table API concepts.
> > > Finally
> > > > > to answer you question for particular comparisons:
> > > > >
> > > > > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > > > > SupportsFilterablePushDown v.s. FilterableSource
> > > > > SupportsProjectablePushDown v.s. ProjectableSource
> > > > > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > > > > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > > > > ScanTableSource v.s. ChangeLogDeserializer.
> > > > >
> > > > > pretty much you can think of all on the left as factories for the
> > right
> > > > > side, left side works with Table API classes (Expressions,
> > DataTypes).
> > > I
> > > > > hope this clarifies it a bit.
> > > > >
> > > > > Best,
> > > > >
> > > > > Dawid
> > > > > On 24/03/2020 15:03, Becket Qin wrote:
> > > > >
> > > > > Hey Kurt,
> > > > >
> > > > > I don't think DataStream should see some SQL specific concepts such
> > as
> > > > >
> > > > > Filtering or ComputedColumn.
> > > > >
> > > > > Projectable and Filterable seems not necessarily SQL concepts, but
> > > could
> > > > be
> > > > > applicable to DataStream source as well to reduce the network load.
> > For
> > > > > example ORC and Parquet should probably also be readable from
> > > DataStream,
> > > > > right?
> > > > >
> > > > > ComputedColumn is not part of the Source, it is an interface
> extends
> > > the
> > > > > Deserializer, which is a pluggable for the Source. From the SQL's
> > > > > perspective it has the concept of computed column, but from the
> > Source
> > > > > perspective, It is essentially a Deserializer which also converts
> the
> > > > > records internally, assuming we allow some conversion to be
> embedded
> > to
> > > > > the source in addition to just deserialization.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <
> > > > imjark@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Thanks Timo for updating the formats section. That would be very
> > > helpful
> > > > > for changelog supporting (FLIP-105).
> > > > >
> > > > > I just left 2 minor comment about some method names. In general,
> I'm
> > +1
> > > > to
> > > > > start a voting.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------------------------
> > > > >
> > > > > Hi Becket,
> > > > >
> > > > > I agree we shouldn't duplicate codes, especiall the runtime
> > > > > implementations.
> > > > > However, the interfaces proposed by FLIP-95 are mainly used during
> > > > > optimization (compiling), not runtime.
> > > > > I don't think there is much to share for this. Because table/sql
> > > > > is declarative, but DataStream is imperative.
> > > > > For example, filter push down, DataStream FilterableSource may
> allow
> > to
> > > > > accept a FilterFunction (which is a black box for the source).
> > > > > However, table sources should pick the pushed filter expressions,
> > some
> > > > > sources may only support "=", "<", ">" conditions.
> > > > > Pushing a FilterFunction doesn't work in table ecosystem. That
> means,
> > > the
> > > > > connectors have to have some table-specific implementations.
> > > > >
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <
> > > > ykt836@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Hi Becket,
> > > > >
> > > > > I don't think DataStream should see some SQL specific concepts such
> > as
> > > > > Filtering or ComputedColumn. It's
> > > > > better to stay within SQL area and translate to more generic
> concept
> > > when
> > > > > translating to DataStream/Runtime
> > > > > layer, such as use MapFunction to represent computed column logic.
> > > > >
> > > > > Best,
> > > > > Kurt
> > > > >
> > > > >
> > > > > On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com>
> <
> > > > becket.qin@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Hi Timo and Dawid,
> > > > >
> > > > > It's really great that we have the same goal. I am actually
> wondering
> > > > >
> > > > > if
> > > > >
> > > > > we
> > > > >
> > > > > can go one step further to avoid some of the interfaces in Table as
> > > > >
> > > > > well.
> > > > >
> > > > > For example, if we have the FilterableSource, do we still need the
> > > > > FilterableTableSource? Should DynamicTableSource just become a
> > > > > Source<*Row*,
> > > > > SourceSplitT, EnumChkT>?
> > > > >
> > > > > Can you help me understand a bit more about the reason we need the
> > > > > following relational representation / wrapper interfaces v.s. the
> > > > > interfaces that we could put to the Source in FLIP-27?
> > > > >
> > > > > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > > > > SupportsFilterablePushDown v.s. FilterableSource
> > > > > SupportsProjectablePushDown v.s. ProjectableSource
> > > > > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > > > > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > > > > ScanTableSource v.s. ChangeLogDeserializer.
> > > > > LookUpTableSource v.s. LookUpSource
> > > > >
> > > > > Assuming we have all the interfaces on the right side, do we still
> > need
> > > > >
> > > > > the
> > > > >
> > > > > interfaces on the left side? Note that the interfaces on the right
> > can
> > > > >
> > > > > be
> > > > >
> > > > > used by both DataStream and Table. If we do this, there will only
> be
> > > > >
> > > > > one
> > > > >
> > > > > set of Source interfaces Table and DataStream, the only difference
> is
> > > > >
> > > > > that
> > > > >
> > > > > the Source for table will have some specific plugins and
> > > > >
> > > > > configurations.
> > > > >
> > > > > An
> > > > >
> > > > > omnipotent Source can implement all the the above interfaces and
> > take a
> > > > > Deserializer that implements both ComputedColumnDeserializer and
> > > > > ChangeLogDeserializer.
> > > > >
> > > > > Would the SQL planner work with that?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <
> jingsonglee0@gmail.com>
> > <
> > > > jingsonglee0@gmail.com>
> > > > > wrote:
> > > > >
> > > > >
> > > > > +1. Thanks Timo for the design doc.
> > > > >
> > > > > We can also consider @Experimental too. But I am +1 to
> > > > >
> > > > > @PublicEvolving,
> > > > >
> > > > > we
> > > > >
> > > > > should be confident in the current change.
> > > > >
> > > > > Best,
> > > > > Jingsong Lee
> > > > >
> > > > > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
> <
> > > > twalthr@apache.org>
> > > > >
> > > > > wrote:
> > > > >
> > > > > @Becket: We totally agree that we don't need table specific
> > > > >
> > > > > connectors
> > > > >
> > > > > during runtime. As Dawid said, the interfaces proposed here are
> > > > >
> > > > > just
> > > > >
> > > > > for
> > > > >
> > > > > communication with the planner. Once the properties (watermarks,
> > > > > computed column, filters, projecttion etc.) are negotiated, we can
> > > > > configure a regular Flink connector.
> > > > >
> > > > > E.g. setting the watermark assigner and deserialization schema of a
> > > > > Kafka connector.
> > > > >
> > > > > For better separation of concerns, Flink connectors should not
> > > > >
> > > > > include
> > > > >
> > > > > relational interfaces and depend on flink-table. This is the
> > > > > responsibility of table source/sink.
> > > > >
> > > > > @Kurt: I would like to mark them @PublicEvolving already because we
> > > > >
> > > > > need
> > > > >
> > > > > to deprecate the old interfaces as early as possible. We cannot
> > > > >
> > > > > redirect
> > > > >
> > > > > to @Internal interfaces. They are not marked @Public, so we can
> > > > >
> > > > > still
> > > > >
> > > > > evolve them. But a core design shift should not happen again, it
> > > > >
> > > > > would
> > > > >
> > > > > leave a bad impression if we are redesign over and over again.
> > > > >
> > > > > Instead
> > > > >
> > > > > we should be confident in the current change.
> > > > >
> > > > > Regards,
> > > > > Timo
> > > > >
> > > > >
> > > > > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > > > >
> > > > > Hi Becket,
> > > > >
> > > > > Answering your question, we have the same intention not to
> > > > >
> > > > > duplicate
> > > > >
> > > > > connectors between datastream and table apis. The interfaces
> > > > >
> > > > > proposed
> > > > >
> > > > > in
> > > > >
> > > > > the FLIP are a way to describe relational properties of a source.
> > > > >
> > > > > The
> > > > >
> > > > > intention is as you described to translate all of those expressed
> > > > >
> > > > > as
> > > > >
> > > > > expressions or other Table specific structures into a DataStream
> > > > >
> > > > > source.
> > > > >
> > > > > In other words I think what we are doing here is in line with
> > > > >
> > > > > what
> > > > >
> > > > > you
> > > > >
> > > > > described.
> > > > >
> > > > > Best,
> > > > >
> > > > > Dawid
> > > > >
> > > > > On 24/03/2020 02:23, Becket Qin wrote:
> > > > >
> > > > > Hi Timo,
> > > > >
> > > > > Thanks for the proposal. I completely agree that the current
> > > > >
> > > > > Table
> > > > >
> > > > > connectors could be simplified quite a bit. I haven't finished
> > > > >
> > > > > reading
> > > > >
> > > > > everything, but here are some quick thoughts.
> > > > >
> > > > > Actually to me the biggest question is why should there be two
> > > > >
> > > > > different
> > > > >
> > > > > connector systems for DataStream and Table? What is the
> > > > >
> > > > > fundamental
> > > > >
> > > > > reason
> > > > >
> > > > > that is preventing us from merging them to one?
> > > > >
> > > > > The basic functionality of a connector is to provide
> > > > >
> > > > > capabilities
> > > > >
> > > > > to
> > > > >
> > > > > do
> > > > >
> > > > > IO
> > > > >
> > > > > and Serde. Conceptually, Table connectors should just be
> > > > >
> > > > > DataStream
> > > > >
> > > > > connectors that are dealing with Rows. It seems that quite a few
> > > > >
> > > > > of
> > > > >
> > > > > the
> > > > >
> > > > > special connector requirements are just a specific way to do IO
> > > > >
> > > > > /
> > > > >
> > > > > Serde.
> > > > >
> > > > > Taking SupportsFilterPushDown as an example, imagine we have the
> > > > >
> > > > > following
> > > > >
> > > > > interface:
> > > > >
> > > > > interface FilterableSource<PREDICATE> {
> > > > >      void applyFilterable(Supplier<PREDICATE> predicate);
> > > > > }
> > > > >
> > > > > And if a ParquetSource would like to support filterable, it will
> > > > >
> > > > > become:
> > > > >
> > > > > class ParquetSource implements Source,
> > > > >
> > > > > FilterableSource(FilterPredicate> {
> > > > >
> > > > >      ...
> > > > > }
> > > > >
> > > > > For Table, one just need to provide an predicate supplier that
> > > > >
> > > > > converts
> > > > >
> > > > > an
> > > > >
> > > > > Expression to the specified predicate type. This has a few
> > > > >
> > > > > benefit:
> > > > >
> > > > > 1. Same unified API for filterable for sources, regardless of
> > > > >
> > > > > DataStream or
> > > > >
> > > > > Table.
> > > > > 2. The  DataStream users now can also use the
> > > > >
> > > > > ExpressionToPredicate
> > > > >
> > > > > supplier if they want to.
> > > > >
> > > > > To summarize, my main point is that I am wondering if it is
> > > > >
> > > > > possible
> > > > >
> > > > > to
> > > > >
> > > > > have a single set of connector interface for both Table and
> > > > >
> > > > > DataStream,
> > > > >
> > > > > rather than having two hierarchies. I am not 100% sure if this
> > > > >
> > > > > would
> > > > >
> > > > > work,
> > > > >
> > > > > but if it works, this would be a huge win from both code
> > > > >
> > > > > maintenance
> > > > >
> > > > > and
> > > > >
> > > > > user experience perspective.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> > > > >
> > > > > dwysakowicz@apache.org>
> > > > >
> > > > > wrote:
> > > > >
> > > > >
> > > > > Hi Timo,
> > > > >
> > > > > Thank you for the proposal. I think it is an important
> > > > >
> > > > > improvement
> > > > >
> > > > > that
> > > > >
> > > > > will benefit many parts of the Table API. The proposal looks
> > > > >
> > > > > really
> > > > >
> > > > > good
> > > > >
> > > > > to me and personally I would be comfortable with voting on the
> > > > >
> > > > > current
> > > > >
> > > > > state.
> > > > >
> > > > > Best,
> > > > >
> > > > > Dawid
> > > > >
> > > > > On 23/03/2020 18:53, Timo Walther wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > I received some questions around how the new interfaces play
> > > > >
> > > > > together
> > > > >
> > > > > with formats and their factories.
> > > > >
> > > > > Furthermore, for MySQL or Postgres CDC logs, the format should
> > > > >
> > > > > be
> > > > >
> > > > > able
> > > > >
> > > > > to return a `ChangelogMode`.
> > > > >
> > > > > Also, I incorporated the feedback around the factory design in
> > > > >
> > > > > general.
> > > > >
> > > > > I added a new section `Factory Interfaces` to the design
> > > > >
> > > > > document.
> > > > >
> > > > > This should be helpful to understand the big picture and
> > > > >
> > > > > connecting
> > > > >
> > > > > the concepts.
> > > > >
> > > > > Please let me know what you think?
> > > > >
> > > > > Thanks,
> > > > > Timo
> > > > >
> > > > >
> > > > > On 18.03.20 13:43, Timo Walther wrote:
> > > > >
> > > > > Hi Benchao,
> > > > >
> > > > > this is a very good question. I will update the FLIP about
> > > > >
> > > > > this.
> > > > >
> > > > > The legacy planner will not support the new interfaces. It
> > > > >
> > > > > will
> > > > >
> > > > > only
> > > > >
> > > > > support the old interfaces. With the next release, I think
> > > > >
> > > > > the
> > > > >
> > > > > Blink
> > > > >
> > > > > planner is stable enough to be the default one as well.
> > > > >
> > > > > Regards,
> > > > > Timo
> > > > >
> > > > > On 18.03.20 08:45, Benchao Li wrote:
> > > > >
> > > > > Hi Timo,
> > > > >
> > > > > Thank you and others for the efforts to prepare this FLIP.
> > > > >
> > > > > The FLIP LGTM generally.
> > > > >
> > > > > +1 for moving blink data structures to table-common, it's
> > > > >
> > > > > useful
> > > > >
> > > > > to
> > > > >
> > > > > udf too
> > > > > in the future.
> > > > > A little question is, do we plan to support the new
> > > > >
> > > > > interfaces
> > > > >
> > > > > and
> > > > >
> > > > > data
> > > > >
> > > > > types in legacy planner?
> > > > > Or we only plan to support these new interfaces in blink
> > > > >
> > > > > planner.
> > > > >
> > > > > And using primary keys from DDL instead of derived key
> > > > >
> > > > > information
> > > > >
> > > > > from
> > > > >
> > > > > each query is also a good idea,
> > > > > we met some use cases where this does not works very well
> > > > >
> > > > > before.
> > > > >
> > > > > This FLIP also makes the dependencies of table modules more
> > > > >
> > > > > clear, I
> > > > >
> > > > > like
> > > > > it very much.
> > > > >
> > > > > Timo Walther <tw...@apache.org> <tw...@apache.org>
> 于2020年3月17日周二
> > > > 上午1:36写道：
> > > > >
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > I'm happy to present the results of long discussions that
> > > > >
> > > > > we
> > > > >
> > > > > had
> > > > >
> > > > > internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> > > > >
> > > > > many
> > > > >
> > > > > more
> > > > >
> > > > > have contributed to this design document.
> > > > >
> > > > > We would like to propose new long-term table source and
> > > > >
> > > > > table
> > > > >
> > > > > sink
> > > > >
> > > > > interfaces:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > > > >
> > > > > This is a requirement for FLIP-105 and finalizing FLIP-32.
> > > > >
> > > > > The goals of this FLIP are:
> > > > >
> > > > > - Simplify the current interface architecture:
> > > > >        - Merge upsert, retract, and append sinks.
> > > > >        - Unify batch and streaming sources.
> > > > >        - Unify batch and streaming sinks.
> > > > >
> > > > > - Allow sources to produce a changelog:
> > > > >        - UpsertTableSources have been requested a lot by
> > > > >
> > > > > users.
> > > > >
> > > > > Now
> > > > >
> > > > > is the
> > > > > time to open the internal planner capabilities via the new
> > > > >
> > > > > interfaces.
> > > > >
> > > > >        - According to FLIP-105, we would like to support
> > > > >
> > > > > changelogs for
> > > > >
> > > > > processing formats such as Debezium.
> > > > >
> > > > > - Don't rely on DataStream API for source and sinks:
> > > > >        - According to FLIP-32, the Table API and SQL should
> > > > >
> > > > > be
> > > > >
> > > > > independent
> > > > > of the DataStream API which is why the `table-common`
> > > > >
> > > > > module
> > > > >
> > > > > has
> > > > >
> > > > > no
> > > > >
> > > > > dependencies on `flink-streaming-java`.
> > > > >        - Source and sink implementations should only depend
> > > > >
> > > > > on
> > > > >
> > > > > the
> > > > >
> > > > > `table-common` module after FLIP-27.
> > > > >        - Until FLIP-27 is ready, we still put most of the
> > > > >
> > > > > interfaces in
> > > > >
> > > > > `table-common` and strictly separate interfaces that
> > > > >
> > > > > communicate
> > > > >
> > > > > with a
> > > > > planner and actual runtime reader/writers.
> > > > >
> > > > > - Implement efficient sources and sinks without planner
> > > > >
> > > > > dependencies:
> > > > >
> > > > >        - Make Blink's internal data structures available to
> > > > >
> > > > > connectors.
> > > > >
> > > > >        - Introduce stable interfaces for data structures
> > > > >
> > > > > that
> > > > >
> > > > > can
> > > > >
> > > > > be
> > > > >
> > > > > marked as `@PublicEvolving`.
> > > > >        - Only require dependencies on `flink-table-common`
> > > > >
> > > > > in
> > > > >
> > > > > the
> > > > >
> > > > > future
> > > > >
> > > > > It finalizes the concept of dynamic tables and consideres
> > > > >
> > > > > how
> > > > >
> > > > > all
> > > > >
> > > > > source/sink related classes play together.
> > > > >
> > > > > We look forward to your feedback.
> > > > >
> > > > > Regards,
> > > > > Timo
> > > > >
> > > > >
> > > > > --
> > > > > Best, Jingsong Lee
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Kurt Young <yk...@gmail.com>.

Hi Becket,

I don't think we should discuss this in pure engineering aspects. Your
proposal is trying
to let SQL connector developers understand as less SQL concepts as
possible. But quite
the opposite, we are designing those interfaces to emphasize the SQL
concept, to bridge
high level concepts into real interfaces and classes.

We keep talking about time-varying relations and dynamic table when
introduce SQL concepts,
sources and sinks are most critical part playing with those concepts. It's
essential to let
Flink SQL developers to learn these concepts and connect them with real
codes by introducing
these connector interfaces and can further write *correct* connectors based
on such domain
knowledge.

So this FLIP is a very important chance to express these concepts and make
most SQL developers
be align with concepts and on same page. It's mostly for different level of
abstractions and for domains
like SQL, it's becoming more important.  It helps Flink SQL go smoothly in
the future, and also
make it easier for new contributors. But I would admit this is not that
obvious for others who don't work
with SQL frequently.

Best,
Kurt


On Wed, Mar 25, 2020 at 11:07 AM Becket Qin <be...@gmail.com> wrote:

> Hi Jark,
>
> It is good to know that we do not expect the end users to touch those
> interfaces.
>
> Then the question boils down to whether the connector developers should be
> aware of the interfaces that are only used by the SQL optimizer. It seems a
> win if we can avoid that.
>
> Two potential solutions off the top of my head are:
> 1. An internal helper class doing the instanceOf based on DataStream source
> interface and create pluggables for that DataStream source.
> 2. codegen the set of TableSource interfaces given a DataStream Source and
> its corresponding TablePluggablesFactory.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> wrote:
>
> > Hi Becket,
> >
> > Regarding to Flavor1 and Flavor2, I want to clarify that user will never
> > use table source like this:
> >
> > {
> >    MyTableSource myTableSource = MyTableSourceFactory.create();
> >    myTableSource.setSchema(mySchema);
> >    myTableSource.applyFilterPredicate(expression);
> >    ...
> > }
> >
> > TableFactory and TableSource are not directly exposed to end users, all
> the
> > methods are called by planner, not users.
> > Users always use DDL or descriptor to register a table, and planner will
> > find the factory and create sources according to the properties.
> > All the optimization are applied automatically, e.g. filter/projection
> > pushdown, users don't need to call `applyFilterPredicate` explicitly.
> >
> >
> >
> > On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com> wrote:
> >
> > > Hi Timo and Dawid,
> > >
> > > Thanks for the clarification. They really help. You are right that we
> are
> > > on the same page regarding the hierarchy. I think the only difference
> > > between our view is the flavor of the interfaces. There are two flavors
> > of
> > > the source interface for DataStream and Table source.
> > >
> > > *Flavor 1. Table Sources are some wrapper interfaces around DataStream
> > > source.*
> > > Following this way, we will reach the design of the current proposal,
> > i.e.
> > > each pluggable exposed in the DataStream source will have a
> corresponding
> > > TableSource interface counterpart, which are at the Factory level.
> Users
> > > will write code like this:
> > >
> > > {
> > >     MyTableSource myTableSource = MyTableSourceFactory.create();
> > >     myTableSource.setSchema(mySchema);
> > >     myTableSource.applyFilterPredicate(expression);
> > >     ...
> > > }
> > >
> > > The good thing for this flavor is that from the SQL / Table's
> > perspective,
> > > there is a dedicated set of Table oriented interface.
> > >
> > > The downsides are:
> > > A. From the user's perspective, DataStream Source and Table Source are
> > just
> > > two different sets of interfaces, regardless of how they are the same
> > > internally.
> > > B. The source developers have to develop for those two sets of
> interfaces
> > > in order to support both DataStream and Table.
> > > C. It is not explicit that DataStream can actually share the pluggable
> in
> > > Table / SQL. For example, in order to provide a filter pluggable with
> SQL
> > > expression, users will have to know the actual converter class that
> > > converts the expression to the filter predicate and construct that
> > > converter by themselves.
> > >
> > > ---------------
> > >
> > > *Flavor 2. A TableSource is a DataStream source with a bunch of
> > pluggables.
> > > No Table specific interfaces at all.*
> > > Following this way, we will reach another design where you have a
> > > SourceFactory and a single Pluggable factory for all the table
> > pluggables.
> > > And users will write something like:
> > >
> > > {
> > >     Deserializer<Row> myTableDeserializer =
> > > MyTablePluggableFactory.createDeserializer(schema)
> > >     MySource<Row> mySource = MySourceFactory.create(properties,
> > > myTableDeserializer);
> > >
> > >
> > >
> >
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> > > }
> > >
> > > The good thing for this flavor is that there is just one set of
> interface
> > > that works for both Table and DataStream. There is no difference
> between
> > > creating a DataStream source and creating a Table source. DataStream
> can
> > > easily reuse the pluggables from the Table sources.
> > >
> > > The downside is that Table / SQL won't have a dedicated API for
> > > optimization. Instead of writing:
> > >
> > > if (MyTableSource instanceOf FilterableTableSource) {
> > >      // Some filter push down logic.
> > >      MyTableSource.applyPredicate(expression)
> > > }
> > >
> > > One have to write:
> > >
> > > if (MySource instanceOf FilterableSource) {
> > >     // Some filter push down logic.
> > >
> > >
> > >
> >
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> > > }
> > >
> > > -------------------------
> > >
> > > Just to be clear, I am not saying flavor 2 is necessarily better than
> > > flavor 1, but I want to make sure flavor 2 is also considered and
> > > discussed.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin.
> > >
> > > On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
> > dwysakowicz@apache.org>
> > > wrote:
> > >
> > > > Hi Becket,
> > > >
> > > > I really think we don't have a differing opinions. We might not see
> the
> > > > changes in the same way yet. Personally I think of the
> > DynamicTableSource
> > > > as of a factory for a Source implemented for the DataStream API. The
> > > > important fact about the DynamicTableSource and all feature traits
> > > > (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work with
> > > Table
> > > > API concepts such as e.g. Expressions, SQL specific types etc. In the
> > end
> > > > what the implementation would resemble is (bear in mind I
> tremendously
> > > > simplified the example, just to show the relation between the two
> > APIs):
> > > >
> > > > SupportsFilterablePushDown {
> > > >
> > > >   applyFilters(List<ResolvedExpression> filters) {
> > > >
> > > >     this.filters = convertToDataStreamFilters(filters);
> > > >
> > > >   }
> > > >
> > > >   Source createSource() {
> > > >
> > > >         return Source.create()
> > > >
> > > >           .applyFilters(this.filters);
> > > >
> > > >    }
> > > >
> > > > }
> > > >
> > > > or exactly as you said for the computed columns:
> > > >
> > > >
> > > > SupportsComputedColumnsPushDown {
> > > >
> > > >
> > > >
> > > >   applyComputedColumn(ComputedColumnConverter converter) {
> > > >
> > > >     this.deserializationSchema = new DeserializationSchema<Row> {
> > > >
> > > >       Row deserialize(...) {
> > > >
> > > >         RowData row = format.deserialize(bytes); // original format,
> > e.g
> > > > json, avro, etc.
> > > >
> > > >         RowData enriched = converter(row)
> > > >
> > > >       }
> > > >
> > > >     }
> > > >
> > > >   }
> > > >
> > > >   Source createSource() {
> > > >
> > > >         return Source.create()
> > > >
> > > >           .withDeserialization(deserializationSchema);
> > > >
> > > >    }
> > > >
> > > > }
> > > >
> > > > So to sum it up again, all those interfaces are factories that
> > configure
> > > > appropriate parts of the DataStream API using Table API concepts.
> > Finally
> > > > to answer you question for particular comparisons:
> > > >
> > > > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > > > SupportsFilterablePushDown v.s. FilterableSource
> > > > SupportsProjectablePushDown v.s. ProjectableSource
> > > > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > > > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > > > ScanTableSource v.s. ChangeLogDeserializer.
> > > >
> > > > pretty much you can think of all on the left as factories for the
> right
> > > > side, left side works with Table API classes (Expressions,
> DataTypes).
> > I
> > > > hope this clarifies it a bit.
> > > >
> > > > Best,
> > > >
> > > > Dawid
> > > > On 24/03/2020 15:03, Becket Qin wrote:
> > > >
> > > > Hey Kurt,
> > > >
> > > > I don't think DataStream should see some SQL specific concepts such
> as
> > > >
> > > > Filtering or ComputedColumn.
> > > >
> > > > Projectable and Filterable seems not necessarily SQL concepts, but
> > could
> > > be
> > > > applicable to DataStream source as well to reduce the network load.
> For
> > > > example ORC and Parquet should probably also be readable from
> > DataStream,
> > > > right?
> > > >
> > > > ComputedColumn is not part of the Source, it is an interface extends
> > the
> > > > Deserializer, which is a pluggable for the Source. From the SQL's
> > > > perspective it has the concept of computed column, but from the
> Source
> > > > perspective, It is essentially a Deserializer which also converts the
> > > > records internally, assuming we allow some conversion to be embedded
> to
> > > > the source in addition to just deserialization.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <
> > > imjark@gmail.com> wrote:
> > > >
> > > >
> > > > Thanks Timo for updating the formats section. That would be very
> > helpful
> > > > for changelog supporting (FLIP-105).
> > > >
> > > > I just left 2 minor comment about some method names. In general, I'm
> +1
> > > to
> > > > start a voting.
> > > >
> > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------------------------
> > > >
> > > > Hi Becket,
> > > >
> > > > I agree we shouldn't duplicate codes, especiall the runtime
> > > > implementations.
> > > > However, the interfaces proposed by FLIP-95 are mainly used during
> > > > optimization (compiling), not runtime.
> > > > I don't think there is much to share for this. Because table/sql
> > > > is declarative, but DataStream is imperative.
> > > > For example, filter push down, DataStream FilterableSource may allow
> to
> > > > accept a FilterFunction (which is a black box for the source).
> > > > However, table sources should pick the pushed filter expressions,
> some
> > > > sources may only support "=", "<", ">" conditions.
> > > > Pushing a FilterFunction doesn't work in table ecosystem. That means,
> > the
> > > > connectors have to have some table-specific implementations.
> > > >
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <
> > > ykt836@gmail.com> wrote:
> > > >
> > > >
> > > > Hi Becket,
> > > >
> > > > I don't think DataStream should see some SQL specific concepts such
> as
> > > > Filtering or ComputedColumn. It's
> > > > better to stay within SQL area and translate to more generic concept
> > when
> > > > translating to DataStream/Runtime
> > > > layer, such as use MapFunction to represent computed column logic.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> <
> > > becket.qin@gmail.com> wrote:
> > > >
> > > >
> > > > Hi Timo and Dawid,
> > > >
> > > > It's really great that we have the same goal. I am actually wondering
> > > >
> > > > if
> > > >
> > > > we
> > > >
> > > > can go one step further to avoid some of the interfaces in Table as
> > > >
> > > > well.
> > > >
> > > > For example, if we have the FilterableSource, do we still need the
> > > > FilterableTableSource? Should DynamicTableSource just become a
> > > > Source<*Row*,
> > > > SourceSplitT, EnumChkT>?
> > > >
> > > > Can you help me understand a bit more about the reason we need the
> > > > following relational representation / wrapper interfaces v.s. the
> > > > interfaces that we could put to the Source in FLIP-27?
> > > >
> > > > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > > > SupportsFilterablePushDown v.s. FilterableSource
> > > > SupportsProjectablePushDown v.s. ProjectableSource
> > > > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > > > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > > > ScanTableSource v.s. ChangeLogDeserializer.
> > > > LookUpTableSource v.s. LookUpSource
> > > >
> > > > Assuming we have all the interfaces on the right side, do we still
> need
> > > >
> > > > the
> > > >
> > > > interfaces on the left side? Note that the interfaces on the right
> can
> > > >
> > > > be
> > > >
> > > > used by both DataStream and Table. If we do this, there will only be
> > > >
> > > > one
> > > >
> > > > set of Source interfaces Table and DataStream, the only difference is
> > > >
> > > > that
> > > >
> > > > the Source for table will have some specific plugins and
> > > >
> > > > configurations.
> > > >
> > > > An
> > > >
> > > > omnipotent Source can implement all the the above interfaces and
> take a
> > > > Deserializer that implements both ComputedColumnDeserializer and
> > > > ChangeLogDeserializer.
> > > >
> > > > Would the SQL planner work with that?
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com>
> <
> > > jingsonglee0@gmail.com>
> > > > wrote:
> > > >
> > > >
> > > > +1. Thanks Timo for the design doc.
> > > >
> > > > We can also consider @Experimental too. But I am +1 to
> > > >
> > > > @PublicEvolving,
> > > >
> > > > we
> > > >
> > > > should be confident in the current change.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org> <
> > > twalthr@apache.org>
> > > >
> > > > wrote:
> > > >
> > > > @Becket: We totally agree that we don't need table specific
> > > >
> > > > connectors
> > > >
> > > > during runtime. As Dawid said, the interfaces proposed here are
> > > >
> > > > just
> > > >
> > > > for
> > > >
> > > > communication with the planner. Once the properties (watermarks,
> > > > computed column, filters, projecttion etc.) are negotiated, we can
> > > > configure a regular Flink connector.
> > > >
> > > > E.g. setting the watermark assigner and deserialization schema of a
> > > > Kafka connector.
> > > >
> > > > For better separation of concerns, Flink connectors should not
> > > >
> > > > include
> > > >
> > > > relational interfaces and depend on flink-table. This is the
> > > > responsibility of table source/sink.
> > > >
> > > > @Kurt: I would like to mark them @PublicEvolving already because we
> > > >
> > > > need
> > > >
> > > > to deprecate the old interfaces as early as possible. We cannot
> > > >
> > > > redirect
> > > >
> > > > to @Internal interfaces. They are not marked @Public, so we can
> > > >
> > > > still
> > > >
> > > > evolve them. But a core design shift should not happen again, it
> > > >
> > > > would
> > > >
> > > > leave a bad impression if we are redesign over and over again.
> > > >
> > > > Instead
> > > >
> > > > we should be confident in the current change.
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > >
> > > > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > > >
> > > > Hi Becket,
> > > >
> > > > Answering your question, we have the same intention not to
> > > >
> > > > duplicate
> > > >
> > > > connectors between datastream and table apis. The interfaces
> > > >
> > > > proposed
> > > >
> > > > in
> > > >
> > > > the FLIP are a way to describe relational properties of a source.
> > > >
> > > > The
> > > >
> > > > intention is as you described to translate all of those expressed
> > > >
> > > > as
> > > >
> > > > expressions or other Table specific structures into a DataStream
> > > >
> > > > source.
> > > >
> > > > In other words I think what we are doing here is in line with
> > > >
> > > > what
> > > >
> > > > you
> > > >
> > > > described.
> > > >
> > > > Best,
> > > >
> > > > Dawid
> > > >
> > > > On 24/03/2020 02:23, Becket Qin wrote:
> > > >
> > > > Hi Timo,
> > > >
> > > > Thanks for the proposal. I completely agree that the current
> > > >
> > > > Table
> > > >
> > > > connectors could be simplified quite a bit. I haven't finished
> > > >
> > > > reading
> > > >
> > > > everything, but here are some quick thoughts.
> > > >
> > > > Actually to me the biggest question is why should there be two
> > > >
> > > > different
> > > >
> > > > connector systems for DataStream and Table? What is the
> > > >
> > > > fundamental
> > > >
> > > > reason
> > > >
> > > > that is preventing us from merging them to one?
> > > >
> > > > The basic functionality of a connector is to provide
> > > >
> > > > capabilities
> > > >
> > > > to
> > > >
> > > > do
> > > >
> > > > IO
> > > >
> > > > and Serde. Conceptually, Table connectors should just be
> > > >
> > > > DataStream
> > > >
> > > > connectors that are dealing with Rows. It seems that quite a few
> > > >
> > > > of
> > > >
> > > > the
> > > >
> > > > special connector requirements are just a specific way to do IO
> > > >
> > > > /
> > > >
> > > > Serde.
> > > >
> > > > Taking SupportsFilterPushDown as an example, imagine we have the
> > > >
> > > > following
> > > >
> > > > interface:
> > > >
> > > > interface FilterableSource<PREDICATE> {
> > > >      void applyFilterable(Supplier<PREDICATE> predicate);
> > > > }
> > > >
> > > > And if a ParquetSource would like to support filterable, it will
> > > >
> > > > become:
> > > >
> > > > class ParquetSource implements Source,
> > > >
> > > > FilterableSource(FilterPredicate> {
> > > >
> > > >      ...
> > > > }
> > > >
> > > > For Table, one just need to provide an predicate supplier that
> > > >
> > > > converts
> > > >
> > > > an
> > > >
> > > > Expression to the specified predicate type. This has a few
> > > >
> > > > benefit:
> > > >
> > > > 1. Same unified API for filterable for sources, regardless of
> > > >
> > > > DataStream or
> > > >
> > > > Table.
> > > > 2. The  DataStream users now can also use the
> > > >
> > > > ExpressionToPredicate
> > > >
> > > > supplier if they want to.
> > > >
> > > > To summarize, my main point is that I am wondering if it is
> > > >
> > > > possible
> > > >
> > > > to
> > > >
> > > > have a single set of connector interface for both Table and
> > > >
> > > > DataStream,
> > > >
> > > > rather than having two hierarchies. I am not 100% sure if this
> > > >
> > > > would
> > > >
> > > > work,
> > > >
> > > > but if it works, this would be a huge win from both code
> > > >
> > > > maintenance
> > > >
> > > > and
> > > >
> > > > user experience perspective.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > >
> > > >
> > > > On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> > > >
> > > > dwysakowicz@apache.org>
> > > >
> > > > wrote:
> > > >
> > > >
> > > > Hi Timo,
> > > >
> > > > Thank you for the proposal. I think it is an important
> > > >
> > > > improvement
> > > >
> > > > that
> > > >
> > > > will benefit many parts of the Table API. The proposal looks
> > > >
> > > > really
> > > >
> > > > good
> > > >
> > > > to me and personally I would be comfortable with voting on the
> > > >
> > > > current
> > > >
> > > > state.
> > > >
> > > > Best,
> > > >
> > > > Dawid
> > > >
> > > > On 23/03/2020 18:53, Timo Walther wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > I received some questions around how the new interfaces play
> > > >
> > > > together
> > > >
> > > > with formats and their factories.
> > > >
> > > > Furthermore, for MySQL or Postgres CDC logs, the format should
> > > >
> > > > be
> > > >
> > > > able
> > > >
> > > > to return a `ChangelogMode`.
> > > >
> > > > Also, I incorporated the feedback around the factory design in
> > > >
> > > > general.
> > > >
> > > > I added a new section `Factory Interfaces` to the design
> > > >
> > > > document.
> > > >
> > > > This should be helpful to understand the big picture and
> > > >
> > > > connecting
> > > >
> > > > the concepts.
> > > >
> > > > Please let me know what you think?
> > > >
> > > > Thanks,
> > > > Timo
> > > >
> > > >
> > > > On 18.03.20 13:43, Timo Walther wrote:
> > > >
> > > > Hi Benchao,
> > > >
> > > > this is a very good question. I will update the FLIP about
> > > >
> > > > this.
> > > >
> > > > The legacy planner will not support the new interfaces. It
> > > >
> > > > will
> > > >
> > > > only
> > > >
> > > > support the old interfaces. With the next release, I think
> > > >
> > > > the
> > > >
> > > > Blink
> > > >
> > > > planner is stable enough to be the default one as well.
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > > On 18.03.20 08:45, Benchao Li wrote:
> > > >
> > > > Hi Timo,
> > > >
> > > > Thank you and others for the efforts to prepare this FLIP.
> > > >
> > > > The FLIP LGTM generally.
> > > >
> > > > +1 for moving blink data structures to table-common, it's
> > > >
> > > > useful
> > > >
> > > > to
> > > >
> > > > udf too
> > > > in the future.
> > > > A little question is, do we plan to support the new
> > > >
> > > > interfaces
> > > >
> > > > and
> > > >
> > > > data
> > > >
> > > > types in legacy planner?
> > > > Or we only plan to support these new interfaces in blink
> > > >
> > > > planner.
> > > >
> > > > And using primary keys from DDL instead of derived key
> > > >
> > > > information
> > > >
> > > > from
> > > >
> > > > each query is also a good idea,
> > > > we met some use cases where this does not works very well
> > > >
> > > > before.
> > > >
> > > > This FLIP also makes the dependencies of table modules more
> > > >
> > > > clear, I
> > > >
> > > > like
> > > > it very much.
> > > >
> > > > Timo Walther <tw...@apache.org> <tw...@apache.org> 于2020年3月17日周二
> > > 上午1:36写道：
> > > >
> > > >
> > > > Hi everyone,
> > > >
> > > > I'm happy to present the results of long discussions that
> > > >
> > > > we
> > > >
> > > > had
> > > >
> > > > internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> > > >
> > > > many
> > > >
> > > > more
> > > >
> > > > have contributed to this design document.
> > > >
> > > > We would like to propose new long-term table source and
> > > >
> > > > table
> > > >
> > > > sink
> > > >
> > > > interfaces:
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > > >
> > > > This is a requirement for FLIP-105 and finalizing FLIP-32.
> > > >
> > > > The goals of this FLIP are:
> > > >
> > > > - Simplify the current interface architecture:
> > > >        - Merge upsert, retract, and append sinks.
> > > >        - Unify batch and streaming sources.
> > > >        - Unify batch and streaming sinks.
> > > >
> > > > - Allow sources to produce a changelog:
> > > >        - UpsertTableSources have been requested a lot by
> > > >
> > > > users.
> > > >
> > > > Now
> > > >
> > > > is the
> > > > time to open the internal planner capabilities via the new
> > > >
> > > > interfaces.
> > > >
> > > >        - According to FLIP-105, we would like to support
> > > >
> > > > changelogs for
> > > >
> > > > processing formats such as Debezium.
> > > >
> > > > - Don't rely on DataStream API for source and sinks:
> > > >        - According to FLIP-32, the Table API and SQL should
> > > >
> > > > be
> > > >
> > > > independent
> > > > of the DataStream API which is why the `table-common`
> > > >
> > > > module
> > > >
> > > > has
> > > >
> > > > no
> > > >
> > > > dependencies on `flink-streaming-java`.
> > > >        - Source and sink implementations should only depend
> > > >
> > > > on
> > > >
> > > > the
> > > >
> > > > `table-common` module after FLIP-27.
> > > >        - Until FLIP-27 is ready, we still put most of the
> > > >
> > > > interfaces in
> > > >
> > > > `table-common` and strictly separate interfaces that
> > > >
> > > > communicate
> > > >
> > > > with a
> > > > planner and actual runtime reader/writers.
> > > >
> > > > - Implement efficient sources and sinks without planner
> > > >
> > > > dependencies:
> > > >
> > > >        - Make Blink's internal data structures available to
> > > >
> > > > connectors.
> > > >
> > > >        - Introduce stable interfaces for data structures
> > > >
> > > > that
> > > >
> > > > can
> > > >
> > > > be
> > > >
> > > > marked as `@PublicEvolving`.
> > > >        - Only require dependencies on `flink-table-common`
> > > >
> > > > in
> > > >
> > > > the
> > > >
> > > > future
> > > >
> > > > It finalizes the concept of dynamic tables and consideres
> > > >
> > > > how
> > > >
> > > > all
> > > >
> > > > source/sink related classes play together.
> > > >
> > > > We look forward to your feedback.
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hi Jark,

It is good to know that we do not expect the end users to touch those
interfaces.

Then the question boils down to whether the connector developers should be
aware of the interfaces that are only used by the SQL optimizer. It seems a
win if we can avoid that.

Two potential solutions off the top of my head are:
1. An internal helper class doing the instanceOf based on DataStream source
interface and create pluggables for that DataStream source.
2. codegen the set of TableSource interfaces given a DataStream Source and
its corresponding TablePluggablesFactory.

Thanks,

Jiangjie (Becket) Qin

On Wed, Mar 25, 2020 at 10:07 AM Jark Wu <im...@gmail.com> wrote:

> Hi Becket,
>
> Regarding to Flavor1 and Flavor2, I want to clarify that user will never
> use table source like this:
>
> {
>    MyTableSource myTableSource = MyTableSourceFactory.create();
>    myTableSource.setSchema(mySchema);
>    myTableSource.applyFilterPredicate(expression);
>    ...
> }
>
> TableFactory and TableSource are not directly exposed to end users, all the
> methods are called by planner, not users.
> Users always use DDL or descriptor to register a table, and planner will
> find the factory and create sources according to the properties.
> All the optimization are applied automatically, e.g. filter/projection
> pushdown, users don't need to call `applyFilterPredicate` explicitly.
>
>
>
> On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com> wrote:
>
> > Hi Timo and Dawid,
> >
> > Thanks for the clarification. They really help. You are right that we are
> > on the same page regarding the hierarchy. I think the only difference
> > between our view is the flavor of the interfaces. There are two flavors
> of
> > the source interface for DataStream and Table source.
> >
> > *Flavor 1. Table Sources are some wrapper interfaces around DataStream
> > source.*
> > Following this way, we will reach the design of the current proposal,
> i.e.
> > each pluggable exposed in the DataStream source will have a corresponding
> > TableSource interface counterpart, which are at the Factory level. Users
> > will write code like this:
> >
> > {
> >     MyTableSource myTableSource = MyTableSourceFactory.create();
> >     myTableSource.setSchema(mySchema);
> >     myTableSource.applyFilterPredicate(expression);
> >     ...
> > }
> >
> > The good thing for this flavor is that from the SQL / Table's
> perspective,
> > there is a dedicated set of Table oriented interface.
> >
> > The downsides are:
> > A. From the user's perspective, DataStream Source and Table Source are
> just
> > two different sets of interfaces, regardless of how they are the same
> > internally.
> > B. The source developers have to develop for those two sets of interfaces
> > in order to support both DataStream and Table.
> > C. It is not explicit that DataStream can actually share the pluggable in
> > Table / SQL. For example, in order to provide a filter pluggable with SQL
> > expression, users will have to know the actual converter class that
> > converts the expression to the filter predicate and construct that
> > converter by themselves.
> >
> > ---------------
> >
> > *Flavor 2. A TableSource is a DataStream source with a bunch of
> pluggables.
> > No Table specific interfaces at all.*
> > Following this way, we will reach another design where you have a
> > SourceFactory and a single Pluggable factory for all the table
> pluggables.
> > And users will write something like:
> >
> > {
> >     Deserializer<Row> myTableDeserializer =
> > MyTablePluggableFactory.createDeserializer(schema)
> >     MySource<Row> mySource = MySourceFactory.create(properties,
> > myTableDeserializer);
> >
> >
> >
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> > }
> >
> > The good thing for this flavor is that there is just one set of interface
> > that works for both Table and DataStream. There is no difference between
> > creating a DataStream source and creating a Table source. DataStream can
> > easily reuse the pluggables from the Table sources.
> >
> > The downside is that Table / SQL won't have a dedicated API for
> > optimization. Instead of writing:
> >
> > if (MyTableSource instanceOf FilterableTableSource) {
> >      // Some filter push down logic.
> >      MyTableSource.applyPredicate(expression)
> > }
> >
> > One have to write:
> >
> > if (MySource instanceOf FilterableSource) {
> >     // Some filter push down logic.
> >
> >
> >
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> > }
> >
> > -------------------------
> >
> > Just to be clear, I am not saying flavor 2 is necessarily better than
> > flavor 1, but I want to make sure flavor 2 is also considered and
> > discussed.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin.
> >
> > On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <
> dwysakowicz@apache.org>
> > wrote:
> >
> > > Hi Becket,
> > >
> > > I really think we don't have a differing opinions. We might not see the
> > > changes in the same way yet. Personally I think of the
> DynamicTableSource
> > > as of a factory for a Source implemented for the DataStream API. The
> > > important fact about the DynamicTableSource and all feature traits
> > > (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work with
> > Table
> > > API concepts such as e.g. Expressions, SQL specific types etc. In the
> end
> > > what the implementation would resemble is (bear in mind I tremendously
> > > simplified the example, just to show the relation between the two
> APIs):
> > >
> > > SupportsFilterablePushDown {
> > >
> > >   applyFilters(List<ResolvedExpression> filters) {
> > >
> > >     this.filters = convertToDataStreamFilters(filters);
> > >
> > >   }
> > >
> > >   Source createSource() {
> > >
> > >         return Source.create()
> > >
> > >           .applyFilters(this.filters);
> > >
> > >    }
> > >
> > > }
> > >
> > > or exactly as you said for the computed columns:
> > >
> > >
> > > SupportsComputedColumnsPushDown {
> > >
> > >
> > >
> > >   applyComputedColumn(ComputedColumnConverter converter) {
> > >
> > >     this.deserializationSchema = new DeserializationSchema<Row> {
> > >
> > >       Row deserialize(...) {
> > >
> > >         RowData row = format.deserialize(bytes); // original format,
> e.g
> > > json, avro, etc.
> > >
> > >         RowData enriched = converter(row)
> > >
> > >       }
> > >
> > >     }
> > >
> > >   }
> > >
> > >   Source createSource() {
> > >
> > >         return Source.create()
> > >
> > >           .withDeserialization(deserializationSchema);
> > >
> > >    }
> > >
> > > }
> > >
> > > So to sum it up again, all those interfaces are factories that
> configure
> > > appropriate parts of the DataStream API using Table API concepts.
> Finally
> > > to answer you question for particular comparisons:
> > >
> > > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > > SupportsFilterablePushDown v.s. FilterableSource
> > > SupportsProjectablePushDown v.s. ProjectableSource
> > > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > > ScanTableSource v.s. ChangeLogDeserializer.
> > >
> > > pretty much you can think of all on the left as factories for the right
> > > side, left side works with Table API classes (Expressions, DataTypes).
> I
> > > hope this clarifies it a bit.
> > >
> > > Best,
> > >
> > > Dawid
> > > On 24/03/2020 15:03, Becket Qin wrote:
> > >
> > > Hey Kurt,
> > >
> > > I don't think DataStream should see some SQL specific concepts such as
> > >
> > > Filtering or ComputedColumn.
> > >
> > > Projectable and Filterable seems not necessarily SQL concepts, but
> could
> > be
> > > applicable to DataStream source as well to reduce the network load. For
> > > example ORC and Parquet should probably also be readable from
> DataStream,
> > > right?
> > >
> > > ComputedColumn is not part of the Source, it is an interface extends
> the
> > > Deserializer, which is a pluggable for the Source. From the SQL's
> > > perspective it has the concept of computed column, but from the Source
> > > perspective, It is essentially a Deserializer which also converts the
> > > records internally, assuming we allow some conversion to be embedded to
> > > the source in addition to just deserialization.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <
> > imjark@gmail.com> wrote:
> > >
> > >
> > > Thanks Timo for updating the formats section. That would be very
> helpful
> > > for changelog supporting (FLIP-105).
> > >
> > > I just left 2 minor comment about some method names. In general, I'm +1
> > to
> > > start a voting.
> > >
> > >
> > >
> >
> --------------------------------------------------------------------------------------------------
> > >
> > > Hi Becket,
> > >
> > > I agree we shouldn't duplicate codes, especiall the runtime
> > > implementations.
> > > However, the interfaces proposed by FLIP-95 are mainly used during
> > > optimization (compiling), not runtime.
> > > I don't think there is much to share for this. Because table/sql
> > > is declarative, but DataStream is imperative.
> > > For example, filter push down, DataStream FilterableSource may allow to
> > > accept a FilterFunction (which is a black box for the source).
> > > However, table sources should pick the pushed filter expressions, some
> > > sources may only support "=", "<", ">" conditions.
> > > Pushing a FilterFunction doesn't work in table ecosystem. That means,
> the
> > > connectors have to have some table-specific implementations.
> > >
> > >
> > > Best,
> > > Jark
> > >
> > > On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <
> > ykt836@gmail.com> wrote:
> > >
> > >
> > > Hi Becket,
> > >
> > > I don't think DataStream should see some SQL specific concepts such as
> > > Filtering or ComputedColumn. It's
> > > better to stay within SQL area and translate to more generic concept
> when
> > > translating to DataStream/Runtime
> > > layer, such as use MapFunction to represent computed column logic.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> <
> > becket.qin@gmail.com> wrote:
> > >
> > >
> > > Hi Timo and Dawid,
> > >
> > > It's really great that we have the same goal. I am actually wondering
> > >
> > > if
> > >
> > > we
> > >
> > > can go one step further to avoid some of the interfaces in Table as
> > >
> > > well.
> > >
> > > For example, if we have the FilterableSource, do we still need the
> > > FilterableTableSource? Should DynamicTableSource just become a
> > > Source<*Row*,
> > > SourceSplitT, EnumChkT>?
> > >
> > > Can you help me understand a bit more about the reason we need the
> > > following relational representation / wrapper interfaces v.s. the
> > > interfaces that we could put to the Source in FLIP-27?
> > >
> > > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > > SupportsFilterablePushDown v.s. FilterableSource
> > > SupportsProjectablePushDown v.s. ProjectableSource
> > > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > > ScanTableSource v.s. ChangeLogDeserializer.
> > > LookUpTableSource v.s. LookUpSource
> > >
> > > Assuming we have all the interfaces on the right side, do we still need
> > >
> > > the
> > >
> > > interfaces on the left side? Note that the interfaces on the right can
> > >
> > > be
> > >
> > > used by both DataStream and Table. If we do this, there will only be
> > >
> > > one
> > >
> > > set of Source interfaces Table and DataStream, the only difference is
> > >
> > > that
> > >
> > > the Source for table will have some specific plugins and
> > >
> > > configurations.
> > >
> > > An
> > >
> > > omnipotent Source can implement all the the above interfaces and take a
> > > Deserializer that implements both ComputedColumnDeserializer and
> > > ChangeLogDeserializer.
> > >
> > > Would the SQL planner work with that?
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com> <
> > jingsonglee0@gmail.com>
> > > wrote:
> > >
> > >
> > > +1. Thanks Timo for the design doc.
> > >
> > > We can also consider @Experimental too. But I am +1 to
> > >
> > > @PublicEvolving,
> > >
> > > we
> > >
> > > should be confident in the current change.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org> <
> > twalthr@apache.org>
> > >
> > > wrote:
> > >
> > > @Becket: We totally agree that we don't need table specific
> > >
> > > connectors
> > >
> > > during runtime. As Dawid said, the interfaces proposed here are
> > >
> > > just
> > >
> > > for
> > >
> > > communication with the planner. Once the properties (watermarks,
> > > computed column, filters, projecttion etc.) are negotiated, we can
> > > configure a regular Flink connector.
> > >
> > > E.g. setting the watermark assigner and deserialization schema of a
> > > Kafka connector.
> > >
> > > For better separation of concerns, Flink connectors should not
> > >
> > > include
> > >
> > > relational interfaces and depend on flink-table. This is the
> > > responsibility of table source/sink.
> > >
> > > @Kurt: I would like to mark them @PublicEvolving already because we
> > >
> > > need
> > >
> > > to deprecate the old interfaces as early as possible. We cannot
> > >
> > > redirect
> > >
> > > to @Internal interfaces. They are not marked @Public, so we can
> > >
> > > still
> > >
> > > evolve them. But a core design shift should not happen again, it
> > >
> > > would
> > >
> > > leave a bad impression if we are redesign over and over again.
> > >
> > > Instead
> > >
> > > we should be confident in the current change.
> > >
> > > Regards,
> > > Timo
> > >
> > >
> > > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > >
> > > Hi Becket,
> > >
> > > Answering your question, we have the same intention not to
> > >
> > > duplicate
> > >
> > > connectors between datastream and table apis. The interfaces
> > >
> > > proposed
> > >
> > > in
> > >
> > > the FLIP are a way to describe relational properties of a source.
> > >
> > > The
> > >
> > > intention is as you described to translate all of those expressed
> > >
> > > as
> > >
> > > expressions or other Table specific structures into a DataStream
> > >
> > > source.
> > >
> > > In other words I think what we are doing here is in line with
> > >
> > > what
> > >
> > > you
> > >
> > > described.
> > >
> > > Best,
> > >
> > > Dawid
> > >
> > > On 24/03/2020 02:23, Becket Qin wrote:
> > >
> > > Hi Timo,
> > >
> > > Thanks for the proposal. I completely agree that the current
> > >
> > > Table
> > >
> > > connectors could be simplified quite a bit. I haven't finished
> > >
> > > reading
> > >
> > > everything, but here are some quick thoughts.
> > >
> > > Actually to me the biggest question is why should there be two
> > >
> > > different
> > >
> > > connector systems for DataStream and Table? What is the
> > >
> > > fundamental
> > >
> > > reason
> > >
> > > that is preventing us from merging them to one?
> > >
> > > The basic functionality of a connector is to provide
> > >
> > > capabilities
> > >
> > > to
> > >
> > > do
> > >
> > > IO
> > >
> > > and Serde. Conceptually, Table connectors should just be
> > >
> > > DataStream
> > >
> > > connectors that are dealing with Rows. It seems that quite a few
> > >
> > > of
> > >
> > > the
> > >
> > > special connector requirements are just a specific way to do IO
> > >
> > > /
> > >
> > > Serde.
> > >
> > > Taking SupportsFilterPushDown as an example, imagine we have the
> > >
> > > following
> > >
> > > interface:
> > >
> > > interface FilterableSource<PREDICATE> {
> > >      void applyFilterable(Supplier<PREDICATE> predicate);
> > > }
> > >
> > > And if a ParquetSource would like to support filterable, it will
> > >
> > > become:
> > >
> > > class ParquetSource implements Source,
> > >
> > > FilterableSource(FilterPredicate> {
> > >
> > >      ...
> > > }
> > >
> > > For Table, one just need to provide an predicate supplier that
> > >
> > > converts
> > >
> > > an
> > >
> > > Expression to the specified predicate type. This has a few
> > >
> > > benefit:
> > >
> > > 1. Same unified API for filterable for sources, regardless of
> > >
> > > DataStream or
> > >
> > > Table.
> > > 2. The  DataStream users now can also use the
> > >
> > > ExpressionToPredicate
> > >
> > > supplier if they want to.
> > >
> > > To summarize, my main point is that I am wondering if it is
> > >
> > > possible
> > >
> > > to
> > >
> > > have a single set of connector interface for both Table and
> > >
> > > DataStream,
> > >
> > > rather than having two hierarchies. I am not 100% sure if this
> > >
> > > would
> > >
> > > work,
> > >
> > > but if it works, this would be a huge win from both code
> > >
> > > maintenance
> > >
> > > and
> > >
> > > user experience perspective.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > > On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> > >
> > > dwysakowicz@apache.org>
> > >
> > > wrote:
> > >
> > >
> > > Hi Timo,
> > >
> > > Thank you for the proposal. I think it is an important
> > >
> > > improvement
> > >
> > > that
> > >
> > > will benefit many parts of the Table API. The proposal looks
> > >
> > > really
> > >
> > > good
> > >
> > > to me and personally I would be comfortable with voting on the
> > >
> > > current
> > >
> > > state.
> > >
> > > Best,
> > >
> > > Dawid
> > >
> > > On 23/03/2020 18:53, Timo Walther wrote:
> > >
> > > Hi everyone,
> > >
> > > I received some questions around how the new interfaces play
> > >
> > > together
> > >
> > > with formats and their factories.
> > >
> > > Furthermore, for MySQL or Postgres CDC logs, the format should
> > >
> > > be
> > >
> > > able
> > >
> > > to return a `ChangelogMode`.
> > >
> > > Also, I incorporated the feedback around the factory design in
> > >
> > > general.
> > >
> > > I added a new section `Factory Interfaces` to the design
> > >
> > > document.
> > >
> > > This should be helpful to understand the big picture and
> > >
> > > connecting
> > >
> > > the concepts.
> > >
> > > Please let me know what you think?
> > >
> > > Thanks,
> > > Timo
> > >
> > >
> > > On 18.03.20 13:43, Timo Walther wrote:
> > >
> > > Hi Benchao,
> > >
> > > this is a very good question. I will update the FLIP about
> > >
> > > this.
> > >
> > > The legacy planner will not support the new interfaces. It
> > >
> > > will
> > >
> > > only
> > >
> > > support the old interfaces. With the next release, I think
> > >
> > > the
> > >
> > > Blink
> > >
> > > planner is stable enough to be the default one as well.
> > >
> > > Regards,
> > > Timo
> > >
> > > On 18.03.20 08:45, Benchao Li wrote:
> > >
> > > Hi Timo,
> > >
> > > Thank you and others for the efforts to prepare this FLIP.
> > >
> > > The FLIP LGTM generally.
> > >
> > > +1 for moving blink data structures to table-common, it's
> > >
> > > useful
> > >
> > > to
> > >
> > > udf too
> > > in the future.
> > > A little question is, do we plan to support the new
> > >
> > > interfaces
> > >
> > > and
> > >
> > > data
> > >
> > > types in legacy planner?
> > > Or we only plan to support these new interfaces in blink
> > >
> > > planner.
> > >
> > > And using primary keys from DDL instead of derived key
> > >
> > > information
> > >
> > > from
> > >
> > > each query is also a good idea,
> > > we met some use cases where this does not works very well
> > >
> > > before.
> > >
> > > This FLIP also makes the dependencies of table modules more
> > >
> > > clear, I
> > >
> > > like
> > > it very much.
> > >
> > > Timo Walther <tw...@apache.org> <tw...@apache.org> 于2020年3月17日周二
> > 上午1:36写道：
> > >
> > >
> > > Hi everyone,
> > >
> > > I'm happy to present the results of long discussions that
> > >
> > > we
> > >
> > > had
> > >
> > > internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> > >
> > > many
> > >
> > > more
> > >
> > > have contributed to this design document.
> > >
> > > We would like to propose new long-term table source and
> > >
> > > table
> > >
> > > sink
> > >
> > > interfaces:
> > >
> > >
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > >
> > > This is a requirement for FLIP-105 and finalizing FLIP-32.
> > >
> > > The goals of this FLIP are:
> > >
> > > - Simplify the current interface architecture:
> > >        - Merge upsert, retract, and append sinks.
> > >        - Unify batch and streaming sources.
> > >        - Unify batch and streaming sinks.
> > >
> > > - Allow sources to produce a changelog:
> > >        - UpsertTableSources have been requested a lot by
> > >
> > > users.
> > >
> > > Now
> > >
> > > is the
> > > time to open the internal planner capabilities via the new
> > >
> > > interfaces.
> > >
> > >        - According to FLIP-105, we would like to support
> > >
> > > changelogs for
> > >
> > > processing formats such as Debezium.
> > >
> > > - Don't rely on DataStream API for source and sinks:
> > >        - According to FLIP-32, the Table API and SQL should
> > >
> > > be
> > >
> > > independent
> > > of the DataStream API which is why the `table-common`
> > >
> > > module
> > >
> > > has
> > >
> > > no
> > >
> > > dependencies on `flink-streaming-java`.
> > >        - Source and sink implementations should only depend
> > >
> > > on
> > >
> > > the
> > >
> > > `table-common` module after FLIP-27.
> > >        - Until FLIP-27 is ready, we still put most of the
> > >
> > > interfaces in
> > >
> > > `table-common` and strictly separate interfaces that
> > >
> > > communicate
> > >
> > > with a
> > > planner and actual runtime reader/writers.
> > >
> > > - Implement efficient sources and sinks without planner
> > >
> > > dependencies:
> > >
> > >        - Make Blink's internal data structures available to
> > >
> > > connectors.
> > >
> > >        - Introduce stable interfaces for data structures
> > >
> > > that
> > >
> > > can
> > >
> > > be
> > >
> > > marked as `@PublicEvolving`.
> > >        - Only require dependencies on `flink-table-common`
> > >
> > > in
> > >
> > > the
> > >
> > > future
> > >
> > > It finalizes the concept of dynamic tables and consideres
> > >
> > > how
> > >
> > > all
> > >
> > > source/sink related classes play together.
> > >
> > > We look forward to your feedback.
> > >
> > > Regards,
> > > Timo
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Jark Wu <im...@gmail.com>.

Hi Becket,

Regarding to Flavor1 and Flavor2, I want to clarify that user will never
use table source like this:

{
   MyTableSource myTableSource = MyTableSourceFactory.create();
   myTableSource.setSchema(mySchema);
   myTableSource.applyFilterPredicate(expression);
   ...
}

TableFactory and TableSource are not directly exposed to end users, all the
methods are called by planner, not users.
Users always use DDL or descriptor to register a table, and planner will
find the factory and create sources according to the properties.
All the optimization are applied automatically, e.g. filter/projection
pushdown, users don't need to call `applyFilterPredicate` explicitly.



On Wed, 25 Mar 2020 at 09:25, Becket Qin <be...@gmail.com> wrote:

> Hi Timo and Dawid,
>
> Thanks for the clarification. They really help. You are right that we are
> on the same page regarding the hierarchy. I think the only difference
> between our view is the flavor of the interfaces. There are two flavors of
> the source interface for DataStream and Table source.
>
> *Flavor 1. Table Sources are some wrapper interfaces around DataStream
> source.*
> Following this way, we will reach the design of the current proposal, i.e.
> each pluggable exposed in the DataStream source will have a corresponding
> TableSource interface counterpart, which are at the Factory level. Users
> will write code like this:
>
> {
>     MyTableSource myTableSource = MyTableSourceFactory.create();
>     myTableSource.setSchema(mySchema);
>     myTableSource.applyFilterPredicate(expression);
>     ...
> }
>
> The good thing for this flavor is that from the SQL / Table's perspective,
> there is a dedicated set of Table oriented interface.
>
> The downsides are:
> A. From the user's perspective, DataStream Source and Table Source are just
> two different sets of interfaces, regardless of how they are the same
> internally.
> B. The source developers have to develop for those two sets of interfaces
> in order to support both DataStream and Table.
> C. It is not explicit that DataStream can actually share the pluggable in
> Table / SQL. For example, in order to provide a filter pluggable with SQL
> expression, users will have to know the actual converter class that
> converts the expression to the filter predicate and construct that
> converter by themselves.
>
> ---------------
>
> *Flavor 2. A TableSource is a DataStream source with a bunch of pluggables.
> No Table specific interfaces at all.*
> Following this way, we will reach another design where you have a
> SourceFactory and a single Pluggable factory for all the table pluggables.
> And users will write something like:
>
> {
>     Deserializer<Row> myTableDeserializer =
> MyTablePluggableFactory.createDeserializer(schema)
>     MySource<Row> mySource = MySourceFactory.create(properties,
> myTableDeserializer);
>
>
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> }
>
> The good thing for this flavor is that there is just one set of interface
> that works for both Table and DataStream. There is no difference between
> creating a DataStream source and creating a Table source. DataStream can
> easily reuse the pluggables from the Table sources.
>
> The downside is that Table / SQL won't have a dedicated API for
> optimization. Instead of writing:
>
> if (MyTableSource instanceOf FilterableTableSource) {
>      // Some filter push down logic.
>      MyTableSource.applyPredicate(expression)
> }
>
> One have to write:
>
> if (MySource instanceOf FilterableSource) {
>     // Some filter push down logic.
>
>
> mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
> }
>
> -------------------------
>
> Just to be clear, I am not saying flavor 2 is necessarily better than
> flavor 1, but I want to make sure flavor 2 is also considered and
> discussed.
>
> Thanks,
>
> Jiangjie (Becket) Qin.
>
> On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
> > Hi Becket,
> >
> > I really think we don't have a differing opinions. We might not see the
> > changes in the same way yet. Personally I think of the DynamicTableSource
> > as of a factory for a Source implemented for the DataStream API. The
> > important fact about the DynamicTableSource and all feature traits
> > (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work with
> Table
> > API concepts such as e.g. Expressions, SQL specific types etc. In the end
> > what the implementation would resemble is (bear in mind I tremendously
> > simplified the example, just to show the relation between the two APIs):
> >
> > SupportsFilterablePushDown {
> >
> >   applyFilters(List<ResolvedExpression> filters) {
> >
> >     this.filters = convertToDataStreamFilters(filters);
> >
> >   }
> >
> >   Source createSource() {
> >
> >         return Source.create()
> >
> >           .applyFilters(this.filters);
> >
> >    }
> >
> > }
> >
> > or exactly as you said for the computed columns:
> >
> >
> > SupportsComputedColumnsPushDown {
> >
> >
> >
> >   applyComputedColumn(ComputedColumnConverter converter) {
> >
> >     this.deserializationSchema = new DeserializationSchema<Row> {
> >
> >       Row deserialize(...) {
> >
> >         RowData row = format.deserialize(bytes); // original format, e.g
> > json, avro, etc.
> >
> >         RowData enriched = converter(row)
> >
> >       }
> >
> >     }
> >
> >   }
> >
> >   Source createSource() {
> >
> >         return Source.create()
> >
> >           .withDeserialization(deserializationSchema);
> >
> >    }
> >
> > }
> >
> > So to sum it up again, all those interfaces are factories that configure
> > appropriate parts of the DataStream API using Table API concepts. Finally
> > to answer you question for particular comparisons:
> >
> > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > SupportsFilterablePushDown v.s. FilterableSource
> > SupportsProjectablePushDown v.s. ProjectableSource
> > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > ScanTableSource v.s. ChangeLogDeserializer.
> >
> > pretty much you can think of all on the left as factories for the right
> > side, left side works with Table API classes (Expressions, DataTypes). I
> > hope this clarifies it a bit.
> >
> > Best,
> >
> > Dawid
> > On 24/03/2020 15:03, Becket Qin wrote:
> >
> > Hey Kurt,
> >
> > I don't think DataStream should see some SQL specific concepts such as
> >
> > Filtering or ComputedColumn.
> >
> > Projectable and Filterable seems not necessarily SQL concepts, but could
> be
> > applicable to DataStream source as well to reduce the network load. For
> > example ORC and Parquet should probably also be readable from DataStream,
> > right?
> >
> > ComputedColumn is not part of the Source, it is an interface extends the
> > Deserializer, which is a pluggable for the Source. From the SQL's
> > perspective it has the concept of computed column, but from the Source
> > perspective, It is essentially a Deserializer which also converts the
> > records internally, assuming we allow some conversion to be embedded to
> > the source in addition to just deserialization.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <
> imjark@gmail.com> wrote:
> >
> >
> > Thanks Timo for updating the formats section. That would be very helpful
> > for changelog supporting (FLIP-105).
> >
> > I just left 2 minor comment about some method names. In general, I'm +1
> to
> > start a voting.
> >
> >
> >
> --------------------------------------------------------------------------------------------------
> >
> > Hi Becket,
> >
> > I agree we shouldn't duplicate codes, especiall the runtime
> > implementations.
> > However, the interfaces proposed by FLIP-95 are mainly used during
> > optimization (compiling), not runtime.
> > I don't think there is much to share for this. Because table/sql
> > is declarative, but DataStream is imperative.
> > For example, filter push down, DataStream FilterableSource may allow to
> > accept a FilterFunction (which is a black box for the source).
> > However, table sources should pick the pushed filter expressions, some
> > sources may only support "=", "<", ">" conditions.
> > Pushing a FilterFunction doesn't work in table ecosystem. That means, the
> > connectors have to have some table-specific implementations.
> >
> >
> > Best,
> > Jark
> >
> > On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <
> ykt836@gmail.com> wrote:
> >
> >
> > Hi Becket,
> >
> > I don't think DataStream should see some SQL specific concepts such as
> > Filtering or ComputedColumn. It's
> > better to stay within SQL area and translate to more generic concept when
> > translating to DataStream/Runtime
> > layer, such as use MapFunction to represent computed column logic.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> <
> becket.qin@gmail.com> wrote:
> >
> >
> > Hi Timo and Dawid,
> >
> > It's really great that we have the same goal. I am actually wondering
> >
> > if
> >
> > we
> >
> > can go one step further to avoid some of the interfaces in Table as
> >
> > well.
> >
> > For example, if we have the FilterableSource, do we still need the
> > FilterableTableSource? Should DynamicTableSource just become a
> > Source<*Row*,
> > SourceSplitT, EnumChkT>?
> >
> > Can you help me understand a bit more about the reason we need the
> > following relational representation / wrapper interfaces v.s. the
> > interfaces that we could put to the Source in FLIP-27?
> >
> > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > SupportsFilterablePushDown v.s. FilterableSource
> > SupportsProjectablePushDown v.s. ProjectableSource
> > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > ScanTableSource v.s. ChangeLogDeserializer.
> > LookUpTableSource v.s. LookUpSource
> >
> > Assuming we have all the interfaces on the right side, do we still need
> >
> > the
> >
> > interfaces on the left side? Note that the interfaces on the right can
> >
> > be
> >
> > used by both DataStream and Table. If we do this, there will only be
> >
> > one
> >
> > set of Source interfaces Table and DataStream, the only difference is
> >
> > that
> >
> > the Source for table will have some specific plugins and
> >
> > configurations.
> >
> > An
> >
> > omnipotent Source can implement all the the above interfaces and take a
> > Deserializer that implements both ComputedColumnDeserializer and
> > ChangeLogDeserializer.
> >
> > Would the SQL planner work with that?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> >
> >
> >
> > On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com> <
> jingsonglee0@gmail.com>
> > wrote:
> >
> >
> > +1. Thanks Timo for the design doc.
> >
> > We can also consider @Experimental too. But I am +1 to
> >
> > @PublicEvolving,
> >
> > we
> >
> > should be confident in the current change.
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org> <
> twalthr@apache.org>
> >
> > wrote:
> >
> > @Becket: We totally agree that we don't need table specific
> >
> > connectors
> >
> > during runtime. As Dawid said, the interfaces proposed here are
> >
> > just
> >
> > for
> >
> > communication with the planner. Once the properties (watermarks,
> > computed column, filters, projecttion etc.) are negotiated, we can
> > configure a regular Flink connector.
> >
> > E.g. setting the watermark assigner and deserialization schema of a
> > Kafka connector.
> >
> > For better separation of concerns, Flink connectors should not
> >
> > include
> >
> > relational interfaces and depend on flink-table. This is the
> > responsibility of table source/sink.
> >
> > @Kurt: I would like to mark them @PublicEvolving already because we
> >
> > need
> >
> > to deprecate the old interfaces as early as possible. We cannot
> >
> > redirect
> >
> > to @Internal interfaces. They are not marked @Public, so we can
> >
> > still
> >
> > evolve them. But a core design shift should not happen again, it
> >
> > would
> >
> > leave a bad impression if we are redesign over and over again.
> >
> > Instead
> >
> > we should be confident in the current change.
> >
> > Regards,
> > Timo
> >
> >
> > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> >
> > Hi Becket,
> >
> > Answering your question, we have the same intention not to
> >
> > duplicate
> >
> > connectors between datastream and table apis. The interfaces
> >
> > proposed
> >
> > in
> >
> > the FLIP are a way to describe relational properties of a source.
> >
> > The
> >
> > intention is as you described to translate all of those expressed
> >
> > as
> >
> > expressions or other Table specific structures into a DataStream
> >
> > source.
> >
> > In other words I think what we are doing here is in line with
> >
> > what
> >
> > you
> >
> > described.
> >
> > Best,
> >
> > Dawid
> >
> > On 24/03/2020 02:23, Becket Qin wrote:
> >
> > Hi Timo,
> >
> > Thanks for the proposal. I completely agree that the current
> >
> > Table
> >
> > connectors could be simplified quite a bit. I haven't finished
> >
> > reading
> >
> > everything, but here are some quick thoughts.
> >
> > Actually to me the biggest question is why should there be two
> >
> > different
> >
> > connector systems for DataStream and Table? What is the
> >
> > fundamental
> >
> > reason
> >
> > that is preventing us from merging them to one?
> >
> > The basic functionality of a connector is to provide
> >
> > capabilities
> >
> > to
> >
> > do
> >
> > IO
> >
> > and Serde. Conceptually, Table connectors should just be
> >
> > DataStream
> >
> > connectors that are dealing with Rows. It seems that quite a few
> >
> > of
> >
> > the
> >
> > special connector requirements are just a specific way to do IO
> >
> > /
> >
> > Serde.
> >
> > Taking SupportsFilterPushDown as an example, imagine we have the
> >
> > following
> >
> > interface:
> >
> > interface FilterableSource<PREDICATE> {
> >      void applyFilterable(Supplier<PREDICATE> predicate);
> > }
> >
> > And if a ParquetSource would like to support filterable, it will
> >
> > become:
> >
> > class ParquetSource implements Source,
> >
> > FilterableSource(FilterPredicate> {
> >
> >      ...
> > }
> >
> > For Table, one just need to provide an predicate supplier that
> >
> > converts
> >
> > an
> >
> > Expression to the specified predicate type. This has a few
> >
> > benefit:
> >
> > 1. Same unified API for filterable for sources, regardless of
> >
> > DataStream or
> >
> > Table.
> > 2. The  DataStream users now can also use the
> >
> > ExpressionToPredicate
> >
> > supplier if they want to.
> >
> > To summarize, my main point is that I am wondering if it is
> >
> > possible
> >
> > to
> >
> > have a single set of connector interface for both Table and
> >
> > DataStream,
> >
> > rather than having two hierarchies. I am not 100% sure if this
> >
> > would
> >
> > work,
> >
> > but if it works, this would be a huge win from both code
> >
> > maintenance
> >
> > and
> >
> > user experience perspective.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> >
> > dwysakowicz@apache.org>
> >
> > wrote:
> >
> >
> > Hi Timo,
> >
> > Thank you for the proposal. I think it is an important
> >
> > improvement
> >
> > that
> >
> > will benefit many parts of the Table API. The proposal looks
> >
> > really
> >
> > good
> >
> > to me and personally I would be comfortable with voting on the
> >
> > current
> >
> > state.
> >
> > Best,
> >
> > Dawid
> >
> > On 23/03/2020 18:53, Timo Walther wrote:
> >
> > Hi everyone,
> >
> > I received some questions around how the new interfaces play
> >
> > together
> >
> > with formats and their factories.
> >
> > Furthermore, for MySQL or Postgres CDC logs, the format should
> >
> > be
> >
> > able
> >
> > to return a `ChangelogMode`.
> >
> > Also, I incorporated the feedback around the factory design in
> >
> > general.
> >
> > I added a new section `Factory Interfaces` to the design
> >
> > document.
> >
> > This should be helpful to understand the big picture and
> >
> > connecting
> >
> > the concepts.
> >
> > Please let me know what you think?
> >
> > Thanks,
> > Timo
> >
> >
> > On 18.03.20 13:43, Timo Walther wrote:
> >
> > Hi Benchao,
> >
> > this is a very good question. I will update the FLIP about
> >
> > this.
> >
> > The legacy planner will not support the new interfaces. It
> >
> > will
> >
> > only
> >
> > support the old interfaces. With the next release, I think
> >
> > the
> >
> > Blink
> >
> > planner is stable enough to be the default one as well.
> >
> > Regards,
> > Timo
> >
> > On 18.03.20 08:45, Benchao Li wrote:
> >
> > Hi Timo,
> >
> > Thank you and others for the efforts to prepare this FLIP.
> >
> > The FLIP LGTM generally.
> >
> > +1 for moving blink data structures to table-common, it's
> >
> > useful
> >
> > to
> >
> > udf too
> > in the future.
> > A little question is, do we plan to support the new
> >
> > interfaces
> >
> > and
> >
> > data
> >
> > types in legacy planner?
> > Or we only plan to support these new interfaces in blink
> >
> > planner.
> >
> > And using primary keys from DDL instead of derived key
> >
> > information
> >
> > from
> >
> > each query is also a good idea,
> > we met some use cases where this does not works very well
> >
> > before.
> >
> > This FLIP also makes the dependencies of table modules more
> >
> > clear, I
> >
> > like
> > it very much.
> >
> > Timo Walther <tw...@apache.org> <tw...@apache.org> 于2020年3月17日周二
> 上午1:36写道：
> >
> >
> > Hi everyone,
> >
> > I'm happy to present the results of long discussions that
> >
> > we
> >
> > had
> >
> > internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> >
> > many
> >
> > more
> >
> > have contributed to this design document.
> >
> > We would like to propose new long-term table source and
> >
> > table
> >
> > sink
> >
> > interfaces:
> >
> >
> >
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> >
> > This is a requirement for FLIP-105 and finalizing FLIP-32.
> >
> > The goals of this FLIP are:
> >
> > - Simplify the current interface architecture:
> >        - Merge upsert, retract, and append sinks.
> >        - Unify batch and streaming sources.
> >        - Unify batch and streaming sinks.
> >
> > - Allow sources to produce a changelog:
> >        - UpsertTableSources have been requested a lot by
> >
> > users.
> >
> > Now
> >
> > is the
> > time to open the internal planner capabilities via the new
> >
> > interfaces.
> >
> >        - According to FLIP-105, we would like to support
> >
> > changelogs for
> >
> > processing formats such as Debezium.
> >
> > - Don't rely on DataStream API for source and sinks:
> >        - According to FLIP-32, the Table API and SQL should
> >
> > be
> >
> > independent
> > of the DataStream API which is why the `table-common`
> >
> > module
> >
> > has
> >
> > no
> >
> > dependencies on `flink-streaming-java`.
> >        - Source and sink implementations should only depend
> >
> > on
> >
> > the
> >
> > `table-common` module after FLIP-27.
> >        - Until FLIP-27 is ready, we still put most of the
> >
> > interfaces in
> >
> > `table-common` and strictly separate interfaces that
> >
> > communicate
> >
> > with a
> > planner and actual runtime reader/writers.
> >
> > - Implement efficient sources and sinks without planner
> >
> > dependencies:
> >
> >        - Make Blink's internal data structures available to
> >
> > connectors.
> >
> >        - Introduce stable interfaces for data structures
> >
> > that
> >
> > can
> >
> > be
> >
> > marked as `@PublicEvolving`.
> >        - Only require dependencies on `flink-table-common`
> >
> > in
> >
> > the
> >
> > future
> >
> > It finalizes the concept of dynamic tables and consideres
> >
> > how
> >
> > all
> >
> > source/sink related classes play together.
> >
> > We look forward to your feedback.
> >
> > Regards,
> > Timo
> >
> >
> > --
> > Best, Jingsong Lee
> >
> >
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hi Timo and Dawid,

Thanks for the clarification. They really help. You are right that we are
on the same page regarding the hierarchy. I think the only difference
between our view is the flavor of the interfaces. There are two flavors of
the source interface for DataStream and Table source.

*Flavor 1. Table Sources are some wrapper interfaces around DataStream
source.*
Following this way, we will reach the design of the current proposal, i.e.
each pluggable exposed in the DataStream source will have a corresponding
TableSource interface counterpart, which are at the Factory level. Users
will write code like this:

{
    MyTableSource myTableSource = MyTableSourceFactory.create();
    myTableSource.setSchema(mySchema);
    myTableSource.applyFilterPredicate(expression);
    ...
}

The good thing for this flavor is that from the SQL / Table's perspective,
there is a dedicated set of Table oriented interface.

The downsides are:
A. From the user's perspective, DataStream Source and Table Source are just
two different sets of interfaces, regardless of how they are the same
internally.
B. The source developers have to develop for those two sets of interfaces
in order to support both DataStream and Table.
C. It is not explicit that DataStream can actually share the pluggable in
Table / SQL. For example, in order to provide a filter pluggable with SQL
expression, users will have to know the actual converter class that
converts the expression to the filter predicate and construct that
converter by themselves.

---------------

*Flavor 2. A TableSource is a DataStream source with a bunch of pluggables.
No Table specific interfaces at all.*
Following this way, we will reach another design where you have a
SourceFactory and a single Pluggable factory for all the table pluggables.
And users will write something like:

{
    Deserializer<Row> myTableDeserializer =
MyTablePluggableFactory.createDeserializer(schema)
    MySource<Row> mySource = MySourceFactory.create(properties,
myTableDeserializer);

mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
}

The good thing for this flavor is that there is just one set of interface
that works for both Table and DataStream. There is no difference between
creating a DataStream source and creating a Table source. DataStream can
easily reuse the pluggables from the Table sources.

The downside is that Table / SQL won't have a dedicated API for
optimization. Instead of writing:

if (MyTableSource instanceOf FilterableTableSource) {
     // Some filter push down logic.
     MyTableSource.applyPredicate(expression)
}

One have to write:

if (MySource instanceOf FilterableSource) {
    // Some filter push down logic.

  mySource.applyPredicate(MyTablePluggableFactory.createFilterPredicate(expression));
}

-------------------------

Just to be clear, I am not saying flavor 2 is necessarily better than
flavor 1, but I want to make sure flavor 2 is also considered and discussed.

Thanks,

Jiangjie (Becket) Qin.

On Tue, Mar 24, 2020 at 10:53 PM Dawid Wysakowicz <dw...@apache.org>
wrote:

> Hi Becket,
>
> I really think we don't have a differing opinions. We might not see the
> changes in the same way yet. Personally I think of the DynamicTableSource
> as of a factory for a Source implemented for the DataStream API. The
> important fact about the DynamicTableSource and all feature traits
> (SupportsFilterablePushDown, SupportsProjectPushDown etc.) work with Table
> API concepts such as e.g. Expressions, SQL specific types etc. In the end
> what the implementation would resemble is (bear in mind I tremendously
> simplified the example, just to show the relation between the two APIs):
>
> SupportsFilterablePushDown {
>
>   applyFilters(List<ResolvedExpression> filters) {
>
>     this.filters = convertToDataStreamFilters(filters);
>
>   }
>
>   Source createSource() {
>
>         return Source.create()
>
>           .applyFilters(this.filters);
>
>    }
>
> }
>
> or exactly as you said for the computed columns:
>
>
> SupportsComputedColumnsPushDown {
>
>
>
>   applyComputedColumn(ComputedColumnConverter converter) {
>
>     this.deserializationSchema = new DeserializationSchema<Row> {
>
>       Row deserialize(...) {
>
>         RowData row = format.deserialize(bytes); // original format, e.g
> json, avro, etc.
>
>         RowData enriched = converter(row)
>
>       }
>
>     }
>
>   }
>
>   Source createSource() {
>
>         return Source.create()
>
>           .withDeserialization(deserializationSchema);
>
>    }
>
> }
>
> So to sum it up again, all those interfaces are factories that configure
> appropriate parts of the DataStream API using Table API concepts. Finally
> to answer you question for particular comparisons:
>
> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> SupportsFilterablePushDown v.s. FilterableSource
> SupportsProjectablePushDown v.s. ProjectableSource
> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> ScanTableSource v.s. ChangeLogDeserializer.
>
> pretty much you can think of all on the left as factories for the right
> side, left side works with Table API classes (Expressions, DataTypes). I
> hope this clarifies it a bit.
>
> Best,
>
> Dawid
> On 24/03/2020 15:03, Becket Qin wrote:
>
> Hey Kurt,
>
> I don't think DataStream should see some SQL specific concepts such as
>
> Filtering or ComputedColumn.
>
> Projectable and Filterable seems not necessarily SQL concepts, but could be
> applicable to DataStream source as well to reduce the network load. For
> example ORC and Parquet should probably also be readable from DataStream,
> right?
>
> ComputedColumn is not part of the Source, it is an interface extends the
> Deserializer, which is a pluggable for the Source. From the SQL's
> perspective it has the concept of computed column, but from the Source
> perspective, It is essentially a Deserializer which also converts the
> records internally, assuming we allow some conversion to be embedded to
> the source in addition to just deserialization.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> <im...@gmail.com> wrote:
>
>
> Thanks Timo for updating the formats section. That would be very helpful
> for changelog supporting (FLIP-105).
>
> I just left 2 minor comment about some method names. In general, I'm +1 to
> start a voting.
>
>
> --------------------------------------------------------------------------------------------------
>
> Hi Becket,
>
> I agree we shouldn't duplicate codes, especiall the runtime
> implementations.
> However, the interfaces proposed by FLIP-95 are mainly used during
> optimization (compiling), not runtime.
> I don't think there is much to share for this. Because table/sql
> is declarative, but DataStream is imperative.
> For example, filter push down, DataStream FilterableSource may allow to
> accept a FilterFunction (which is a black box for the source).
> However, table sources should pick the pushed filter expressions, some
> sources may only support "=", "<", ">" conditions.
> Pushing a FilterFunction doesn't work in table ecosystem. That means, the
> connectors have to have some table-specific implementations.
>
>
> Best,
> Jark
>
> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> <yk...@gmail.com> wrote:
>
>
> Hi Becket,
>
> I don't think DataStream should see some SQL specific concepts such as
> Filtering or ComputedColumn. It's
> better to stay within SQL area and translate to more generic concept when
> translating to DataStream/Runtime
> layer, such as use MapFunction to represent computed column logic.
>
> Best,
> Kurt
>
>
> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> <be...@gmail.com> wrote:
>
>
> Hi Timo and Dawid,
>
> It's really great that we have the same goal. I am actually wondering
>
> if
>
> we
>
> can go one step further to avoid some of the interfaces in Table as
>
> well.
>
> For example, if we have the FilterableSource, do we still need the
> FilterableTableSource? Should DynamicTableSource just become a
> Source<*Row*,
> SourceSplitT, EnumChkT>?
>
> Can you help me understand a bit more about the reason we need the
> following relational representation / wrapper interfaces v.s. the
> interfaces that we could put to the Source in FLIP-27?
>
> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> SupportsFilterablePushDown v.s. FilterableSource
> SupportsProjectablePushDown v.s. ProjectableSource
> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> ScanTableSource v.s. ChangeLogDeserializer.
> LookUpTableSource v.s. LookUpSource
>
> Assuming we have all the interfaces on the right side, do we still need
>
> the
>
> interfaces on the left side? Note that the interfaces on the right can
>
> be
>
> used by both DataStream and Table. If we do this, there will only be
>
> one
>
> set of Source interfaces Table and DataStream, the only difference is
>
> that
>
> the Source for table will have some specific plugins and
>
> configurations.
>
> An
>
> omnipotent Source can implement all the the above interfaces and take a
> Deserializer that implements both ComputedColumnDeserializer and
> ChangeLogDeserializer.
>
> Would the SQL planner work with that?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
>
>
>
> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com> <ji...@gmail.com>
> wrote:
>
>
> +1. Thanks Timo for the design doc.
>
> We can also consider @Experimental too. But I am +1 to
>
> @PublicEvolving,
>
> we
>
> should be confident in the current change.
>
> Best,
> Jingsong Lee
>
> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org> <tw...@apache.org>
>
> wrote:
>
> @Becket: We totally agree that we don't need table specific
>
> connectors
>
> during runtime. As Dawid said, the interfaces proposed here are
>
> just
>
> for
>
> communication with the planner. Once the properties (watermarks,
> computed column, filters, projecttion etc.) are negotiated, we can
> configure a regular Flink connector.
>
> E.g. setting the watermark assigner and deserialization schema of a
> Kafka connector.
>
> For better separation of concerns, Flink connectors should not
>
> include
>
> relational interfaces and depend on flink-table. This is the
> responsibility of table source/sink.
>
> @Kurt: I would like to mark them @PublicEvolving already because we
>
> need
>
> to deprecate the old interfaces as early as possible. We cannot
>
> redirect
>
> to @Internal interfaces. They are not marked @Public, so we can
>
> still
>
> evolve them. But a core design shift should not happen again, it
>
> would
>
> leave a bad impression if we are redesign over and over again.
>
> Instead
>
> we should be confident in the current change.
>
> Regards,
> Timo
>
>
> On 24.03.20 09:20, Dawid Wysakowicz wrote:
>
> Hi Becket,
>
> Answering your question, we have the same intention not to
>
> duplicate
>
> connectors between datastream and table apis. The interfaces
>
> proposed
>
> in
>
> the FLIP are a way to describe relational properties of a source.
>
> The
>
> intention is as you described to translate all of those expressed
>
> as
>
> expressions or other Table specific structures into a DataStream
>
> source.
>
> In other words I think what we are doing here is in line with
>
> what
>
> you
>
> described.
>
> Best,
>
> Dawid
>
> On 24/03/2020 02:23, Becket Qin wrote:
>
> Hi Timo,
>
> Thanks for the proposal. I completely agree that the current
>
> Table
>
> connectors could be simplified quite a bit. I haven't finished
>
> reading
>
> everything, but here are some quick thoughts.
>
> Actually to me the biggest question is why should there be two
>
> different
>
> connector systems for DataStream and Table? What is the
>
> fundamental
>
> reason
>
> that is preventing us from merging them to one?
>
> The basic functionality of a connector is to provide
>
> capabilities
>
> to
>
> do
>
> IO
>
> and Serde. Conceptually, Table connectors should just be
>
> DataStream
>
> connectors that are dealing with Rows. It seems that quite a few
>
> of
>
> the
>
> special connector requirements are just a specific way to do IO
>
> /
>
> Serde.
>
> Taking SupportsFilterPushDown as an example, imagine we have the
>
> following
>
> interface:
>
> interface FilterableSource<PREDICATE> {
>      void applyFilterable(Supplier<PREDICATE> predicate);
> }
>
> And if a ParquetSource would like to support filterable, it will
>
> become:
>
> class ParquetSource implements Source,
>
> FilterableSource(FilterPredicate> {
>
>      ...
> }
>
> For Table, one just need to provide an predicate supplier that
>
> converts
>
> an
>
> Expression to the specified predicate type. This has a few
>
> benefit:
>
> 1. Same unified API for filterable for sources, regardless of
>
> DataStream or
>
> Table.
> 2. The  DataStream users now can also use the
>
> ExpressionToPredicate
>
> supplier if they want to.
>
> To summarize, my main point is that I am wondering if it is
>
> possible
>
> to
>
> have a single set of connector interface for both Table and
>
> DataStream,
>
> rather than having two hierarchies. I am not 100% sure if this
>
> would
>
> work,
>
> but if it works, this would be a huge win from both code
>
> maintenance
>
> and
>
> user experience perspective.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
>
> dwysakowicz@apache.org>
>
> wrote:
>
>
> Hi Timo,
>
> Thank you for the proposal. I think it is an important
>
> improvement
>
> that
>
> will benefit many parts of the Table API. The proposal looks
>
> really
>
> good
>
> to me and personally I would be comfortable with voting on the
>
> current
>
> state.
>
> Best,
>
> Dawid
>
> On 23/03/2020 18:53, Timo Walther wrote:
>
> Hi everyone,
>
> I received some questions around how the new interfaces play
>
> together
>
> with formats and their factories.
>
> Furthermore, for MySQL or Postgres CDC logs, the format should
>
> be
>
> able
>
> to return a `ChangelogMode`.
>
> Also, I incorporated the feedback around the factory design in
>
> general.
>
> I added a new section `Factory Interfaces` to the design
>
> document.
>
> This should be helpful to understand the big picture and
>
> connecting
>
> the concepts.
>
> Please let me know what you think?
>
> Thanks,
> Timo
>
>
> On 18.03.20 13:43, Timo Walther wrote:
>
> Hi Benchao,
>
> this is a very good question. I will update the FLIP about
>
> this.
>
> The legacy planner will not support the new interfaces. It
>
> will
>
> only
>
> support the old interfaces. With the next release, I think
>
> the
>
> Blink
>
> planner is stable enough to be the default one as well.
>
> Regards,
> Timo
>
> On 18.03.20 08:45, Benchao Li wrote:
>
> Hi Timo,
>
> Thank you and others for the efforts to prepare this FLIP.
>
> The FLIP LGTM generally.
>
> +1 for moving blink data structures to table-common, it's
>
> useful
>
> to
>
> udf too
> in the future.
> A little question is, do we plan to support the new
>
> interfaces
>
> and
>
> data
>
> types in legacy planner?
> Or we only plan to support these new interfaces in blink
>
> planner.
>
> And using primary keys from DDL instead of derived key
>
> information
>
> from
>
> each query is also a good idea,
> we met some use cases where this does not works very well
>
> before.
>
> This FLIP also makes the dependencies of table modules more
>
> clear, I
>
> like
> it very much.
>
> Timo Walther <tw...@apache.org> <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
>
>
> Hi everyone,
>
> I'm happy to present the results of long discussions that
>
> we
>
> had
>
> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
>
> many
>
> more
>
> have contributed to this design document.
>
> We would like to propose new long-term table source and
>
> table
>
> sink
>
> interfaces:
>
>
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>
> This is a requirement for FLIP-105 and finalizing FLIP-32.
>
> The goals of this FLIP are:
>
> - Simplify the current interface architecture:
>        - Merge upsert, retract, and append sinks.
>        - Unify batch and streaming sources.
>        - Unify batch and streaming sinks.
>
> - Allow sources to produce a changelog:
>        - UpsertTableSources have been requested a lot by
>
> users.
>
> Now
>
> is the
> time to open the internal planner capabilities via the new
>
> interfaces.
>
>        - According to FLIP-105, we would like to support
>
> changelogs for
>
> processing formats such as Debezium.
>
> - Don't rely on DataStream API for source and sinks:
>        - According to FLIP-32, the Table API and SQL should
>
> be
>
> independent
> of the DataStream API which is why the `table-common`
>
> module
>
> has
>
> no
>
> dependencies on `flink-streaming-java`.
>        - Source and sink implementations should only depend
>
> on
>
> the
>
> `table-common` module after FLIP-27.
>        - Until FLIP-27 is ready, we still put most of the
>
> interfaces in
>
> `table-common` and strictly separate interfaces that
>
> communicate
>
> with a
> planner and actual runtime reader/writers.
>
> - Implement efficient sources and sinks without planner
>
> dependencies:
>
>        - Make Blink's internal data structures available to
>
> connectors.
>
>        - Introduce stable interfaces for data structures
>
> that
>
> can
>
> be
>
> marked as `@PublicEvolving`.
>        - Only require dependencies on `flink-table-common`
>
> in
>
> the
>
> future
>
> It finalizes the concept of dynamic tables and consideres
>
> how
>
> all
>
> source/sink related classes play together.
>
> We look forward to your feedback.
>
> Regards,
> Timo
>
>
> --
> Best, Jingsong Lee
>
>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Dawid Wysakowicz <dw...@apache.org>.

Hi Becket,

I really think we don't have a differing opinions. We might not see the
changes in the same way yet. Personally I think of the
DynamicTableSource as of a factory for a Source implemented for the
DataStream API. The important fact about the DynamicTableSource and all
feature traits (SupportsFilterablePushDown, SupportsProjectPushDown
etc.) work with Table API concepts such as e.g. Expressions, SQL
specific types etc. In the end what the implementation would resemble is
(bear in mind I tremendously simplified the example, just to show the
relation between the two APIs):

SupportsFilterablePushDown {

  applyFilters(List<ResolvedExpression> filters) {

    this.filters = convertToDataStreamFilters(filters);

  }

  Source createSource() {

        return Source.create()

          .applyFilters(this.filters);

   }

}

or exactly as you said for the computed columns:


SupportsComputedColumnsPushDown {

 

|applyComputedColumn(ComputedColumnConverter converter)|{

    this.deserializationSchema = new DeserializationSchema<Row> {

      Row deserialize(...) {

        RowData row = format.deserialize(bytes); // original format, e.g
json, avro, etc.

        RowData enriched = converter(row)

      }

    }

  }

  Source createSource() {

        return Source.create()

          .withDeserialization(deserializationSchema);

   }

}

So to sum it up again, all those interfaces are factories that configure
appropriate parts of the DataStream API using Table API concepts.
Finally to answer you question for particular comparisons:

DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
SupportsFilterablePushDown v.s. FilterableSource
SupportsProjectablePushDown v.s. ProjectableSource
SupportsWatermarkPushDown v.s. WithWatermarkAssigner
SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
ScanTableSource v.s. ChangeLogDeserializer.

pretty much you can think of all on the left as factories for the right
side, left side works with Table API classes (Expressions, DataTypes). I
hope this clarifies it a bit.

Best,

Dawid

On 24/03/2020 15:03, Becket Qin wrote:
> Hey Kurt,
>
> I don't think DataStream should see some SQL specific concepts such as
>> Filtering or ComputedColumn.
>
> Projectable and Filterable seems not necessarily SQL concepts, but could be
> applicable to DataStream source as well to reduce the network load. For
> example ORC and Parquet should probably also be readable from DataStream,
> right?
>
> ComputedColumn is not part of the Source, it is an interface extends the
> Deserializer, which is a pluggable for the Source. From the SQL's
> perspective it has the concept of computed column, but from the Source
> perspective, It is essentially a Deserializer which also converts the
> records internally, assuming we allow some conversion to be embedded to
> the source in addition to just deserialization.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> wrote:
>
>> Thanks Timo for updating the formats section. That would be very helpful
>> for changelog supporting (FLIP-105).
>>
>> I just left 2 minor comment about some method names. In general, I'm +1 to
>> start a voting.
>>
>>
>> --------------------------------------------------------------------------------------------------
>>
>> Hi Becket,
>>
>> I agree we shouldn't duplicate codes, especiall the runtime
>> implementations.
>> However, the interfaces proposed by FLIP-95 are mainly used during
>> optimization (compiling), not runtime.
>> I don't think there is much to share for this. Because table/sql
>> is declarative, but DataStream is imperative.
>> For example, filter push down, DataStream FilterableSource may allow to
>> accept a FilterFunction (which is a black box for the source).
>> However, table sources should pick the pushed filter expressions, some
>> sources may only support "=", "<", ">" conditions.
>> Pushing a FilterFunction doesn't work in table ecosystem. That means, the
>> connectors have to have some table-specific implementations.
>>
>>
>> Best,
>> Jark
>>
>> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> wrote:
>>
>>> Hi Becket,
>>>
>>> I don't think DataStream should see some SQL specific concepts such as
>>> Filtering or ComputedColumn. It's
>>> better to stay within SQL area and translate to more generic concept when
>>> translating to DataStream/Runtime
>>> layer, such as use MapFunction to represent computed column logic.
>>>
>>> Best,
>>> Kurt
>>>
>>>
>>> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> wrote:
>>>
>>>> Hi Timo and Dawid,
>>>>
>>>> It's really great that we have the same goal. I am actually wondering
>> if
>>> we
>>>> can go one step further to avoid some of the interfaces in Table as
>> well.
>>>> For example, if we have the FilterableSource, do we still need the
>>>> FilterableTableSource? Should DynamicTableSource just become a
>>>> Source<*Row*,
>>>> SourceSplitT, EnumChkT>?
>>>>
>>>> Can you help me understand a bit more about the reason we need the
>>>> following relational representation / wrapper interfaces v.s. the
>>>> interfaces that we could put to the Source in FLIP-27?
>>>>
>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>>>> SupportsFilterablePushDown v.s. FilterableSource
>>>> SupportsProjectablePushDown v.s. ProjectableSource
>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>>>> ScanTableSource v.s. ChangeLogDeserializer.
>>>> LookUpTableSource v.s. LookUpSource
>>>>
>>>> Assuming we have all the interfaces on the right side, do we still need
>>> the
>>>> interfaces on the left side? Note that the interfaces on the right can
>> be
>>>> used by both DataStream and Table. If we do this, there will only be
>> one
>>>> set of Source interfaces Table and DataStream, the only difference is
>>> that
>>>> the Source for table will have some specific plugins and
>> configurations.
>>> An
>>>> omnipotent Source can implement all the the above interfaces and take a
>>>> Deserializer that implements both ComputedColumnDeserializer and
>>>> ChangeLogDeserializer.
>>>>
>>>> Would the SQL planner work with that?
>>>>
>>>> Thanks,
>>>>
>>>> Jiangjie (Becket) Qin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1. Thanks Timo for the design doc.
>>>>>
>>>>> We can also consider @Experimental too. But I am +1 to
>> @PublicEvolving,
>>>> we
>>>>> should be confident in the current change.
>>>>>
>>>>> Best,
>>>>> Jingsong Lee
>>>>>
>>>>> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
>>> wrote:
>>>>>> @Becket: We totally agree that we don't need table specific
>>> connectors
>>>>>> during runtime. As Dawid said, the interfaces proposed here are
>> just
>>>> for
>>>>>> communication with the planner. Once the properties (watermarks,
>>>>>> computed column, filters, projecttion etc.) are negotiated, we can
>>>>>> configure a regular Flink connector.
>>>>>>
>>>>>> E.g. setting the watermark assigner and deserialization schema of a
>>>>>> Kafka connector.
>>>>>>
>>>>>> For better separation of concerns, Flink connectors should not
>>> include
>>>>>> relational interfaces and depend on flink-table. This is the
>>>>>> responsibility of table source/sink.
>>>>>>
>>>>>> @Kurt: I would like to mark them @PublicEvolving already because we
>>>> need
>>>>>> to deprecate the old interfaces as early as possible. We cannot
>>>> redirect
>>>>>> to @Internal interfaces. They are not marked @Public, so we can
>> still
>>>>>> evolve them. But a core design shift should not happen again, it
>>> would
>>>>>> leave a bad impression if we are redesign over and over again.
>>> Instead
>>>>>> we should be confident in the current change.
>>>>>>
>>>>>> Regards,
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> On 24.03.20 09:20, Dawid Wysakowicz wrote:
>>>>>>> Hi Becket,
>>>>>>>
>>>>>>> Answering your question, we have the same intention not to
>>> duplicate
>>>>>>> connectors between datastream and table apis. The interfaces
>>> proposed
>>>>> in
>>>>>>> the FLIP are a way to describe relational properties of a source.
>>> The
>>>>>>> intention is as you described to translate all of those expressed
>>> as
>>>>>>> expressions or other Table specific structures into a DataStream
>>>>> source.
>>>>>>> In other words I think what we are doing here is in line with
>> what
>>>> you
>>>>>>> described.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Dawid
>>>>>>>
>>>>>>> On 24/03/2020 02:23, Becket Qin wrote:
>>>>>>>> Hi Timo,
>>>>>>>>
>>>>>>>> Thanks for the proposal. I completely agree that the current
>> Table
>>>>>>>> connectors could be simplified quite a bit. I haven't finished
>>>> reading
>>>>>>>> everything, but here are some quick thoughts.
>>>>>>>>
>>>>>>>> Actually to me the biggest question is why should there be two
>>>>> different
>>>>>>>> connector systems for DataStream and Table? What is the
>>> fundamental
>>>>>> reason
>>>>>>>> that is preventing us from merging them to one?
>>>>>>>>
>>>>>>>> The basic functionality of a connector is to provide
>> capabilities
>>> to
>>>>> do
>>>>>> IO
>>>>>>>> and Serde. Conceptually, Table connectors should just be
>>> DataStream
>>>>>>>> connectors that are dealing with Rows. It seems that quite a few
>>> of
>>>>> the
>>>>>>>> special connector requirements are just a specific way to do IO
>> /
>>>>> Serde.
>>>>>>>> Taking SupportsFilterPushDown as an example, imagine we have the
>>>>>> following
>>>>>>>> interface:
>>>>>>>>
>>>>>>>> interface FilterableSource<PREDICATE> {
>>>>>>>>      void applyFilterable(Supplier<PREDICATE> predicate);
>>>>>>>> }
>>>>>>>>
>>>>>>>> And if a ParquetSource would like to support filterable, it will
>>>>> become:
>>>>>>>> class ParquetSource implements Source,
>>>>>> FilterableSource(FilterPredicate> {
>>>>>>>>      ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> For Table, one just need to provide an predicate supplier that
>>>>> converts
>>>>>> an
>>>>>>>> Expression to the specified predicate type. This has a few
>>> benefit:
>>>>>>>> 1. Same unified API for filterable for sources, regardless of
>>>>>> DataStream or
>>>>>>>> Table.
>>>>>>>> 2. The  DataStream users now can also use the
>>> ExpressionToPredicate
>>>>>>>> supplier if they want to.
>>>>>>>>
>>>>>>>> To summarize, my main point is that I am wondering if it is
>>> possible
>>>>> to
>>>>>>>> have a single set of connector interface for both Table and
>>>>> DataStream,
>>>>>>>> rather than having two hierarchies. I am not 100% sure if this
>>> would
>>>>>> work,
>>>>>>>> but if it works, this would be a huge win from both code
>>> maintenance
>>>>> and
>>>>>>>> user experience perspective.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
>>>>>> dwysakowicz@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Timo,
>>>>>>>>>
>>>>>>>>> Thank you for the proposal. I think it is an important
>>> improvement
>>>>> that
>>>>>>>>> will benefit many parts of the Table API. The proposal looks
>>> really
>>>>>> good
>>>>>>>>> to me and personally I would be comfortable with voting on the
>>>>> current
>>>>>>>>> state.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Dawid
>>>>>>>>>
>>>>>>>>> On 23/03/2020 18:53, Timo Walther wrote:
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I received some questions around how the new interfaces play
>>>>> together
>>>>>>>>>> with formats and their factories.
>>>>>>>>>>
>>>>>>>>>> Furthermore, for MySQL or Postgres CDC logs, the format should
>>> be
>>>>> able
>>>>>>>>>> to return a `ChangelogMode`.
>>>>>>>>>>
>>>>>>>>>> Also, I incorporated the feedback around the factory design in
>>>>>> general.
>>>>>>>>>> I added a new section `Factory Interfaces` to the design
>>> document.
>>>>>>>>>> This should be helpful to understand the big picture and
>>>> connecting
>>>>>>>>>> the concepts.
>>>>>>>>>>
>>>>>>>>>> Please let me know what you think?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Timo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 18.03.20 13:43, Timo Walther wrote:
>>>>>>>>>>> Hi Benchao,
>>>>>>>>>>>
>>>>>>>>>>> this is a very good question. I will update the FLIP about
>>> this.
>>>>>>>>>>> The legacy planner will not support the new interfaces. It
>> will
>>>>> only
>>>>>>>>>>> support the old interfaces. With the next release, I think
>> the
>>>>> Blink
>>>>>>>>>>> planner is stable enough to be the default one as well.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Timo
>>>>>>>>>>>
>>>>>>>>>>> On 18.03.20 08:45, Benchao Li wrote:
>>>>>>>>>>>> Hi Timo,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you and others for the efforts to prepare this FLIP.
>>>>>>>>>>>>
>>>>>>>>>>>> The FLIP LGTM generally.
>>>>>>>>>>>>
>>>>>>>>>>>> +1 for moving blink data structures to table-common, it's
>>> useful
>>>>> to
>>>>>>>>>>>> udf too
>>>>>>>>>>>> in the future.
>>>>>>>>>>>> A little question is, do we plan to support the new
>> interfaces
>>>> and
>>>>>> data
>>>>>>>>>>>> types in legacy planner?
>>>>>>>>>>>> Or we only plan to support these new interfaces in blink
>>>> planner.
>>>>>>>>>>>> And using primary keys from DDL instead of derived key
>>>> information
>>>>>> from
>>>>>>>>>>>> each query is also a good idea,
>>>>>>>>>>>> we met some use cases where this does not works very well
>>>> before.
>>>>>>>>>>>> This FLIP also makes the dependencies of table modules more
>>>>> clear, I
>>>>>>>>>>>> like
>>>>>>>>>>>> it very much.
>>>>>>>>>>>>
>>>>>>>>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm happy to present the results of long discussions that
>> we
>>>> had
>>>>>>>>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
>>> many
>>>>>> more
>>>>>>>>>>>>> have contributed to this design document.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We would like to propose new long-term table source and
>> table
>>>>> sink
>>>>>>>>>>>>> interfaces:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>>>>>>>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The goals of this FLIP are:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Simplify the current interface architecture:
>>>>>>>>>>>>>        - Merge upsert, retract, and append sinks.
>>>>>>>>>>>>>        - Unify batch and streaming sources.
>>>>>>>>>>>>>        - Unify batch and streaming sinks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Allow sources to produce a changelog:
>>>>>>>>>>>>>        - UpsertTableSources have been requested a lot by
>>> users.
>>>>> Now
>>>>>>>>>>>>> is the
>>>>>>>>>>>>> time to open the internal planner capabilities via the new
>>>>>> interfaces.
>>>>>>>>>>>>>        - According to FLIP-105, we would like to support
>>>>>> changelogs for
>>>>>>>>>>>>> processing formats such as Debezium.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Don't rely on DataStream API for source and sinks:
>>>>>>>>>>>>>        - According to FLIP-32, the Table API and SQL should
>>> be
>>>>>>>>>>>>> independent
>>>>>>>>>>>>> of the DataStream API which is why the `table-common`
>> module
>>>> has
>>>>> no
>>>>>>>>>>>>> dependencies on `flink-streaming-java`.
>>>>>>>>>>>>>        - Source and sink implementations should only depend
>>> on
>>>>> the
>>>>>>>>>>>>> `table-common` module after FLIP-27.
>>>>>>>>>>>>>        - Until FLIP-27 is ready, we still put most of the
>>>>>> interfaces in
>>>>>>>>>>>>> `table-common` and strictly separate interfaces that
>>>> communicate
>>>>>>>>>>>>> with a
>>>>>>>>>>>>> planner and actual runtime reader/writers.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Implement efficient sources and sinks without planner
>>>>>> dependencies:
>>>>>>>>>>>>>        - Make Blink's internal data structures available to
>>>>>> connectors.
>>>>>>>>>>>>>        - Introduce stable interfaces for data structures
>> that
>>>> can
>>>>>> be
>>>>>>>>>>>>> marked as `@PublicEvolving`.
>>>>>>>>>>>>>        - Only require dependencies on `flink-table-common`
>> in
>>>> the
>>>>>>>>>>>>> future
>>>>>>>>>>>>>
>>>>>>>>>>>>> It finalizes the concept of dynamic tables and consideres
>> how
>>>> all
>>>>>>>>>>>>> source/sink related classes play together.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We look forward to your feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Timo
>>>>>>>>>>>>>
>>>>>>
>>>>> --
>>>>> Best, Jingsong Lee
>>>>>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Timo Walther <tw...@apache.org>.

Hi Becket,

it is true that concepts such as projection and filtering are worth 
having in DataStream API as well. And a SourceFunction can provide 
interfaces for those concepts. In the table related classes we will 
generate runtime classes that adhere to those interfaces and deal with 
RowData (possibly code generated and on binary data).

We consider everything in DataStream API as "runtime" classes. It is an 
orthogonal discussion which components a SourceFunction needs.

Currently, I could imagine something like:

SourceFunction(FlatMapFunction, WatermarkAssigner)

Once we have completed this discussion, the table sources/and sinks will 
generate FlatMapFunction (out of projection, filter, and computed 
columns) and WatermarkAssigner (out of computed columns). Until now we 
are using the old interfaces.

Regards,
Timo


On 24.03.20 15:03, Becket Qin wrote:
> Hey Kurt,
> 
> I don't think DataStream should see some SQL specific concepts such as
>> Filtering or ComputedColumn.
> 
> 
> Projectable and Filterable seems not necessarily SQL concepts, but could be
> applicable to DataStream source as well to reduce the network load. For
> example ORC and Parquet should probably also be readable from DataStream,
> right?
> 
> ComputedColumn is not part of the Source, it is an interface extends the
> Deserializer, which is a pluggable for the Source. From the SQL's
> perspective it has the concept of computed column, but from the Source
> perspective, It is essentially a Deserializer which also converts the
> records internally, assuming we allow some conversion to be embedded to
> the source in addition to just deserialization.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> wrote:
> 
>> Thanks Timo for updating the formats section. That would be very helpful
>> for changelog supporting (FLIP-105).
>>
>> I just left 2 minor comment about some method names. In general, I'm +1 to
>> start a voting.
>>
>>
>> --------------------------------------------------------------------------------------------------
>>
>> Hi Becket,
>>
>> I agree we shouldn't duplicate codes, especiall the runtime
>> implementations.
>> However, the interfaces proposed by FLIP-95 are mainly used during
>> optimization (compiling), not runtime.
>> I don't think there is much to share for this. Because table/sql
>> is declarative, but DataStream is imperative.
>> For example, filter push down, DataStream FilterableSource may allow to
>> accept a FilterFunction (which is a black box for the source).
>> However, table sources should pick the pushed filter expressions, some
>> sources may only support "=", "<", ">" conditions.
>> Pushing a FilterFunction doesn't work in table ecosystem. That means, the
>> connectors have to have some table-specific implementations.
>>
>>
>> Best,
>> Jark
>>
>> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> wrote:
>>
>>> Hi Becket,
>>>
>>> I don't think DataStream should see some SQL specific concepts such as
>>> Filtering or ComputedColumn. It's
>>> better to stay within SQL area and translate to more generic concept when
>>> translating to DataStream/Runtime
>>> layer, such as use MapFunction to represent computed column logic.
>>>
>>> Best,
>>> Kurt
>>>
>>>
>>> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> wrote:
>>>
>>>> Hi Timo and Dawid,
>>>>
>>>> It's really great that we have the same goal. I am actually wondering
>> if
>>> we
>>>> can go one step further to avoid some of the interfaces in Table as
>> well.
>>>>
>>>> For example, if we have the FilterableSource, do we still need the
>>>> FilterableTableSource? Should DynamicTableSource just become a
>>>> Source<*Row*,
>>>> SourceSplitT, EnumChkT>?
>>>>
>>>> Can you help me understand a bit more about the reason we need the
>>>> following relational representation / wrapper interfaces v.s. the
>>>> interfaces that we could put to the Source in FLIP-27?
>>>>
>>>> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>>>> SupportsFilterablePushDown v.s. FilterableSource
>>>> SupportsProjectablePushDown v.s. ProjectableSource
>>>> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>>>> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>>>> ScanTableSource v.s. ChangeLogDeserializer.
>>>> LookUpTableSource v.s. LookUpSource
>>>>
>>>> Assuming we have all the interfaces on the right side, do we still need
>>> the
>>>> interfaces on the left side? Note that the interfaces on the right can
>> be
>>>> used by both DataStream and Table. If we do this, there will only be
>> one
>>>> set of Source interfaces Table and DataStream, the only difference is
>>> that
>>>> the Source for table will have some specific plugins and
>> configurations.
>>> An
>>>> omnipotent Source can implement all the the above interfaces and take a
>>>> Deserializer that implements both ComputedColumnDeserializer and
>>>> ChangeLogDeserializer.
>>>>
>>>> Would the SQL planner work with that?
>>>>
>>>> Thanks,
>>>>
>>>> Jiangjie (Becket) Qin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1. Thanks Timo for the design doc.
>>>>>
>>>>> We can also consider @Experimental too. But I am +1 to
>> @PublicEvolving,
>>>> we
>>>>> should be confident in the current change.
>>>>>
>>>>> Best,
>>>>> Jingsong Lee
>>>>>
>>>>> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
>>> wrote:
>>>>>
>>>>>> @Becket: We totally agree that we don't need table specific
>>> connectors
>>>>>> during runtime. As Dawid said, the interfaces proposed here are
>> just
>>>> for
>>>>>> communication with the planner. Once the properties (watermarks,
>>>>>> computed column, filters, projecttion etc.) are negotiated, we can
>>>>>> configure a regular Flink connector.
>>>>>>
>>>>>> E.g. setting the watermark assigner and deserialization schema of a
>>>>>> Kafka connector.
>>>>>>
>>>>>> For better separation of concerns, Flink connectors should not
>>> include
>>>>>> relational interfaces and depend on flink-table. This is the
>>>>>> responsibility of table source/sink.
>>>>>>
>>>>>> @Kurt: I would like to mark them @PublicEvolving already because we
>>>> need
>>>>>> to deprecate the old interfaces as early as possible. We cannot
>>>> redirect
>>>>>> to @Internal interfaces. They are not marked @Public, so we can
>> still
>>>>>> evolve them. But a core design shift should not happen again, it
>>> would
>>>>>> leave a bad impression if we are redesign over and over again.
>>> Instead
>>>>>> we should be confident in the current change.
>>>>>>
>>>>>> Regards,
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> On 24.03.20 09:20, Dawid Wysakowicz wrote:
>>>>>>> Hi Becket,
>>>>>>>
>>>>>>> Answering your question, we have the same intention not to
>>> duplicate
>>>>>>> connectors between datastream and table apis. The interfaces
>>> proposed
>>>>> in
>>>>>>> the FLIP are a way to describe relational properties of a source.
>>> The
>>>>>>> intention is as you described to translate all of those expressed
>>> as
>>>>>>> expressions or other Table specific structures into a DataStream
>>>>> source.
>>>>>>> In other words I think what we are doing here is in line with
>> what
>>>> you
>>>>>>> described.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Dawid
>>>>>>>
>>>>>>> On 24/03/2020 02:23, Becket Qin wrote:
>>>>>>>> Hi Timo,
>>>>>>>>
>>>>>>>> Thanks for the proposal. I completely agree that the current
>> Table
>>>>>>>> connectors could be simplified quite a bit. I haven't finished
>>>> reading
>>>>>>>> everything, but here are some quick thoughts.
>>>>>>>>
>>>>>>>> Actually to me the biggest question is why should there be two
>>>>> different
>>>>>>>> connector systems for DataStream and Table? What is the
>>> fundamental
>>>>>> reason
>>>>>>>> that is preventing us from merging them to one?
>>>>>>>>
>>>>>>>> The basic functionality of a connector is to provide
>> capabilities
>>> to
>>>>> do
>>>>>> IO
>>>>>>>> and Serde. Conceptually, Table connectors should just be
>>> DataStream
>>>>>>>> connectors that are dealing with Rows. It seems that quite a few
>>> of
>>>>> the
>>>>>>>> special connector requirements are just a specific way to do IO
>> /
>>>>> Serde.
>>>>>>>> Taking SupportsFilterPushDown as an example, imagine we have the
>>>>>> following
>>>>>>>> interface:
>>>>>>>>
>>>>>>>> interface FilterableSource<PREDICATE> {
>>>>>>>>       void applyFilterable(Supplier<PREDICATE> predicate);
>>>>>>>> }
>>>>>>>>
>>>>>>>> And if a ParquetSource would like to support filterable, it will
>>>>> become:
>>>>>>>>
>>>>>>>> class ParquetSource implements Source,
>>>>>> FilterableSource(FilterPredicate> {
>>>>>>>>       ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> For Table, one just need to provide an predicate supplier that
>>>>> converts
>>>>>> an
>>>>>>>> Expression to the specified predicate type. This has a few
>>> benefit:
>>>>>>>> 1. Same unified API for filterable for sources, regardless of
>>>>>> DataStream or
>>>>>>>> Table.
>>>>>>>> 2. The  DataStream users now can also use the
>>> ExpressionToPredicate
>>>>>>>> supplier if they want to.
>>>>>>>>
>>>>>>>> To summarize, my main point is that I am wondering if it is
>>> possible
>>>>> to
>>>>>>>> have a single set of connector interface for both Table and
>>>>> DataStream,
>>>>>>>> rather than having two hierarchies. I am not 100% sure if this
>>> would
>>>>>> work,
>>>>>>>> but if it works, this would be a huge win from both code
>>> maintenance
>>>>> and
>>>>>>>> user experience perspective.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
>>>>>> dwysakowicz@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Timo,
>>>>>>>>>
>>>>>>>>> Thank you for the proposal. I think it is an important
>>> improvement
>>>>> that
>>>>>>>>> will benefit many parts of the Table API. The proposal looks
>>> really
>>>>>> good
>>>>>>>>> to me and personally I would be comfortable with voting on the
>>>>> current
>>>>>>>>> state.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Dawid
>>>>>>>>>
>>>>>>>>> On 23/03/2020 18:53, Timo Walther wrote:
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I received some questions around how the new interfaces play
>>>>> together
>>>>>>>>>> with formats and their factories.
>>>>>>>>>>
>>>>>>>>>> Furthermore, for MySQL or Postgres CDC logs, the format should
>>> be
>>>>> able
>>>>>>>>>> to return a `ChangelogMode`.
>>>>>>>>>>
>>>>>>>>>> Also, I incorporated the feedback around the factory design in
>>>>>> general.
>>>>>>>>>>
>>>>>>>>>> I added a new section `Factory Interfaces` to the design
>>> document.
>>>>>>>>>> This should be helpful to understand the big picture and
>>>> connecting
>>>>>>>>>> the concepts.
>>>>>>>>>>
>>>>>>>>>> Please let me know what you think?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Timo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 18.03.20 13:43, Timo Walther wrote:
>>>>>>>>>>> Hi Benchao,
>>>>>>>>>>>
>>>>>>>>>>> this is a very good question. I will update the FLIP about
>>> this.
>>>>>>>>>>>
>>>>>>>>>>> The legacy planner will not support the new interfaces. It
>> will
>>>>> only
>>>>>>>>>>> support the old interfaces. With the next release, I think
>> the
>>>>> Blink
>>>>>>>>>>> planner is stable enough to be the default one as well.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Timo
>>>>>>>>>>>
>>>>>>>>>>> On 18.03.20 08:45, Benchao Li wrote:
>>>>>>>>>>>> Hi Timo,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you and others for the efforts to prepare this FLIP.
>>>>>>>>>>>>
>>>>>>>>>>>> The FLIP LGTM generally.
>>>>>>>>>>>>
>>>>>>>>>>>> +1 for moving blink data structures to table-common, it's
>>> useful
>>>>> to
>>>>>>>>>>>> udf too
>>>>>>>>>>>> in the future.
>>>>>>>>>>>> A little question is, do we plan to support the new
>> interfaces
>>>> and
>>>>>> data
>>>>>>>>>>>> types in legacy planner?
>>>>>>>>>>>> Or we only plan to support these new interfaces in blink
>>>> planner.
>>>>>>>>>>>>
>>>>>>>>>>>> And using primary keys from DDL instead of derived key
>>>> information
>>>>>> from
>>>>>>>>>>>> each query is also a good idea,
>>>>>>>>>>>> we met some use cases where this does not works very well
>>>> before.
>>>>>>>>>>>>
>>>>>>>>>>>> This FLIP also makes the dependencies of table modules more
>>>>> clear, I
>>>>>>>>>>>> like
>>>>>>>>>>>> it very much.
>>>>>>>>>>>>
>>>>>>>>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm happy to present the results of long discussions that
>> we
>>>> had
>>>>>>>>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
>>> many
>>>>>> more
>>>>>>>>>>>>> have contributed to this design document.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We would like to propose new long-term table source and
>> table
>>>>> sink
>>>>>>>>>>>>> interfaces:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The goals of this FLIP are:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Simplify the current interface architecture:
>>>>>>>>>>>>>         - Merge upsert, retract, and append sinks.
>>>>>>>>>>>>>         - Unify batch and streaming sources.
>>>>>>>>>>>>>         - Unify batch and streaming sinks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Allow sources to produce a changelog:
>>>>>>>>>>>>>         - UpsertTableSources have been requested a lot by
>>> users.
>>>>> Now
>>>>>>>>>>>>> is the
>>>>>>>>>>>>> time to open the internal planner capabilities via the new
>>>>>> interfaces.
>>>>>>>>>>>>>         - According to FLIP-105, we would like to support
>>>>>> changelogs for
>>>>>>>>>>>>> processing formats such as Debezium.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Don't rely on DataStream API for source and sinks:
>>>>>>>>>>>>>         - According to FLIP-32, the Table API and SQL should
>>> be
>>>>>>>>>>>>> independent
>>>>>>>>>>>>> of the DataStream API which is why the `table-common`
>> module
>>>> has
>>>>> no
>>>>>>>>>>>>> dependencies on `flink-streaming-java`.
>>>>>>>>>>>>>         - Source and sink implementations should only depend
>>> on
>>>>> the
>>>>>>>>>>>>> `table-common` module after FLIP-27.
>>>>>>>>>>>>>         - Until FLIP-27 is ready, we still put most of the
>>>>>> interfaces in
>>>>>>>>>>>>> `table-common` and strictly separate interfaces that
>>>> communicate
>>>>>>>>>>>>> with a
>>>>>>>>>>>>> planner and actual runtime reader/writers.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Implement efficient sources and sinks without planner
>>>>>> dependencies:
>>>>>>>>>>>>>         - Make Blink's internal data structures available to
>>>>>> connectors.
>>>>>>>>>>>>>         - Introduce stable interfaces for data structures
>> that
>>>> can
>>>>>> be
>>>>>>>>>>>>> marked as `@PublicEvolving`.
>>>>>>>>>>>>>         - Only require dependencies on `flink-table-common`
>> in
>>>> the
>>>>>>>>>>>>> future
>>>>>>>>>>>>>
>>>>>>>>>>>>> It finalizes the concept of dynamic tables and consideres
>> how
>>>> all
>>>>>>>>>>>>> source/sink related classes play together.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We look forward to your feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Timo
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Best, Jingsong Lee
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hi Jark,

However, the interfaces proposed by FLIP-95 are mainly used during
> optimization (compiling), not runtime.


Yes, I am aware of that, I am wondering whether the SQL planner can use the
counterpart interface in the Source to apply the optimizations. It seems
should also work, right?

If we have another set of interfaces for Table Source, all the connector
must expose two different sets of interfaces if they want to make it work
for both DataStream and Table. This will be a per-connector work,
regardless of how different Flink uses these interfaces internally. And we
will face the problem that sometimes a Source is available for DataStream
but not Table, which sounds weird.

If we can make the single Source interface work for SQL, the story is much
simpler. Once you have that Source implemented, the minimum thing needed to
make it work for Table is pretty much a SchemaDeserializer<Row>, anything
else will be some optional additional feature.

For example, filter push down, DataStream FilterableSource may allow to
> accept a FilterFunction (which is a black box for the source).
> However, table sources should pick the pushed filter expressions, some
> sources may only support "=", "<", ">" conditions.
> Pushing a FilterFunction doesn't work in table ecosystem. That means, the
> connectors have to have some table-specific implementations.


Can you articulate a bit more on this? For filterable there might be two
cases:
1. The filter predicate is source specific, i.e. the source needs to
understand it. Such filtering usually improves the performance by
leveraging some index / reducing the network traffic.
2. The filter predicate is source agnostic, i.e. the source does not
understand it. Such filtering happens in the Source itself after the
records have been read from the remote system.

By FilterableSource, I meant the first case. So a Supplier<FilterPredicate>
can be provided to convert from SQL expression to the required
FilterPredicate. And that FilterPredicate will be used by the Source.
For the second case, I'd say it should be part of the Deserializer, again
assuming the Deserializer can also have some parsing logic in addition to
its basic function.

Thanks,

Jiangjie (Becket) Qin

On Tue, Mar 24, 2020 at 10:03 PM Becket Qin <be...@gmail.com> wrote:

> Hey Kurt,
>
> I don't think DataStream should see some SQL specific concepts such as
>> Filtering or ComputedColumn.
>
>
> Projectable and Filterable seems not necessarily SQL concepts, but could
> be applicable to DataStream source as well to reduce the network load. For
> example ORC and Parquet should probably also be readable from DataStream,
> right?
>
> ComputedColumn is not part of the Source, it is an interface extends the
> Deserializer, which is a pluggable for the Source. From the SQL's
> perspective it has the concept of computed column, but from the Source
> perspective, It is essentially a Deserializer which also converts the
> records internally, assuming we allow some conversion to be embedded to
> the source in addition to just deserialization.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> wrote:
>
>> Thanks Timo for updating the formats section. That would be very helpful
>> for changelog supporting (FLIP-105).
>>
>> I just left 2 minor comment about some method names. In general, I'm +1 to
>> start a voting.
>>
>>
>> --------------------------------------------------------------------------------------------------
>>
>> Hi Becket,
>>
>> I agree we shouldn't duplicate codes, especiall the runtime
>> implementations.
>> However, the interfaces proposed by FLIP-95 are mainly used during
>> optimization (compiling), not runtime.
>> I don't think there is much to share for this. Because table/sql
>> is declarative, but DataStream is imperative.
>> For example, filter push down, DataStream FilterableSource may allow to
>> accept a FilterFunction (which is a black box for the source).
>> However, table sources should pick the pushed filter expressions, some
>> sources may only support "=", "<", ">" conditions.
>> Pushing a FilterFunction doesn't work in table ecosystem. That means, the
>> connectors have to have some table-specific implementations.
>>
>>
>> Best,
>> Jark
>>
>> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> wrote:
>>
>> > Hi Becket,
>> >
>> > I don't think DataStream should see some SQL specific concepts such as
>> > Filtering or ComputedColumn. It's
>> > better to stay within SQL area and translate to more generic concept
>> when
>> > translating to DataStream/Runtime
>> > layer, such as use MapFunction to represent computed column logic.
>> >
>> > Best,
>> > Kurt
>> >
>> >
>> > On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com>
>> wrote:
>> >
>> > > Hi Timo and Dawid,
>> > >
>> > > It's really great that we have the same goal. I am actually wondering
>> if
>> > we
>> > > can go one step further to avoid some of the interfaces in Table as
>> well.
>> > >
>> > > For example, if we have the FilterableSource, do we still need the
>> > > FilterableTableSource? Should DynamicTableSource just become a
>> > > Source<*Row*,
>> > > SourceSplitT, EnumChkT>?
>> > >
>> > > Can you help me understand a bit more about the reason we need the
>> > > following relational representation / wrapper interfaces v.s. the
>> > > interfaces that we could put to the Source in FLIP-27?
>> > >
>> > > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
>> > > SupportsFilterablePushDown v.s. FilterableSource
>> > > SupportsProjectablePushDown v.s. ProjectableSource
>> > > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
>> > > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
>> > > ScanTableSource v.s. ChangeLogDeserializer.
>> > > LookUpTableSource v.s. LookUpSource
>> > >
>> > > Assuming we have all the interfaces on the right side, do we still
>> need
>> > the
>> > > interfaces on the left side? Note that the interfaces on the right
>> can be
>> > > used by both DataStream and Table. If we do this, there will only be
>> one
>> > > set of Source interfaces Table and DataStream, the only difference is
>> > that
>> > > the Source for table will have some specific plugins and
>> configurations.
>> > An
>> > > omnipotent Source can implement all the the above interfaces and take
>> a
>> > > Deserializer that implements both ComputedColumnDeserializer and
>> > > ChangeLogDeserializer.
>> > >
>> > > Would the SQL planner work with that?
>> > >
>> > > Thanks,
>> > >
>> > > Jiangjie (Becket) Qin
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com>
>> > > wrote:
>> > >
>> > > > +1. Thanks Timo for the design doc.
>> > > >
>> > > > We can also consider @Experimental too. But I am +1 to
>> @PublicEvolving,
>> > > we
>> > > > should be confident in the current change.
>> > > >
>> > > > Best,
>> > > > Jingsong Lee
>> > > >
>> > > > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
>> > wrote:
>> > > >
>> > > > > @Becket: We totally agree that we don't need table specific
>> > connectors
>> > > > > during runtime. As Dawid said, the interfaces proposed here are
>> just
>> > > for
>> > > > > communication with the planner. Once the properties (watermarks,
>> > > > > computed column, filters, projecttion etc.) are negotiated, we can
>> > > > > configure a regular Flink connector.
>> > > > >
>> > > > > E.g. setting the watermark assigner and deserialization schema of
>> a
>> > > > > Kafka connector.
>> > > > >
>> > > > > For better separation of concerns, Flink connectors should not
>> > include
>> > > > > relational interfaces and depend on flink-table. This is the
>> > > > > responsibility of table source/sink.
>> > > > >
>> > > > > @Kurt: I would like to mark them @PublicEvolving already because
>> we
>> > > need
>> > > > > to deprecate the old interfaces as early as possible. We cannot
>> > > redirect
>> > > > > to @Internal interfaces. They are not marked @Public, so we can
>> still
>> > > > > evolve them. But a core design shift should not happen again, it
>> > would
>> > > > > leave a bad impression if we are redesign over and over again.
>> > Instead
>> > > > > we should be confident in the current change.
>> > > > >
>> > > > > Regards,
>> > > > > Timo
>> > > > >
>> > > > >
>> > > > > On 24.03.20 09:20, Dawid Wysakowicz wrote:
>> > > > > > Hi Becket,
>> > > > > >
>> > > > > > Answering your question, we have the same intention not to
>> > duplicate
>> > > > > > connectors between datastream and table apis. The interfaces
>> > proposed
>> > > > in
>> > > > > > the FLIP are a way to describe relational properties of a
>> source.
>> > The
>> > > > > > intention is as you described to translate all of those
>> expressed
>> > as
>> > > > > > expressions or other Table specific structures into a DataStream
>> > > > source.
>> > > > > > In other words I think what we are doing here is in line with
>> what
>> > > you
>> > > > > > described.
>> > > > > >
>> > > > > > Best,
>> > > > > >
>> > > > > > Dawid
>> > > > > >
>> > > > > > On 24/03/2020 02:23, Becket Qin wrote:
>> > > > > >> Hi Timo,
>> > > > > >>
>> > > > > >> Thanks for the proposal. I completely agree that the current
>> Table
>> > > > > >> connectors could be simplified quite a bit. I haven't finished
>> > > reading
>> > > > > >> everything, but here are some quick thoughts.
>> > > > > >>
>> > > > > >> Actually to me the biggest question is why should there be two
>> > > > different
>> > > > > >> connector systems for DataStream and Table? What is the
>> > fundamental
>> > > > > reason
>> > > > > >> that is preventing us from merging them to one?
>> > > > > >>
>> > > > > >> The basic functionality of a connector is to provide
>> capabilities
>> > to
>> > > > do
>> > > > > IO
>> > > > > >> and Serde. Conceptually, Table connectors should just be
>> > DataStream
>> > > > > >> connectors that are dealing with Rows. It seems that quite a
>> few
>> > of
>> > > > the
>> > > > > >> special connector requirements are just a specific way to do
>> IO /
>> > > > Serde.
>> > > > > >> Taking SupportsFilterPushDown as an example, imagine we have
>> the
>> > > > > following
>> > > > > >> interface:
>> > > > > >>
>> > > > > >> interface FilterableSource<PREDICATE> {
>> > > > > >>      void applyFilterable(Supplier<PREDICATE> predicate);
>> > > > > >> }
>> > > > > >>
>> > > > > >> And if a ParquetSource would like to support filterable, it
>> will
>> > > > become:
>> > > > > >>
>> > > > > >> class ParquetSource implements Source,
>> > > > > FilterableSource(FilterPredicate> {
>> > > > > >>      ...
>> > > > > >> }
>> > > > > >>
>> > > > > >> For Table, one just need to provide an predicate supplier that
>> > > > converts
>> > > > > an
>> > > > > >> Expression to the specified predicate type. This has a few
>> > benefit:
>> > > > > >> 1. Same unified API for filterable for sources, regardless of
>> > > > > DataStream or
>> > > > > >> Table.
>> > > > > >> 2. The  DataStream users now can also use the
>> > ExpressionToPredicate
>> > > > > >> supplier if they want to.
>> > > > > >>
>> > > > > >> To summarize, my main point is that I am wondering if it is
>> > possible
>> > > > to
>> > > > > >> have a single set of connector interface for both Table and
>> > > > DataStream,
>> > > > > >> rather than having two hierarchies. I am not 100% sure if this
>> > would
>> > > > > work,
>> > > > > >> but if it works, this would be a huge win from both code
>> > maintenance
>> > > > and
>> > > > > >> user experience perspective.
>> > > > > >>
>> > > > > >> Thanks,
>> > > > > >>
>> > > > > >> Jiangjie (Becket) Qin
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
>> > > > > dwysakowicz@apache.org>
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >>> Hi Timo,
>> > > > > >>>
>> > > > > >>> Thank you for the proposal. I think it is an important
>> > improvement
>> > > > that
>> > > > > >>> will benefit many parts of the Table API. The proposal looks
>> > really
>> > > > > good
>> > > > > >>> to me and personally I would be comfortable with voting on the
>> > > > current
>> > > > > >>> state.
>> > > > > >>>
>> > > > > >>> Best,
>> > > > > >>>
>> > > > > >>> Dawid
>> > > > > >>>
>> > > > > >>> On 23/03/2020 18:53, Timo Walther wrote:
>> > > > > >>>> Hi everyone,
>> > > > > >>>>
>> > > > > >>>> I received some questions around how the new interfaces play
>> > > > together
>> > > > > >>>> with formats and their factories.
>> > > > > >>>>
>> > > > > >>>> Furthermore, for MySQL or Postgres CDC logs, the format
>> should
>> > be
>> > > > able
>> > > > > >>>> to return a `ChangelogMode`.
>> > > > > >>>>
>> > > > > >>>> Also, I incorporated the feedback around the factory design
>> in
>> > > > > general.
>> > > > > >>>>
>> > > > > >>>> I added a new section `Factory Interfaces` to the design
>> > document.
>> > > > > >>>> This should be helpful to understand the big picture and
>> > > connecting
>> > > > > >>>> the concepts.
>> > > > > >>>>
>> > > > > >>>> Please let me know what you think?
>> > > > > >>>>
>> > > > > >>>> Thanks,
>> > > > > >>>> Timo
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>> On 18.03.20 13:43, Timo Walther wrote:
>> > > > > >>>>> Hi Benchao,
>> > > > > >>>>>
>> > > > > >>>>> this is a very good question. I will update the FLIP about
>> > this.
>> > > > > >>>>>
>> > > > > >>>>> The legacy planner will not support the new interfaces. It
>> will
>> > > > only
>> > > > > >>>>> support the old interfaces. With the next release, I think
>> the
>> > > > Blink
>> > > > > >>>>> planner is stable enough to be the default one as well.
>> > > > > >>>>>
>> > > > > >>>>> Regards,
>> > > > > >>>>> Timo
>> > > > > >>>>>
>> > > > > >>>>> On 18.03.20 08:45, Benchao Li wrote:
>> > > > > >>>>>> Hi Timo,
>> > > > > >>>>>>
>> > > > > >>>>>> Thank you and others for the efforts to prepare this FLIP.
>> > > > > >>>>>>
>> > > > > >>>>>> The FLIP LGTM generally.
>> > > > > >>>>>>
>> > > > > >>>>>> +1 for moving blink data structures to table-common, it's
>> > useful
>> > > > to
>> > > > > >>>>>> udf too
>> > > > > >>>>>> in the future.
>> > > > > >>>>>> A little question is, do we plan to support the new
>> interfaces
>> > > and
>> > > > > data
>> > > > > >>>>>> types in legacy planner?
>> > > > > >>>>>> Or we only plan to support these new interfaces in blink
>> > > planner.
>> > > > > >>>>>>
>> > > > > >>>>>> And using primary keys from DDL instead of derived key
>> > > information
>> > > > > from
>> > > > > >>>>>> each query is also a good idea,
>> > > > > >>>>>> we met some use cases where this does not works very well
>> > > before.
>> > > > > >>>>>>
>> > > > > >>>>>> This FLIP also makes the dependencies of table modules more
>> > > > clear, I
>> > > > > >>>>>> like
>> > > > > >>>>>> it very much.
>> > > > > >>>>>>
>> > > > > >>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
>> > > > > >>>>>>
>> > > > > >>>>>>> Hi everyone,
>> > > > > >>>>>>>
>> > > > > >>>>>>> I'm happy to present the results of long discussions that
>> we
>> > > had
>> > > > > >>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
>> > many
>> > > > > more
>> > > > > >>>>>>> have contributed to this design document.
>> > > > > >>>>>>>
>> > > > > >>>>>>> We would like to propose new long-term table source and
>> table
>> > > > sink
>> > > > > >>>>>>> interfaces:
>> > > > > >>>>>>>
>> > > > > >>>>>>>
>> > > > > >>>>>>>
>> > > > > >>>
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>> > > > > >>>>>>>
>> > > > > >>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>> > > > > >>>>>>>
>> > > > > >>>>>>> The goals of this FLIP are:
>> > > > > >>>>>>>
>> > > > > >>>>>>> - Simplify the current interface architecture:
>> > > > > >>>>>>>        - Merge upsert, retract, and append sinks.
>> > > > > >>>>>>>        - Unify batch and streaming sources.
>> > > > > >>>>>>>        - Unify batch and streaming sinks.
>> > > > > >>>>>>>
>> > > > > >>>>>>> - Allow sources to produce a changelog:
>> > > > > >>>>>>>        - UpsertTableSources have been requested a lot by
>> > users.
>> > > > Now
>> > > > > >>>>>>> is the
>> > > > > >>>>>>> time to open the internal planner capabilities via the new
>> > > > > interfaces.
>> > > > > >>>>>>>        - According to FLIP-105, we would like to support
>> > > > > changelogs for
>> > > > > >>>>>>> processing formats such as Debezium.
>> > > > > >>>>>>>
>> > > > > >>>>>>> - Don't rely on DataStream API for source and sinks:
>> > > > > >>>>>>>        - According to FLIP-32, the Table API and SQL
>> should
>> > be
>> > > > > >>>>>>> independent
>> > > > > >>>>>>> of the DataStream API which is why the `table-common`
>> module
>> > > has
>> > > > no
>> > > > > >>>>>>> dependencies on `flink-streaming-java`.
>> > > > > >>>>>>>        - Source and sink implementations should only
>> depend
>> > on
>> > > > the
>> > > > > >>>>>>> `table-common` module after FLIP-27.
>> > > > > >>>>>>>        - Until FLIP-27 is ready, we still put most of the
>> > > > > interfaces in
>> > > > > >>>>>>> `table-common` and strictly separate interfaces that
>> > > communicate
>> > > > > >>>>>>> with a
>> > > > > >>>>>>> planner and actual runtime reader/writers.
>> > > > > >>>>>>>
>> > > > > >>>>>>> - Implement efficient sources and sinks without planner
>> > > > > dependencies:
>> > > > > >>>>>>>        - Make Blink's internal data structures available
>> to
>> > > > > connectors.
>> > > > > >>>>>>>        - Introduce stable interfaces for data structures
>> that
>> > > can
>> > > > > be
>> > > > > >>>>>>> marked as `@PublicEvolving`.
>> > > > > >>>>>>>        - Only require dependencies on
>> `flink-table-common` in
>> > > the
>> > > > > >>>>>>> future
>> > > > > >>>>>>>
>> > > > > >>>>>>> It finalizes the concept of dynamic tables and consideres
>> how
>> > > all
>> > > > > >>>>>>> source/sink related classes play together.
>> > > > > >>>>>>>
>> > > > > >>>>>>> We look forward to your feedback.
>> > > > > >>>>>>>
>> > > > > >>>>>>> Regards,
>> > > > > >>>>>>> Timo
>> > > > > >>>>>>>
>> > > > > >>>>>>
>> > > > > >>>
>> > > > > >
>> > > > >
>> > > > >
>> > > >
>> > > > --
>> > > > Best, Jingsong Lee
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hey Kurt,

I don't think DataStream should see some SQL specific concepts such as
> Filtering or ComputedColumn.


Projectable and Filterable seems not necessarily SQL concepts, but could be
applicable to DataStream source as well to reduce the network load. For
example ORC and Parquet should probably also be readable from DataStream,
right?

ComputedColumn is not part of the Source, it is an interface extends the
Deserializer, which is a pluggable for the Source. From the SQL's
perspective it has the concept of computed column, but from the Source
perspective, It is essentially a Deserializer which also converts the
records internally, assuming we allow some conversion to be embedded to
the source in addition to just deserialization.

Thanks,

Jiangjie (Becket) Qin

On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <im...@gmail.com> wrote:

> Thanks Timo for updating the formats section. That would be very helpful
> for changelog supporting (FLIP-105).
>
> I just left 2 minor comment about some method names. In general, I'm +1 to
> start a voting.
>
>
> --------------------------------------------------------------------------------------------------
>
> Hi Becket,
>
> I agree we shouldn't duplicate codes, especiall the runtime
> implementations.
> However, the interfaces proposed by FLIP-95 are mainly used during
> optimization (compiling), not runtime.
> I don't think there is much to share for this. Because table/sql
> is declarative, but DataStream is imperative.
> For example, filter push down, DataStream FilterableSource may allow to
> accept a FilterFunction (which is a black box for the source).
> However, table sources should pick the pushed filter expressions, some
> sources may only support "=", "<", ">" conditions.
> Pushing a FilterFunction doesn't work in table ecosystem. That means, the
> connectors have to have some table-specific implementations.
>
>
> Best,
> Jark
>
> On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> wrote:
>
> > Hi Becket,
> >
> > I don't think DataStream should see some SQL specific concepts such as
> > Filtering or ComputedColumn. It's
> > better to stay within SQL area and translate to more generic concept when
> > translating to DataStream/Runtime
> > layer, such as use MapFunction to represent computed column logic.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> wrote:
> >
> > > Hi Timo and Dawid,
> > >
> > > It's really great that we have the same goal. I am actually wondering
> if
> > we
> > > can go one step further to avoid some of the interfaces in Table as
> well.
> > >
> > > For example, if we have the FilterableSource, do we still need the
> > > FilterableTableSource? Should DynamicTableSource just become a
> > > Source<*Row*,
> > > SourceSplitT, EnumChkT>?
> > >
> > > Can you help me understand a bit more about the reason we need the
> > > following relational representation / wrapper interfaces v.s. the
> > > interfaces that we could put to the Source in FLIP-27?
> > >
> > > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > > SupportsFilterablePushDown v.s. FilterableSource
> > > SupportsProjectablePushDown v.s. ProjectableSource
> > > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > > ScanTableSource v.s. ChangeLogDeserializer.
> > > LookUpTableSource v.s. LookUpSource
> > >
> > > Assuming we have all the interfaces on the right side, do we still need
> > the
> > > interfaces on the left side? Note that the interfaces on the right can
> be
> > > used by both DataStream and Table. If we do this, there will only be
> one
> > > set of Source interfaces Table and DataStream, the only difference is
> > that
> > > the Source for table will have some specific plugins and
> configurations.
> > An
> > > omnipotent Source can implement all the the above interfaces and take a
> > > Deserializer that implements both ComputedColumnDeserializer and
> > > ChangeLogDeserializer.
> > >
> > > Would the SQL planner work with that?
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com>
> > > wrote:
> > >
> > > > +1. Thanks Timo for the design doc.
> > > >
> > > > We can also consider @Experimental too. But I am +1 to
> @PublicEvolving,
> > > we
> > > > should be confident in the current change.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
> > wrote:
> > > >
> > > > > @Becket: We totally agree that we don't need table specific
> > connectors
> > > > > during runtime. As Dawid said, the interfaces proposed here are
> just
> > > for
> > > > > communication with the planner. Once the properties (watermarks,
> > > > > computed column, filters, projecttion etc.) are negotiated, we can
> > > > > configure a regular Flink connector.
> > > > >
> > > > > E.g. setting the watermark assigner and deserialization schema of a
> > > > > Kafka connector.
> > > > >
> > > > > For better separation of concerns, Flink connectors should not
> > include
> > > > > relational interfaces and depend on flink-table. This is the
> > > > > responsibility of table source/sink.
> > > > >
> > > > > @Kurt: I would like to mark them @PublicEvolving already because we
> > > need
> > > > > to deprecate the old interfaces as early as possible. We cannot
> > > redirect
> > > > > to @Internal interfaces. They are not marked @Public, so we can
> still
> > > > > evolve them. But a core design shift should not happen again, it
> > would
> > > > > leave a bad impression if we are redesign over and over again.
> > Instead
> > > > > we should be confident in the current change.
> > > > >
> > > > > Regards,
> > > > > Timo
> > > > >
> > > > >
> > > > > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > > > > > Hi Becket,
> > > > > >
> > > > > > Answering your question, we have the same intention not to
> > duplicate
> > > > > > connectors between datastream and table apis. The interfaces
> > proposed
> > > > in
> > > > > > the FLIP are a way to describe relational properties of a source.
> > The
> > > > > > intention is as you described to translate all of those expressed
> > as
> > > > > > expressions or other Table specific structures into a DataStream
> > > > source.
> > > > > > In other words I think what we are doing here is in line with
> what
> > > you
> > > > > > described.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Dawid
> > > > > >
> > > > > > On 24/03/2020 02:23, Becket Qin wrote:
> > > > > >> Hi Timo,
> > > > > >>
> > > > > >> Thanks for the proposal. I completely agree that the current
> Table
> > > > > >> connectors could be simplified quite a bit. I haven't finished
> > > reading
> > > > > >> everything, but here are some quick thoughts.
> > > > > >>
> > > > > >> Actually to me the biggest question is why should there be two
> > > > different
> > > > > >> connector systems for DataStream and Table? What is the
> > fundamental
> > > > > reason
> > > > > >> that is preventing us from merging them to one?
> > > > > >>
> > > > > >> The basic functionality of a connector is to provide
> capabilities
> > to
> > > > do
> > > > > IO
> > > > > >> and Serde. Conceptually, Table connectors should just be
> > DataStream
> > > > > >> connectors that are dealing with Rows. It seems that quite a few
> > of
> > > > the
> > > > > >> special connector requirements are just a specific way to do IO
> /
> > > > Serde.
> > > > > >> Taking SupportsFilterPushDown as an example, imagine we have the
> > > > > following
> > > > > >> interface:
> > > > > >>
> > > > > >> interface FilterableSource<PREDICATE> {
> > > > > >>      void applyFilterable(Supplier<PREDICATE> predicate);
> > > > > >> }
> > > > > >>
> > > > > >> And if a ParquetSource would like to support filterable, it will
> > > > become:
> > > > > >>
> > > > > >> class ParquetSource implements Source,
> > > > > FilterableSource(FilterPredicate> {
> > > > > >>      ...
> > > > > >> }
> > > > > >>
> > > > > >> For Table, one just need to provide an predicate supplier that
> > > > converts
> > > > > an
> > > > > >> Expression to the specified predicate type. This has a few
> > benefit:
> > > > > >> 1. Same unified API for filterable for sources, regardless of
> > > > > DataStream or
> > > > > >> Table.
> > > > > >> 2. The  DataStream users now can also use the
> > ExpressionToPredicate
> > > > > >> supplier if they want to.
> > > > > >>
> > > > > >> To summarize, my main point is that I am wondering if it is
> > possible
> > > > to
> > > > > >> have a single set of connector interface for both Table and
> > > > DataStream,
> > > > > >> rather than having two hierarchies. I am not 100% sure if this
> > would
> > > > > work,
> > > > > >> but if it works, this would be a huge win from both code
> > maintenance
> > > > and
> > > > > >> user experience perspective.
> > > > > >>
> > > > > >> Thanks,
> > > > > >>
> > > > > >> Jiangjie (Becket) Qin
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> > > > > dwysakowicz@apache.org>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Timo,
> > > > > >>>
> > > > > >>> Thank you for the proposal. I think it is an important
> > improvement
> > > > that
> > > > > >>> will benefit many parts of the Table API. The proposal looks
> > really
> > > > > good
> > > > > >>> to me and personally I would be comfortable with voting on the
> > > > current
> > > > > >>> state.
> > > > > >>>
> > > > > >>> Best,
> > > > > >>>
> > > > > >>> Dawid
> > > > > >>>
> > > > > >>> On 23/03/2020 18:53, Timo Walther wrote:
> > > > > >>>> Hi everyone,
> > > > > >>>>
> > > > > >>>> I received some questions around how the new interfaces play
> > > > together
> > > > > >>>> with formats and their factories.
> > > > > >>>>
> > > > > >>>> Furthermore, for MySQL or Postgres CDC logs, the format should
> > be
> > > > able
> > > > > >>>> to return a `ChangelogMode`.
> > > > > >>>>
> > > > > >>>> Also, I incorporated the feedback around the factory design in
> > > > > general.
> > > > > >>>>
> > > > > >>>> I added a new section `Factory Interfaces` to the design
> > document.
> > > > > >>>> This should be helpful to understand the big picture and
> > > connecting
> > > > > >>>> the concepts.
> > > > > >>>>
> > > > > >>>> Please let me know what you think?
> > > > > >>>>
> > > > > >>>> Thanks,
> > > > > >>>> Timo
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 18.03.20 13:43, Timo Walther wrote:
> > > > > >>>>> Hi Benchao,
> > > > > >>>>>
> > > > > >>>>> this is a very good question. I will update the FLIP about
> > this.
> > > > > >>>>>
> > > > > >>>>> The legacy planner will not support the new interfaces. It
> will
> > > > only
> > > > > >>>>> support the old interfaces. With the next release, I think
> the
> > > > Blink
> > > > > >>>>> planner is stable enough to be the default one as well.
> > > > > >>>>>
> > > > > >>>>> Regards,
> > > > > >>>>> Timo
> > > > > >>>>>
> > > > > >>>>> On 18.03.20 08:45, Benchao Li wrote:
> > > > > >>>>>> Hi Timo,
> > > > > >>>>>>
> > > > > >>>>>> Thank you and others for the efforts to prepare this FLIP.
> > > > > >>>>>>
> > > > > >>>>>> The FLIP LGTM generally.
> > > > > >>>>>>
> > > > > >>>>>> +1 for moving blink data structures to table-common, it's
> > useful
> > > > to
> > > > > >>>>>> udf too
> > > > > >>>>>> in the future.
> > > > > >>>>>> A little question is, do we plan to support the new
> interfaces
> > > and
> > > > > data
> > > > > >>>>>> types in legacy planner?
> > > > > >>>>>> Or we only plan to support these new interfaces in blink
> > > planner.
> > > > > >>>>>>
> > > > > >>>>>> And using primary keys from DDL instead of derived key
> > > information
> > > > > from
> > > > > >>>>>> each query is also a good idea,
> > > > > >>>>>> we met some use cases where this does not works very well
> > > before.
> > > > > >>>>>>
> > > > > >>>>>> This FLIP also makes the dependencies of table modules more
> > > > clear, I
> > > > > >>>>>> like
> > > > > >>>>>> it very much.
> > > > > >>>>>>
> > > > > >>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
> > > > > >>>>>>
> > > > > >>>>>>> Hi everyone,
> > > > > >>>>>>>
> > > > > >>>>>>> I'm happy to present the results of long discussions that
> we
> > > had
> > > > > >>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> > many
> > > > > more
> > > > > >>>>>>> have contributed to this design document.
> > > > > >>>>>>>
> > > > > >>>>>>> We would like to propose new long-term table source and
> table
> > > > sink
> > > > > >>>>>>> interfaces:
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > > > > >>>>>>>
> > > > > >>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> > > > > >>>>>>>
> > > > > >>>>>>> The goals of this FLIP are:
> > > > > >>>>>>>
> > > > > >>>>>>> - Simplify the current interface architecture:
> > > > > >>>>>>>        - Merge upsert, retract, and append sinks.
> > > > > >>>>>>>        - Unify batch and streaming sources.
> > > > > >>>>>>>        - Unify batch and streaming sinks.
> > > > > >>>>>>>
> > > > > >>>>>>> - Allow sources to produce a changelog:
> > > > > >>>>>>>        - UpsertTableSources have been requested a lot by
> > users.
> > > > Now
> > > > > >>>>>>> is the
> > > > > >>>>>>> time to open the internal planner capabilities via the new
> > > > > interfaces.
> > > > > >>>>>>>        - According to FLIP-105, we would like to support
> > > > > changelogs for
> > > > > >>>>>>> processing formats such as Debezium.
> > > > > >>>>>>>
> > > > > >>>>>>> - Don't rely on DataStream API for source and sinks:
> > > > > >>>>>>>        - According to FLIP-32, the Table API and SQL should
> > be
> > > > > >>>>>>> independent
> > > > > >>>>>>> of the DataStream API which is why the `table-common`
> module
> > > has
> > > > no
> > > > > >>>>>>> dependencies on `flink-streaming-java`.
> > > > > >>>>>>>        - Source and sink implementations should only depend
> > on
> > > > the
> > > > > >>>>>>> `table-common` module after FLIP-27.
> > > > > >>>>>>>        - Until FLIP-27 is ready, we still put most of the
> > > > > interfaces in
> > > > > >>>>>>> `table-common` and strictly separate interfaces that
> > > communicate
> > > > > >>>>>>> with a
> > > > > >>>>>>> planner and actual runtime reader/writers.
> > > > > >>>>>>>
> > > > > >>>>>>> - Implement efficient sources and sinks without planner
> > > > > dependencies:
> > > > > >>>>>>>        - Make Blink's internal data structures available to
> > > > > connectors.
> > > > > >>>>>>>        - Introduce stable interfaces for data structures
> that
> > > can
> > > > > be
> > > > > >>>>>>> marked as `@PublicEvolving`.
> > > > > >>>>>>>        - Only require dependencies on `flink-table-common`
> in
> > > the
> > > > > >>>>>>> future
> > > > > >>>>>>>
> > > > > >>>>>>> It finalizes the concept of dynamic tables and consideres
> how
> > > all
> > > > > >>>>>>> source/sink related classes play together.
> > > > > >>>>>>>
> > > > > >>>>>>> We look forward to your feedback.
> > > > > >>>>>>>
> > > > > >>>>>>> Regards,
> > > > > >>>>>>> Timo
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>
> > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Jark Wu <im...@gmail.com>.

Thanks Timo for updating the formats section. That would be very helpful
for changelog supporting (FLIP-105).

I just left 2 minor comment about some method names. In general, I'm +1 to
start a voting.

--------------------------------------------------------------------------------------------------

Hi Becket,

I agree we shouldn't duplicate codes, especiall the runtime
implementations.
However, the interfaces proposed by FLIP-95 are mainly used during
optimization (compiling), not runtime.
I don't think there is much to share for this. Because table/sql
is declarative, but DataStream is imperative.
For example, filter push down, DataStream FilterableSource may allow to
accept a FilterFunction (which is a black box for the source).
However, table sources should pick the pushed filter expressions, some
sources may only support "=", "<", ">" conditions.
Pushing a FilterFunction doesn't work in table ecosystem. That means, the
connectors have to have some table-specific implementations.


Best,
Jark

On Tue, 24 Mar 2020 at 20:41, Kurt Young <yk...@gmail.com> wrote:

> Hi Becket,
>
> I don't think DataStream should see some SQL specific concepts such as
> Filtering or ComputedColumn. It's
> better to stay within SQL area and translate to more generic concept when
> translating to DataStream/Runtime
> layer, such as use MapFunction to represent computed column logic.
>
> Best,
> Kurt
>
>
> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> wrote:
>
> > Hi Timo and Dawid,
> >
> > It's really great that we have the same goal. I am actually wondering if
> we
> > can go one step further to avoid some of the interfaces in Table as well.
> >
> > For example, if we have the FilterableSource, do we still need the
> > FilterableTableSource? Should DynamicTableSource just become a
> > Source<*Row*,
> > SourceSplitT, EnumChkT>?
> >
> > Can you help me understand a bit more about the reason we need the
> > following relational representation / wrapper interfaces v.s. the
> > interfaces that we could put to the Source in FLIP-27?
> >
> > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > SupportsFilterablePushDown v.s. FilterableSource
> > SupportsProjectablePushDown v.s. ProjectableSource
> > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > ScanTableSource v.s. ChangeLogDeserializer.
> > LookUpTableSource v.s. LookUpSource
> >
> > Assuming we have all the interfaces on the right side, do we still need
> the
> > interfaces on the left side? Note that the interfaces on the right can be
> > used by both DataStream and Table. If we do this, there will only be one
> > set of Source interfaces Table and DataStream, the only difference is
> that
> > the Source for table will have some specific plugins and configurations.
> An
> > omnipotent Source can implement all the the above interfaces and take a
> > Deserializer that implements both ComputedColumnDeserializer and
> > ChangeLogDeserializer.
> >
> > Would the SQL planner work with that?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> >
> >
> >
> > On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com>
> > wrote:
> >
> > > +1. Thanks Timo for the design doc.
> > >
> > > We can also consider @Experimental too. But I am +1 to @PublicEvolving,
> > we
> > > should be confident in the current change.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org>
> wrote:
> > >
> > > > @Becket: We totally agree that we don't need table specific
> connectors
> > > > during runtime. As Dawid said, the interfaces proposed here are just
> > for
> > > > communication with the planner. Once the properties (watermarks,
> > > > computed column, filters, projecttion etc.) are negotiated, we can
> > > > configure a regular Flink connector.
> > > >
> > > > E.g. setting the watermark assigner and deserialization schema of a
> > > > Kafka connector.
> > > >
> > > > For better separation of concerns, Flink connectors should not
> include
> > > > relational interfaces and depend on flink-table. This is the
> > > > responsibility of table source/sink.
> > > >
> > > > @Kurt: I would like to mark them @PublicEvolving already because we
> > need
> > > > to deprecate the old interfaces as early as possible. We cannot
> > redirect
> > > > to @Internal interfaces. They are not marked @Public, so we can still
> > > > evolve them. But a core design shift should not happen again, it
> would
> > > > leave a bad impression if we are redesign over and over again.
> Instead
> > > > we should be confident in the current change.
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > >
> > > > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > > > > Hi Becket,
> > > > >
> > > > > Answering your question, we have the same intention not to
> duplicate
> > > > > connectors between datastream and table apis. The interfaces
> proposed
> > > in
> > > > > the FLIP are a way to describe relational properties of a source.
> The
> > > > > intention is as you described to translate all of those expressed
> as
> > > > > expressions or other Table specific structures into a DataStream
> > > source.
> > > > > In other words I think what we are doing here is in line with what
> > you
> > > > > described.
> > > > >
> > > > > Best,
> > > > >
> > > > > Dawid
> > > > >
> > > > > On 24/03/2020 02:23, Becket Qin wrote:
> > > > >> Hi Timo,
> > > > >>
> > > > >> Thanks for the proposal. I completely agree that the current Table
> > > > >> connectors could be simplified quite a bit. I haven't finished
> > reading
> > > > >> everything, but here are some quick thoughts.
> > > > >>
> > > > >> Actually to me the biggest question is why should there be two
> > > different
> > > > >> connector systems for DataStream and Table? What is the
> fundamental
> > > > reason
> > > > >> that is preventing us from merging them to one?
> > > > >>
> > > > >> The basic functionality of a connector is to provide capabilities
> to
> > > do
> > > > IO
> > > > >> and Serde. Conceptually, Table connectors should just be
> DataStream
> > > > >> connectors that are dealing with Rows. It seems that quite a few
> of
> > > the
> > > > >> special connector requirements are just a specific way to do IO /
> > > Serde.
> > > > >> Taking SupportsFilterPushDown as an example, imagine we have the
> > > > following
> > > > >> interface:
> > > > >>
> > > > >> interface FilterableSource<PREDICATE> {
> > > > >>      void applyFilterable(Supplier<PREDICATE> predicate);
> > > > >> }
> > > > >>
> > > > >> And if a ParquetSource would like to support filterable, it will
> > > become:
> > > > >>
> > > > >> class ParquetSource implements Source,
> > > > FilterableSource(FilterPredicate> {
> > > > >>      ...
> > > > >> }
> > > > >>
> > > > >> For Table, one just need to provide an predicate supplier that
> > > converts
> > > > an
> > > > >> Expression to the specified predicate type. This has a few
> benefit:
> > > > >> 1. Same unified API for filterable for sources, regardless of
> > > > DataStream or
> > > > >> Table.
> > > > >> 2. The  DataStream users now can also use the
> ExpressionToPredicate
> > > > >> supplier if they want to.
> > > > >>
> > > > >> To summarize, my main point is that I am wondering if it is
> possible
> > > to
> > > > >> have a single set of connector interface for both Table and
> > > DataStream,
> > > > >> rather than having two hierarchies. I am not 100% sure if this
> would
> > > > work,
> > > > >> but if it works, this would be a huge win from both code
> maintenance
> > > and
> > > > >> user experience perspective.
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Jiangjie (Becket) Qin
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> > > > dwysakowicz@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Timo,
> > > > >>>
> > > > >>> Thank you for the proposal. I think it is an important
> improvement
> > > that
> > > > >>> will benefit many parts of the Table API. The proposal looks
> really
> > > > good
> > > > >>> to me and personally I would be comfortable with voting on the
> > > current
> > > > >>> state.
> > > > >>>
> > > > >>> Best,
> > > > >>>
> > > > >>> Dawid
> > > > >>>
> > > > >>> On 23/03/2020 18:53, Timo Walther wrote:
> > > > >>>> Hi everyone,
> > > > >>>>
> > > > >>>> I received some questions around how the new interfaces play
> > > together
> > > > >>>> with formats and their factories.
> > > > >>>>
> > > > >>>> Furthermore, for MySQL or Postgres CDC logs, the format should
> be
> > > able
> > > > >>>> to return a `ChangelogMode`.
> > > > >>>>
> > > > >>>> Also, I incorporated the feedback around the factory design in
> > > > general.
> > > > >>>>
> > > > >>>> I added a new section `Factory Interfaces` to the design
> document.
> > > > >>>> This should be helpful to understand the big picture and
> > connecting
> > > > >>>> the concepts.
> > > > >>>>
> > > > >>>> Please let me know what you think?
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Timo
> > > > >>>>
> > > > >>>>
> > > > >>>> On 18.03.20 13:43, Timo Walther wrote:
> > > > >>>>> Hi Benchao,
> > > > >>>>>
> > > > >>>>> this is a very good question. I will update the FLIP about
> this.
> > > > >>>>>
> > > > >>>>> The legacy planner will not support the new interfaces. It will
> > > only
> > > > >>>>> support the old interfaces. With the next release, I think the
> > > Blink
> > > > >>>>> planner is stable enough to be the default one as well.
> > > > >>>>>
> > > > >>>>> Regards,
> > > > >>>>> Timo
> > > > >>>>>
> > > > >>>>> On 18.03.20 08:45, Benchao Li wrote:
> > > > >>>>>> Hi Timo,
> > > > >>>>>>
> > > > >>>>>> Thank you and others for the efforts to prepare this FLIP.
> > > > >>>>>>
> > > > >>>>>> The FLIP LGTM generally.
> > > > >>>>>>
> > > > >>>>>> +1 for moving blink data structures to table-common, it's
> useful
> > > to
> > > > >>>>>> udf too
> > > > >>>>>> in the future.
> > > > >>>>>> A little question is, do we plan to support the new interfaces
> > and
> > > > data
> > > > >>>>>> types in legacy planner?
> > > > >>>>>> Or we only plan to support these new interfaces in blink
> > planner.
> > > > >>>>>>
> > > > >>>>>> And using primary keys from DDL instead of derived key
> > information
> > > > from
> > > > >>>>>> each query is also a good idea,
> > > > >>>>>> we met some use cases where this does not works very well
> > before.
> > > > >>>>>>
> > > > >>>>>> This FLIP also makes the dependencies of table modules more
> > > clear, I
> > > > >>>>>> like
> > > > >>>>>> it very much.
> > > > >>>>>>
> > > > >>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
> > > > >>>>>>
> > > > >>>>>>> Hi everyone,
> > > > >>>>>>>
> > > > >>>>>>> I'm happy to present the results of long discussions that we
> > had
> > > > >>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> many
> > > > more
> > > > >>>>>>> have contributed to this design document.
> > > > >>>>>>>
> > > > >>>>>>> We would like to propose new long-term table source and table
> > > sink
> > > > >>>>>>> interfaces:
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > > > >>>>>>>
> > > > >>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> > > > >>>>>>>
> > > > >>>>>>> The goals of this FLIP are:
> > > > >>>>>>>
> > > > >>>>>>> - Simplify the current interface architecture:
> > > > >>>>>>>        - Merge upsert, retract, and append sinks.
> > > > >>>>>>>        - Unify batch and streaming sources.
> > > > >>>>>>>        - Unify batch and streaming sinks.
> > > > >>>>>>>
> > > > >>>>>>> - Allow sources to produce a changelog:
> > > > >>>>>>>        - UpsertTableSources have been requested a lot by
> users.
> > > Now
> > > > >>>>>>> is the
> > > > >>>>>>> time to open the internal planner capabilities via the new
> > > > interfaces.
> > > > >>>>>>>        - According to FLIP-105, we would like to support
> > > > changelogs for
> > > > >>>>>>> processing formats such as Debezium.
> > > > >>>>>>>
> > > > >>>>>>> - Don't rely on DataStream API for source and sinks:
> > > > >>>>>>>        - According to FLIP-32, the Table API and SQL should
> be
> > > > >>>>>>> independent
> > > > >>>>>>> of the DataStream API which is why the `table-common` module
> > has
> > > no
> > > > >>>>>>> dependencies on `flink-streaming-java`.
> > > > >>>>>>>        - Source and sink implementations should only depend
> on
> > > the
> > > > >>>>>>> `table-common` module after FLIP-27.
> > > > >>>>>>>        - Until FLIP-27 is ready, we still put most of the
> > > > interfaces in
> > > > >>>>>>> `table-common` and strictly separate interfaces that
> > communicate
> > > > >>>>>>> with a
> > > > >>>>>>> planner and actual runtime reader/writers.
> > > > >>>>>>>
> > > > >>>>>>> - Implement efficient sources and sinks without planner
> > > > dependencies:
> > > > >>>>>>>        - Make Blink's internal data structures available to
> > > > connectors.
> > > > >>>>>>>        - Introduce stable interfaces for data structures that
> > can
> > > > be
> > > > >>>>>>> marked as `@PublicEvolving`.
> > > > >>>>>>>        - Only require dependencies on `flink-table-common` in
> > the
> > > > >>>>>>> future
> > > > >>>>>>>
> > > > >>>>>>> It finalizes the concept of dynamic tables and consideres how
> > all
> > > > >>>>>>> source/sink related classes play together.
> > > > >>>>>>>
> > > > >>>>>>> We look forward to your feedback.
> > > > >>>>>>>
> > > > >>>>>>> Regards,
> > > > >>>>>>> Timo
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>
> > > > >
> > > >
> > > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Kurt Young <yk...@gmail.com>.

Hi Becket,

I don't think DataStream should see some SQL specific concepts such as
Filtering or ComputedColumn. It's
better to stay within SQL area and translate to more generic concept when
translating to DataStream/Runtime
layer, such as use MapFunction to represent computed column logic.

Best,
Kurt


On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <be...@gmail.com> wrote:

> Hi Timo and Dawid,
>
> It's really great that we have the same goal. I am actually wondering if we
> can go one step further to avoid some of the interfaces in Table as well.
>
> For example, if we have the FilterableSource, do we still need the
> FilterableTableSource? Should DynamicTableSource just become a
> Source<*Row*,
> SourceSplitT, EnumChkT>?
>
> Can you help me understand a bit more about the reason we need the
> following relational representation / wrapper interfaces v.s. the
> interfaces that we could put to the Source in FLIP-27?
>
> DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> SupportsFilterablePushDown v.s. FilterableSource
> SupportsProjectablePushDown v.s. ProjectableSource
> SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> ScanTableSource v.s. ChangeLogDeserializer.
> LookUpTableSource v.s. LookUpSource
>
> Assuming we have all the interfaces on the right side, do we still need the
> interfaces on the left side? Note that the interfaces on the right can be
> used by both DataStream and Table. If we do this, there will only be one
> set of Source interfaces Table and DataStream, the only difference is that
> the Source for table will have some specific plugins and configurations. An
> omnipotent Source can implement all the the above interfaces and take a
> Deserializer that implements both ComputedColumnDeserializer and
> ChangeLogDeserializer.
>
> Would the SQL planner work with that?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
>
>
>
> On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com>
> wrote:
>
> > +1. Thanks Timo for the design doc.
> >
> > We can also consider @Experimental too. But I am +1 to @PublicEvolving,
> we
> > should be confident in the current change.
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org> wrote:
> >
> > > @Becket: We totally agree that we don't need table specific connectors
> > > during runtime. As Dawid said, the interfaces proposed here are just
> for
> > > communication with the planner. Once the properties (watermarks,
> > > computed column, filters, projecttion etc.) are negotiated, we can
> > > configure a regular Flink connector.
> > >
> > > E.g. setting the watermark assigner and deserialization schema of a
> > > Kafka connector.
> > >
> > > For better separation of concerns, Flink connectors should not include
> > > relational interfaces and depend on flink-table. This is the
> > > responsibility of table source/sink.
> > >
> > > @Kurt: I would like to mark them @PublicEvolving already because we
> need
> > > to deprecate the old interfaces as early as possible. We cannot
> redirect
> > > to @Internal interfaces. They are not marked @Public, so we can still
> > > evolve them. But a core design shift should not happen again, it would
> > > leave a bad impression if we are redesign over and over again. Instead
> > > we should be confident in the current change.
> > >
> > > Regards,
> > > Timo
> > >
> > >
> > > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > > > Hi Becket,
> > > >
> > > > Answering your question, we have the same intention not to duplicate
> > > > connectors between datastream and table apis. The interfaces proposed
> > in
> > > > the FLIP are a way to describe relational properties of a source. The
> > > > intention is as you described to translate all of those expressed as
> > > > expressions or other Table specific structures into a DataStream
> > source.
> > > > In other words I think what we are doing here is in line with what
> you
> > > > described.
> > > >
> > > > Best,
> > > >
> > > > Dawid
> > > >
> > > > On 24/03/2020 02:23, Becket Qin wrote:
> > > >> Hi Timo,
> > > >>
> > > >> Thanks for the proposal. I completely agree that the current Table
> > > >> connectors could be simplified quite a bit. I haven't finished
> reading
> > > >> everything, but here are some quick thoughts.
> > > >>
> > > >> Actually to me the biggest question is why should there be two
> > different
> > > >> connector systems for DataStream and Table? What is the fundamental
> > > reason
> > > >> that is preventing us from merging them to one?
> > > >>
> > > >> The basic functionality of a connector is to provide capabilities to
> > do
> > > IO
> > > >> and Serde. Conceptually, Table connectors should just be DataStream
> > > >> connectors that are dealing with Rows. It seems that quite a few of
> > the
> > > >> special connector requirements are just a specific way to do IO /
> > Serde.
> > > >> Taking SupportsFilterPushDown as an example, imagine we have the
> > > following
> > > >> interface:
> > > >>
> > > >> interface FilterableSource<PREDICATE> {
> > > >>      void applyFilterable(Supplier<PREDICATE> predicate);
> > > >> }
> > > >>
> > > >> And if a ParquetSource would like to support filterable, it will
> > become:
> > > >>
> > > >> class ParquetSource implements Source,
> > > FilterableSource(FilterPredicate> {
> > > >>      ...
> > > >> }
> > > >>
> > > >> For Table, one just need to provide an predicate supplier that
> > converts
> > > an
> > > >> Expression to the specified predicate type. This has a few benefit:
> > > >> 1. Same unified API for filterable for sources, regardless of
> > > DataStream or
> > > >> Table.
> > > >> 2. The  DataStream users now can also use the ExpressionToPredicate
> > > >> supplier if they want to.
> > > >>
> > > >> To summarize, my main point is that I am wondering if it is possible
> > to
> > > >> have a single set of connector interface for both Table and
> > DataStream,
> > > >> rather than having two hierarchies. I am not 100% sure if this would
> > > work,
> > > >> but if it works, this would be a huge win from both code maintenance
> > and
> > > >> user experience perspective.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jiangjie (Becket) Qin
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> > > dwysakowicz@apache.org>
> > > >> wrote:
> > > >>
> > > >>> Hi Timo,
> > > >>>
> > > >>> Thank you for the proposal. I think it is an important improvement
> > that
> > > >>> will benefit many parts of the Table API. The proposal looks really
> > > good
> > > >>> to me and personally I would be comfortable with voting on the
> > current
> > > >>> state.
> > > >>>
> > > >>> Best,
> > > >>>
> > > >>> Dawid
> > > >>>
> > > >>> On 23/03/2020 18:53, Timo Walther wrote:
> > > >>>> Hi everyone,
> > > >>>>
> > > >>>> I received some questions around how the new interfaces play
> > together
> > > >>>> with formats and their factories.
> > > >>>>
> > > >>>> Furthermore, for MySQL or Postgres CDC logs, the format should be
> > able
> > > >>>> to return a `ChangelogMode`.
> > > >>>>
> > > >>>> Also, I incorporated the feedback around the factory design in
> > > general.
> > > >>>>
> > > >>>> I added a new section `Factory Interfaces` to the design document.
> > > >>>> This should be helpful to understand the big picture and
> connecting
> > > >>>> the concepts.
> > > >>>>
> > > >>>> Please let me know what you think?
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Timo
> > > >>>>
> > > >>>>
> > > >>>> On 18.03.20 13:43, Timo Walther wrote:
> > > >>>>> Hi Benchao,
> > > >>>>>
> > > >>>>> this is a very good question. I will update the FLIP about this.
> > > >>>>>
> > > >>>>> The legacy planner will not support the new interfaces. It will
> > only
> > > >>>>> support the old interfaces. With the next release, I think the
> > Blink
> > > >>>>> planner is stable enough to be the default one as well.
> > > >>>>>
> > > >>>>> Regards,
> > > >>>>> Timo
> > > >>>>>
> > > >>>>> On 18.03.20 08:45, Benchao Li wrote:
> > > >>>>>> Hi Timo,
> > > >>>>>>
> > > >>>>>> Thank you and others for the efforts to prepare this FLIP.
> > > >>>>>>
> > > >>>>>> The FLIP LGTM generally.
> > > >>>>>>
> > > >>>>>> +1 for moving blink data structures to table-common, it's useful
> > to
> > > >>>>>> udf too
> > > >>>>>> in the future.
> > > >>>>>> A little question is, do we plan to support the new interfaces
> and
> > > data
> > > >>>>>> types in legacy planner?
> > > >>>>>> Or we only plan to support these new interfaces in blink
> planner.
> > > >>>>>>
> > > >>>>>> And using primary keys from DDL instead of derived key
> information
> > > from
> > > >>>>>> each query is also a good idea,
> > > >>>>>> we met some use cases where this does not works very well
> before.
> > > >>>>>>
> > > >>>>>> This FLIP also makes the dependencies of table modules more
> > clear, I
> > > >>>>>> like
> > > >>>>>> it very much.
> > > >>>>>>
> > > >>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
> > > >>>>>>
> > > >>>>>>> Hi everyone,
> > > >>>>>>>
> > > >>>>>>> I'm happy to present the results of long discussions that we
> had
> > > >>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many
> > > more
> > > >>>>>>> have contributed to this design document.
> > > >>>>>>>
> > > >>>>>>> We would like to propose new long-term table source and table
> > sink
> > > >>>>>>> interfaces:
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > > >>>>>>>
> > > >>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> > > >>>>>>>
> > > >>>>>>> The goals of this FLIP are:
> > > >>>>>>>
> > > >>>>>>> - Simplify the current interface architecture:
> > > >>>>>>>        - Merge upsert, retract, and append sinks.
> > > >>>>>>>        - Unify batch and streaming sources.
> > > >>>>>>>        - Unify batch and streaming sinks.
> > > >>>>>>>
> > > >>>>>>> - Allow sources to produce a changelog:
> > > >>>>>>>        - UpsertTableSources have been requested a lot by users.
> > Now
> > > >>>>>>> is the
> > > >>>>>>> time to open the internal planner capabilities via the new
> > > interfaces.
> > > >>>>>>>        - According to FLIP-105, we would like to support
> > > changelogs for
> > > >>>>>>> processing formats such as Debezium.
> > > >>>>>>>
> > > >>>>>>> - Don't rely on DataStream API for source and sinks:
> > > >>>>>>>        - According to FLIP-32, the Table API and SQL should be
> > > >>>>>>> independent
> > > >>>>>>> of the DataStream API which is why the `table-common` module
> has
> > no
> > > >>>>>>> dependencies on `flink-streaming-java`.
> > > >>>>>>>        - Source and sink implementations should only depend on
> > the
> > > >>>>>>> `table-common` module after FLIP-27.
> > > >>>>>>>        - Until FLIP-27 is ready, we still put most of the
> > > interfaces in
> > > >>>>>>> `table-common` and strictly separate interfaces that
> communicate
> > > >>>>>>> with a
> > > >>>>>>> planner and actual runtime reader/writers.
> > > >>>>>>>
> > > >>>>>>> - Implement efficient sources and sinks without planner
> > > dependencies:
> > > >>>>>>>        - Make Blink's internal data structures available to
> > > connectors.
> > > >>>>>>>        - Introduce stable interfaces for data structures that
> can
> > > be
> > > >>>>>>> marked as `@PublicEvolving`.
> > > >>>>>>>        - Only require dependencies on `flink-table-common` in
> the
> > > >>>>>>> future
> > > >>>>>>>
> > > >>>>>>> It finalizes the concept of dynamic tables and consideres how
> all
> > > >>>>>>> source/sink related classes play together.
> > > >>>>>>>
> > > >>>>>>> We look forward to your feedback.
> > > >>>>>>>
> > > >>>>>>> Regards,
> > > >>>>>>> Timo
> > > >>>>>>>
> > > >>>>>>
> > > >>>
> > > >
> > >
> > >
> >
> > --
> > Best, Jingsong Lee
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hi Timo and Dawid,

It's really great that we have the same goal. I am actually wondering if we
can go one step further to avoid some of the interfaces in Table as well.

For example, if we have the FilterableSource, do we still need the
FilterableTableSource? Should DynamicTableSource just become a Source<*Row*,
SourceSplitT, EnumChkT>?

Can you help me understand a bit more about the reason we need the
following relational representation / wrapper interfaces v.s. the
interfaces that we could put to the Source in FLIP-27?

DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
SupportsFilterablePushDown v.s. FilterableSource
SupportsProjectablePushDown v.s. ProjectableSource
SupportsWatermarkPushDown v.s. WithWatermarkAssigner
SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
ScanTableSource v.s. ChangeLogDeserializer.
LookUpTableSource v.s. LookUpSource

Assuming we have all the interfaces on the right side, do we still need the
interfaces on the left side? Note that the interfaces on the right can be
used by both DataStream and Table. If we do this, there will only be one
set of Source interfaces Table and DataStream, the only difference is that
the Source for table will have some specific plugins and configurations. An
omnipotent Source can implement all the the above interfaces and take a
Deserializer that implements both ComputedColumnDeserializer and
ChangeLogDeserializer.

Would the SQL planner work with that?

Thanks,

Jiangjie (Becket) Qin






On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <ji...@gmail.com> wrote:

> +1. Thanks Timo for the design doc.
>
> We can also consider @Experimental too. But I am +1 to @PublicEvolving, we
> should be confident in the current change.
>
> Best,
> Jingsong Lee
>
> On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org> wrote:
>
> > @Becket: We totally agree that we don't need table specific connectors
> > during runtime. As Dawid said, the interfaces proposed here are just for
> > communication with the planner. Once the properties (watermarks,
> > computed column, filters, projecttion etc.) are negotiated, we can
> > configure a regular Flink connector.
> >
> > E.g. setting the watermark assigner and deserialization schema of a
> > Kafka connector.
> >
> > For better separation of concerns, Flink connectors should not include
> > relational interfaces and depend on flink-table. This is the
> > responsibility of table source/sink.
> >
> > @Kurt: I would like to mark them @PublicEvolving already because we need
> > to deprecate the old interfaces as early as possible. We cannot redirect
> > to @Internal interfaces. They are not marked @Public, so we can still
> > evolve them. But a core design shift should not happen again, it would
> > leave a bad impression if we are redesign over and over again. Instead
> > we should be confident in the current change.
> >
> > Regards,
> > Timo
> >
> >
> > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > > Hi Becket,
> > >
> > > Answering your question, we have the same intention not to duplicate
> > > connectors between datastream and table apis. The interfaces proposed
> in
> > > the FLIP are a way to describe relational properties of a source. The
> > > intention is as you described to translate all of those expressed as
> > > expressions or other Table specific structures into a DataStream
> source.
> > > In other words I think what we are doing here is in line with what you
> > > described.
> > >
> > > Best,
> > >
> > > Dawid
> > >
> > > On 24/03/2020 02:23, Becket Qin wrote:
> > >> Hi Timo,
> > >>
> > >> Thanks for the proposal. I completely agree that the current Table
> > >> connectors could be simplified quite a bit. I haven't finished reading
> > >> everything, but here are some quick thoughts.
> > >>
> > >> Actually to me the biggest question is why should there be two
> different
> > >> connector systems for DataStream and Table? What is the fundamental
> > reason
> > >> that is preventing us from merging them to one?
> > >>
> > >> The basic functionality of a connector is to provide capabilities to
> do
> > IO
> > >> and Serde. Conceptually, Table connectors should just be DataStream
> > >> connectors that are dealing with Rows. It seems that quite a few of
> the
> > >> special connector requirements are just a specific way to do IO /
> Serde.
> > >> Taking SupportsFilterPushDown as an example, imagine we have the
> > following
> > >> interface:
> > >>
> > >> interface FilterableSource<PREDICATE> {
> > >>      void applyFilterable(Supplier<PREDICATE> predicate);
> > >> }
> > >>
> > >> And if a ParquetSource would like to support filterable, it will
> become:
> > >>
> > >> class ParquetSource implements Source,
> > FilterableSource(FilterPredicate> {
> > >>      ...
> > >> }
> > >>
> > >> For Table, one just need to provide an predicate supplier that
> converts
> > an
> > >> Expression to the specified predicate type. This has a few benefit:
> > >> 1. Same unified API for filterable for sources, regardless of
> > DataStream or
> > >> Table.
> > >> 2. The  DataStream users now can also use the ExpressionToPredicate
> > >> supplier if they want to.
> > >>
> > >> To summarize, my main point is that I am wondering if it is possible
> to
> > >> have a single set of connector interface for both Table and
> DataStream,
> > >> rather than having two hierarchies. I am not 100% sure if this would
> > work,
> > >> but if it works, this would be a huge win from both code maintenance
> and
> > >> user experience perspective.
> > >>
> > >> Thanks,
> > >>
> > >> Jiangjie (Becket) Qin
> > >>
> > >>
> > >>
> > >> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> > dwysakowicz@apache.org>
> > >> wrote:
> > >>
> > >>> Hi Timo,
> > >>>
> > >>> Thank you for the proposal. I think it is an important improvement
> that
> > >>> will benefit many parts of the Table API. The proposal looks really
> > good
> > >>> to me and personally I would be comfortable with voting on the
> current
> > >>> state.
> > >>>
> > >>> Best,
> > >>>
> > >>> Dawid
> > >>>
> > >>> On 23/03/2020 18:53, Timo Walther wrote:
> > >>>> Hi everyone,
> > >>>>
> > >>>> I received some questions around how the new interfaces play
> together
> > >>>> with formats and their factories.
> > >>>>
> > >>>> Furthermore, for MySQL or Postgres CDC logs, the format should be
> able
> > >>>> to return a `ChangelogMode`.
> > >>>>
> > >>>> Also, I incorporated the feedback around the factory design in
> > general.
> > >>>>
> > >>>> I added a new section `Factory Interfaces` to the design document.
> > >>>> This should be helpful to understand the big picture and connecting
> > >>>> the concepts.
> > >>>>
> > >>>> Please let me know what you think?
> > >>>>
> > >>>> Thanks,
> > >>>> Timo
> > >>>>
> > >>>>
> > >>>> On 18.03.20 13:43, Timo Walther wrote:
> > >>>>> Hi Benchao,
> > >>>>>
> > >>>>> this is a very good question. I will update the FLIP about this.
> > >>>>>
> > >>>>> The legacy planner will not support the new interfaces. It will
> only
> > >>>>> support the old interfaces. With the next release, I think the
> Blink
> > >>>>> planner is stable enough to be the default one as well.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Timo
> > >>>>>
> > >>>>> On 18.03.20 08:45, Benchao Li wrote:
> > >>>>>> Hi Timo,
> > >>>>>>
> > >>>>>> Thank you and others for the efforts to prepare this FLIP.
> > >>>>>>
> > >>>>>> The FLIP LGTM generally.
> > >>>>>>
> > >>>>>> +1 for moving blink data structures to table-common, it's useful
> to
> > >>>>>> udf too
> > >>>>>> in the future.
> > >>>>>> A little question is, do we plan to support the new interfaces and
> > data
> > >>>>>> types in legacy planner?
> > >>>>>> Or we only plan to support these new interfaces in blink planner.
> > >>>>>>
> > >>>>>> And using primary keys from DDL instead of derived key information
> > from
> > >>>>>> each query is also a good idea,
> > >>>>>> we met some use cases where this does not works very well before.
> > >>>>>>
> > >>>>>> This FLIP also makes the dependencies of table modules more
> clear, I
> > >>>>>> like
> > >>>>>> it very much.
> > >>>>>>
> > >>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
> > >>>>>>
> > >>>>>>> Hi everyone,
> > >>>>>>>
> > >>>>>>> I'm happy to present the results of long discussions that we had
> > >>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many
> > more
> > >>>>>>> have contributed to this design document.
> > >>>>>>>
> > >>>>>>> We would like to propose new long-term table source and table
> sink
> > >>>>>>> interfaces:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > >>>>>>>
> > >>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> > >>>>>>>
> > >>>>>>> The goals of this FLIP are:
> > >>>>>>>
> > >>>>>>> - Simplify the current interface architecture:
> > >>>>>>>        - Merge upsert, retract, and append sinks.
> > >>>>>>>        - Unify batch and streaming sources.
> > >>>>>>>        - Unify batch and streaming sinks.
> > >>>>>>>
> > >>>>>>> - Allow sources to produce a changelog:
> > >>>>>>>        - UpsertTableSources have been requested a lot by users.
> Now
> > >>>>>>> is the
> > >>>>>>> time to open the internal planner capabilities via the new
> > interfaces.
> > >>>>>>>        - According to FLIP-105, we would like to support
> > changelogs for
> > >>>>>>> processing formats such as Debezium.
> > >>>>>>>
> > >>>>>>> - Don't rely on DataStream API for source and sinks:
> > >>>>>>>        - According to FLIP-32, the Table API and SQL should be
> > >>>>>>> independent
> > >>>>>>> of the DataStream API which is why the `table-common` module has
> no
> > >>>>>>> dependencies on `flink-streaming-java`.
> > >>>>>>>        - Source and sink implementations should only depend on
> the
> > >>>>>>> `table-common` module after FLIP-27.
> > >>>>>>>        - Until FLIP-27 is ready, we still put most of the
> > interfaces in
> > >>>>>>> `table-common` and strictly separate interfaces that communicate
> > >>>>>>> with a
> > >>>>>>> planner and actual runtime reader/writers.
> > >>>>>>>
> > >>>>>>> - Implement efficient sources and sinks without planner
> > dependencies:
> > >>>>>>>        - Make Blink's internal data structures available to
> > connectors.
> > >>>>>>>        - Introduce stable interfaces for data structures that can
> > be
> > >>>>>>> marked as `@PublicEvolving`.
> > >>>>>>>        - Only require dependencies on `flink-table-common` in the
> > >>>>>>> future
> > >>>>>>>
> > >>>>>>> It finalizes the concept of dynamic tables and consideres how all
> > >>>>>>> source/sink related classes play together.
> > >>>>>>>
> > >>>>>>> We look forward to your feedback.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Timo
> > >>>>>>>
> > >>>>>>
> > >>>
> > >
> >
> >
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Jingsong Li <ji...@gmail.com>.

+1. Thanks Timo for the design doc.

We can also consider @Experimental too. But I am +1 to @PublicEvolving, we
should be confident in the current change.

Best,
Jingsong Lee

On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <tw...@apache.org> wrote:

> @Becket: We totally agree that we don't need table specific connectors
> during runtime. As Dawid said, the interfaces proposed here are just for
> communication with the planner. Once the properties (watermarks,
> computed column, filters, projecttion etc.) are negotiated, we can
> configure a regular Flink connector.
>
> E.g. setting the watermark assigner and deserialization schema of a
> Kafka connector.
>
> For better separation of concerns, Flink connectors should not include
> relational interfaces and depend on flink-table. This is the
> responsibility of table source/sink.
>
> @Kurt: I would like to mark them @PublicEvolving already because we need
> to deprecate the old interfaces as early as possible. We cannot redirect
> to @Internal interfaces. They are not marked @Public, so we can still
> evolve them. But a core design shift should not happen again, it would
> leave a bad impression if we are redesign over and over again. Instead
> we should be confident in the current change.
>
> Regards,
> Timo
>
>
> On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > Hi Becket,
> >
> > Answering your question, we have the same intention not to duplicate
> > connectors between datastream and table apis. The interfaces proposed in
> > the FLIP are a way to describe relational properties of a source. The
> > intention is as you described to translate all of those expressed as
> > expressions or other Table specific structures into a DataStream source.
> > In other words I think what we are doing here is in line with what you
> > described.
> >
> > Best,
> >
> > Dawid
> >
> > On 24/03/2020 02:23, Becket Qin wrote:
> >> Hi Timo,
> >>
> >> Thanks for the proposal. I completely agree that the current Table
> >> connectors could be simplified quite a bit. I haven't finished reading
> >> everything, but here are some quick thoughts.
> >>
> >> Actually to me the biggest question is why should there be two different
> >> connector systems for DataStream and Table? What is the fundamental
> reason
> >> that is preventing us from merging them to one?
> >>
> >> The basic functionality of a connector is to provide capabilities to do
> IO
> >> and Serde. Conceptually, Table connectors should just be DataStream
> >> connectors that are dealing with Rows. It seems that quite a few of the
> >> special connector requirements are just a specific way to do IO / Serde.
> >> Taking SupportsFilterPushDown as an example, imagine we have the
> following
> >> interface:
> >>
> >> interface FilterableSource<PREDICATE> {
> >>      void applyFilterable(Supplier<PREDICATE> predicate);
> >> }
> >>
> >> And if a ParquetSource would like to support filterable, it will become:
> >>
> >> class ParquetSource implements Source,
> FilterableSource(FilterPredicate> {
> >>      ...
> >> }
> >>
> >> For Table, one just need to provide an predicate supplier that converts
> an
> >> Expression to the specified predicate type. This has a few benefit:
> >> 1. Same unified API for filterable for sources, regardless of
> DataStream or
> >> Table.
> >> 2. The  DataStream users now can also use the ExpressionToPredicate
> >> supplier if they want to.
> >>
> >> To summarize, my main point is that I am wondering if it is possible to
> >> have a single set of connector interface for both Table and DataStream,
> >> rather than having two hierarchies. I am not 100% sure if this would
> work,
> >> but if it works, this would be a huge win from both code maintenance and
> >> user experience perspective.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >>
> >> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> dwysakowicz@apache.org>
> >> wrote:
> >>
> >>> Hi Timo,
> >>>
> >>> Thank you for the proposal. I think it is an important improvement that
> >>> will benefit many parts of the Table API. The proposal looks really
> good
> >>> to me and personally I would be comfortable with voting on the current
> >>> state.
> >>>
> >>> Best,
> >>>
> >>> Dawid
> >>>
> >>> On 23/03/2020 18:53, Timo Walther wrote:
> >>>> Hi everyone,
> >>>>
> >>>> I received some questions around how the new interfaces play together
> >>>> with formats and their factories.
> >>>>
> >>>> Furthermore, for MySQL or Postgres CDC logs, the format should be able
> >>>> to return a `ChangelogMode`.
> >>>>
> >>>> Also, I incorporated the feedback around the factory design in
> general.
> >>>>
> >>>> I added a new section `Factory Interfaces` to the design document.
> >>>> This should be helpful to understand the big picture and connecting
> >>>> the concepts.
> >>>>
> >>>> Please let me know what you think?
> >>>>
> >>>> Thanks,
> >>>> Timo
> >>>>
> >>>>
> >>>> On 18.03.20 13:43, Timo Walther wrote:
> >>>>> Hi Benchao,
> >>>>>
> >>>>> this is a very good question. I will update the FLIP about this.
> >>>>>
> >>>>> The legacy planner will not support the new interfaces. It will only
> >>>>> support the old interfaces. With the next release, I think the Blink
> >>>>> planner is stable enough to be the default one as well.
> >>>>>
> >>>>> Regards,
> >>>>> Timo
> >>>>>
> >>>>> On 18.03.20 08:45, Benchao Li wrote:
> >>>>>> Hi Timo,
> >>>>>>
> >>>>>> Thank you and others for the efforts to prepare this FLIP.
> >>>>>>
> >>>>>> The FLIP LGTM generally.
> >>>>>>
> >>>>>> +1 for moving blink data structures to table-common, it's useful to
> >>>>>> udf too
> >>>>>> in the future.
> >>>>>> A little question is, do we plan to support the new interfaces and
> data
> >>>>>> types in legacy planner?
> >>>>>> Or we only plan to support these new interfaces in blink planner.
> >>>>>>
> >>>>>> And using primary keys from DDL instead of derived key information
> from
> >>>>>> each query is also a good idea,
> >>>>>> we met some use cases where this does not works very well before.
> >>>>>>
> >>>>>> This FLIP also makes the dependencies of table modules more clear, I
> >>>>>> like
> >>>>>> it very much.
> >>>>>>
> >>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
> >>>>>>
> >>>>>>> Hi everyone,
> >>>>>>>
> >>>>>>> I'm happy to present the results of long discussions that we had
> >>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many
> more
> >>>>>>> have contributed to this design document.
> >>>>>>>
> >>>>>>> We would like to propose new long-term table source and table sink
> >>>>>>> interfaces:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> >>>>>>>
> >>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> >>>>>>>
> >>>>>>> The goals of this FLIP are:
> >>>>>>>
> >>>>>>> - Simplify the current interface architecture:
> >>>>>>>        - Merge upsert, retract, and append sinks.
> >>>>>>>        - Unify batch and streaming sources.
> >>>>>>>        - Unify batch and streaming sinks.
> >>>>>>>
> >>>>>>> - Allow sources to produce a changelog:
> >>>>>>>        - UpsertTableSources have been requested a lot by users. Now
> >>>>>>> is the
> >>>>>>> time to open the internal planner capabilities via the new
> interfaces.
> >>>>>>>        - According to FLIP-105, we would like to support
> changelogs for
> >>>>>>> processing formats such as Debezium.
> >>>>>>>
> >>>>>>> - Don't rely on DataStream API for source and sinks:
> >>>>>>>        - According to FLIP-32, the Table API and SQL should be
> >>>>>>> independent
> >>>>>>> of the DataStream API which is why the `table-common` module has no
> >>>>>>> dependencies on `flink-streaming-java`.
> >>>>>>>        - Source and sink implementations should only depend on the
> >>>>>>> `table-common` module after FLIP-27.
> >>>>>>>        - Until FLIP-27 is ready, we still put most of the
> interfaces in
> >>>>>>> `table-common` and strictly separate interfaces that communicate
> >>>>>>> with a
> >>>>>>> planner and actual runtime reader/writers.
> >>>>>>>
> >>>>>>> - Implement efficient sources and sinks without planner
> dependencies:
> >>>>>>>        - Make Blink's internal data structures available to
> connectors.
> >>>>>>>        - Introduce stable interfaces for data structures that can
> be
> >>>>>>> marked as `@PublicEvolving`.
> >>>>>>>        - Only require dependencies on `flink-table-common` in the
> >>>>>>> future
> >>>>>>>
> >>>>>>> It finalizes the concept of dynamic tables and consideres how all
> >>>>>>> source/sink related classes play together.
> >>>>>>>
> >>>>>>> We look forward to your feedback.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Timo
> >>>>>>>
> >>>>>>
> >>>
> >
>
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Timo Walther <tw...@apache.org>.

@Becket: We totally agree that we don't need table specific connectors 
during runtime. As Dawid said, the interfaces proposed here are just for 
communication with the planner. Once the properties (watermarks, 
computed column, filters, projecttion etc.) are negotiated, we can 
configure a regular Flink connector.

E.g. setting the watermark assigner and deserialization schema of a 
Kafka connector.

For better separation of concerns, Flink connectors should not include 
relational interfaces and depend on flink-table. This is the 
responsibility of table source/sink.

@Kurt: I would like to mark them @PublicEvolving already because we need 
to deprecate the old interfaces as early as possible. We cannot redirect 
to @Internal interfaces. They are not marked @Public, so we can still 
evolve them. But a core design shift should not happen again, it would 
leave a bad impression if we are redesign over and over again. Instead 
we should be confident in the current change.

Regards,
Timo


On 24.03.20 09:20, Dawid Wysakowicz wrote:
> Hi Becket,
> 
> Answering your question, we have the same intention not to duplicate
> connectors between datastream and table apis. The interfaces proposed in
> the FLIP are a way to describe relational properties of a source. The
> intention is as you described to translate all of those expressed as
> expressions or other Table specific structures into a DataStream source.
> In other words I think what we are doing here is in line with what you
> described.
> 
> Best,
> 
> Dawid
> 
> On 24/03/2020 02:23, Becket Qin wrote:
>> Hi Timo,
>>
>> Thanks for the proposal. I completely agree that the current Table
>> connectors could be simplified quite a bit. I haven't finished reading
>> everything, but here are some quick thoughts.
>>
>> Actually to me the biggest question is why should there be two different
>> connector systems for DataStream and Table? What is the fundamental reason
>> that is preventing us from merging them to one?
>>
>> The basic functionality of a connector is to provide capabilities to do IO
>> and Serde. Conceptually, Table connectors should just be DataStream
>> connectors that are dealing with Rows. It seems that quite a few of the
>> special connector requirements are just a specific way to do IO / Serde.
>> Taking SupportsFilterPushDown as an example, imagine we have the following
>> interface:
>>
>> interface FilterableSource<PREDICATE> {
>>      void applyFilterable(Supplier<PREDICATE> predicate);
>> }
>>
>> And if a ParquetSource would like to support filterable, it will become:
>>
>> class ParquetSource implements Source, FilterableSource(FilterPredicate> {
>>      ...
>> }
>>
>> For Table, one just need to provide an predicate supplier that converts an
>> Expression to the specified predicate type. This has a few benefit:
>> 1. Same unified API for filterable for sources, regardless of DataStream or
>> Table.
>> 2. The  DataStream users now can also use the ExpressionToPredicate
>> supplier if they want to.
>>
>> To summarize, my main point is that I am wondering if it is possible to
>> have a single set of connector interface for both Table and DataStream,
>> rather than having two hierarchies. I am not 100% sure if this would work,
>> but if it works, this would be a huge win from both code maintenance and
>> user experience perspective.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>>
>> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <dw...@apache.org>
>> wrote:
>>
>>> Hi Timo,
>>>
>>> Thank you for the proposal. I think it is an important improvement that
>>> will benefit many parts of the Table API. The proposal looks really good
>>> to me and personally I would be comfortable with voting on the current
>>> state.
>>>
>>> Best,
>>>
>>> Dawid
>>>
>>> On 23/03/2020 18:53, Timo Walther wrote:
>>>> Hi everyone,
>>>>
>>>> I received some questions around how the new interfaces play together
>>>> with formats and their factories.
>>>>
>>>> Furthermore, for MySQL or Postgres CDC logs, the format should be able
>>>> to return a `ChangelogMode`.
>>>>
>>>> Also, I incorporated the feedback around the factory design in general.
>>>>
>>>> I added a new section `Factory Interfaces` to the design document.
>>>> This should be helpful to understand the big picture and connecting
>>>> the concepts.
>>>>
>>>> Please let me know what you think?
>>>>
>>>> Thanks,
>>>> Timo
>>>>
>>>>
>>>> On 18.03.20 13:43, Timo Walther wrote:
>>>>> Hi Benchao,
>>>>>
>>>>> this is a very good question. I will update the FLIP about this.
>>>>>
>>>>> The legacy planner will not support the new interfaces. It will only
>>>>> support the old interfaces. With the next release, I think the Blink
>>>>> planner is stable enough to be the default one as well.
>>>>>
>>>>> Regards,
>>>>> Timo
>>>>>
>>>>> On 18.03.20 08:45, Benchao Li wrote:
>>>>>> Hi Timo,
>>>>>>
>>>>>> Thank you and others for the efforts to prepare this FLIP.
>>>>>>
>>>>>> The FLIP LGTM generally.
>>>>>>
>>>>>> +1 for moving blink data structures to table-common, it's useful to
>>>>>> udf too
>>>>>> in the future.
>>>>>> A little question is, do we plan to support the new interfaces and data
>>>>>> types in legacy planner?
>>>>>> Or we only plan to support these new interfaces in blink planner.
>>>>>>
>>>>>> And using primary keys from DDL instead of derived key information from
>>>>>> each query is also a good idea,
>>>>>> we met some use cases where this does not works very well before.
>>>>>>
>>>>>> This FLIP also makes the dependencies of table modules more clear, I
>>>>>> like
>>>>>> it very much.
>>>>>>
>>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I'm happy to present the results of long discussions that we had
>>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many more
>>>>>>> have contributed to this design document.
>>>>>>>
>>>>>>> We would like to propose new long-term table source and table sink
>>>>>>> interfaces:
>>>>>>>
>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>>>>>>
>>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>>>>>>
>>>>>>> The goals of this FLIP are:
>>>>>>>
>>>>>>> - Simplify the current interface architecture:
>>>>>>>        - Merge upsert, retract, and append sinks.
>>>>>>>        - Unify batch and streaming sources.
>>>>>>>        - Unify batch and streaming sinks.
>>>>>>>
>>>>>>> - Allow sources to produce a changelog:
>>>>>>>        - UpsertTableSources have been requested a lot by users. Now
>>>>>>> is the
>>>>>>> time to open the internal planner capabilities via the new interfaces.
>>>>>>>        - According to FLIP-105, we would like to support changelogs for
>>>>>>> processing formats such as Debezium.
>>>>>>>
>>>>>>> - Don't rely on DataStream API for source and sinks:
>>>>>>>        - According to FLIP-32, the Table API and SQL should be
>>>>>>> independent
>>>>>>> of the DataStream API which is why the `table-common` module has no
>>>>>>> dependencies on `flink-streaming-java`.
>>>>>>>        - Source and sink implementations should only depend on the
>>>>>>> `table-common` module after FLIP-27.
>>>>>>>        - Until FLIP-27 is ready, we still put most of the interfaces in
>>>>>>> `table-common` and strictly separate interfaces that communicate
>>>>>>> with a
>>>>>>> planner and actual runtime reader/writers.
>>>>>>>
>>>>>>> - Implement efficient sources and sinks without planner dependencies:
>>>>>>>        - Make Blink's internal data structures available to connectors.
>>>>>>>        - Introduce stable interfaces for data structures that can be
>>>>>>> marked as `@PublicEvolving`.
>>>>>>>        - Only require dependencies on `flink-table-common` in the
>>>>>>> future
>>>>>>>
>>>>>>> It finalizes the concept of dynamic tables and consideres how all
>>>>>>> source/sink related classes play together.
>>>>>>>
>>>>>>> We look forward to your feedback.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Timo
>>>>>>>
>>>>>>
>>>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Dawid Wysakowicz <dw...@apache.org>.

Hi Becket,

Answering your question, we have the same intention not to duplicate
connectors between datastream and table apis. The interfaces proposed in
the FLIP are a way to describe relational properties of a source. The
intention is as you described to translate all of those expressed as
expressions or other Table specific structures into a DataStream source.
In other words I think what we are doing here is in line with what you
described.

Best,

Dawid

On 24/03/2020 02:23, Becket Qin wrote:
> Hi Timo,
>
> Thanks for the proposal. I completely agree that the current Table
> connectors could be simplified quite a bit. I haven't finished reading
> everything, but here are some quick thoughts.
>
> Actually to me the biggest question is why should there be two different
> connector systems for DataStream and Table? What is the fundamental reason
> that is preventing us from merging them to one?
>
> The basic functionality of a connector is to provide capabilities to do IO
> and Serde. Conceptually, Table connectors should just be DataStream
> connectors that are dealing with Rows. It seems that quite a few of the
> special connector requirements are just a specific way to do IO / Serde.
> Taking SupportsFilterPushDown as an example, imagine we have the following
> interface:
>
> interface FilterableSource<PREDICATE> {
>     void applyFilterable(Supplier<PREDICATE> predicate);
> }
>
> And if a ParquetSource would like to support filterable, it will become:
>
> class ParquetSource implements Source, FilterableSource(FilterPredicate> {
>     ...
> }
>
> For Table, one just need to provide an predicate supplier that converts an
> Expression to the specified predicate type. This has a few benefit:
> 1. Same unified API for filterable for sources, regardless of DataStream or
> Table.
> 2. The  DataStream users now can also use the ExpressionToPredicate
> supplier if they want to.
>
> To summarize, my main point is that I am wondering if it is possible to
> have a single set of connector interface for both Table and DataStream,
> rather than having two hierarchies. I am not 100% sure if this would work,
> but if it works, this would be a huge win from both code maintenance and
> user experience perspective.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
>> Hi Timo,
>>
>> Thank you for the proposal. I think it is an important improvement that
>> will benefit many parts of the Table API. The proposal looks really good
>> to me and personally I would be comfortable with voting on the current
>> state.
>>
>> Best,
>>
>> Dawid
>>
>> On 23/03/2020 18:53, Timo Walther wrote:
>>> Hi everyone,
>>>
>>> I received some questions around how the new interfaces play together
>>> with formats and their factories.
>>>
>>> Furthermore, for MySQL or Postgres CDC logs, the format should be able
>>> to return a `ChangelogMode`.
>>>
>>> Also, I incorporated the feedback around the factory design in general.
>>>
>>> I added a new section `Factory Interfaces` to the design document.
>>> This should be helpful to understand the big picture and connecting
>>> the concepts.
>>>
>>> Please let me know what you think?
>>>
>>> Thanks,
>>> Timo
>>>
>>>
>>> On 18.03.20 13:43, Timo Walther wrote:
>>>> Hi Benchao,
>>>>
>>>> this is a very good question. I will update the FLIP about this.
>>>>
>>>> The legacy planner will not support the new interfaces. It will only
>>>> support the old interfaces. With the next release, I think the Blink
>>>> planner is stable enough to be the default one as well.
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>> On 18.03.20 08:45, Benchao Li wrote:
>>>>> Hi Timo,
>>>>>
>>>>> Thank you and others for the efforts to prepare this FLIP.
>>>>>
>>>>> The FLIP LGTM generally.
>>>>>
>>>>> +1 for moving blink data structures to table-common, it's useful to
>>>>> udf too
>>>>> in the future.
>>>>> A little question is, do we plan to support the new interfaces and data
>>>>> types in legacy planner?
>>>>> Or we only plan to support these new interfaces in blink planner.
>>>>>
>>>>> And using primary keys from DDL instead of derived key information from
>>>>> each query is also a good idea,
>>>>> we met some use cases where this does not works very well before.
>>>>>
>>>>> This FLIP also makes the dependencies of table modules more clear, I
>>>>> like
>>>>> it very much.
>>>>>
>>>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I'm happy to present the results of long discussions that we had
>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many more
>>>>>> have contributed to this design document.
>>>>>>
>>>>>> We would like to propose new long-term table source and table sink
>>>>>> interfaces:
>>>>>>
>>>>>>
>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>>>>>
>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>>>>>
>>>>>> The goals of this FLIP are:
>>>>>>
>>>>>> - Simplify the current interface architecture:
>>>>>>       - Merge upsert, retract, and append sinks.
>>>>>>       - Unify batch and streaming sources.
>>>>>>       - Unify batch and streaming sinks.
>>>>>>
>>>>>> - Allow sources to produce a changelog:
>>>>>>       - UpsertTableSources have been requested a lot by users. Now
>>>>>> is the
>>>>>> time to open the internal planner capabilities via the new interfaces.
>>>>>>       - According to FLIP-105, we would like to support changelogs for
>>>>>> processing formats such as Debezium.
>>>>>>
>>>>>> - Don't rely on DataStream API for source and sinks:
>>>>>>       - According to FLIP-32, the Table API and SQL should be
>>>>>> independent
>>>>>> of the DataStream API which is why the `table-common` module has no
>>>>>> dependencies on `flink-streaming-java`.
>>>>>>       - Source and sink implementations should only depend on the
>>>>>> `table-common` module after FLIP-27.
>>>>>>       - Until FLIP-27 is ready, we still put most of the interfaces in
>>>>>> `table-common` and strictly separate interfaces that communicate
>>>>>> with a
>>>>>> planner and actual runtime reader/writers.
>>>>>>
>>>>>> - Implement efficient sources and sinks without planner dependencies:
>>>>>>       - Make Blink's internal data structures available to connectors.
>>>>>>       - Introduce stable interfaces for data structures that can be
>>>>>> marked as `@PublicEvolving`.
>>>>>>       - Only require dependencies on `flink-table-common` in the
>>>>>> future
>>>>>>
>>>>>> It finalizes the concept of dynamic tables and consideres how all
>>>>>> source/sink related classes play together.
>>>>>>
>>>>>> We look forward to your feedback.
>>>>>>
>>>>>> Regards,
>>>>>> Timo
>>>>>>
>>>>>
>>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Becket Qin <be...@gmail.com>.

Hi Timo,

Thanks for the proposal. I completely agree that the current Table
connectors could be simplified quite a bit. I haven't finished reading
everything, but here are some quick thoughts.

Actually to me the biggest question is why should there be two different
connector systems for DataStream and Table? What is the fundamental reason
that is preventing us from merging them to one?

The basic functionality of a connector is to provide capabilities to do IO
and Serde. Conceptually, Table connectors should just be DataStream
connectors that are dealing with Rows. It seems that quite a few of the
special connector requirements are just a specific way to do IO / Serde.
Taking SupportsFilterPushDown as an example, imagine we have the following
interface:

interface FilterableSource<PREDICATE> {
    void applyFilterable(Supplier<PREDICATE> predicate);
}

And if a ParquetSource would like to support filterable, it will become:

class ParquetSource implements Source, FilterableSource(FilterPredicate> {
    ...
}

For Table, one just need to provide an predicate supplier that converts an
Expression to the specified predicate type. This has a few benefit:
1. Same unified API for filterable for sources, regardless of DataStream or
Table.
2. The  DataStream users now can also use the ExpressionToPredicate
supplier if they want to.

To summarize, my main point is that I am wondering if it is possible to
have a single set of connector interface for both Table and DataStream,
rather than having two hierarchies. I am not 100% sure if this would work,
but if it works, this would be a huge win from both code maintenance and
user experience perspective.

Thanks,

Jiangjie (Becket) Qin



On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <dw...@apache.org>
wrote:

> Hi Timo,
>
> Thank you for the proposal. I think it is an important improvement that
> will benefit many parts of the Table API. The proposal looks really good
> to me and personally I would be comfortable with voting on the current
> state.
>
> Best,
>
> Dawid
>
> On 23/03/2020 18:53, Timo Walther wrote:
> > Hi everyone,
> >
> > I received some questions around how the new interfaces play together
> > with formats and their factories.
> >
> > Furthermore, for MySQL or Postgres CDC logs, the format should be able
> > to return a `ChangelogMode`.
> >
> > Also, I incorporated the feedback around the factory design in general.
> >
> > I added a new section `Factory Interfaces` to the design document.
> > This should be helpful to understand the big picture and connecting
> > the concepts.
> >
> > Please let me know what you think?
> >
> > Thanks,
> > Timo
> >
> >
> > On 18.03.20 13:43, Timo Walther wrote:
> >> Hi Benchao,
> >>
> >> this is a very good question. I will update the FLIP about this.
> >>
> >> The legacy planner will not support the new interfaces. It will only
> >> support the old interfaces. With the next release, I think the Blink
> >> planner is stable enough to be the default one as well.
> >>
> >> Regards,
> >> Timo
> >>
> >> On 18.03.20 08:45, Benchao Li wrote:
> >>> Hi Timo,
> >>>
> >>> Thank you and others for the efforts to prepare this FLIP.
> >>>
> >>> The FLIP LGTM generally.
> >>>
> >>> +1 for moving blink data structures to table-common, it's useful to
> >>> udf too
> >>> in the future.
> >>> A little question is, do we plan to support the new interfaces and data
> >>> types in legacy planner?
> >>> Or we only plan to support these new interfaces in blink planner.
> >>>
> >>> And using primary keys from DDL instead of derived key information from
> >>> each query is also a good idea,
> >>> we met some use cases where this does not works very well before.
> >>>
> >>> This FLIP also makes the dependencies of table modules more clear, I
> >>> like
> >>> it very much.
> >>>
> >>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
> >>>
> >>>> Hi everyone,
> >>>>
> >>>> I'm happy to present the results of long discussions that we had
> >>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many more
> >>>> have contributed to this design document.
> >>>>
> >>>> We would like to propose new long-term table source and table sink
> >>>> interfaces:
> >>>>
> >>>>
> >>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> >>>>
> >>>>
> >>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> >>>>
> >>>> The goals of this FLIP are:
> >>>>
> >>>> - Simplify the current interface architecture:
> >>>>       - Merge upsert, retract, and append sinks.
> >>>>       - Unify batch and streaming sources.
> >>>>       - Unify batch and streaming sinks.
> >>>>
> >>>> - Allow sources to produce a changelog:
> >>>>       - UpsertTableSources have been requested a lot by users. Now
> >>>> is the
> >>>> time to open the internal planner capabilities via the new interfaces.
> >>>>       - According to FLIP-105, we would like to support changelogs for
> >>>> processing formats such as Debezium.
> >>>>
> >>>> - Don't rely on DataStream API for source and sinks:
> >>>>       - According to FLIP-32, the Table API and SQL should be
> >>>> independent
> >>>> of the DataStream API which is why the `table-common` module has no
> >>>> dependencies on `flink-streaming-java`.
> >>>>       - Source and sink implementations should only depend on the
> >>>> `table-common` module after FLIP-27.
> >>>>       - Until FLIP-27 is ready, we still put most of the interfaces in
> >>>> `table-common` and strictly separate interfaces that communicate
> >>>> with a
> >>>> planner and actual runtime reader/writers.
> >>>>
> >>>> - Implement efficient sources and sinks without planner dependencies:
> >>>>       - Make Blink's internal data structures available to connectors.
> >>>>       - Introduce stable interfaces for data structures that can be
> >>>> marked as `@PublicEvolving`.
> >>>>       - Only require dependencies on `flink-table-common` in the
> >>>> future
> >>>>
> >>>> It finalizes the concept of dynamic tables and consideres how all
> >>>> source/sink related classes play together.
> >>>>
> >>>> We look forward to your feedback.
> >>>>
> >>>> Regards,
> >>>> Timo
> >>>>
> >>>
> >>>
> >
>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Dawid Wysakowicz <dw...@apache.org>.

Hi Timo,

Thank you for the proposal. I think it is an important improvement that
will benefit many parts of the Table API. The proposal looks really good
to me and personally I would be comfortable with voting on the current
state.

Best,

Dawid

On 23/03/2020 18:53, Timo Walther wrote:
> Hi everyone,
>
> I received some questions around how the new interfaces play together
> with formats and their factories.
>
> Furthermore, for MySQL or Postgres CDC logs, the format should be able
> to return a `ChangelogMode`.
>
> Also, I incorporated the feedback around the factory design in general.
>
> I added a new section `Factory Interfaces` to the design document.
> This should be helpful to understand the big picture and connecting
> the concepts.
>
> Please let me know what you think?
>
> Thanks,
> Timo
>
>
> On 18.03.20 13:43, Timo Walther wrote:
>> Hi Benchao,
>>
>> this is a very good question. I will update the FLIP about this.
>>
>> The legacy planner will not support the new interfaces. It will only
>> support the old interfaces. With the next release, I think the Blink
>> planner is stable enough to be the default one as well.
>>
>> Regards,
>> Timo
>>
>> On 18.03.20 08:45, Benchao Li wrote:
>>> Hi Timo,
>>>
>>> Thank you and others for the efforts to prepare this FLIP.
>>>
>>> The FLIP LGTM generally.
>>>
>>> +1 for moving blink data structures to table-common, it's useful to
>>> udf too
>>> in the future.
>>> A little question is, do we plan to support the new interfaces and data
>>> types in legacy planner?
>>> Or we only plan to support these new interfaces in blink planner.
>>>
>>> And using primary keys from DDL instead of derived key information from
>>> each query is also a good idea,
>>> we met some use cases where this does not works very well before.
>>>
>>> This FLIP also makes the dependencies of table modules more clear, I
>>> like
>>> it very much.
>>>
>>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
>>>
>>>> Hi everyone,
>>>>
>>>> I'm happy to present the results of long discussions that we had
>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many more
>>>> have contributed to this design document.
>>>>
>>>> We would like to propose new long-term table source and table sink
>>>> interfaces:
>>>>
>>>>
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>>>
>>>>
>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>>>
>>>> The goals of this FLIP are:
>>>>
>>>> - Simplify the current interface architecture:
>>>>       - Merge upsert, retract, and append sinks.
>>>>       - Unify batch and streaming sources.
>>>>       - Unify batch and streaming sinks.
>>>>
>>>> - Allow sources to produce a changelog:
>>>>       - UpsertTableSources have been requested a lot by users. Now
>>>> is the
>>>> time to open the internal planner capabilities via the new interfaces.
>>>>       - According to FLIP-105, we would like to support changelogs for
>>>> processing formats such as Debezium.
>>>>
>>>> - Don't rely on DataStream API for source and sinks:
>>>>       - According to FLIP-32, the Table API and SQL should be
>>>> independent
>>>> of the DataStream API which is why the `table-common` module has no
>>>> dependencies on `flink-streaming-java`.
>>>>       - Source and sink implementations should only depend on the
>>>> `table-common` module after FLIP-27.
>>>>       - Until FLIP-27 is ready, we still put most of the interfaces in
>>>> `table-common` and strictly separate interfaces that communicate
>>>> with a
>>>> planner and actual runtime reader/writers.
>>>>
>>>> - Implement efficient sources and sinks without planner dependencies:
>>>>       - Make Blink's internal data structures available to connectors.
>>>>       - Introduce stable interfaces for data structures that can be
>>>> marked as `@PublicEvolving`.
>>>>       - Only require dependencies on `flink-table-common` in the
>>>> future
>>>>
>>>> It finalizes the concept of dynamic tables and consideres how all
>>>> source/sink related classes play together.
>>>>
>>>> We look forward to your feedback.
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>
>>>
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Timo Walther <tw...@apache.org>.

Hi everyone,

I received some questions around how the new interfaces play together 
with formats and their factories.

Furthermore, for MySQL or Postgres CDC logs, the format should be able 
to return a `ChangelogMode`.

Also, I incorporated the feedback around the factory design in general.

I added a new section `Factory Interfaces` to the design document. This 
should be helpful to understand the big picture and connecting the concepts.

Please let me know what you think?

Thanks,
Timo


On 18.03.20 13:43, Timo Walther wrote:
> Hi Benchao,
> 
> this is a very good question. I will update the FLIP about this.
> 
> The legacy planner will not support the new interfaces. It will only 
> support the old interfaces. With the next release, I think the Blink 
> planner is stable enough to be the default one as well.
> 
> Regards,
> Timo
> 
> On 18.03.20 08:45, Benchao Li wrote:
>> Hi Timo,
>>
>> Thank you and others for the efforts to prepare this FLIP.
>>
>> The FLIP LGTM generally.
>>
>> +1 for moving blink data structures to table-common, it's useful to 
>> udf too
>> in the future.
>> A little question is, do we plan to support the new interfaces and data
>> types in legacy planner?
>> Or we only plan to support these new interfaces in blink planner.
>>
>> And using primary keys from DDL instead of derived key information from
>> each query is also a good idea,
>> we met some use cases where this does not works very well before.
>>
>> This FLIP also makes the dependencies of table modules more clear, I like
>> it very much.
>>
>> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
>>
>>> Hi everyone,
>>>
>>> I'm happy to present the results of long discussions that we had
>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many more
>>> have contributed to this design document.
>>>
>>> We would like to propose new long-term table source and table sink
>>> interfaces:
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces 
>>>
>>>
>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>>
>>> The goals of this FLIP are:
>>>
>>> - Simplify the current interface architecture:
>>>       - Merge upsert, retract, and append sinks.
>>>       - Unify batch and streaming sources.
>>>       - Unify batch and streaming sinks.
>>>
>>> - Allow sources to produce a changelog:
>>>       - UpsertTableSources have been requested a lot by users. Now is 
>>> the
>>> time to open the internal planner capabilities via the new interfaces.
>>>       - According to FLIP-105, we would like to support changelogs for
>>> processing formats such as Debezium.
>>>
>>> - Don't rely on DataStream API for source and sinks:
>>>       - According to FLIP-32, the Table API and SQL should be 
>>> independent
>>> of the DataStream API which is why the `table-common` module has no
>>> dependencies on `flink-streaming-java`.
>>>       - Source and sink implementations should only depend on the
>>> `table-common` module after FLIP-27.
>>>       - Until FLIP-27 is ready, we still put most of the interfaces in
>>> `table-common` and strictly separate interfaces that communicate with a
>>> planner and actual runtime reader/writers.
>>>
>>> - Implement efficient sources and sinks without planner dependencies:
>>>       - Make Blink's internal data structures available to connectors.
>>>       - Introduce stable interfaces for data structures that can be
>>> marked as `@PublicEvolving`.
>>>       - Only require dependencies on `flink-table-common` in the future
>>>
>>> It finalizes the concept of dynamic tables and consideres how all
>>> source/sink related classes play together.
>>>
>>> We look forward to your feedback.
>>>
>>> Regards,
>>> Timo
>>>
>>
>>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Timo Walther <tw...@apache.org>.

Hi Benchao,

this is a very good question. I will update the FLIP about this.

The legacy planner will not support the new interfaces. It will only 
support the old interfaces. With the next release, I think the Blink 
planner is stable enough to be the default one as well.

Regards,
Timo

On 18.03.20 08:45, Benchao Li wrote:
> Hi Timo,
> 
> Thank you and others for the efforts to prepare this FLIP.
> 
> The FLIP LGTM generally.
> 
> +1 for moving blink data structures to table-common, it's useful to udf too
> in the future.
> A little question is, do we plan to support the new interfaces and data
> types in legacy planner?
> Or we only plan to support these new interfaces in blink planner.
> 
> And using primary keys from DDL instead of derived key information from
> each query is also a good idea,
> we met some use cases where this does not works very well before.
> 
> This FLIP also makes the dependencies of table modules more clear, I like
> it very much.
> 
> Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：
> 
>> Hi everyone,
>>
>> I'm happy to present the results of long discussions that we had
>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many more
>> have contributed to this design document.
>>
>> We would like to propose new long-term table source and table sink
>> interfaces:
>>
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>
>> This is a requirement for FLIP-105 and finalizing FLIP-32.
>>
>> The goals of this FLIP are:
>>
>> - Simplify the current interface architecture:
>>       - Merge upsert, retract, and append sinks.
>>       - Unify batch and streaming sources.
>>       - Unify batch and streaming sinks.
>>
>> - Allow sources to produce a changelog:
>>       - UpsertTableSources have been requested a lot by users. Now is the
>> time to open the internal planner capabilities via the new interfaces.
>>       - According to FLIP-105, we would like to support changelogs for
>> processing formats such as Debezium.
>>
>> - Don't rely on DataStream API for source and sinks:
>>       - According to FLIP-32, the Table API and SQL should be independent
>> of the DataStream API which is why the `table-common` module has no
>> dependencies on `flink-streaming-java`.
>>       - Source and sink implementations should only depend on the
>> `table-common` module after FLIP-27.
>>       - Until FLIP-27 is ready, we still put most of the interfaces in
>> `table-common` and strictly separate interfaces that communicate with a
>> planner and actual runtime reader/writers.
>>
>> - Implement efficient sources and sinks without planner dependencies:
>>       - Make Blink's internal data structures available to connectors.
>>       - Introduce stable interfaces for data structures that can be
>> marked as `@PublicEvolving`.
>>       - Only require dependencies on `flink-table-common` in the future
>>
>> It finalizes the concept of dynamic tables and consideres how all
>> source/sink related classes play together.
>>
>> We look forward to your feedback.
>>
>> Regards,
>> Timo
>>
> 
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Posted by Benchao Li <li...@gmail.com>.

Hi Timo,

Thank you and others for the efforts to prepare this FLIP.

The FLIP LGTM generally.

+1 for moving blink data structures to table-common, it's useful to udf too
in the future.
A little question is, do we plan to support the new interfaces and data
types in legacy planner?
Or we only plan to support these new interfaces in blink planner.

And using primary keys from DDL instead of derived key information from
each query is also a good idea,
we met some use cases where this does not works very well before.

This FLIP also makes the dependencies of table modules more clear, I like
it very much.

Timo Walther <tw...@apache.org> 于2020年3月17日周二 上午1:36写道：

> Hi everyone,
>
> I'm happy to present the results of long discussions that we had
> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and many more
> have contributed to this design document.
>
> We would like to propose new long-term table source and table sink
> interfaces:
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>
> This is a requirement for FLIP-105 and finalizing FLIP-32.
>
> The goals of this FLIP are:
>
> - Simplify the current interface architecture:
>      - Merge upsert, retract, and append sinks.
>      - Unify batch and streaming sources.
>      - Unify batch and streaming sinks.
>
> - Allow sources to produce a changelog:
>      - UpsertTableSources have been requested a lot by users. Now is the
> time to open the internal planner capabilities via the new interfaces.
>      - According to FLIP-105, we would like to support changelogs for
> processing formats such as Debezium.
>
> - Don't rely on DataStream API for source and sinks:
>      - According to FLIP-32, the Table API and SQL should be independent
> of the DataStream API which is why the `table-common` module has no
> dependencies on `flink-streaming-java`.
>      - Source and sink implementations should only depend on the
> `table-common` module after FLIP-27.
>      - Until FLIP-27 is ready, we still put most of the interfaces in
> `table-common` and strictly separate interfaces that communicate with a
> planner and actual runtime reader/writers.
>
> - Implement efficient sources and sinks without planner dependencies:
>      - Make Blink's internal data structures available to connectors.
>      - Introduce stable interfaces for data structures that can be
> marked as `@PublicEvolving`.
>      - Only require dependencies on `flink-table-common` in the future
>
> It finalizes the concept of dynamic tables and consideres how all
> source/sink related classes play together.
>
> We look forward to your feedback.
>
> Regards,
> Timo
>


-- 

Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: libenchao@gmail.com; libenchao@pku.edu.cn