You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Александр Смирнов <sm...@gmail.com> on 2022/05/03 08:15:11 UTC

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Hi Qingsheng, Leonard and Jark,

Thanks for your detailed feedback! However, I have questions about
some of your statements (maybe I didn't get something?).

> Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF proc_time”

I agree that the semantics of "FOR SYSTEM_TIME AS OF proc_time" is not
fully implemented with caching, but as you said, users go on it
consciously to achieve better performance (no one proposed to enable
caching by default, etc.). Or by users do you mean other developers of
connectors? In this case developers explicitly specify whether their
connector supports caching or not (in the list of supported options),
no one makes them do that if they don't want to. So what exactly is
the difference between implementing caching in modules
flink-table-runtime and in flink-table-common from the considered
point of view? How does it affect on breaking/non-breaking the
semantics of "FOR SYSTEM_TIME AS OF proc_time"?

> confront a situation that allows table options in DDL to control the behavior of the framework, which has never happened previously and should be cautious

If we talk about main differences of semantics of DDL options and
config options("table.exec.xxx"), isn't it about limiting the scope of
the options + importance for the user business logic rather than
specific location of corresponding logic in the framework? I mean that
in my design, for example, putting an option with lookup cache
strategy in configurations would  be the wrong decision, because it
directly affects the user's business logic (not just performance
optimization) + touches just several functions of ONE table (there can
be multiple tables with different caches). Does it really matter for
the user (or someone else) where the logic is located, which is
affected by the applied option?
Also I can remember DDL option 'sink.parallelism', which in some way
"controls the behavior of the framework" and I don't see any problem
here.

> introduce a new interface for this all-caching scenario and the design would become more complex

This is a subject for a separate discussion, but actually in our
internal version we solved this problem quite easily - we reused
InputFormat class (so there is no need for a new API). The point is
that currently all lookup connectors use InputFormat for scanning the
data in batch mode: HBase, JDBC and even Hive - it uses class
PartitionReader, that is actually just a wrapper around InputFormat.
The advantage of this solution is the ability to reload cache data in
parallel (number of threads depends on number of InputSplits, but has
an upper limit). As a result cache reload time significantly reduces
(as well as time of input stream blocking). I know that usually we try
to avoid usage of concurrency in Flink code, but maybe this one can be
an exception. BTW I don't say that it's an ideal solution, maybe there
are better ones.

> Providing the cache in the framework might introduce compatibility issues

It's possible only in cases when the developer of the connector won't
properly refactor his code and will use new cache options incorrectly
(i.e. explicitly provide the same options into 2 different code
places). For correct behavior all he will need to do is to redirect
existing options to the framework's LookupConfig (+ maybe add an alias
for options, if there was different naming), everything will be
transparent for users. If the developer won't do refactoring at all,
nothing will be changed for the connector because of backward
compatibility. Also if a developer wants to use his own cache logic,
he just can refuse to pass some of the configs into the framework, and
instead make his own implementation with already existing configs and
metrics (but actually I think that it's a rare case).

> filters and projections should be pushed all the way down to the table function, like what we do in the scan source

It's the great purpose. But the truth is that the ONLY connector that
supports filter pushdown is FileSystemTableSource
(no database connector supports it currently). Also for some databases
it's simply impossible to pushdown such complex filters that we have
in Flink.

>  only applying these optimizations to the cache seems not quite useful

Filters can cut off an arbitrarily large amount of data from the
dimension table. For a simple example, suppose in dimension table
'users'
we have column 'age' with values from 20 to 40, and input stream
'clicks' that is ~uniformly distributed by age of users. If we have
filter 'age > 30',
there will be twice less data in cache. This means the user can
increase 'lookup.cache.max-rows' by almost 2 times. It will gain a
huge
performance boost. Moreover, this optimization starts to really shine
in 'ALL' cache, where tables without filters and projections can't fit
in memory, but with them - can. This opens up additional possibilities
for users. And this doesn't sound as 'not quite useful'.

It would be great to hear other voices regarding this topic! Because
we have quite a lot of controversial points, and I think with the help
of others it will be easier for us to come to a consensus.

Best regards,
Smirnov Alexander


пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <re...@gmail.com>:
>
> Hi Alexander and Arvid,
>
> Thanks for the discussion and sorry for my late response! We had an internal discussion together with Jark and Leonard and I’d like to summarize our ideas. Instead of implementing the cache logic in the table runtime layer or wrapping around the user-provided table function, we prefer to introduce some new APIs extending TableFunction with these concerns:
>
> 1. Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF proc_time”, because it couldn’t truly reflect the content of the lookup table at the moment of querying. If users choose to enable caching on the lookup table, they implicitly indicate that this breakage is acceptable in exchange for the performance. So we prefer not to provide caching on the table runtime level.
>
> 2. If we make the cache implementation in the framework (whether in a runner or a wrapper around TableFunction), we have to confront a situation that allows table options in DDL to control the behavior of the framework, which has never happened previously and should be cautious. Under the current design the behavior of the framework should only be specified by configurations (“table.exec.xxx”), and it’s hard to apply these general configs to a specific table.
>
> 3. We have use cases that lookup source loads and refresh all records periodically into the memory to achieve high lookup performance (like Hive connector in the community, and also widely used by our internal connectors). Wrapping the cache around the user’s TableFunction works fine for LRU caches, but I think we have to introduce a new interface for this all-caching scenario and the design would become more complex.
>
> 4. Providing the cache in the framework might introduce compatibility issues to existing lookup sources like there might exist two caches with totally different strategies if the user incorrectly configures the table (one in the framework and another implemented by the lookup source).
>
> As for the optimization mentioned by Alexander, I think filters and projections should be pushed all the way down to the table function, like what we do in the scan source, instead of the runner with the cache. The goal of using cache is to reduce the network I/O and pressure on the external system, and only applying these optimizations to the cache seems not quite useful.
>
> I made some updates to the FLIP[1] to reflect our ideas. We prefer to keep the cache implementation as a part of TableFunction, and we could provide some helper classes (CachingTableFunction, AllCachingTableFunction, CachingAsyncTableFunction) to developers and regulate metrics of the cache. Also, I made a POC[2] for your reference.
>
> Looking forward to your ideas!
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
>
> Best regards,
>
> Qingsheng
>
> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <sm...@gmail.com> wrote:
>>
>> Thanks for the response, Arvid!
>>
>> I have few comments on your message.
>>
>> > but could also live with an easier solution as the first step:
>>
>> I think that these 2 ways are mutually exclusive (originally proposed
>> by Qingsheng and mine), because conceptually they follow the same
>> goal, but implementation details are different. If we will go one way,
>> moving to another way in the future will mean deleting existing code
>> and once again changing the API for connectors. So I think we should
>> reach a consensus with the community about that and then work together
>> on this FLIP, i.e. divide the work on tasks for different parts of the
>> flip (for example, LRU cache unification / introducing proposed set of
>> metrics / further work…). WDYT, Qingsheng?
>>
>> > as the source will only receive the requests after filter
>>
>> Actually if filters are applied to fields of the lookup table, we
>> firstly must do requests, and only after that we can filter responses,
>> because lookup connectors don't have filter pushdown. So if filtering
>> is done before caching, there will be much less rows in cache.
>>
>> > @Alexander unfortunately, your architecture is not shared. I don't know the
>>
>> > solution to share images to be honest.
>>
>> Sorry for that, I’m a bit new to such kinds of conversations :)
>> I have no write access to the confluence, so I made a Jira issue,
>> where described the proposed changes in more details -
>> https://issues.apache.org/jira/browse/FLINK-27411.
>>
>> Will happy to get more feedback!
>>
>> Best,
>> Smirnov Alexander
>>
>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <ar...@apache.org>:
>> >
>> > Hi Qingsheng,
>> >
>> > Thanks for driving this; the inconsistency was not satisfying for me.
>> >
>> > I second Alexander's idea though but could also live with an easier
>> > solution as the first step: Instead of making caching an implementation
>> > detail of TableFunction X, rather devise a caching layer around X. So the
>> > proposal would be a CachingTableFunction that delegates to X in case of
>> > misses and else manages the cache. Lifting it into the operator model as
>> > proposed would be even better but is probably unnecessary in the first step
>> > for a lookup source (as the source will only receive the requests after
>> > filter; applying projection may be more interesting to save memory).
>> >
>> > Another advantage is that all the changes of this FLIP would be limited to
>> > options, no need for new public interfaces. Everything else remains an
>> > implementation of Table runtime. That means we can easily incorporate the
>> > optimization potential that Alexander pointed out later.
>> >
>> > @Alexander unfortunately, your architecture is not shared. I don't know the
>> > solution to share images to be honest.
>> >
>> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <sm...@gmail.com>
>> > wrote:
>> >
>> > > Hi Qingsheng! My name is Alexander, I'm not a committer yet, but I'd
>> > > really like to become one. And this FLIP really interested me.
>> > > Actually I have worked on a similar feature in my company’s Flink
>> > > fork, and we would like to share our thoughts on this and make code
>> > > open source.
>> > >
>> > > I think there is a better alternative than introducing an abstract
>> > > class for TableFunction (CachingTableFunction). As you know,
>> > > TableFunction exists in the flink-table-common module, which provides
>> > > only an API for working with tables – it’s very convenient for importing
>> > > in connectors. In turn, CachingTableFunction contains logic for
>> > > runtime execution,  so this class and everything connected with it
>> > > should be located in another module, probably in flink-table-runtime.
>> > > But this will require connectors to depend on another module, which
>> > > contains a lot of runtime logic, which doesn’t sound good.
>> > >
>> > > I suggest adding a new method ‘getLookupConfig’ to LookupTableSource
>> > > or LookupRuntimeProvider to allow connectors to only pass
>> > > configurations to the planner, therefore they won’t depend on runtime
>> > > realization. Based on these configs planner will construct a lookup
>> > > join operator with corresponding runtime logic (ProcessFunctions in
>> > > module flink-table-runtime). Architecture looks like in the pinned
>> > > image (LookupConfig class there is actually yours CacheConfig).
>> > >
>> > > Classes in flink-table-planner, that will be responsible for this –
>> > > CommonPhysicalLookupJoin and his inheritors.
>> > > Current classes for lookup join in  flink-table-runtime  -
>> > > LookupJoinRunner, AsyncLookupJoinRunner, LookupJoinRunnerWithCalc,
>> > > AsyncLookupJoinRunnerWithCalc.
>> > >
>> > > I suggest adding classes LookupJoinCachingRunner,
>> > > LookupJoinCachingRunnerWithCalc, etc.
>> > >
>> > > And here comes another more powerful advantage of such a solution. If
>> > > we have caching logic on a lower level, we can apply some
>> > > optimizations to it. LookupJoinRunnerWithCalc was named like this
>> > > because it uses the ‘calc’ function, which actually mostly consists of
>> > > filters and projections.
>> > >
>> > > For example, in join table A with lookup table B condition ‘JOIN … ON
>> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary > 1000’  ‘calc’
>> > > function will contain filters A.age = B.age + 10 and B.salary > 1000.
>> > >
>> > > If we apply this function before storing records in cache, size of
>> > > cache will be significantly reduced: filters = avoid storing useless
>> > > records in cache, projections = reduce records’ size. So the initial
>> > > max number of records in cache can be increased by the user.
>> > >
>> > > What do you think about it?
>> > >
>> > >
>> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>> > > > Hi devs,
>> > > >
>> > > > Yuan and I would like to start a discussion about FLIP-221[1], which
>> > > introduces an abstraction of lookup table cache and its standard metrics.
>> > > >
>> > > > Currently each lookup table source should implement their own cache to
>> > > store lookup results, and there isn’t a standard of metrics for users and
>> > > developers to tuning their jobs with lookup joins, which is a quite common
>> > > use case in Flink table / SQL.
>> > > >
>> > > > Therefore we propose some new APIs including cache, metrics, wrapper
>> > > classes of TableFunction and new table options. Please take a look at the
>> > > FLIP page [1] to get more details. Any suggestions and comments would be
>> > > appreciated!
>> > > >
>> > > > [1]
>> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> > > >
>> > > > Best regards,
>> > > >
>> > > > Qingsheng
>> > > >
>> > > >
>> > >
>
>
>
> --
> Best Regards,
>
> Qingsheng Ren
>
> Real-time Computing Team
> Alibaba Cloud
>
> Email: renqschn@gmail.com

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jark Wu <im...@gmail.com>.

Thank Qingsheng for updating the design.

I just have a minor comment.
Changing RescanRuntimeProvider into the builder pattern seems not necessary.
The builder pattern is usually used when there are several optional
parameters.
However, ScanRuntimeProvider and rescan interval are both required when
defining RescanRuntimeProvider. So I think the "of" method is more concise.

Best,
Jark

On Tue, 17 May 2022 at 20:16, Qingsheng Ren <re...@gmail.com> wrote:

> Hi devs,
>
> I just updated FLIP-221 [1] according to discussions addressed above. This
> version made minor changes to make interfaces clearer:
>
> 1. The argument type of LookupFunction#lookup and
> AsyncLookupFunction#asyncLookup are changed from Object... to RowData, in
> order to be symmetric with output type and be more descriptive.
> 2. “set” is removed from method names in LookupCacheMetricGroup to align
> with methods in MetricGroup.
> 3. Add statements to deprecate TableFunctionProvider and
> AsyncTableFunctionProvider
> 4. Add builder class for LookupFunctionProvider,
> AsyncLookupFunctionProvider and RescanRuntimeProvider.
> 5. “invalidateAll” and “putAll” methods are removed from LookupCache
>
> Looking forward to your ideas!
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>
> Best,
>
> Qingsheng
>
> > On May 17, 2022, at 18:10, Qingsheng Ren <re...@gmail.com> wrote:
> >
> > Hi Alexander,
> >
> > Thanks for the review and glad to see we are on the same page! I think
> you forgot to cc the dev mailing list so I’m also quoting your reply under
> this email.
> >
> >> We can add 'maxRetryTimes' option into this class
> >
> > In my opinion the retry logic should be implemented in lookup() instead
> of in LookupFunction#eval(). Retrying is only meaningful under some
> specific retriable failures, and there might be custom logic before making
> retry, such as re-establish the connection (JdbcRowDataLookupFunction is an
> example), so it's more handy to leave it to the connector.
> >
> >> I don't see DDL options, that were in previous version of FLIP. Do you
> have any special plans for them?
> >
> > We decide not to provide common DDL options and let developers to define
> their own options as we do now per connector.
> >
> > The rest of comments sound great and I’ll update the FLIP. Hope we can
> finalize our proposal soon!
> >
> > Best,
> >
> > Qingsheng
> >
> >
> >> On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com>
> wrote:
> >>
> >> Hi Qingsheng and devs!
> >>
> >> I like the overall design of updated FLIP, however I have several
> >> suggestions and questions.
> >>
> >> 1) Introducing LookupFunction as a subclass of TableFunction is a good
> >> idea. We can add 'maxRetryTimes' option into this class. 'eval' method
> >> of new LookupFunction is great for this purpose. The same is for
> >> 'async' case.
> >>
> >> 2) There might be other configs in future, such as 'cacheMissingKey'
> >> in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
> >> Maybe use Builder pattern in LookupFunctionProvider and
> >> RescanRuntimeProvider for more flexibility (use one 'build' method
> >> instead of many 'of' methods in future)?
> >>
> >> 3) What are the plans for existing TableFunctionProvider and
> >> AsyncTableFunctionProvider? I think they should be deprecated.
> >>
> >> 4) Am I right that the current design does not assume usage of
> >> user-provided LookupCache in re-scanning? In this case, it is not very
> >> clear why do we need methods such as 'invalidate' or 'putAll' in
> >> LookupCache.
> >>
> >> 5) I don't see DDL options, that were in previous version of FLIP. Do
> >> you have any special plans for them?
> >>
> >> If you don't mind, I would be glad to be able to make small
> >> adjustments to the FLIP document too. I think it's worth mentioning
> >> about what exactly optimizations are planning in the future.
> >>
> >> Best regards,
> >> Smirnov Alexander
> >>
> >> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
> >>>
> >>> Hi Alexander and devs,
> >>>
> >>> Thank you very much for the in-depth discussion! As Jark mentioned we
> were inspired by Alexander's idea and made a refactor on our design.
> FLIP-221 [1] has been updated to reflect our design now and we are happy to
> hear more suggestions from you!
> >>>
> >>> Compared to the previous design:
> >>> 1. The lookup cache serves at table runtime level and is integrated as
> a component of LookupJoinRunner as discussed previously.
> >>> 2. Interfaces are renamed and re-designed to reflect the new design.
> >>> 3. We separate the all-caching case individually and introduce a new
> RescanRuntimeProvider to reuse the ability of scanning. We are planning to
> support SourceFunction / InputFormat for now considering the complexity of
> FLIP-27 Source API.
> >>> 4. A new interface LookupFunction is introduced to make the semantic
> of lookup more straightforward for developers.
> >>>
> >>> For replying to Alexander:
> >>>> However I'm a little confused whether InputFormat is deprecated or
> not. Am I right that it will be so in the future, but currently it's not?
> >>> Yes you are right. InputFormat is not deprecated for now. I think it
> will be deprecated in the future but we don't have a clear plan for that.
> >>>
> >>> Thanks again for the discussion on this FLIP and looking forward to
> cooperating with you after we finalize the design and interfaces!
> >>>
> >>> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>
> >>> Best regards,
> >>>
> >>> Qingsheng
> >>>
> >>>
> >>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> smiralexan@gmail.com> wrote:
> >>>>
> >>>> Hi Jark, Qingsheng and Leonard!
> >>>>
> >>>> Glad to see that we came to a consensus on almost all points!
> >>>>
> >>>> However I'm a little confused whether InputFormat is deprecated or
> >>>> not. Am I right that it will be so in the future, but currently it's
> >>>> not? Actually I also think that for the first version it's OK to use
> >>>> InputFormat in ALL cache realization, because supporting rescan
> >>>> ability seems like a very distant prospect. But for this decision we
> >>>> need a consensus among all discussion participants.
> >>>>
> >>>> In general, I don't have something to argue with your statements. All
> >>>> of them correspond my ideas. Looking ahead, it would be nice to work
> >>>> on this FLIP cooperatively. I've already done a lot of work on lookup
> >>>> join caching with realization very close to the one we are discussing,
> >>>> and want to share the results of this work. Anyway looking forward for
> >>>> the FLIP update!
> >>>>
> >>>> Best regards,
> >>>> Smirnov Alexander
> >>>>
> >>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> >>>>>
> >>>>> Hi Alex,
> >>>>>
> >>>>> Thanks for summarizing your points.
> >>>>>
> >>>>> In the past week, Qingsheng, Leonard, and I have discussed it
> several times
> >>>>> and we have totally refactored the design.
> >>>>> I'm glad to say we have reached a consensus on many of your points!
> >>>>> Qingsheng is still working on updating the design docs and maybe can
> be
> >>>>> available in the next few days.
> >>>>> I will share some conclusions from our discussions:
> >>>>>
> >>>>> 1) we have refactored the design towards to "cache in framework" way.
> >>>>>
> >>>>> 2) a "LookupCache" interface for users to customize and a default
> >>>>> implementation with builder for users to easy-use.
> >>>>> This can both make it possible to both have flexibility and
> conciseness.
> >>>>>
> >>>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp
> reducing
> >>>>> IO.
> >>>>> Filter pushdown should be the final state and the unified way to both
> >>>>> support pruning ALL cache and LRU cache,
> >>>>> so I think we should make effort in this direction. If we need to
> support
> >>>>> filter pushdown for ALL cache anyway, why not use
> >>>>> it for LRU cache as well? Either way, as we decide to implement the
> cache
> >>>>> in the framework, we have the chance to support
> >>>>> filter on cache anytime. This is an optimization and it doesn't
> affect the
> >>>>> public API. I think we can create a JIRA issue to
> >>>>> discuss it when the FLIP is accepted.
> >>>>>
> >>>>> 4) The idea to support ALL cache is similar to your proposal.
> >>>>> In the first version, we will only support InputFormat,
> SourceFunction for
> >>>>> cache all (invoke InputFormat in join operator).
> >>>>> For FLIP-27 source, we need to join a true source operator instead of
> >>>>> calling it embedded in the join operator.
> >>>>> However, this needs another FLIP to support the re-scan ability for
> FLIP-27
> >>>>> Source, and this can be a large work.
> >>>>> In order to not block this issue, we can put the effort of FLIP-27
> source
> >>>>> integration into future work and integrate
> >>>>> InputFormat&SourceFunction for now.
> >>>>>
> >>>>> I think it's fine to use InputFormat&SourceFunction, as they are not
> >>>>> deprecated, otherwise, we have to introduce another function
> >>>>> similar to them which is meaningless. We need to plan FLIP-27 source
> >>>>> integration ASAP before InputFormat & SourceFunction are deprecated.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> smiralexan@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Martijn!
> >>>>>>
> >>>>>> Got it. Therefore, the realization with InputFormat is not
> considered.
> >>>>>> Thanks for clearing that up!
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Smirnov Alexander
> >>>>>>
> >>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> With regards to:
> >>>>>>>
> >>>>>>>> But if there are plans to refactor all connectors to FLIP-27
> >>>>>>>
> >>>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces
> will be
> >>>>>>> deprecated and connectors will either be refactored to use the new
> ones
> >>>>>> or
> >>>>>>> dropped.
> >>>>>>>
> >>>>>>> The caching should work for connectors that are using FLIP-27
> interfaces,
> >>>>>>> we should not introduce new features for old interfaces.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>>
> >>>>>>> Martijn
> >>>>>>>
> >>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> smiralexan@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Jark!
> >>>>>>>>
> >>>>>>>> Sorry for the late response. I would like to make some comments
> and
> >>>>>>>> clarify my points.
> >>>>>>>>
> >>>>>>>> 1) I agree with your first statement. I think we can achieve both
> >>>>>>>> advantages this way: put the Cache interface in
> flink-table-common,
> >>>>>>>> but have implementations of it in flink-table-runtime. Therefore
> if a
> >>>>>>>> connector developer wants to use existing cache strategies and
> their
> >>>>>>>> implementations, he can just pass lookupConfig to the planner,
> but if
> >>>>>>>> he wants to have its own cache implementation in his
> TableFunction, it
> >>>>>>>> will be possible for him to use the existing interface for this
> >>>>>>>> purpose (we can explicitly point this out in the documentation).
> In
> >>>>>>>> this way all configs and metrics will be unified. WDYT?
> >>>>>>>>
> >>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90%
> of
> >>>>>>>> lookup requests that can never be cached
> >>>>>>>>
> >>>>>>>> 2) Let me clarify the logic filters optimization in case of LRU
> cache.
> >>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
> >>>>>>>> store the response of the dimension table in cache, even after
> >>>>>>>> applying calc function. I.e. if there are no rows after applying
> >>>>>>>> filters to the result of the 'eval' method of TableFunction, we
> store
> >>>>>>>> the empty list by lookup keys. Therefore the cache line will be
> >>>>>>>> filled, but will require much less memory (in bytes). I.e. we
> don't
> >>>>>>>> completely filter keys, by which result was pruned, but
> significantly
> >>>>>>>> reduce required memory to store this result. If the user knows
> about
> >>>>>>>> this behavior, he can increase the 'max-rows' option before the
> start
> >>>>>>>> of the job. But actually I came up with the idea that we can do
> this
> >>>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods
> of
> >>>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
> >>>>>>>> (value of cache). Therefore cache can automatically fit much more
> >>>>>>>> records than before.
> >>>>>>>>
> >>>>>>>>> Flink SQL has provided a standard way to do filters and projects
> >>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> SupportsProjectionPushDown.
> >>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean
> it's
> >>>>>> hard
> >>>>>>>> to implement.
> >>>>>>>>
> >>>>>>>> It's debatable how difficult it will be to implement filter
> pushdown.
> >>>>>>>> But I think the fact that currently there is no database connector
> >>>>>>>> with filter pushdown at least means that this feature won't be
> >>>>>>>> supported soon in connectors. Moreover, if we talk about other
> >>>>>>>> connectors (not in Flink repo), their databases might not support
> all
> >>>>>>>> Flink filters (or not support filters at all). I think users are
> >>>>>>>> interested in supporting cache filters optimization
> independently of
> >>>>>>>> supporting other features and solving more complex problems (or
> >>>>>>>> unsolvable at all).
> >>>>>>>>
> >>>>>>>> 3) I agree with your third statement. Actually in our internal
> version
> >>>>>>>> I also tried to unify the logic of scanning and reloading data
> from
> >>>>>>>> connectors. But unfortunately, I didn't find a way to unify the
> logic
> >>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
> Source,...)
> >>>>>>>> and reuse it in reloading ALL cache. As a result I settled on
> using
> >>>>>>>> InputFormat, because it was used for scanning in all lookup
> >>>>>>>> connectors. (I didn't know that there are plans to deprecate
> >>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27
> source
> >>>>>>>> in ALL caching is not good idea, because this source was designed
> to
> >>>>>>>> work in distributed environment (SplitEnumerator on JobManager and
> >>>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
> >>>>>>>> operator in our case). There is even no direct way to pass splits
> from
> >>>>>>>> SplitEnumerator to SourceReader (this logic works through
> >>>>>>>> SplitEnumeratorContext, which requires
> >>>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage
> of
> >>>>>>>> InputFormat for ALL cache seems much more clearer and easier. But
> if
> >>>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
> >>>>>>>> following ideas: maybe we can refuse from lookup join ALL cache in
> >>>>>>>> favor of simple join with multiple scanning of batch source? The
> point
> >>>>>>>> is that the only difference between lookup join ALL cache and
> simple
> >>>>>>>> join with batch source is that in the first case scanning is
> performed
> >>>>>>>> multiple times, in between which state (cache) is cleared
> (correct me
> >>>>>>>> if I'm wrong). So what if we extend the functionality of simple
> join
> >>>>>>>> to support state reloading + extend the functionality of scanning
> >>>>>>>> batch source multiple times (this one should be easy with new
> FLIP-27
> >>>>>>>> source, that unifies streaming/batch reading - we will need to
> change
> >>>>>>>> only SplitEnumerator, which will pass splits again after some
> TTL).
> >>>>>>>> WDYT? I must say that this looks like a long-term goal and will
> make
> >>>>>>>> the scope of this FLIP even larger than you said. Maybe we can
> limit
> >>>>>>>> ourselves to a simpler solution now (InputFormats).
> >>>>>>>>
> >>>>>>>> So to sum up, my points is like this:
> >>>>>>>> 1) There is a way to make both concise and flexible interfaces for
> >>>>>>>> caching in lookup join.
> >>>>>>>> 2) Cache filters optimization is important both in LRU and ALL
> caches.
> >>>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
> >>>>>>>> connectors, some of the connectors might not have the opportunity
> to
> >>>>>>>> support filter pushdown + as I know, currently filter pushdown
> works
> >>>>>>>> only for scanning (not lookup). So cache filters + projections
> >>>>>>>> optimization should be independent from other features.
> >>>>>>>> 4) ALL cache realization is a complex topic that involves multiple
> >>>>>>>> aspects of how Flink is developing. Refusing from InputFormat in
> favor
> >>>>>>>> of FLIP-27 Source will make ALL cache realization really complex
> and
> >>>>>>>> not clear, so maybe instead of that we can extend the
> functionality of
> >>>>>>>> simple join or not refuse from InputFormat in case of lookup join
> ALL
> >>>>>>>> cache?
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Smirnov Alexander
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>>
> >>>>>>
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>>>>>>>
> >>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> >>>>>>>>>
> >>>>>>>>> It's great to see the active discussion! I want to share my
> ideas:
> >>>>>>>>>
> >>>>>>>>> 1) implement the cache in framework vs. connectors base
> >>>>>>>>> I don't have a strong opinion on this. Both ways should work
> (e.g.,
> >>>>>> cache
> >>>>>>>>> pruning, compatibility).
> >>>>>>>>> The framework way can provide more concise interfaces.
> >>>>>>>>> The connector base way can define more flexible cache
> >>>>>>>>> strategies/implementations.
> >>>>>>>>> We are still investigating a way to see if we can have both
> >>>>>> advantages.
> >>>>>>>>> We should reach a consensus that the way should be a final state,
> >>>>>> and we
> >>>>>>>>> are on the path to it.
> >>>>>>>>>
> >>>>>>>>> 2) filters and projections pushdown:
> >>>>>>>>> I agree with Alex that the filter pushdown into cache can
> benefit a
> >>>>>> lot
> >>>>>>>> for
> >>>>>>>>> ALL cache.
> >>>>>>>>> However, this is not true for LRU cache. Connectors use cache to
> >>>>>> reduce
> >>>>>>>> IO
> >>>>>>>>> requests to databases for better throughput.
> >>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90%
> of
> >>>>>>>> lookup
> >>>>>>>>> requests that can never be cached
> >>>>>>>>> and hit directly to the databases. That means the cache is
> >>>>>> meaningless in
> >>>>>>>>> this case.
> >>>>>>>>>
> >>>>>>>>> IMO, Flink SQL has provided a standard way to do filters and
> projects
> >>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>> SupportsProjectionPushDown.
> >>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean
> it's
> >>>>>> hard
> >>>>>>>> to
> >>>>>>>>> implement.
> >>>>>>>>> They should implement the pushdown interfaces to reduce IO and
> the
> >>>>>> cache
> >>>>>>>>> size.
> >>>>>>>>> That should be a final state that the scan source and lookup
> source
> >>>>>> share
> >>>>>>>>> the exact pushdown implementation.
> >>>>>>>>> I don't see why we need to duplicate the pushdown logic in
> caches,
> >>>>>> which
> >>>>>>>>> will complex the lookup join design.
> >>>>>>>>>
> >>>>>>>>> 3) ALL cache abstraction
> >>>>>>>>> All cache might be the most challenging part of this FLIP. We
> have
> >>>>>> never
> >>>>>>>>> provided a reload-lookup public interface.
> >>>>>>>>> Currently, we put the reload logic in the "eval" method of
> >>>>>> TableFunction.
> >>>>>>>>> That's hard for some sources (e.g., Hive).
> >>>>>>>>> Ideally, connector implementation should share the logic of
> reload
> >>>>>> and
> >>>>>>>>> scan, i.e. ScanTableSource with
> InputFormat/SourceFunction/FLIP-27
> >>>>>>>> Source.
> >>>>>>>>> However, InputFormat/SourceFunction are deprecated, and the
> FLIP-27
> >>>>>>>> source
> >>>>>>>>> is deeply coupled with SourceOperator.
> >>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may
> make
> >>>>>> the
> >>>>>>>>> scope of this FLIP much larger.
> >>>>>>>>> We are still investigating how to abstract the ALL cache logic
> and
> >>>>>> reuse
> >>>>>>>>> the existing source interfaces.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jark
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> It's a much more complicated activity and lies out of the scope
> of
> >>>>>> this
> >>>>>>>>>> improvement. Because such pushdowns should be done for all
> >>>>>>>> ScanTableSource
> >>>>>>>>>> implementations (not only for Lookup ones).
> >>>>>>>>>>
> >>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> >>>>>> martijnvisser@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> One question regarding "And Alexander correctly mentioned that
> >>>>>> filter
> >>>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." ->
> Would
> >>>>>> an
> >>>>>>>>>>> alternative solution be to actually implement these filter
> >>>>>> pushdowns?
> >>>>>>>> I
> >>>>>>>>>>> can
> >>>>>>>>>>> imagine that there are many more benefits to doing that,
> outside
> >>>>>> of
> >>>>>>>> lookup
> >>>>>>>>>>> caching and metrics.
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards,
> >>>>>>>>>>>
> >>>>>>>>>>> Martijn Visser
> >>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro.v.boyko@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for driving such a valuable improvement!
> >>>>>>>>>>>>
> >>>>>>>>>>>> I do think that single cache implementation would be a nice
> >>>>>>>> opportunity
> >>>>>>>>>>> for
> >>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
> >>>>>>>> semantics
> >>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> >>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the cache
> >>>>>> size
> >>>>>>>> by
> >>>>>>>>>>>> simply filtering unnecessary data. And the most handy way to
> do
> >>>>>> it
> >>>>>>>> is
> >>>>>>>>>>> apply
> >>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
> >>>>>>>> through the
> >>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
> >>>>>> mentioned
> >>>>>>>> that
> >>>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
> >>>>>>>>>>>> 2) The ability to set the different caching parameters for
> >>>>>> different
> >>>>>>>>>>> tables
> >>>>>>>>>>>> is quite important. So I would prefer to set it through DDL
> >>>>>> rather
> >>>>>>>> than
> >>>>>>>>>>>> have similar ttla, strategy and other options for all lookup
> >>>>>> tables.
> >>>>>>>>>>>> 3) Providing the cache into the framework really deprives us
> of
> >>>>>>>>>>>> extensibility (users won't be able to implement their own
> >>>>>> cache).
> >>>>>>>> But
> >>>>>>>>>>> most
> >>>>>>>>>>>> probably it might be solved by creating more different cache
> >>>>>>>> strategies
> >>>>>>>>>>> and
> >>>>>>>>>>>> a wider set of configurations.
> >>>>>>>>>>>>
> >>>>>>>>>>>> All these points are much closer to the schema proposed by
> >>>>>>>> Alexander.
> >>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all these
> >>>>>>>>>>> facilities
> >>>>>>>>>>>> might be simply implemented in your architecture?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> >>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I don't have much to chip in, but just wanted to express that
> >>>>>> I
> >>>>>>>> really
> >>>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
> >>>>>> that
> >>>>>>>>>>> others
> >>>>>>>>>>>>> will join the conversation.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> >>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have questions
> >>>>>>>> about
> >>>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
> >>>>>> AS OF
> >>>>>>>>>>>>> proc_time”
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
> >>>>>> proc_time"
> >>>>>>>> is
> >>>>>>>>>>> not
> >>>>>>>>>>>>>> fully implemented with caching, but as you said, users go
> >>>>>> on it
> >>>>>>>>>>>>>> consciously to achieve better performance (no one proposed
> >>>>>> to
> >>>>>>>> enable
> >>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
> >>>>>>>> developers
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> connectors? In this case developers explicitly specify
> >>>>>> whether
> >>>>>>>> their
> >>>>>>>>>>>>>> connector supports caching or not (in the list of supported
> >>>>>>>>>>> options),
> >>>>>>>>>>>>>> no one makes them do that if they don't want to. So what
> >>>>>>>> exactly is
> >>>>>>>>>>>>>> the difference between implementing caching in modules
> >>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
> >>>>>>>> considered
> >>>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
> >>>>>> the
> >>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> confront a situation that allows table options in DDL to
> >>>>>>>> control
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> behavior of the framework, which has never happened
> >>>>>> previously
> >>>>>>>> and
> >>>>>>>>>>>> should
> >>>>>>>>>>>>>> be cautious
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If we talk about main differences of semantics of DDL
> >>>>>> options
> >>>>>>>> and
> >>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
> >>>>>> the
> >>>>>>>> scope
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> the options + importance for the user business logic rather
> >>>>>> than
> >>>>>>>>>>>>>> specific location of corresponding logic in the framework? I
> >>>>>>>> mean
> >>>>>>>>>>> that
> >>>>>>>>>>>>>> in my design, for example, putting an option with lookup
> >>>>>> cache
> >>>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
> >>>>>>>> because it
> >>>>>>>>>>>>>> directly affects the user's business logic (not just
> >>>>>> performance
> >>>>>>>>>>>>>> optimization) + touches just several functions of ONE table
> >>>>>>>> (there
> >>>>>>>>>>> can
> >>>>>>>>>>>>>> be multiple tables with different caches). Does it really
> >>>>>>>> matter for
> >>>>>>>>>>>>>> the user (or someone else) where the logic is located,
> >>>>>> which is
> >>>>>>>>>>>>>> affected by the applied option?
> >>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
> >>>>>>>> some way
> >>>>>>>>>>>>>> "controls the behavior of the framework" and I don't see any
> >>>>>>>> problem
> >>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
> >>>>>> and
> >>>>>>>> the
> >>>>>>>>>>>> design
> >>>>>>>>>>>>>> would become more complex
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This is a subject for a separate discussion, but actually
> >>>>>> in our
> >>>>>>>>>>>>>> internal version we solved this problem quite easily - we
> >>>>>> reused
> >>>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
> >>>>>>>> point is
> >>>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
> >>>>>>>> scanning
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
> >>>>>> class
> >>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
> >>>>>>>> InputFormat.
> >>>>>>>>>>>>>> The advantage of this solution is the ability to reload
> >>>>>> cache
> >>>>>>>> data
> >>>>>>>>>>> in
> >>>>>>>>>>>>>> parallel (number of threads depends on number of
> >>>>>> InputSplits,
> >>>>>>>> but
> >>>>>>>>>>> has
> >>>>>>>>>>>>>> an upper limit). As a result cache reload time significantly
> >>>>>>>> reduces
> >>>>>>>>>>>>>> (as well as time of input stream blocking). I know that
> >>>>>> usually
> >>>>>>>> we
> >>>>>>>>>>> try
> >>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
> >>>>>> one
> >>>>>>>> can
> >>>>>>>>>>> be
> >>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
> >>>>>> maybe
> >>>>>>>>>>> there
> >>>>>>>>>>>>>> are better ones.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Providing the cache in the framework might introduce
> >>>>>>>> compatibility
> >>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It's possible only in cases when the developer of the
> >>>>>> connector
> >>>>>>>>>>> won't
> >>>>>>>>>>>>>> properly refactor his code and will use new cache options
> >>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
> >>>>>> code
> >>>>>>>>>>>>>> places). For correct behavior all he will need to do is to
> >>>>>>>> redirect
> >>>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
> >>>>>> add an
> >>>>>>>>>>> alias
> >>>>>>>>>>>>>> for options, if there was different naming), everything
> >>>>>> will be
> >>>>>>>>>>>>>> transparent for users. If the developer won't do
> >>>>>> refactoring at
> >>>>>>>> all,
> >>>>>>>>>>>>>> nothing will be changed for the connector because of
> >>>>>> backward
> >>>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
> >>>>>> cache
> >>>>>>>> logic,
> >>>>>>>>>>>>>> he just can refuse to pass some of the configs into the
> >>>>>>>> framework,
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> instead make his own implementation with already existing
> >>>>>>>> configs
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> filters and projections should be pushed all the way down
> >>>>>> to
> >>>>>>>> the
> >>>>>>>>>>>> table
> >>>>>>>>>>>>>> function, like what we do in the scan source
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
> >>>>>> connector
> >>>>>>>>>>> that
> >>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> >>>>>>>>>>>>>> (no database connector supports it currently). Also for some
> >>>>>>>>>>> databases
> >>>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
> >>>>>> that we
> >>>>>>>> have
> >>>>>>>>>>>>>> in Flink.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> only applying these optimizations to the cache seems not
> >>>>>>>> quite
> >>>>>>>>>>>> useful
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
> >>>>>> from the
> >>>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
> >>>>>>>> table
> >>>>>>>>>>>>>> 'users'
> >>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
> >>>>>> stream
> >>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
> >>>>>> we
> >>>>>>>> have
> >>>>>>>>>>>>>> filter 'age > 30',
> >>>>>>>>>>>>>> there will be twice less data in cache. This means the user
> >>>>>> can
> >>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
> >>>>>>>> gain a
> >>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
> >>>>>> really
> >>>>>>>>>>> shine
> >>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and projections
> >>>>>>>> can't
> >>>>>>>>>>> fit
> >>>>>>>>>>>>>> in memory, but with them - can. This opens up additional
> >>>>>>>>>>> possibilities
> >>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It would be great to hear other voices regarding this topic!
> >>>>>>>> Because
> >>>>>>>>>>>>>> we have quite a lot of controversial points, and I think
> >>>>>> with
> >>>>>>>> the
> >>>>>>>>>>> help
> >>>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> >>>>>> renqschn@gmail.com
> >>>>>>>>> :
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
> >>>>>> We
> >>>>>>>> had
> >>>>>>>>>>> an
> >>>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
> >>>>>> like
> >>>>>>>> to
> >>>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
> >>>>>> logic in
> >>>>>>>> the
> >>>>>>>>>>>> table
> >>>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
> >>>>>>>> function,
> >>>>>>>>>>> we
> >>>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
> >>>>>> with
> >>>>>>>> these
> >>>>>>>>>>>>>> concerns:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> >>>>>> SYSTEM_TIME
> >>>>>>>> AS OF
> >>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
> >>>>>> of the
> >>>>>>>>>>> lookup
> >>>>>>>>>>>>>> table at the moment of querying. If users choose to enable
> >>>>>>>> caching
> >>>>>>>>>>> on
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage is
> >>>>>>>>>>> acceptable
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
> >>>>>>>> caching on
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> table runtime level.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
> >>>>>>>> (whether
> >>>>>>>>>>> in a
> >>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
> >>>>>> confront a
> >>>>>>>>>>>>> situation
> >>>>>>>>>>>>>> that allows table options in DDL to control the behavior of
> >>>>>> the
> >>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>> which has never happened previously and should be cautious.
> >>>>>>>> Under
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> current design the behavior of the framework should only be
> >>>>>>>>>>> specified
> >>>>>>>>>>>> by
> >>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
> >>>>>> these
> >>>>>>>>>>> general
> >>>>>>>>>>>>>> configs to a specific table.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
> >>>>>> all
> >>>>>>>>>>> records
> >>>>>>>>>>>>>> periodically into the memory to achieve high lookup
> >>>>>> performance
> >>>>>>>>>>> (like
> >>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>> connector in the community, and also widely used by our
> >>>>>> internal
> >>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> >>>>>> TableFunction
> >>>>>>>>>>> works
> >>>>>>>>>>>>> fine
> >>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
> >>>>>>>> interface for
> >>>>>>>>>>>> this
> >>>>>>>>>>>>>> all-caching scenario and the design would become more
> >>>>>> complex.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
> >>>>>>>>>>> compatibility
> >>>>>>>>>>>>>> issues to existing lookup sources like there might exist two
> >>>>>>>> caches
> >>>>>>>>>>>> with
> >>>>>>>>>>>>>> totally different strategies if the user incorrectly
> >>>>>> configures
> >>>>>>>> the
> >>>>>>>>>>>> table
> >>>>>>>>>>>>>> (one in the framework and another implemented by the lookup
> >>>>>>>> source).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
> >>>>>>>> filters
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> projections should be pushed all the way down to the table
> >>>>>>>> function,
> >>>>>>>>>>>> like
> >>>>>>>>>>>>>> what we do in the scan source, instead of the runner with
> >>>>>> the
> >>>>>>>> cache.
> >>>>>>>>>>>> The
> >>>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
> >>>>>> pressure
> >>>>>>>> on the
> >>>>>>>>>>>>>> external system, and only applying these optimizations to
> >>>>>> the
> >>>>>>>> cache
> >>>>>>>>>>>> seems
> >>>>>>>>>>>>>> not quite useful.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
> >>>>>> We
> >>>>>>>>>>> prefer to
> >>>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
> >>>>>> and we
> >>>>>>>>>>> could
> >>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> >>>>>>>>>>>>> AllCachingTableFunction,
> >>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
> >>>>>> metrics
> >>>>>>>> of the
> >>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Looking forward to your ideas!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> >>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have few comments on your message.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> but could also live with an easier solution as the
> >>>>>> first
> >>>>>>>> step:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> >>>>>> (originally
> >>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
> >>>>>> the
> >>>>>>>> same
> >>>>>>>>>>>>>>>> goal, but implementation details are different. If we
> >>>>>> will
> >>>>>>>> go one
> >>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>> moving to another way in the future will mean deleting
> >>>>>>>> existing
> >>>>>>>>>>> code
> >>>>>>>>>>>>>>>> and once again changing the API for connectors. So I
> >>>>>> think we
> >>>>>>>>>>> should
> >>>>>>>>>>>>>>>> reach a consensus with the community about that and then
> >>>>>> work
> >>>>>>>>>>>> together
> >>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
> >>>>>>>> parts
> >>>>>>>>>>> of
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
> >>>>>>>> proposed
> >>>>>>>>>>> set
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> as the source will only receive the requests after
> >>>>>> filter
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
> >>>>>>>> table, we
> >>>>>>>>>>>>>>>> firstly must do requests, and only after that we can
> >>>>>> filter
> >>>>>>>>>>>> responses,
> >>>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
> >>>>>> if
> >>>>>>>>>>>> filtering
> >>>>>>>>>>>>>>>> is done before caching, there will be much less rows in
> >>>>>>>> cache.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>>>>> shared.
> >>>>>>>> I
> >>>>>>>>>>> don't
> >>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> >>>>>> conversations
> >>>>>>>> :)
> >>>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
> >>>>>> Jira
> >>>>>>>> issue,
> >>>>>>>>>>>>>>>> where described the proposed changes in more details -
> >>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Will happy to get more feedback!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> >>>>>> arvid@apache.org>:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
> >>>>>>>> satisfying
> >>>>>>>>>>> for
> >>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
> >>>>>> with
> >>>>>>>> an
> >>>>>>>>>>>> easier
> >>>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
> >>>>>> an
> >>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
> >>>>>> layer
> >>>>>>>>>>> around X.
> >>>>>>>>>>>>> So
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> >>>>>> delegates to
> >>>>>>>> X in
> >>>>>>>>>>>> case
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
> >>>>>>>> operator
> >>>>>>>>>>>>> model
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>> proposed would be even better but is probably
> >>>>>> unnecessary
> >>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> first step
> >>>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
> >>>>>> the
> >>>>>>>>>>> requests
> >>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
> >>>>>> save
> >>>>>>>>>>>> memory).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
> >>>>>>>> would be
> >>>>>>>>>>>>>> limited to
> >>>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
> >>>>>> else
> >>>>>>>>>>>> remains
> >>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
> >>>>>> easily
> >>>>>>>>>>>>> incorporate
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
> >>>>>> later.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>>>>> shared.
> >>>>>>>> I
> >>>>>>>>>>> don't
> >>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> >>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> >>>>>> committer
> >>>>>>>> yet,
> >>>>>>>>>>> but
> >>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> >>>>>>>> interested
> >>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
> >>>>>>>> company’s
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
> >>>>>> this and
> >>>>>>>>>>> make
> >>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>> open source.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I think there is a better alternative than
> >>>>>> introducing an
> >>>>>>>>>>>> abstract
> >>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
> >>>>>> you
> >>>>>>>> know,
> >>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> >>>>>> module,
> >>>>>>>> which
> >>>>>>>>>>>>>> provides
> >>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> >>>>>>>> convenient
> >>>>>>>>>>> for
> >>>>>>>>>>>>>> importing
> >>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
> >>>>>>>> logic
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
> >>>>>>>> connected
> >>>>>>>>>>> with
> >>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> should be located in another module, probably in
> >>>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>> But this will require connectors to depend on another
> >>>>>>>> module,
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
> >>>>>>>> good.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
> >>>>>>>>>>>>> LookupTableSource
> >>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
> >>>>>> pass
> >>>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
> >>>>>>>> depend on
> >>>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>>> realization. Based on these configs planner will
> >>>>>>>> construct a
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> >>>>>>>>>>> (ProcessFunctions
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>> pinned
> >>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
> >>>>>>>>>>> CacheConfig).
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> >>>>>> responsible
> >>>>>>>> for
> >>>>>>>>>>>> this
> >>>>>>>>>>>>> –
> >>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> >>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>>>>> flink-table-runtime
> >>>>>>>> -
> >>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> >>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
> >>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
> >>>>>> such a
> >>>>>>>>>>>> solution.
> >>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
> >>>>>> some
> >>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
> >>>>>> named
> >>>>>>>> like
> >>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
> >>>>>>>> mostly
> >>>>>>>>>>>>> consists
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> filters and projections.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
> >>>>>>>> condition
> >>>>>>>>>>>> ‘JOIN …
> >>>>>>>>>>>>>> ON
> >>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
> >>>>>> 1000’
> >>>>>>>>>>>> ‘calc’
> >>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
> >>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>> 1000.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If we apply this function before storing records in
> >>>>>>>> cache,
> >>>>>>>>>>> size
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
> >>>>>>>> storing
> >>>>>>>>>>>>> useless
> >>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> >>>>>> size. So
> >>>>>>>> the
> >>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>>> max number of records in cache can be increased by
> >>>>>> the
> >>>>>>>> user.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> What do you think about it?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >>>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
> >>>>>>>>>>> FLIP-221[1],
> >>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
> >>>>>> its
> >>>>>>>>>>> standard
> >>>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
> >>>>>>>> their
> >>>>>>>>>>> own
> >>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
> >>>>>>>> metrics
> >>>>>>>>>>> for
> >>>>>>>>>>>>>> users and
> >>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
> >>>>>> which
> >>>>>>>> is a
> >>>>>>>>>>>>> quite
> >>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
> >>>>>>>>>>> metrics,
> >>>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> >>>>>> Please
> >>>>>>>> take a
> >>>>>>>>>>>> look
> >>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
> >>>>>> and
> >>>>>>>>>>> comments
> >>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Best regards,
> >>>>>>>>>> Roman Boyko
> >>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best Regards,
> >>>
> >>> Qingsheng Ren
> >>>
> >>> Real-time Computing Team
> >>> Alibaba Cloud
> >>>
> >>> Email: renqschn@gmail.com
> >
>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi devs, 

I just updated FLIP-221 [1] according to discussions addressed above. This version made minor changes to make interfaces clearer:

1. The argument type of LookupFunction#lookup and AsyncLookupFunction#asyncLookup are changed from Object... to RowData, in order to be symmetric with output type and be more descriptive.
2. “set” is removed from method names in LookupCacheMetricGroup to align with methods in MetricGroup.
3. Add statements to deprecate TableFunctionProvider and AsyncTableFunctionProvider
4. Add builder class for LookupFunctionProvider, AsyncLookupFunctionProvider and RescanRuntimeProvider.
5. “invalidateAll” and “putAll” methods are removed from LookupCache

Looking forward to your ideas!

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric

Best,

Qingsheng

> On May 17, 2022, at 18:10, Qingsheng Ren <re...@gmail.com> wrote:
> 
> Hi Alexander, 
> 
> Thanks for the review and glad to see we are on the same page! I think you forgot to cc the dev mailing list so I’m also quoting your reply under this email. 
> 
>> We can add 'maxRetryTimes' option into this class
> 
> In my opinion the retry logic should be implemented in lookup() instead of in LookupFunction#eval(). Retrying is only meaningful under some specific retriable failures, and there might be custom logic before making retry, such as re-establish the connection (JdbcRowDataLookupFunction is an example), so it's more handy to leave it to the connector.
> 
>> I don't see DDL options, that were in previous version of FLIP. Do you have any special plans for them?
> 
> We decide not to provide common DDL options and let developers to define their own options as we do now per connector. 
> 
> The rest of comments sound great and I’ll update the FLIP. Hope we can finalize our proposal soon!
> 
> Best, 
> 
> Qingsheng
> 
> 
>> On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com> wrote:
>> 
>> Hi Qingsheng and devs!
>> 
>> I like the overall design of updated FLIP, however I have several
>> suggestions and questions.
>> 
>> 1) Introducing LookupFunction as a subclass of TableFunction is a good
>> idea. We can add 'maxRetryTimes' option into this class. 'eval' method
>> of new LookupFunction is great for this purpose. The same is for
>> 'async' case.
>> 
>> 2) There might be other configs in future, such as 'cacheMissingKey'
>> in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
>> Maybe use Builder pattern in LookupFunctionProvider and
>> RescanRuntimeProvider for more flexibility (use one 'build' method
>> instead of many 'of' methods in future)?
>> 
>> 3) What are the plans for existing TableFunctionProvider and
>> AsyncTableFunctionProvider? I think they should be deprecated.
>> 
>> 4) Am I right that the current design does not assume usage of
>> user-provided LookupCache in re-scanning? In this case, it is not very
>> clear why do we need methods such as 'invalidate' or 'putAll' in
>> LookupCache.
>> 
>> 5) I don't see DDL options, that were in previous version of FLIP. Do
>> you have any special plans for them?
>> 
>> If you don't mind, I would be glad to be able to make small
>> adjustments to the FLIP document too. I think it's worth mentioning
>> about what exactly optimizations are planning in the future.
>> 
>> Best regards,
>> Smirnov Alexander
>> 
>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
>>> 
>>> Hi Alexander and devs,
>>> 
>>> Thank you very much for the in-depth discussion! As Jark mentioned we were inspired by Alexander's idea and made a refactor on our design. FLIP-221 [1] has been updated to reflect our design now and we are happy to hear more suggestions from you!
>>> 
>>> Compared to the previous design:
>>> 1. The lookup cache serves at table runtime level and is integrated as a component of LookupJoinRunner as discussed previously.
>>> 2. Interfaces are renamed and re-designed to reflect the new design.
>>> 3. We separate the all-caching case individually and introduce a new RescanRuntimeProvider to reuse the ability of scanning. We are planning to support SourceFunction / InputFormat for now considering the complexity of FLIP-27 Source API.
>>> 4. A new interface LookupFunction is introduced to make the semantic of lookup more straightforward for developers.
>>> 
>>> For replying to Alexander:
>>>> However I'm a little confused whether InputFormat is deprecated or not. Am I right that it will be so in the future, but currently it's not?
>>> Yes you are right. InputFormat is not deprecated for now. I think it will be deprecated in the future but we don't have a clear plan for that.
>>> 
>>> Thanks again for the discussion on this FLIP and looking forward to cooperating with you after we finalize the design and interfaces!
>>> 
>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>> 
>>> Best regards,
>>> 
>>> Qingsheng
>>> 
>>> 
>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <sm...@gmail.com> wrote:
>>>> 
>>>> Hi Jark, Qingsheng and Leonard!
>>>> 
>>>> Glad to see that we came to a consensus on almost all points!
>>>> 
>>>> However I'm a little confused whether InputFormat is deprecated or
>>>> not. Am I right that it will be so in the future, but currently it's
>>>> not? Actually I also think that for the first version it's OK to use
>>>> InputFormat in ALL cache realization, because supporting rescan
>>>> ability seems like a very distant prospect. But for this decision we
>>>> need a consensus among all discussion participants.
>>>> 
>>>> In general, I don't have something to argue with your statements. All
>>>> of them correspond my ideas. Looking ahead, it would be nice to work
>>>> on this FLIP cooperatively. I've already done a lot of work on lookup
>>>> join caching with realization very close to the one we are discussing,
>>>> and want to share the results of this work. Anyway looking forward for
>>>> the FLIP update!
>>>> 
>>>> Best regards,
>>>> Smirnov Alexander
>>>> 
>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>>>>> 
>>>>> Hi Alex,
>>>>> 
>>>>> Thanks for summarizing your points.
>>>>> 
>>>>> In the past week, Qingsheng, Leonard, and I have discussed it several times
>>>>> and we have totally refactored the design.
>>>>> I'm glad to say we have reached a consensus on many of your points!
>>>>> Qingsheng is still working on updating the design docs and maybe can be
>>>>> available in the next few days.
>>>>> I will share some conclusions from our discussions:
>>>>> 
>>>>> 1) we have refactored the design towards to "cache in framework" way.
>>>>> 
>>>>> 2) a "LookupCache" interface for users to customize and a default
>>>>> implementation with builder for users to easy-use.
>>>>> This can both make it possible to both have flexibility and conciseness.
>>>>> 
>>>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
>>>>> IO.
>>>>> Filter pushdown should be the final state and the unified way to both
>>>>> support pruning ALL cache and LRU cache,
>>>>> so I think we should make effort in this direction. If we need to support
>>>>> filter pushdown for ALL cache anyway, why not use
>>>>> it for LRU cache as well? Either way, as we decide to implement the cache
>>>>> in the framework, we have the chance to support
>>>>> filter on cache anytime. This is an optimization and it doesn't affect the
>>>>> public API. I think we can create a JIRA issue to
>>>>> discuss it when the FLIP is accepted.
>>>>> 
>>>>> 4) The idea to support ALL cache is similar to your proposal.
>>>>> In the first version, we will only support InputFormat, SourceFunction for
>>>>> cache all (invoke InputFormat in join operator).
>>>>> For FLIP-27 source, we need to join a true source operator instead of
>>>>> calling it embedded in the join operator.
>>>>> However, this needs another FLIP to support the re-scan ability for FLIP-27
>>>>> Source, and this can be a large work.
>>>>> In order to not block this issue, we can put the effort of FLIP-27 source
>>>>> integration into future work and integrate
>>>>> InputFormat&SourceFunction for now.
>>>>> 
>>>>> I think it's fine to use InputFormat&SourceFunction, as they are not
>>>>> deprecated, otherwise, we have to introduce another function
>>>>> similar to them which is meaningless. We need to plan FLIP-27 source
>>>>> integration ASAP before InputFormat & SourceFunction are deprecated.
>>>>> 
>>>>> Best,
>>>>> Jark
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi Martijn!
>>>>>> 
>>>>>> Got it. Therefore, the realization with InputFormat is not considered.
>>>>>> Thanks for clearing that up!
>>>>>> 
>>>>>> Best regards,
>>>>>> Smirnov Alexander
>>>>>> 
>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> With regards to:
>>>>>>> 
>>>>>>>> But if there are plans to refactor all connectors to FLIP-27
>>>>>>> 
>>>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces will be
>>>>>>> deprecated and connectors will either be refactored to use the new ones
>>>>>> or
>>>>>>> dropped.
>>>>>>> 
>>>>>>> The caching should work for connectors that are using FLIP-27 interfaces,
>>>>>>> we should not introduce new features for old interfaces.
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> 
>>>>>>> Martijn
>>>>>>> 
>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Jark!
>>>>>>>> 
>>>>>>>> Sorry for the late response. I would like to make some comments and
>>>>>>>> clarify my points.
>>>>>>>> 
>>>>>>>> 1) I agree with your first statement. I think we can achieve both
>>>>>>>> advantages this way: put the Cache interface in flink-table-common,
>>>>>>>> but have implementations of it in flink-table-runtime. Therefore if a
>>>>>>>> connector developer wants to use existing cache strategies and their
>>>>>>>> implementations, he can just pass lookupConfig to the planner, but if
>>>>>>>> he wants to have its own cache implementation in his TableFunction, it
>>>>>>>> will be possible for him to use the existing interface for this
>>>>>>>> purpose (we can explicitly point this out in the documentation). In
>>>>>>>> this way all configs and metrics will be unified. WDYT?
>>>>>>>> 
>>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
>>>>>>>> lookup requests that can never be cached
>>>>>>>> 
>>>>>>>> 2) Let me clarify the logic filters optimization in case of LRU cache.
>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
>>>>>>>> store the response of the dimension table in cache, even after
>>>>>>>> applying calc function. I.e. if there are no rows after applying
>>>>>>>> filters to the result of the 'eval' method of TableFunction, we store
>>>>>>>> the empty list by lookup keys. Therefore the cache line will be
>>>>>>>> filled, but will require much less memory (in bytes). I.e. we don't
>>>>>>>> completely filter keys, by which result was pruned, but significantly
>>>>>>>> reduce required memory to store this result. If the user knows about
>>>>>>>> this behavior, he can increase the 'max-rows' option before the start
>>>>>>>> of the job. But actually I came up with the idea that we can do this
>>>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods of
>>>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
>>>>>>>> (value of cache). Therefore cache can automatically fit much more
>>>>>>>> records than before.
>>>>>>>> 
>>>>>>>>> Flink SQL has provided a standard way to do filters and projects
>>>>>>>> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
>>>>>> hard
>>>>>>>> to implement.
>>>>>>>> 
>>>>>>>> It's debatable how difficult it will be to implement filter pushdown.
>>>>>>>> But I think the fact that currently there is no database connector
>>>>>>>> with filter pushdown at least means that this feature won't be
>>>>>>>> supported soon in connectors. Moreover, if we talk about other
>>>>>>>> connectors (not in Flink repo), their databases might not support all
>>>>>>>> Flink filters (or not support filters at all). I think users are
>>>>>>>> interested in supporting cache filters optimization  independently of
>>>>>>>> supporting other features and solving more complex problems (or
>>>>>>>> unsolvable at all).
>>>>>>>> 
>>>>>>>> 3) I agree with your third statement. Actually in our internal version
>>>>>>>> I also tried to unify the logic of scanning and reloading data from
>>>>>>>> connectors. But unfortunately, I didn't find a way to unify the logic
>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
>>>>>>>> and reuse it in reloading ALL cache. As a result I settled on using
>>>>>>>> InputFormat, because it was used for scanning in all lookup
>>>>>>>> connectors. (I didn't know that there are plans to deprecate
>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
>>>>>>>> in ALL caching is not good idea, because this source was designed to
>>>>>>>> work in distributed environment (SplitEnumerator on JobManager and
>>>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
>>>>>>>> operator in our case). There is even no direct way to pass splits from
>>>>>>>> SplitEnumerator to SourceReader (this logic works through
>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
>>>>>>>> InputFormat for ALL cache seems much more clearer and easier. But if
>>>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
>>>>>>>> following ideas: maybe we can refuse from lookup join ALL cache in
>>>>>>>> favor of simple join with multiple scanning of batch source? The point
>>>>>>>> is that the only difference between lookup join ALL cache and simple
>>>>>>>> join with batch source is that in the first case scanning is performed
>>>>>>>> multiple times, in between which state (cache) is cleared (correct me
>>>>>>>> if I'm wrong). So what if we extend the functionality of simple join
>>>>>>>> to support state reloading + extend the functionality of scanning
>>>>>>>> batch source multiple times (this one should be easy with new FLIP-27
>>>>>>>> source, that unifies streaming/batch reading - we will need to change
>>>>>>>> only SplitEnumerator, which will pass splits again after some TTL).
>>>>>>>> WDYT? I must say that this looks like a long-term goal and will make
>>>>>>>> the scope of this FLIP even larger than you said. Maybe we can limit
>>>>>>>> ourselves to a simpler solution now (InputFormats).
>>>>>>>> 
>>>>>>>> So to sum up, my points is like this:
>>>>>>>> 1) There is a way to make both concise and flexible interfaces for
>>>>>>>> caching in lookup join.
>>>>>>>> 2) Cache filters optimization is important both in LRU and ALL caches.
>>>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
>>>>>>>> connectors, some of the connectors might not have the opportunity to
>>>>>>>> support filter pushdown + as I know, currently filter pushdown works
>>>>>>>> only for scanning (not lookup). So cache filters + projections
>>>>>>>> optimization should be independent from other features.
>>>>>>>> 4) ALL cache realization is a complex topic that involves multiple
>>>>>>>> aspects of how Flink is developing. Refusing from InputFormat in favor
>>>>>>>> of FLIP-27 Source will make ALL cache realization really complex and
>>>>>>>> not clear, so maybe instead of that we can extend the functionality of
>>>>>>>> simple join or not refuse from InputFormat in case of lookup join ALL
>>>>>>>> cache?
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Smirnov Alexander
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> 
>>>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>> 
>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>>>>>>>>> 
>>>>>>>>> It's great to see the active discussion! I want to share my ideas:
>>>>>>>>> 
>>>>>>>>> 1) implement the cache in framework vs. connectors base
>>>>>>>>> I don't have a strong opinion on this. Both ways should work (e.g.,
>>>>>> cache
>>>>>>>>> pruning, compatibility).
>>>>>>>>> The framework way can provide more concise interfaces.
>>>>>>>>> The connector base way can define more flexible cache
>>>>>>>>> strategies/implementations.
>>>>>>>>> We are still investigating a way to see if we can have both
>>>>>> advantages.
>>>>>>>>> We should reach a consensus that the way should be a final state,
>>>>>> and we
>>>>>>>>> are on the path to it.
>>>>>>>>> 
>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>> I agree with Alex that the filter pushdown into cache can benefit a
>>>>>> lot
>>>>>>>> for
>>>>>>>>> ALL cache.
>>>>>>>>> However, this is not true for LRU cache. Connectors use cache to
>>>>>> reduce
>>>>>>>> IO
>>>>>>>>> requests to databases for better throughput.
>>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
>>>>>>>> lookup
>>>>>>>>> requests that can never be cached
>>>>>>>>> and hit directly to the databases. That means the cache is
>>>>>> meaningless in
>>>>>>>>> this case.
>>>>>>>>> 
>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters and projects
>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>> SupportsProjectionPushDown.
>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
>>>>>> hard
>>>>>>>> to
>>>>>>>>> implement.
>>>>>>>>> They should implement the pushdown interfaces to reduce IO and the
>>>>>> cache
>>>>>>>>> size.
>>>>>>>>> That should be a final state that the scan source and lookup source
>>>>>> share
>>>>>>>>> the exact pushdown implementation.
>>>>>>>>> I don't see why we need to duplicate the pushdown logic in caches,
>>>>>> which
>>>>>>>>> will complex the lookup join design.
>>>>>>>>> 
>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>> All cache might be the most challenging part of this FLIP. We have
>>>>>> never
>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>> Currently, we put the reload logic in the "eval" method of
>>>>>> TableFunction.
>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>> Ideally, connector implementation should share the logic of reload
>>>>>> and
>>>>>>>>> scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
>>>>>>>> Source.
>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and the FLIP-27
>>>>>>>> source
>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may make
>>>>>> the
>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>> We are still investigating how to abstract the ALL cache logic and
>>>>>> reuse
>>>>>>>>> the existing source interfaces.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Jark
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> It's a much more complicated activity and lies out of the scope of
>>>>>> this
>>>>>>>>>> improvement. Because such pushdowns should be done for all
>>>>>>>> ScanTableSource
>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>> 
>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>>>>>> martijnvisser@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>> 
>>>>>>>>>>> One question regarding "And Alexander correctly mentioned that
>>>>>> filter
>>>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." -> Would
>>>>>> an
>>>>>>>>>>> alternative solution be to actually implement these filter
>>>>>> pushdowns?
>>>>>>>> I
>>>>>>>>>>> can
>>>>>>>>>>> imagine that there are many more benefits to doing that, outside
>>>>>> of
>>>>>>>> lookup
>>>>>>>>>>> caching and metrics.
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> 
>>>>>>>>>>> Martijn Visser
>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for driving such a valuable improvement!
>>>>>>>>>>>> 
>>>>>>>>>>>> I do think that single cache implementation would be a nice
>>>>>>>> opportunity
>>>>>>>>>>> for
>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
>>>>>>>> semantics
>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
>>>>>>>>>>>> 
>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the cache
>>>>>> size
>>>>>>>> by
>>>>>>>>>>>> simply filtering unnecessary data. And the most handy way to do
>>>>>> it
>>>>>>>> is
>>>>>>>>>>> apply
>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
>>>>>>>> through the
>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
>>>>>> mentioned
>>>>>>>> that
>>>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
>>>>>>>>>>>> 2) The ability to set the different caching parameters for
>>>>>> different
>>>>>>>>>>> tables
>>>>>>>>>>>> is quite important. So I would prefer to set it through DDL
>>>>>> rather
>>>>>>>> than
>>>>>>>>>>>> have similar ttla, strategy and other options for all lookup
>>>>>> tables.
>>>>>>>>>>>> 3) Providing the cache into the framework really deprives us of
>>>>>>>>>>>> extensibility (users won't be able to implement their own
>>>>>> cache).
>>>>>>>> But
>>>>>>>>>>> most
>>>>>>>>>>>> probably it might be solved by creating more different cache
>>>>>>>> strategies
>>>>>>>>>>> and
>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>> 
>>>>>>>>>>>> All these points are much closer to the schema proposed by
>>>>>>>> Alexander.
>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all these
>>>>>>>>>>> facilities
>>>>>>>>>>>> might be simply implemented in your architecture?
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't have much to chip in, but just wanted to express that
>>>>>> I
>>>>>>>> really
>>>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
>>>>>> that
>>>>>>>>>>> others
>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have questions
>>>>>>>> about
>>>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
>>>>>> AS OF
>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
>>>>>> proc_time"
>>>>>>>> is
>>>>>>>>>>> not
>>>>>>>>>>>>>> fully implemented with caching, but as you said, users go
>>>>>> on it
>>>>>>>>>>>>>> consciously to achieve better performance (no one proposed
>>>>>> to
>>>>>>>> enable
>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
>>>>>>>> developers
>>>>>>>>>>> of
>>>>>>>>>>>>>> connectors? In this case developers explicitly specify
>>>>>> whether
>>>>>>>> their
>>>>>>>>>>>>>> connector supports caching or not (in the list of supported
>>>>>>>>>>> options),
>>>>>>>>>>>>>> no one makes them do that if they don't want to. So what
>>>>>>>> exactly is
>>>>>>>>>>>>>> the difference between implementing caching in modules
>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
>>>>>>>> considered
>>>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
>>>>>> the
>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> confront a situation that allows table options in DDL to
>>>>>>>> control
>>>>>>>>>>> the
>>>>>>>>>>>>>> behavior of the framework, which has never happened
>>>>>> previously
>>>>>>>> and
>>>>>>>>>>>> should
>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If we talk about main differences of semantics of DDL
>>>>>> options
>>>>>>>> and
>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
>>>>>> the
>>>>>>>> scope
>>>>>>>>>>> of
>>>>>>>>>>>>>> the options + importance for the user business logic rather
>>>>>> than
>>>>>>>>>>>>>> specific location of corresponding logic in the framework? I
>>>>>>>> mean
>>>>>>>>>>> that
>>>>>>>>>>>>>> in my design, for example, putting an option with lookup
>>>>>> cache
>>>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
>>>>>>>> because it
>>>>>>>>>>>>>> directly affects the user's business logic (not just
>>>>>> performance
>>>>>>>>>>>>>> optimization) + touches just several functions of ONE table
>>>>>>>> (there
>>>>>>>>>>> can
>>>>>>>>>>>>>> be multiple tables with different caches). Does it really
>>>>>>>> matter for
>>>>>>>>>>>>>> the user (or someone else) where the logic is located,
>>>>>> which is
>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
>>>>>>>> some way
>>>>>>>>>>>>>> "controls the behavior of the framework" and I don't see any
>>>>>>>> problem
>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
>>>>>> and
>>>>>>>> the
>>>>>>>>>>>> design
>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is a subject for a separate discussion, but actually
>>>>>> in our
>>>>>>>>>>>>>> internal version we solved this problem quite easily - we
>>>>>> reused
>>>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
>>>>>>>> point is
>>>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
>>>>>>>> scanning
>>>>>>>>>>> the
>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
>>>>>> class
>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
>>>>>>>> InputFormat.
>>>>>>>>>>>>>> The advantage of this solution is the ability to reload
>>>>>> cache
>>>>>>>> data
>>>>>>>>>>> in
>>>>>>>>>>>>>> parallel (number of threads depends on number of
>>>>>> InputSplits,
>>>>>>>> but
>>>>>>>>>>> has
>>>>>>>>>>>>>> an upper limit). As a result cache reload time significantly
>>>>>>>> reduces
>>>>>>>>>>>>>> (as well as time of input stream blocking). I know that
>>>>>> usually
>>>>>>>> we
>>>>>>>>>>> try
>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
>>>>>> one
>>>>>>>> can
>>>>>>>>>>> be
>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
>>>>>> maybe
>>>>>>>>>>> there
>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Providing the cache in the framework might introduce
>>>>>>>> compatibility
>>>>>>>>>>>>> issues
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It's possible only in cases when the developer of the
>>>>>> connector
>>>>>>>>>>> won't
>>>>>>>>>>>>>> properly refactor his code and will use new cache options
>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
>>>>>> code
>>>>>>>>>>>>>> places). For correct behavior all he will need to do is to
>>>>>>>> redirect
>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
>>>>>> add an
>>>>>>>>>>> alias
>>>>>>>>>>>>>> for options, if there was different naming), everything
>>>>>> will be
>>>>>>>>>>>>>> transparent for users. If the developer won't do
>>>>>> refactoring at
>>>>>>>> all,
>>>>>>>>>>>>>> nothing will be changed for the connector because of
>>>>>> backward
>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
>>>>>> cache
>>>>>>>> logic,
>>>>>>>>>>>>>> he just can refuse to pass some of the configs into the
>>>>>>>> framework,
>>>>>>>>>>> and
>>>>>>>>>>>>>> instead make his own implementation with already existing
>>>>>>>> configs
>>>>>>>>>>> and
>>>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> filters and projections should be pushed all the way down
>>>>>> to
>>>>>>>> the
>>>>>>>>>>>> table
>>>>>>>>>>>>>> function, like what we do in the scan source
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
>>>>>> connector
>>>>>>>>>>> that
>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
>>>>>>>>>>>>>> (no database connector supports it currently). Also for some
>>>>>>>>>>> databases
>>>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
>>>>>> that we
>>>>>>>> have
>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> only applying these optimizations to the cache seems not
>>>>>>>> quite
>>>>>>>>>>>> useful
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
>>>>>> from the
>>>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
>>>>>>>> table
>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
>>>>>> stream
>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
>>>>>> we
>>>>>>>> have
>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>> there will be twice less data in cache. This means the user
>>>>>> can
>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
>>>>>>>> gain a
>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
>>>>>> really
>>>>>>>>>>> shine
>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and projections
>>>>>>>> can't
>>>>>>>>>>> fit
>>>>>>>>>>>>>> in memory, but with them - can. This opens up additional
>>>>>>>>>>> possibilities
>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It would be great to hear other voices regarding this topic!
>>>>>>>> Because
>>>>>>>>>>>>>> we have quite a lot of controversial points, and I think
>>>>>> with
>>>>>>>> the
>>>>>>>>>>> help
>>>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>>>>>> renqschn@gmail.com
>>>>>>>>> :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
>>>>>> We
>>>>>>>> had
>>>>>>>>>>> an
>>>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
>>>>>> like
>>>>>>>> to
>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
>>>>>> logic in
>>>>>>>> the
>>>>>>>>>>>> table
>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
>>>>>>>> function,
>>>>>>>>>>> we
>>>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
>>>>>> with
>>>>>>>> these
>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
>>>>>> SYSTEM_TIME
>>>>>>>> AS OF
>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
>>>>>> of the
>>>>>>>>>>> lookup
>>>>>>>>>>>>>> table at the moment of querying. If users choose to enable
>>>>>>>> caching
>>>>>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage is
>>>>>>>>>>> acceptable
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
>>>>>>>> caching on
>>>>>>>>>>>> the
>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
>>>>>>>> (whether
>>>>>>>>>>> in a
>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
>>>>>> confront a
>>>>>>>>>>>>> situation
>>>>>>>>>>>>>> that allows table options in DDL to control the behavior of
>>>>>> the
>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>> which has never happened previously and should be cautious.
>>>>>>>> Under
>>>>>>>>>>> the
>>>>>>>>>>>>>> current design the behavior of the framework should only be
>>>>>>>>>>> specified
>>>>>>>>>>>> by
>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
>>>>>> these
>>>>>>>>>>> general
>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
>>>>>> all
>>>>>>>>>>> records
>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
>>>>>> performance
>>>>>>>>>>> (like
>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>> connector in the community, and also widely used by our
>>>>>> internal
>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
>>>>>> TableFunction
>>>>>>>>>>> works
>>>>>>>>>>>>> fine
>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
>>>>>>>> interface for
>>>>>>>>>>>> this
>>>>>>>>>>>>>> all-caching scenario and the design would become more
>>>>>> complex.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
>>>>>>>>>>> compatibility
>>>>>>>>>>>>>> issues to existing lookup sources like there might exist two
>>>>>>>> caches
>>>>>>>>>>>> with
>>>>>>>>>>>>>> totally different strategies if the user incorrectly
>>>>>> configures
>>>>>>>> the
>>>>>>>>>>>> table
>>>>>>>>>>>>>> (one in the framework and another implemented by the lookup
>>>>>>>> source).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
>>>>>>>> filters
>>>>>>>>>>> and
>>>>>>>>>>>>>> projections should be pushed all the way down to the table
>>>>>>>> function,
>>>>>>>>>>>> like
>>>>>>>>>>>>>> what we do in the scan source, instead of the runner with
>>>>>> the
>>>>>>>> cache.
>>>>>>>>>>>> The
>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
>>>>>> pressure
>>>>>>>> on the
>>>>>>>>>>>>>> external system, and only applying these optimizations to
>>>>>> the
>>>>>>>> cache
>>>>>>>>>>>> seems
>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
>>>>>> We
>>>>>>>>>>> prefer to
>>>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
>>>>>> and we
>>>>>>>>>>> could
>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
>>>>>> metrics
>>>>>>>> of the
>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> but could also live with an easier solution as the
>>>>>> first
>>>>>>>> step:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
>>>>>> (originally
>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
>>>>>> the
>>>>>>>> same
>>>>>>>>>>>>>>>> goal, but implementation details are different. If we
>>>>>> will
>>>>>>>> go one
>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>> moving to another way in the future will mean deleting
>>>>>>>> existing
>>>>>>>>>>> code
>>>>>>>>>>>>>>>> and once again changing the API for connectors. So I
>>>>>> think we
>>>>>>>>>>> should
>>>>>>>>>>>>>>>> reach a consensus with the community about that and then
>>>>>> work
>>>>>>>>>>>> together
>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
>>>>>>>> parts
>>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
>>>>>>>> proposed
>>>>>>>>>>> set
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> as the source will only receive the requests after
>>>>>> filter
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
>>>>>>>> table, we
>>>>>>>>>>>>>>>> firstly must do requests, and only after that we can
>>>>>> filter
>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
>>>>>> if
>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>> is done before caching, there will be much less rows in
>>>>>>>> cache.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>>>>> shared.
>>>>>>>> I
>>>>>>>>>>> don't
>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
>>>>>> conversations
>>>>>>>> :)
>>>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
>>>>>> Jira
>>>>>>>> issue,
>>>>>>>>>>>>>>>> where described the proposed changes in more details -
>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
>>>>>>>> satisfying
>>>>>>>>>>> for
>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
>>>>>> with
>>>>>>>> an
>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
>>>>>> an
>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
>>>>>> layer
>>>>>>>>>>> around X.
>>>>>>>>>>>>> So
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
>>>>>> delegates to
>>>>>>>> X in
>>>>>>>>>>>> case
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
>>>>>>>> operator
>>>>>>>>>>>>> model
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> proposed would be even better but is probably
>>>>>> unnecessary
>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
>>>>>> the
>>>>>>>>>>> requests
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
>>>>>> save
>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
>>>>>>>> would be
>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
>>>>>> else
>>>>>>>>>>>> remains
>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
>>>>>> easily
>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
>>>>>> later.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>>>>> shared.
>>>>>>>> I
>>>>>>>>>>> don't
>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
>>>>>> committer
>>>>>>>> yet,
>>>>>>>>>>> but
>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
>>>>>>>> interested
>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
>>>>>>>> company’s
>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
>>>>>> this and
>>>>>>>>>>> make
>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I think there is a better alternative than
>>>>>> introducing an
>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
>>>>>> you
>>>>>>>> know,
>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
>>>>>> module,
>>>>>>>> which
>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
>>>>>>>> convenient
>>>>>>>>>>> for
>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
>>>>>>>> logic
>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
>>>>>>>> connected
>>>>>>>>>>> with
>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> should be located in another module, probably in
>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>> But this will require connectors to depend on another
>>>>>>>> module,
>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
>>>>>>>> good.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
>>>>>> pass
>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
>>>>>>>> depend on
>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>> realization. Based on these configs planner will
>>>>>>>> construct a
>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
>>>>>> in
>>>>>>>> the
>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
>>>>>> responsible
>>>>>>>> for
>>>>>>>>>>>> this
>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>> flink-table-runtime
>>>>>>>> -
>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
>>>>>> such a
>>>>>>>>>>>> solution.
>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
>>>>>> some
>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
>>>>>> named
>>>>>>>> like
>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
>>>>>>>> mostly
>>>>>>>>>>>>> consists
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
>>>>>>>> condition
>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
>>>>>> 1000’
>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> If we apply this function before storing records in
>>>>>>>> cache,
>>>>>>>>>>> size
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
>>>>>>>> storing
>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
>>>>>> size. So
>>>>>>>> the
>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>> max number of records in cache can be increased by
>>>>>> the
>>>>>>>> user.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
>>>>>> its
>>>>>>>>>>> standard
>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
>>>>>>>> their
>>>>>>>>>>> own
>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
>>>>>>>> metrics
>>>>>>>>>>> for
>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
>>>>>> which
>>>>>>>> is a
>>>>>>>>>>>>> quite
>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
>>>>>>>>>>> metrics,
>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
>>>>>> Please
>>>>>>>> take a
>>>>>>>>>>>> look
>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
>>>>>> and
>>>>>>>>>>> comments
>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Roman Boyko
>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> 
>>> Qingsheng Ren
>>> 
>>> Real-time Computing Team
>>> Alibaba Cloud
>>> 
>>> Email: renqschn@gmail.com
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi devs,

We recently updated FLIP-221 [1] in order to make the concept of caching clearer and introduce some new common lookup table options. The changes are listed below:

1. "LRU cache" and “all cache” have been renamed as “partial cache” and “full cache”. It is not necessary for the cache to using only size-based LRU evection policy so using “partial” will be more precise. This change was inspired by the definition in Microsoft SQL Server [2][3]. “RescanRuntimeProvider” has been renamed to “FullCachingLookupProvider” accordingly.

2. Common lookup options are introduced in the “Table Options for Lookup Cache” section to unify the usage of lookup tables.

Looking forward to your feedbacks!

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221%3A+Abstraction+for+lookup+source+cache+and+metric
[2] https://docs.microsoft.com/en-us/sql/integration-services/connection-manager/lookup-transformation-full-cache-mode-cache-connection-manager
[3] https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/implement-a-lookup-in-no-cache-or-partial-cache-mode

Best regards, 

Qingsheng

> On May 18, 2022, at 17:04, Qingsheng Ren <re...@gmail.com> wrote:
> 
> Hi Jark and Alexander, 
> 
> Thanks for your comments! I’m also OK to introduce common table options. I prefer to introduce a new DefaultLookupCacheOptions class for holding these option definitions because putting all options into FactoryUtil would make it a bit ”crowded” and not well categorized. 
> 
> FLIP has been updated according to suggestions above: 
> 1. Use static “of” method for constructing RescanRuntimeProvider considering both arguments are required.
> 2. Introduce new table options matching DefaultLookupCacheFactory
> 
> Best,
> Qingsheng
> 
> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
> Hi Alex,
> 
> 1) retry logic
> I think we can extract some common retry logic into utilities, e.g. RetryUtils#tryTimes(times, call). 
> This seems independent of this FLIP and can be reused by DataStream users. 
> Maybe we can open an issue to discuss this and where to put it. 
> 
> 2) cache ConfigOptions
> I'm fine with defining cache config options in the framework. 
> A candidate place to put is FactoryUtil which also includes "sink.parallelism", "format" options.
> 
> Best,
> Jark
> 
> 
> On Wed, 18 May 2022 at 13:52, Александр Смирнов <sm...@gmail.com> wrote:
> Hi Qingsheng,
> 
> Thank you for considering my comments.
> 
> >  there might be custom logic before making retry, such as re-establish the connection
> 
> Yes, I understand that. I meant that such logic can be placed in a
> separate function, that can be implemented by connectors. Just moving
> the retry logic would make connector's LookupFunction more concise +
> avoid duplicate code. However, it's a minor change. The decision is up
> to you.
> 
> > We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
> 
> What is the reason for that? One of the main goals of this FLIP was to
> unify the configs, wasn't it? I understand that current cache design
> doesn't depend on ConfigOptions, like was before. But still we can put
> these options into the framework, so connectors can reuse them and
> avoid code duplication, and, what is more significant, avoid possible
> different options naming. This moment can be pointed out in
> documentation for connector developers.
> 
> Best regards,
> Alexander
> 
> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
> >
> > Hi Alexander,
> >
> > Thanks for the review and glad to see we are on the same page! I think you forgot to cc the dev mailing list so I’m also quoting your reply under this email.
> >
> > >  We can add 'maxRetryTimes' option into this class
> >
> > In my opinion the retry logic should be implemented in lookup() instead of in LookupFunction#eval(). Retrying is only meaningful under some specific retriable failures, and there might be custom logic before making retry, such as re-establish the connection (JdbcRowDataLookupFunction is an example), so it's more handy to leave it to the connector.
> >
> > > I don't see DDL options, that were in previous version of FLIP. Do you have any special plans for them?
> >
> > We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
> >
> > The rest of comments sound great and I’ll update the FLIP. Hope we can finalize our proposal soon!
> >
> > Best,
> >
> > Qingsheng
> >
> >
> > > On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com> wrote:
> > >
> > > Hi Qingsheng and devs!
> > >
> > > I like the overall design of updated FLIP, however I have several
> > > suggestions and questions.
> > >
> > > 1) Introducing LookupFunction as a subclass of TableFunction is a good
> > > idea. We can add 'maxRetryTimes' option into this class. 'eval' method
> > > of new LookupFunction is great for this purpose. The same is for
> > > 'async' case.
> > >
> > > 2) There might be other configs in future, such as 'cacheMissingKey'
> > > in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
> > > Maybe use Builder pattern in LookupFunctionProvider and
> > > RescanRuntimeProvider for more flexibility (use one 'build' method
> > > instead of many 'of' methods in future)?
> > >
> > > 3) What are the plans for existing TableFunctionProvider and
> > > AsyncTableFunctionProvider? I think they should be deprecated.
> > >
> > > 4) Am I right that the current design does not assume usage of
> > > user-provided LookupCache in re-scanning? In this case, it is not very
> > > clear why do we need methods such as 'invalidate' or 'putAll' in
> > > LookupCache.
> > >
> > > 5) I don't see DDL options, that were in previous version of FLIP. Do
> > > you have any special plans for them?
> > >
> > > If you don't mind, I would be glad to be able to make small
> > > adjustments to the FLIP document too. I think it's worth mentioning
> > > about what exactly optimizations are planning in the future.
> > >
> > > Best regards,
> > > Smirnov Alexander
> > >
> > > пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
> > >>
> > >> Hi Alexander and devs,
> > >>
> > >> Thank you very much for the in-depth discussion! As Jark mentioned we were inspired by Alexander's idea and made a refactor on our design. FLIP-221 [1] has been updated to reflect our design now and we are happy to hear more suggestions from you!
> > >>
> > >> Compared to the previous design:
> > >> 1. The lookup cache serves at table runtime level and is integrated as a component of LookupJoinRunner as discussed previously.
> > >> 2. Interfaces are renamed and re-designed to reflect the new design.
> > >> 3. We separate the all-caching case individually and introduce a new RescanRuntimeProvider to reuse the ability of scanning. We are planning to support SourceFunction / InputFormat for now considering the complexity of FLIP-27 Source API.
> > >> 4. A new interface LookupFunction is introduced to make the semantic of lookup more straightforward for developers.
> > >>
> > >> For replying to Alexander:
> > >>> However I'm a little confused whether InputFormat is deprecated or not. Am I right that it will be so in the future, but currently it's not?
> > >> Yes you are right. InputFormat is not deprecated for now. I think it will be deprecated in the future but we don't have a clear plan for that.
> > >>
> > >> Thanks again for the discussion on this FLIP and looking forward to cooperating with you after we finalize the design and interfaces!
> > >>
> > >> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>
> > >> Best regards,
> > >>
> > >> Qingsheng
> > >>
> > >>
> > >> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <sm...@gmail.com> wrote:
> > >>>
> > >>> Hi Jark, Qingsheng and Leonard!
> > >>>
> > >>> Glad to see that we came to a consensus on almost all points!
> > >>>
> > >>> However I'm a little confused whether InputFormat is deprecated or
> > >>> not. Am I right that it will be so in the future, but currently it's
> > >>> not? Actually I also think that for the first version it's OK to use
> > >>> InputFormat in ALL cache realization, because supporting rescan
> > >>> ability seems like a very distant prospect. But for this decision we
> > >>> need a consensus among all discussion participants.
> > >>>
> > >>> In general, I don't have something to argue with your statements. All
> > >>> of them correspond my ideas. Looking ahead, it would be nice to work
> > >>> on this FLIP cooperatively. I've already done a lot of work on lookup
> > >>> join caching with realization very close to the one we are discussing,
> > >>> and want to share the results of this work. Anyway looking forward for
> > >>> the FLIP update!
> > >>>
> > >>> Best regards,
> > >>> Smirnov Alexander
> > >>>
> > >>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> > >>>>
> > >>>> Hi Alex,
> > >>>>
> > >>>> Thanks for summarizing your points.
> > >>>>
> > >>>> In the past week, Qingsheng, Leonard, and I have discussed it several times
> > >>>> and we have totally refactored the design.
> > >>>> I'm glad to say we have reached a consensus on many of your points!
> > >>>> Qingsheng is still working on updating the design docs and maybe can be
> > >>>> available in the next few days.
> > >>>> I will share some conclusions from our discussions:
> > >>>>
> > >>>> 1) we have refactored the design towards to "cache in framework" way.
> > >>>>
> > >>>> 2) a "LookupCache" interface for users to customize and a default
> > >>>> implementation with builder for users to easy-use.
> > >>>> This can both make it possible to both have flexibility and conciseness.
> > >>>>
> > >>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
> > >>>> IO.
> > >>>> Filter pushdown should be the final state and the unified way to both
> > >>>> support pruning ALL cache and LRU cache,
> > >>>> so I think we should make effort in this direction. If we need to support
> > >>>> filter pushdown for ALL cache anyway, why not use
> > >>>> it for LRU cache as well? Either way, as we decide to implement the cache
> > >>>> in the framework, we have the chance to support
> > >>>> filter on cache anytime. This is an optimization and it doesn't affect the
> > >>>> public API. I think we can create a JIRA issue to
> > >>>> discuss it when the FLIP is accepted.
> > >>>>
> > >>>> 4) The idea to support ALL cache is similar to your proposal.
> > >>>> In the first version, we will only support InputFormat, SourceFunction for
> > >>>> cache all (invoke InputFormat in join operator).
> > >>>> For FLIP-27 source, we need to join a true source operator instead of
> > >>>> calling it embedded in the join operator.
> > >>>> However, this needs another FLIP to support the re-scan ability for FLIP-27
> > >>>> Source, and this can be a large work.
> > >>>> In order to not block this issue, we can put the effort of FLIP-27 source
> > >>>> integration into future work and integrate
> > >>>> InputFormat&SourceFunction for now.
> > >>>>
> > >>>> I think it's fine to use InputFormat&SourceFunction, as they are not
> > >>>> deprecated, otherwise, we have to introduce another function
> > >>>> similar to them which is meaningless. We need to plan FLIP-27 source
> > >>>> integration ASAP before InputFormat & SourceFunction are deprecated.
> > >>>>
> > >>>> Best,
> > >>>> Jark
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Martijn!
> > >>>>>
> > >>>>> Got it. Therefore, the realization with InputFormat is not considered.
> > >>>>> Thanks for clearing that up!
> > >>>>>
> > >>>>> Best regards,
> > >>>>> Smirnov Alexander
> > >>>>>
> > >>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> With regards to:
> > >>>>>>
> > >>>>>>> But if there are plans to refactor all connectors to FLIP-27
> > >>>>>>
> > >>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces will be
> > >>>>>> deprecated and connectors will either be refactored to use the new ones
> > >>>>> or
> > >>>>>> dropped.
> > >>>>>>
> > >>>>>> The caching should work for connectors that are using FLIP-27 interfaces,
> > >>>>>> we should not introduce new features for old interfaces.
> > >>>>>>
> > >>>>>> Best regards,
> > >>>>>>
> > >>>>>> Martijn
> > >>>>>>
> > >>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Jark!
> > >>>>>>>
> > >>>>>>> Sorry for the late response. I would like to make some comments and
> > >>>>>>> clarify my points.
> > >>>>>>>
> > >>>>>>> 1) I agree with your first statement. I think we can achieve both
> > >>>>>>> advantages this way: put the Cache interface in flink-table-common,
> > >>>>>>> but have implementations of it in flink-table-runtime. Therefore if a
> > >>>>>>> connector developer wants to use existing cache strategies and their
> > >>>>>>> implementations, he can just pass lookupConfig to the planner, but if
> > >>>>>>> he wants to have its own cache implementation in his TableFunction, it
> > >>>>>>> will be possible for him to use the existing interface for this
> > >>>>>>> purpose (we can explicitly point this out in the documentation). In
> > >>>>>>> this way all configs and metrics will be unified. WDYT?
> > >>>>>>>
> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
> > >>>>>>> lookup requests that can never be cached
> > >>>>>>>
> > >>>>>>> 2) Let me clarify the logic filters optimization in case of LRU cache.
> > >>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
> > >>>>>>> store the response of the dimension table in cache, even after
> > >>>>>>> applying calc function. I.e. if there are no rows after applying
> > >>>>>>> filters to the result of the 'eval' method of TableFunction, we store
> > >>>>>>> the empty list by lookup keys. Therefore the cache line will be
> > >>>>>>> filled, but will require much less memory (in bytes). I.e. we don't
> > >>>>>>> completely filter keys, by which result was pruned, but significantly
> > >>>>>>> reduce required memory to store this result. If the user knows about
> > >>>>>>> this behavior, he can increase the 'max-rows' option before the start
> > >>>>>>> of the job. But actually I came up with the idea that we can do this
> > >>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods of
> > >>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
> > >>>>>>> (value of cache). Therefore cache can automatically fit much more
> > >>>>>>> records than before.
> > >>>>>>>
> > >>>>>>>> Flink SQL has provided a standard way to do filters and projects
> > >>>>>>> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> > >>>>> hard
> > >>>>>>> to implement.
> > >>>>>>>
> > >>>>>>> It's debatable how difficult it will be to implement filter pushdown.
> > >>>>>>> But I think the fact that currently there is no database connector
> > >>>>>>> with filter pushdown at least means that this feature won't be
> > >>>>>>> supported soon in connectors. Moreover, if we talk about other
> > >>>>>>> connectors (not in Flink repo), their databases might not support all
> > >>>>>>> Flink filters (or not support filters at all). I think users are
> > >>>>>>> interested in supporting cache filters optimization  independently of
> > >>>>>>> supporting other features and solving more complex problems (or
> > >>>>>>> unsolvable at all).
> > >>>>>>>
> > >>>>>>> 3) I agree with your third statement. Actually in our internal version
> > >>>>>>> I also tried to unify the logic of scanning and reloading data from
> > >>>>>>> connectors. But unfortunately, I didn't find a way to unify the logic
> > >>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
> > >>>>>>> and reuse it in reloading ALL cache. As a result I settled on using
> > >>>>>>> InputFormat, because it was used for scanning in all lookup
> > >>>>>>> connectors. (I didn't know that there are plans to deprecate
> > >>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
> > >>>>>>> in ALL caching is not good idea, because this source was designed to
> > >>>>>>> work in distributed environment (SplitEnumerator on JobManager and
> > >>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
> > >>>>>>> operator in our case). There is even no direct way to pass splits from
> > >>>>>>> SplitEnumerator to SourceReader (this logic works through
> > >>>>>>> SplitEnumeratorContext, which requires
> > >>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
> > >>>>>>> InputFormat for ALL cache seems much more clearer and easier. But if
> > >>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
> > >>>>>>> following ideas: maybe we can refuse from lookup join ALL cache in
> > >>>>>>> favor of simple join with multiple scanning of batch source? The point
> > >>>>>>> is that the only difference between lookup join ALL cache and simple
> > >>>>>>> join with batch source is that in the first case scanning is performed
> > >>>>>>> multiple times, in between which state (cache) is cleared (correct me
> > >>>>>>> if I'm wrong). So what if we extend the functionality of simple join
> > >>>>>>> to support state reloading + extend the functionality of scanning
> > >>>>>>> batch source multiple times (this one should be easy with new FLIP-27
> > >>>>>>> source, that unifies streaming/batch reading - we will need to change
> > >>>>>>> only SplitEnumerator, which will pass splits again after some TTL).
> > >>>>>>> WDYT? I must say that this looks like a long-term goal and will make
> > >>>>>>> the scope of this FLIP even larger than you said. Maybe we can limit
> > >>>>>>> ourselves to a simpler solution now (InputFormats).
> > >>>>>>>
> > >>>>>>> So to sum up, my points is like this:
> > >>>>>>> 1) There is a way to make both concise and flexible interfaces for
> > >>>>>>> caching in lookup join.
> > >>>>>>> 2) Cache filters optimization is important both in LRU and ALL caches.
> > >>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
> > >>>>>>> connectors, some of the connectors might not have the opportunity to
> > >>>>>>> support filter pushdown + as I know, currently filter pushdown works
> > >>>>>>> only for scanning (not lookup). So cache filters + projections
> > >>>>>>> optimization should be independent from other features.
> > >>>>>>> 4) ALL cache realization is a complex topic that involves multiple
> > >>>>>>> aspects of how Flink is developing. Refusing from InputFormat in favor
> > >>>>>>> of FLIP-27 Source will make ALL cache realization really complex and
> > >>>>>>> not clear, so maybe instead of that we can extend the functionality of
> > >>>>>>> simple join or not refuse from InputFormat in case of lookup join ALL
> > >>>>>>> cache?
> > >>>>>>>
> > >>>>>>> Best regards,
> > >>>>>>> Smirnov Alexander
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> [1]
> > >>>>>>>
> > >>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > >>>>>>>
> > >>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> > >>>>>>>>
> > >>>>>>>> It's great to see the active discussion! I want to share my ideas:
> > >>>>>>>>
> > >>>>>>>> 1) implement the cache in framework vs. connectors base
> > >>>>>>>> I don't have a strong opinion on this. Both ways should work (e.g.,
> > >>>>> cache
> > >>>>>>>> pruning, compatibility).
> > >>>>>>>> The framework way can provide more concise interfaces.
> > >>>>>>>> The connector base way can define more flexible cache
> > >>>>>>>> strategies/implementations.
> > >>>>>>>> We are still investigating a way to see if we can have both
> > >>>>> advantages.
> > >>>>>>>> We should reach a consensus that the way should be a final state,
> > >>>>> and we
> > >>>>>>>> are on the path to it.
> > >>>>>>>>
> > >>>>>>>> 2) filters and projections pushdown:
> > >>>>>>>> I agree with Alex that the filter pushdown into cache can benefit a
> > >>>>> lot
> > >>>>>>> for
> > >>>>>>>> ALL cache.
> > >>>>>>>> However, this is not true for LRU cache. Connectors use cache to
> > >>>>> reduce
> > >>>>>>> IO
> > >>>>>>>> requests to databases for better throughput.
> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
> > >>>>>>> lookup
> > >>>>>>>> requests that can never be cached
> > >>>>>>>> and hit directly to the databases. That means the cache is
> > >>>>> meaningless in
> > >>>>>>>> this case.
> > >>>>>>>>
> > >>>>>>>> IMO, Flink SQL has provided a standard way to do filters and projects
> > >>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > >>>>> SupportsProjectionPushDown.
> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> > >>>>> hard
> > >>>>>>> to
> > >>>>>>>> implement.
> > >>>>>>>> They should implement the pushdown interfaces to reduce IO and the
> > >>>>> cache
> > >>>>>>>> size.
> > >>>>>>>> That should be a final state that the scan source and lookup source
> > >>>>> share
> > >>>>>>>> the exact pushdown implementation.
> > >>>>>>>> I don't see why we need to duplicate the pushdown logic in caches,
> > >>>>> which
> > >>>>>>>> will complex the lookup join design.
> > >>>>>>>>
> > >>>>>>>> 3) ALL cache abstraction
> > >>>>>>>> All cache might be the most challenging part of this FLIP. We have
> > >>>>> never
> > >>>>>>>> provided a reload-lookup public interface.
> > >>>>>>>> Currently, we put the reload logic in the "eval" method of
> > >>>>> TableFunction.
> > >>>>>>>> That's hard for some sources (e.g., Hive).
> > >>>>>>>> Ideally, connector implementation should share the logic of reload
> > >>>>> and
> > >>>>>>>> scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
> > >>>>>>> Source.
> > >>>>>>>> However, InputFormat/SourceFunction are deprecated, and the FLIP-27
> > >>>>>>> source
> > >>>>>>>> is deeply coupled with SourceOperator.
> > >>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may make
> > >>>>> the
> > >>>>>>>> scope of this FLIP much larger.
> > >>>>>>>> We are still investigating how to abstract the ALL cache logic and
> > >>>>> reuse
> > >>>>>>>> the existing source interfaces.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Jark
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> It's a much more complicated activity and lies out of the scope of
> > >>>>> this
> > >>>>>>>>> improvement. Because such pushdowns should be done for all
> > >>>>>>> ScanTableSource
> > >>>>>>>>> implementations (not only for Lookup ones).
> > >>>>>>>>>
> > >>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> > >>>>> martijnvisser@apache.org>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi everyone,
> > >>>>>>>>>>
> > >>>>>>>>>> One question regarding "And Alexander correctly mentioned that
> > >>>>> filter
> > >>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." -> Would
> > >>>>> an
> > >>>>>>>>>> alternative solution be to actually implement these filter
> > >>>>> pushdowns?
> > >>>>>>> I
> > >>>>>>>>>> can
> > >>>>>>>>>> imagine that there are many more benefits to doing that, outside
> > >>>>> of
> > >>>>>>> lookup
> > >>>>>>>>>> caching and metrics.
> > >>>>>>>>>>
> > >>>>>>>>>> Best regards,
> > >>>>>>>>>>
> > >>>>>>>>>> Martijn Visser
> > >>>>>>>>>> https://twitter.com/MartijnVisser82
> > >>>>>>>>>> https://github.com/MartijnVisser
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi everyone!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for driving such a valuable improvement!
> > >>>>>>>>>>>
> > >>>>>>>>>>> I do think that single cache implementation would be a nice
> > >>>>>>> opportunity
> > >>>>>>>>>> for
> > >>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
> > >>>>>>> semantics
> > >>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> > >>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the cache
> > >>>>> size
> > >>>>>>> by
> > >>>>>>>>>>> simply filtering unnecessary data. And the most handy way to do
> > >>>>> it
> > >>>>>>> is
> > >>>>>>>>>> apply
> > >>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
> > >>>>>>> through the
> > >>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
> > >>>>> mentioned
> > >>>>>>> that
> > >>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
> > >>>>>>>>>>> 2) The ability to set the different caching parameters for
> > >>>>> different
> > >>>>>>>>>> tables
> > >>>>>>>>>>> is quite important. So I would prefer to set it through DDL
> > >>>>> rather
> > >>>>>>> than
> > >>>>>>>>>>> have similar ttla, strategy and other options for all lookup
> > >>>>> tables.
> > >>>>>>>>>>> 3) Providing the cache into the framework really deprives us of
> > >>>>>>>>>>> extensibility (users won't be able to implement their own
> > >>>>> cache).
> > >>>>>>> But
> > >>>>>>>>>> most
> > >>>>>>>>>>> probably it might be solved by creating more different cache
> > >>>>>>> strategies
> > >>>>>>>>>> and
> > >>>>>>>>>>> a wider set of configurations.
> > >>>>>>>>>>>
> > >>>>>>>>>>> All these points are much closer to the schema proposed by
> > >>>>>>> Alexander.
> > >>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all these
> > >>>>>>>>>> facilities
> > >>>>>>>>>>> might be simply implemented in your architecture?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best regards,
> > >>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> > >>>>>>> martijnvisser@apache.org>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I don't have much to chip in, but just wanted to express that
> > >>>>> I
> > >>>>>>> really
> > >>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
> > >>>>> that
> > >>>>>>>>>> others
> > >>>>>>>>>>>> will join the conversation.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> > >>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for your detailed feedback! However, I have questions
> > >>>>>>> about
> > >>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
> > >>>>> AS OF
> > >>>>>>>>>>>> proc_time”
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
> > >>>>> proc_time"
> > >>>>>>> is
> > >>>>>>>>>> not
> > >>>>>>>>>>>>> fully implemented with caching, but as you said, users go
> > >>>>> on it
> > >>>>>>>>>>>>> consciously to achieve better performance (no one proposed
> > >>>>> to
> > >>>>>>> enable
> > >>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
> > >>>>>>> developers
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> connectors? In this case developers explicitly specify
> > >>>>> whether
> > >>>>>>> their
> > >>>>>>>>>>>>> connector supports caching or not (in the list of supported
> > >>>>>>>>>> options),
> > >>>>>>>>>>>>> no one makes them do that if they don't want to. So what
> > >>>>>>> exactly is
> > >>>>>>>>>>>>> the difference between implementing caching in modules
> > >>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
> > >>>>>>> considered
> > >>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
> > >>>>> the
> > >>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> confront a situation that allows table options in DDL to
> > >>>>>>> control
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> behavior of the framework, which has never happened
> > >>>>> previously
> > >>>>>>> and
> > >>>>>>>>>>> should
> > >>>>>>>>>>>>> be cautious
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> If we talk about main differences of semantics of DDL
> > >>>>> options
> > >>>>>>> and
> > >>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
> > >>>>> the
> > >>>>>>> scope
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> the options + importance for the user business logic rather
> > >>>>> than
> > >>>>>>>>>>>>> specific location of corresponding logic in the framework? I
> > >>>>>>> mean
> > >>>>>>>>>> that
> > >>>>>>>>>>>>> in my design, for example, putting an option with lookup
> > >>>>> cache
> > >>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
> > >>>>>>> because it
> > >>>>>>>>>>>>> directly affects the user's business logic (not just
> > >>>>> performance
> > >>>>>>>>>>>>> optimization) + touches just several functions of ONE table
> > >>>>>>> (there
> > >>>>>>>>>> can
> > >>>>>>>>>>>>> be multiple tables with different caches). Does it really
> > >>>>>>> matter for
> > >>>>>>>>>>>>> the user (or someone else) where the logic is located,
> > >>>>> which is
> > >>>>>>>>>>>>> affected by the applied option?
> > >>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
> > >>>>>>> some way
> > >>>>>>>>>>>>> "controls the behavior of the framework" and I don't see any
> > >>>>>>> problem
> > >>>>>>>>>>>>> here.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
> > >>>>> and
> > >>>>>>> the
> > >>>>>>>>>>> design
> > >>>>>>>>>>>>> would become more complex
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> This is a subject for a separate discussion, but actually
> > >>>>> in our
> > >>>>>>>>>>>>> internal version we solved this problem quite easily - we
> > >>>>> reused
> > >>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
> > >>>>>>> point is
> > >>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
> > >>>>>>> scanning
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
> > >>>>> class
> > >>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
> > >>>>>>> InputFormat.
> > >>>>>>>>>>>>> The advantage of this solution is the ability to reload
> > >>>>> cache
> > >>>>>>> data
> > >>>>>>>>>> in
> > >>>>>>>>>>>>> parallel (number of threads depends on number of
> > >>>>> InputSplits,
> > >>>>>>> but
> > >>>>>>>>>> has
> > >>>>>>>>>>>>> an upper limit). As a result cache reload time significantly
> > >>>>>>> reduces
> > >>>>>>>>>>>>> (as well as time of input stream blocking). I know that
> > >>>>> usually
> > >>>>>>> we
> > >>>>>>>>>> try
> > >>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
> > >>>>> one
> > >>>>>>> can
> > >>>>>>>>>> be
> > >>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
> > >>>>> maybe
> > >>>>>>>>>> there
> > >>>>>>>>>>>>> are better ones.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Providing the cache in the framework might introduce
> > >>>>>>> compatibility
> > >>>>>>>>>>>> issues
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It's possible only in cases when the developer of the
> > >>>>> connector
> > >>>>>>>>>> won't
> > >>>>>>>>>>>>> properly refactor his code and will use new cache options
> > >>>>>>>>>> incorrectly
> > >>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
> > >>>>> code
> > >>>>>>>>>>>>> places). For correct behavior all he will need to do is to
> > >>>>>>> redirect
> > >>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
> > >>>>> add an
> > >>>>>>>>>> alias
> > >>>>>>>>>>>>> for options, if there was different naming), everything
> > >>>>> will be
> > >>>>>>>>>>>>> transparent for users. If the developer won't do
> > >>>>> refactoring at
> > >>>>>>> all,
> > >>>>>>>>>>>>> nothing will be changed for the connector because of
> > >>>>> backward
> > >>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
> > >>>>> cache
> > >>>>>>> logic,
> > >>>>>>>>>>>>> he just can refuse to pass some of the configs into the
> > >>>>>>> framework,
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> instead make his own implementation with already existing
> > >>>>>>> configs
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> filters and projections should be pushed all the way down
> > >>>>> to
> > >>>>>>> the
> > >>>>>>>>>>> table
> > >>>>>>>>>>>>> function, like what we do in the scan source
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
> > >>>>> connector
> > >>>>>>>>>> that
> > >>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> > >>>>>>>>>>>>> (no database connector supports it currently). Also for some
> > >>>>>>>>>> databases
> > >>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
> > >>>>> that we
> > >>>>>>> have
> > >>>>>>>>>>>>> in Flink.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> only applying these optimizations to the cache seems not
> > >>>>>>> quite
> > >>>>>>>>>>> useful
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
> > >>>>> from the
> > >>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
> > >>>>>>> table
> > >>>>>>>>>>>>> 'users'
> > >>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
> > >>>>> stream
> > >>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
> > >>>>> we
> > >>>>>>> have
> > >>>>>>>>>>>>> filter 'age > 30',
> > >>>>>>>>>>>>> there will be twice less data in cache. This means the user
> > >>>>> can
> > >>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
> > >>>>>>> gain a
> > >>>>>>>>>>>>> huge
> > >>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
> > >>>>> really
> > >>>>>>>>>> shine
> > >>>>>>>>>>>>> in 'ALL' cache, where tables without filters and projections
> > >>>>>>> can't
> > >>>>>>>>>> fit
> > >>>>>>>>>>>>> in memory, but with them - can. This opens up additional
> > >>>>>>>>>> possibilities
> > >>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It would be great to hear other voices regarding this topic!
> > >>>>>>> Because
> > >>>>>>>>>>>>> we have quite a lot of controversial points, and I think
> > >>>>> with
> > >>>>>>> the
> > >>>>>>>>>> help
> > >>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> > >>>>> renqschn@gmail.com
> > >>>>>>>> :
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi Alexander and Arvid,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
> > >>>>> We
> > >>>>>>> had
> > >>>>>>>>>> an
> > >>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
> > >>>>> like
> > >>>>>>> to
> > >>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
> > >>>>> logic in
> > >>>>>>> the
> > >>>>>>>>>>> table
> > >>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
> > >>>>>>> function,
> > >>>>>>>>>> we
> > >>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
> > >>>>> with
> > >>>>>>> these
> > >>>>>>>>>>>>> concerns:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> > >>>>> SYSTEM_TIME
> > >>>>>>> AS OF
> > >>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
> > >>>>> of the
> > >>>>>>>>>> lookup
> > >>>>>>>>>>>>> table at the moment of querying. If users choose to enable
> > >>>>>>> caching
> > >>>>>>>>>> on
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage is
> > >>>>>>>>>> acceptable
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
> > >>>>>>> caching on
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> table runtime level.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
> > >>>>>>> (whether
> > >>>>>>>>>> in a
> > >>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
> > >>>>> confront a
> > >>>>>>>>>>>> situation
> > >>>>>>>>>>>>> that allows table options in DDL to control the behavior of
> > >>>>> the
> > >>>>>>>>>>>> framework,
> > >>>>>>>>>>>>> which has never happened previously and should be cautious.
> > >>>>>>> Under
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> current design the behavior of the framework should only be
> > >>>>>>>>>> specified
> > >>>>>>>>>>> by
> > >>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
> > >>>>> these
> > >>>>>>>>>> general
> > >>>>>>>>>>>>> configs to a specific table.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
> > >>>>> all
> > >>>>>>>>>> records
> > >>>>>>>>>>>>> periodically into the memory to achieve high lookup
> > >>>>> performance
> > >>>>>>>>>> (like
> > >>>>>>>>>>>> Hive
> > >>>>>>>>>>>>> connector in the community, and also widely used by our
> > >>>>> internal
> > >>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> > >>>>> TableFunction
> > >>>>>>>>>> works
> > >>>>>>>>>>>> fine
> > >>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
> > >>>>>>> interface for
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>> all-caching scenario and the design would become more
> > >>>>> complex.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
> > >>>>>>>>>> compatibility
> > >>>>>>>>>>>>> issues to existing lookup sources like there might exist two
> > >>>>>>> caches
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>> totally different strategies if the user incorrectly
> > >>>>> configures
> > >>>>>>> the
> > >>>>>>>>>>> table
> > >>>>>>>>>>>>> (one in the framework and another implemented by the lookup
> > >>>>>>> source).
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
> > >>>>>>> filters
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> projections should be pushed all the way down to the table
> > >>>>>>> function,
> > >>>>>>>>>>> like
> > >>>>>>>>>>>>> what we do in the scan source, instead of the runner with
> > >>>>> the
> > >>>>>>> cache.
> > >>>>>>>>>>> The
> > >>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
> > >>>>> pressure
> > >>>>>>> on the
> > >>>>>>>>>>>>> external system, and only applying these optimizations to
> > >>>>> the
> > >>>>>>> cache
> > >>>>>>>>>>> seems
> > >>>>>>>>>>>>> not quite useful.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
> > >>>>> We
> > >>>>>>>>>> prefer to
> > >>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
> > >>>>> and we
> > >>>>>>>>>> could
> > >>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> > >>>>>>>>>>>> AllCachingTableFunction,
> > >>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
> > >>>>> metrics
> > >>>>>>> of the
> > >>>>>>>>>>>> cache.
> > >>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Looking forward to your ideas!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > >>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I have few comments on your message.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> but could also live with an easier solution as the
> > >>>>> first
> > >>>>>>> step:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> > >>>>> (originally
> > >>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
> > >>>>> the
> > >>>>>>> same
> > >>>>>>>>>>>>>>> goal, but implementation details are different. If we
> > >>>>> will
> > >>>>>>> go one
> > >>>>>>>>>>> way,
> > >>>>>>>>>>>>>>> moving to another way in the future will mean deleting
> > >>>>>>> existing
> > >>>>>>>>>> code
> > >>>>>>>>>>>>>>> and once again changing the API for connectors. So I
> > >>>>> think we
> > >>>>>>>>>> should
> > >>>>>>>>>>>>>>> reach a consensus with the community about that and then
> > >>>>> work
> > >>>>>>>>>>> together
> > >>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
> > >>>>>>> parts
> > >>>>>>>>>> of
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
> > >>>>>>> proposed
> > >>>>>>>>>> set
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> as the source will only receive the requests after
> > >>>>> filter
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
> > >>>>>>> table, we
> > >>>>>>>>>>>>>>> firstly must do requests, and only after that we can
> > >>>>> filter
> > >>>>>>>>>>> responses,
> > >>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
> > >>>>> if
> > >>>>>>>>>>> filtering
> > >>>>>>>>>>>>>>> is done before caching, there will be much less rows in
> > >>>>>>> cache.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> > >>>>> shared.
> > >>>>>>> I
> > >>>>>>>>>> don't
> > >>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> solution to share images to be honest.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> > >>>>> conversations
> > >>>>>>> :)
> > >>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
> > >>>>> Jira
> > >>>>>>> issue,
> > >>>>>>>>>>>>>>> where described the proposed changes in more details -
> > >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Will happy to get more feedback!
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> > >>>>> arvid@apache.org>:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
> > >>>>>>> satisfying
> > >>>>>>>>>> for
> > >>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
> > >>>>> with
> > >>>>>>> an
> > >>>>>>>>>>> easier
> > >>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
> > >>>>> an
> > >>>>>>>>>>>>> implementation
> > >>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
> > >>>>> layer
> > >>>>>>>>>> around X.
> > >>>>>>>>>>>> So
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> > >>>>> delegates to
> > >>>>>>> X in
> > >>>>>>>>>>> case
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
> > >>>>>>> operator
> > >>>>>>>>>>>> model
> > >>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>> proposed would be even better but is probably
> > >>>>> unnecessary
> > >>>>>>> in
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> first step
> > >>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
> > >>>>> the
> > >>>>>>>>>> requests
> > >>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
> > >>>>> save
> > >>>>>>>>>>> memory).
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
> > >>>>>>> would be
> > >>>>>>>>>>>>> limited to
> > >>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
> > >>>>> else
> > >>>>>>>>>>> remains
> > >>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
> > >>>>> easily
> > >>>>>>>>>>>> incorporate
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
> > >>>>> later.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> > >>>>> shared.
> > >>>>>>> I
> > >>>>>>>>>> don't
> > >>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>> solution to share images to be honest.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > >>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> > >>>>> committer
> > >>>>>>> yet,
> > >>>>>>>>>> but
> > >>>>>>>>>>>> I'd
> > >>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> > >>>>>>> interested
> > >>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
> > >>>>>>> company’s
> > >>>>>>>>>>> Flink
> > >>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
> > >>>>> this and
> > >>>>>>>>>> make
> > >>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>> open source.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I think there is a better alternative than
> > >>>>> introducing an
> > >>>>>>>>>>> abstract
> > >>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
> > >>>>> you
> > >>>>>>> know,
> > >>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> > >>>>> module,
> > >>>>>>> which
> > >>>>>>>>>>>>> provides
> > >>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> > >>>>>>> convenient
> > >>>>>>>>>> for
> > >>>>>>>>>>>>> importing
> > >>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
> > >>>>>>> logic
> > >>>>>>>>>> for
> > >>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
> > >>>>>>> connected
> > >>>>>>>>>> with
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>> should be located in another module, probably in
> > >>>>>>>>>>>>> flink-table-runtime.
> > >>>>>>>>>>>>>>>>> But this will require connectors to depend on another
> > >>>>>>> module,
> > >>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
> > >>>>>>> good.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
> > >>>>>>>>>>>> LookupTableSource
> > >>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
> > >>>>> pass
> > >>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
> > >>>>>>> depend on
> > >>>>>>>>>>>>> runtime
> > >>>>>>>>>>>>>>>>> realization. Based on these configs planner will
> > >>>>>>> construct a
> > >>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> > >>>>>>>>>> (ProcessFunctions
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
> > >>>>> in
> > >>>>>>> the
> > >>>>>>>>>>> pinned
> > >>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
> > >>>>>>>>>> CacheConfig).
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> > >>>>> responsible
> > >>>>>>> for
> > >>>>>>>>>>> this
> > >>>>>>>>>>>> –
> > >>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> > >>>>>>>>>>>>>>>>> Current classes for lookup join in
> > >>>>> flink-table-runtime
> > >>>>>>> -
> > >>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> > >>>>>>>>>>> LookupJoinRunnerWithCalc,
> > >>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
> > >>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
> > >>>>> such a
> > >>>>>>>>>>> solution.
> > >>>>>>>>>>>>> If
> > >>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
> > >>>>> some
> > >>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
> > >>>>> named
> > >>>>>>> like
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
> > >>>>>>> mostly
> > >>>>>>>>>>>> consists
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> filters and projections.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
> > >>>>>>> condition
> > >>>>>>>>>>> ‘JOIN …
> > >>>>>>>>>>>>> ON
> > >>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
> > >>>>> 1000’
> > >>>>>>>>>>> ‘calc’
> > >>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
> > >>>>>>>>>> B.salary >
> > >>>>>>>>>>>>> 1000.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> If we apply this function before storing records in
> > >>>>>>> cache,
> > >>>>>>>>>> size
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
> > >>>>>>> storing
> > >>>>>>>>>>>> useless
> > >>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> > >>>>> size. So
> > >>>>>>> the
> > >>>>>>>>>>>> initial
> > >>>>>>>>>>>>>>>>> max number of records in cache can be increased by
> > >>>>> the
> > >>>>>>> user.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> What do you think about it?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > >>>>>>>>>>>>>>>>>> Hi devs,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
> > >>>>>>>>>> FLIP-221[1],
> > >>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
> > >>>>> its
> > >>>>>>>>>> standard
> > >>>>>>>>>>>>> metrics.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
> > >>>>>>> their
> > >>>>>>>>>> own
> > >>>>>>>>>>>>> cache to
> > >>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
> > >>>>>>> metrics
> > >>>>>>>>>> for
> > >>>>>>>>>>>>> users and
> > >>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
> > >>>>> which
> > >>>>>>> is a
> > >>>>>>>>>>>> quite
> > >>>>>>>>>>>>> common
> > >>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
> > >>>>>>>>>> metrics,
> > >>>>>>>>>>>>> wrapper
> > >>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> > >>>>> Please
> > >>>>>>> take a
> > >>>>>>>>>>> look
> > >>>>>>>>>>>>> at the
> > >>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
> > >>>>> and
> > >>>>>>>>>> comments
> > >>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>> appreciated!
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Qingsheng Ren
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Real-time Computing Team
> > >>>>>>>>>>>>>> Alibaba Cloud
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Email: renqschn@gmail.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Best regards,
> > >>>>>>>>> Roman Boyko
> > >>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Best Regards,
> > >>
> > >> Qingsheng Ren
> > >>
> > >> Real-time Computing Team
> > >> Alibaba Cloud
> > >>
> > >> Email: renqschn@gmail.com
> >

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Leonard Xu <xb...@gmail.com>.

Thanks Qingsheng and Alexander for the update. Current API and Options design of this FLIP look good enough from my side,.
If no more concerns about the thread, I think we can start a VOTE thread later.

Best,
Leonard


> 2022年5月18日 下午5:04，Qingsheng Ren <re...@gmail.com> 写道：
> 
> Hi Jark and Alexander,
> 
> Thanks for your comments! I’m also OK to introduce common table options. I
> prefer to introduce a new DefaultLookupCacheOptions class for holding these
> option definitions because putting all options into FactoryUtil would make
> it a bit ”crowded” and not well categorized.
> 
> FLIP has been updated according to suggestions above:
> 1. Use static “of” method for constructing RescanRuntimeProvider
> considering both arguments are required.
> 2. Introduce new table options matching DefaultLookupCacheFactory
> 
> Best,
> Qingsheng
> 
> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
> 
>> Hi Alex,
>> 
>> 1) retry logic
>> I think we can extract some common retry logic into utilities, e.g.
>> RetryUtils#tryTimes(times, call).
>> This seems independent of this FLIP and can be reused by DataStream users.
>> Maybe we can open an issue to discuss this and where to put it.
>> 
>> 2) cache ConfigOptions
>> I'm fine with defining cache config options in the framework.
>> A candidate place to put is FactoryUtil which also includes
>> "sink.parallelism", "format" options.
>> 
>> Best,
>> Jark
>> 
>> 
>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <sm...@gmail.com>
>> wrote:
>> 
>>> Hi Qingsheng,
>>> 
>>> Thank you for considering my comments.
>>> 
>>>> there might be custom logic before making retry, such as re-establish
>>> the connection
>>> 
>>> Yes, I understand that. I meant that such logic can be placed in a
>>> separate function, that can be implemented by connectors. Just moving
>>> the retry logic would make connector's LookupFunction more concise +
>>> avoid duplicate code. However, it's a minor change. The decision is up
>>> to you.
>>> 
>>>> We decide not to provide common DDL options and let developers to
>>> define their own options as we do now per connector.
>>> 
>>> What is the reason for that? One of the main goals of this FLIP was to
>>> unify the configs, wasn't it? I understand that current cache design
>>> doesn't depend on ConfigOptions, like was before. But still we can put
>>> these options into the framework, so connectors can reuse them and
>>> avoid code duplication, and, what is more significant, avoid possible
>>> different options naming. This moment can be pointed out in
>>> documentation for connector developers.
>>> 
>>> Best regards,
>>> Alexander
>>> 
>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
>>>> 
>>>> Hi Alexander,
>>>> 
>>>> Thanks for the review and glad to see we are on the same page! I think
>>> you forgot to cc the dev mailing list so I’m also quoting your reply under
>>> this email.
>>>> 
>>>>> We can add 'maxRetryTimes' option into this class
>>>> 
>>>> In my opinion the retry logic should be implemented in lookup() instead
>>> of in LookupFunction#eval(). Retrying is only meaningful under some
>>> specific retriable failures, and there might be custom logic before making
>>> retry, such as re-establish the connection (JdbcRowDataLookupFunction is an
>>> example), so it's more handy to leave it to the connector.
>>>> 
>>>>> I don't see DDL options, that were in previous version of FLIP. Do
>>> you have any special plans for them?
>>>> 
>>>> We decide not to provide common DDL options and let developers to
>>> define their own options as we do now per connector.
>>>> 
>>>> The rest of comments sound great and I’ll update the FLIP. Hope we can
>>> finalize our proposal soon!
>>>> 
>>>> Best,
>>>> 
>>>> Qingsheng
>>>> 
>>>> 
>>>>> On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com>
>>> wrote:
>>>>> 
>>>>> Hi Qingsheng and devs!
>>>>> 
>>>>> I like the overall design of updated FLIP, however I have several
>>>>> suggestions and questions.
>>>>> 
>>>>> 1) Introducing LookupFunction as a subclass of TableFunction is a good
>>>>> idea. We can add 'maxRetryTimes' option into this class. 'eval' method
>>>>> of new LookupFunction is great for this purpose. The same is for
>>>>> 'async' case.
>>>>> 
>>>>> 2) There might be other configs in future, such as 'cacheMissingKey'
>>>>> in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
>>>>> Maybe use Builder pattern in LookupFunctionProvider and
>>>>> RescanRuntimeProvider for more flexibility (use one 'build' method
>>>>> instead of many 'of' methods in future)?
>>>>> 
>>>>> 3) What are the plans for existing TableFunctionProvider and
>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
>>>>> 
>>>>> 4) Am I right that the current design does not assume usage of
>>>>> user-provided LookupCache in re-scanning? In this case, it is not very
>>>>> clear why do we need methods such as 'invalidate' or 'putAll' in
>>>>> LookupCache.
>>>>> 
>>>>> 5) I don't see DDL options, that were in previous version of FLIP. Do
>>>>> you have any special plans for them?
>>>>> 
>>>>> If you don't mind, I would be glad to be able to make small
>>>>> adjustments to the FLIP document too. I think it's worth mentioning
>>>>> about what exactly optimizations are planning in the future.
>>>>> 
>>>>> Best regards,
>>>>> Smirnov Alexander
>>>>> 
>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
>>>>>> 
>>>>>> Hi Alexander and devs,
>>>>>> 
>>>>>> Thank you very much for the in-depth discussion! As Jark mentioned
>>> we were inspired by Alexander's idea and made a refactor on our design.
>>> FLIP-221 [1] has been updated to reflect our design now and we are happy to
>>> hear more suggestions from you!
>>>>>> 
>>>>>> Compared to the previous design:
>>>>>> 1. The lookup cache serves at table runtime level and is integrated
>>> as a component of LookupJoinRunner as discussed previously.
>>>>>> 2. Interfaces are renamed and re-designed to reflect the new design.
>>>>>> 3. We separate the all-caching case individually and introduce a new
>>> RescanRuntimeProvider to reuse the ability of scanning. We are planning to
>>> support SourceFunction / InputFormat for now considering the complexity of
>>> FLIP-27 Source API.
>>>>>> 4. A new interface LookupFunction is introduced to make the semantic
>>> of lookup more straightforward for developers.
>>>>>> 
>>>>>> For replying to Alexander:
>>>>>>> However I'm a little confused whether InputFormat is deprecated or
>>> not. Am I right that it will be so in the future, but currently it's not?
>>>>>> Yes you are right. InputFormat is not deprecated for now. I think it
>>> will be deprecated in the future but we don't have a clear plan for that.
>>>>>> 
>>>>>> Thanks again for the discussion on this FLIP and looking forward to
>>> cooperating with you after we finalize the design and interfaces!
>>>>>> 
>>>>>> [1]
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Qingsheng
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
>>> smiralexan@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi Jark, Qingsheng and Leonard!
>>>>>>> 
>>>>>>> Glad to see that we came to a consensus on almost all points!
>>>>>>> 
>>>>>>> However I'm a little confused whether InputFormat is deprecated or
>>>>>>> not. Am I right that it will be so in the future, but currently it's
>>>>>>> not? Actually I also think that for the first version it's OK to use
>>>>>>> InputFormat in ALL cache realization, because supporting rescan
>>>>>>> ability seems like a very distant prospect. But for this decision we
>>>>>>> need a consensus among all discussion participants.
>>>>>>> 
>>>>>>> In general, I don't have something to argue with your statements.
>>> All
>>>>>>> of them correspond my ideas. Looking ahead, it would be nice to work
>>>>>>> on this FLIP cooperatively. I've already done a lot of work on
>>> lookup
>>>>>>> join caching with realization very close to the one we are
>>> discussing,
>>>>>>> and want to share the results of this work. Anyway looking forward
>>> for
>>>>>>> the FLIP update!
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Smirnov Alexander
>>>>>>> 
>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>>>>>>>> 
>>>>>>>> Hi Alex,
>>>>>>>> 
>>>>>>>> Thanks for summarizing your points.
>>>>>>>> 
>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed it
>>> several times
>>>>>>>> and we have totally refactored the design.
>>>>>>>> I'm glad to say we have reached a consensus on many of your points!
>>>>>>>> Qingsheng is still working on updating the design docs and maybe
>>> can be
>>>>>>>> available in the next few days.
>>>>>>>> I will share some conclusions from our discussions:
>>>>>>>> 
>>>>>>>> 1) we have refactored the design towards to "cache in framework"
>>> way.
>>>>>>>> 
>>>>>>>> 2) a "LookupCache" interface for users to customize and a default
>>>>>>>> implementation with builder for users to easy-use.
>>>>>>>> This can both make it possible to both have flexibility and
>>> conciseness.
>>>>>>>> 
>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp
>>> reducing
>>>>>>>> IO.
>>>>>>>> Filter pushdown should be the final state and the unified way to
>>> both
>>>>>>>> support pruning ALL cache and LRU cache,
>>>>>>>> so I think we should make effort in this direction. If we need to
>>> support
>>>>>>>> filter pushdown for ALL cache anyway, why not use
>>>>>>>> it for LRU cache as well? Either way, as we decide to implement
>>> the cache
>>>>>>>> in the framework, we have the chance to support
>>>>>>>> filter on cache anytime. This is an optimization and it doesn't
>>> affect the
>>>>>>>> public API. I think we can create a JIRA issue to
>>>>>>>> discuss it when the FLIP is accepted.
>>>>>>>> 
>>>>>>>> 4) The idea to support ALL cache is similar to your proposal.
>>>>>>>> In the first version, we will only support InputFormat,
>>> SourceFunction for
>>>>>>>> cache all (invoke InputFormat in join operator).
>>>>>>>> For FLIP-27 source, we need to join a true source operator instead
>>> of
>>>>>>>> calling it embedded in the join operator.
>>>>>>>> However, this needs another FLIP to support the re-scan ability
>>> for FLIP-27
>>>>>>>> Source, and this can be a large work.
>>>>>>>> In order to not block this issue, we can put the effort of FLIP-27
>>> source
>>>>>>>> integration into future work and integrate
>>>>>>>> InputFormat&SourceFunction for now.
>>>>>>>> 
>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as they are
>>> not
>>>>>>>> deprecated, otherwise, we have to introduce another function
>>>>>>>> similar to them which is meaningless. We need to plan FLIP-27
>>> source
>>>>>>>> integration ASAP before InputFormat & SourceFunction are
>>> deprecated.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
>>> smiralexan@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Martijn!
>>>>>>>>> 
>>>>>>>>> Got it. Therefore, the realization with InputFormat is not
>>> considered.
>>>>>>>>> Thanks for clearing that up!
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Smirnov Alexander
>>>>>>>>> 
>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <martijn@ververica.com
>>>> :
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> With regards to:
>>>>>>>>>> 
>>>>>>>>>>> But if there are plans to refactor all connectors to FLIP-27
>>>>>>>>>> 
>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
>>> interfaces will be
>>>>>>>>>> deprecated and connectors will either be refactored to use the
>>> new ones
>>>>>>>>> or
>>>>>>>>>> dropped.
>>>>>>>>>> 
>>>>>>>>>> The caching should work for connectors that are using FLIP-27
>>> interfaces,
>>>>>>>>>> we should not introduce new features for old interfaces.
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> 
>>>>>>>>>> Martijn
>>>>>>>>>> 
>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
>>> smiralexan@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Jark!
>>>>>>>>>>> 
>>>>>>>>>>> Sorry for the late response. I would like to make some comments
>>> and
>>>>>>>>>>> clarify my points.
>>>>>>>>>>> 
>>>>>>>>>>> 1) I agree with your first statement. I think we can achieve
>>> both
>>>>>>>>>>> advantages this way: put the Cache interface in
>>> flink-table-common,
>>>>>>>>>>> but have implementations of it in flink-table-runtime.
>>> Therefore if a
>>>>>>>>>>> connector developer wants to use existing cache strategies and
>>> their
>>>>>>>>>>> implementations, he can just pass lookupConfig to the planner,
>>> but if
>>>>>>>>>>> he wants to have its own cache implementation in his
>>> TableFunction, it
>>>>>>>>>>> will be possible for him to use the existing interface for this
>>>>>>>>>>> purpose (we can explicitly point this out in the
>>> documentation). In
>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
>>>>>>>>>>> 
>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will have
>>> 90% of
>>>>>>>>>>> lookup requests that can never be cached
>>>>>>>>>>> 
>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case of LRU
>>> cache.
>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we
>>> always
>>>>>>>>>>> store the response of the dimension table in cache, even after
>>>>>>>>>>> applying calc function. I.e. if there are no rows after applying
>>>>>>>>>>> filters to the result of the 'eval' method of TableFunction, we
>>> store
>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line will be
>>>>>>>>>>> filled, but will require much less memory (in bytes). I.e. we
>>> don't
>>>>>>>>>>> completely filter keys, by which result was pruned, but
>>> significantly
>>>>>>>>>>> reduce required memory to store this result. If the user knows
>>> about
>>>>>>>>>>> this behavior, he can increase the 'max-rows' option before the
>>> start
>>>>>>>>>>> of the job. But actually I came up with the idea that we can do
>>> this
>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher'
>>> methods of
>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
>>>>>>>>>>> (value of cache). Therefore cache can automatically fit much
>>> more
>>>>>>>>>>> records than before.
>>>>>>>>>>> 
>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and
>>> projects
>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>> SupportsProjectionPushDown.
>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean
>>> it's
>>>>>>>>> hard
>>>>>>>>>>> to implement.
>>>>>>>>>>> 
>>>>>>>>>>> It's debatable how difficult it will be to implement filter
>>> pushdown.
>>>>>>>>>>> But I think the fact that currently there is no database
>>> connector
>>>>>>>>>>> with filter pushdown at least means that this feature won't be
>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about other
>>>>>>>>>>> connectors (not in Flink repo), their databases might not
>>> support all
>>>>>>>>>>> Flink filters (or not support filters at all). I think users are
>>>>>>>>>>> interested in supporting cache filters optimization
>>> independently of
>>>>>>>>>>> supporting other features and solving more complex problems (or
>>>>>>>>>>> unsolvable at all).
>>>>>>>>>>> 
>>>>>>>>>>> 3) I agree with your third statement. Actually in our internal
>>> version
>>>>>>>>>>> I also tried to unify the logic of scanning and reloading data
>>> from
>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to unify the
>>> logic
>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
>>> Source,...)
>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I settled on
>>> using
>>>>>>>>>>> InputFormat, because it was used for scanning in all lookup
>>>>>>>>>>> connectors. (I didn't know that there are plans to deprecate
>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27
>>> source
>>>>>>>>>>> in ALL caching is not good idea, because this source was
>>> designed to
>>>>>>>>>>> work in distributed environment (SplitEnumerator on JobManager
>>> and
>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
>>>>>>>>>>> operator in our case). There is even no direct way to pass
>>> splits from
>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works through
>>>>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents).
>>> Usage of
>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and easier.
>>> But if
>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I have
>>> the
>>>>>>>>>>> following ideas: maybe we can refuse from lookup join ALL cache
>>> in
>>>>>>>>>>> favor of simple join with multiple scanning of batch source?
>>> The point
>>>>>>>>>>> is that the only difference between lookup join ALL cache and
>>> simple
>>>>>>>>>>> join with batch source is that in the first case scanning is
>>> performed
>>>>>>>>>>> multiple times, in between which state (cache) is cleared
>>> (correct me
>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of simple
>>> join
>>>>>>>>>>> to support state reloading + extend the functionality of
>>> scanning
>>>>>>>>>>> batch source multiple times (this one should be easy with new
>>> FLIP-27
>>>>>>>>>>> source, that unifies streaming/batch reading - we will need to
>>> change
>>>>>>>>>>> only SplitEnumerator, which will pass splits again after some
>>> TTL).
>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal and will
>>> make
>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe we can
>>> limit
>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
>>>>>>>>>>> 
>>>>>>>>>>> So to sum up, my points is like this:
>>>>>>>>>>> 1) There is a way to make both concise and flexible interfaces
>>> for
>>>>>>>>>>> caching in lookup join.
>>>>>>>>>>> 2) Cache filters optimization is important both in LRU and ALL
>>> caches.
>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
>>>>>>>>>>> connectors, some of the connectors might not have the
>>> opportunity to
>>>>>>>>>>> support filter pushdown + as I know, currently filter pushdown
>>> works
>>>>>>>>>>> only for scanning (not lookup). So cache filters + projections
>>>>>>>>>>> optimization should be independent from other features.
>>>>>>>>>>> 4) ALL cache realization is a complex topic that involves
>>> multiple
>>>>>>>>>>> aspects of how Flink is developing. Refusing from InputFormat
>>> in favor
>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really
>>> complex and
>>>>>>>>>>> not clear, so maybe instead of that we can extend the
>>> functionality of
>>>>>>>>>>> simple join or not refuse from InputFormat in case of lookup
>>> join ALL
>>>>>>>>>>> cache?
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> [1]
>>>>>>>>>>> 
>>>>>>>>> 
>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>>>>> 
>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>>>>>>>>>>>> 
>>>>>>>>>>>> It's great to see the active discussion! I want to share my
>>> ideas:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should work
>>> (e.g.,
>>>>>>>>> cache
>>>>>>>>>>>> pruning, compatibility).
>>>>>>>>>>>> The framework way can provide more concise interfaces.
>>>>>>>>>>>> The connector base way can define more flexible cache
>>>>>>>>>>>> strategies/implementations.
>>>>>>>>>>>> We are still investigating a way to see if we can have both
>>>>>>>>> advantages.
>>>>>>>>>>>> We should reach a consensus that the way should be a final
>>> state,
>>>>>>>>> and we
>>>>>>>>>>>> are on the path to it.
>>>>>>>>>>>> 
>>>>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache can
>>> benefit a
>>>>>>>>> lot
>>>>>>>>>>> for
>>>>>>>>>>>> ALL cache.
>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use cache
>>> to
>>>>>>>>> reduce
>>>>>>>>>>> IO
>>>>>>>>>>>> requests to databases for better throughput.
>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will have
>>> 90% of
>>>>>>>>>>> lookup
>>>>>>>>>>>> requests that can never be cached
>>>>>>>>>>>> and hit directly to the databases. That means the cache is
>>>>>>>>> meaningless in
>>>>>>>>>>>> this case.
>>>>>>>>>>>> 
>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters and
>>> projects
>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean
>>> it's
>>>>>>>>> hard
>>>>>>>>>>> to
>>>>>>>>>>>> implement.
>>>>>>>>>>>> They should implement the pushdown interfaces to reduce IO and
>>> the
>>>>>>>>> cache
>>>>>>>>>>>> size.
>>>>>>>>>>>> That should be a final state that the scan source and lookup
>>> source
>>>>>>>>> share
>>>>>>>>>>>> the exact pushdown implementation.
>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic in
>>> caches,
>>>>>>>>> which
>>>>>>>>>>>> will complex the lookup join design.
>>>>>>>>>>>> 
>>>>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>>>>> All cache might be the most challenging part of this FLIP. We
>>> have
>>>>>>>>> never
>>>>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method of
>>>>>>>>> TableFunction.
>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>>>>> Ideally, connector implementation should share the logic of
>>> reload
>>>>>>>>> and
>>>>>>>>>>>> scan, i.e. ScanTableSource with
>>> InputFormat/SourceFunction/FLIP-27
>>>>>>>>>>> Source.
>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and the
>>> FLIP-27
>>>>>>>>>>> source
>>>>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this
>>> may make
>>>>>>>>> the
>>>>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>>>>> We are still investigating how to abstract the ALL cache logic
>>> and
>>>>>>>>> reuse
>>>>>>>>>>>> the existing source interfaces.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jark
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro.v.boyko@gmail.com
>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> It's a much more complicated activity and lies out of the
>>> scope of
>>>>>>>>> this
>>>>>>>>>>>>> improvement. Because such pushdowns should be done for all
>>>>>>>>>>> ScanTableSource
>>>>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> One question regarding "And Alexander correctly mentioned
>>> that
>>>>>>>>> filter
>>>>>>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." ->
>>> Would
>>>>>>>>> an
>>>>>>>>>>>>>> alternative solution be to actually implement these filter
>>>>>>>>> pushdowns?
>>>>>>>>>>> I
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>> imagine that there are many more benefits to doing that,
>>> outside
>>>>>>>>> of
>>>>>>>>>>> lookup
>>>>>>>>>>>>>> caching and metrics.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>>> ro.v.boyko@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I do think that single cache implementation would be a nice
>>>>>>>>>>> opportunity
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
>>> proc_time"
>>>>>>>>>>> semantics
>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the
>>> cache
>>>>>>>>> size
>>>>>>>>>>> by
>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most handy way
>>> to do
>>>>>>>>> it
>>>>>>>>>>> is
>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
>>>>>>>>>>> through the
>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
>>>>>>>>> mentioned
>>>>>>>>>>> that
>>>>>>>>>>>>>>> filter pushdown still is not implemented for
>>> jdbc/hive/hbase.
>>>>>>>>>>>>>>> 2) The ability to set the different caching parameters for
>>>>>>>>> different
>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>> is quite important. So I would prefer to set it through DDL
>>>>>>>>> rather
>>>>>>>>>>> than
>>>>>>>>>>>>>>> have similar ttla, strategy and other options for all lookup
>>>>>>>>> tables.
>>>>>>>>>>>>>>> 3) Providing the cache into the framework really deprives
>>> us of
>>>>>>>>>>>>>>> extensibility (users won't be able to implement their own
>>>>>>>>> cache).
>>>>>>>>>>> But
>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>> probably it might be solved by creating more different cache
>>>>>>>>>>> strategies
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> All these points are much closer to the schema proposed by
>>>>>>>>>>> Alexander.
>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all
>>> these
>>>>>>>>>>>>>> facilities
>>>>>>>>>>>>>>> might be simply implemented in your architecture?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to express
>>> that
>>>>>>>>> I
>>>>>>>>>>> really
>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
>>>>>>>>> that
>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
>>> questions
>>>>>>>>>>> about
>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
>>>>>>>>> AS OF
>>>>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
>>>>>>>>> proc_time"
>>>>>>>>>>> is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said, users go
>>>>>>>>> on it
>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one proposed
>>>>>>>>> to
>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
>>>>>>>>>>> developers
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly specify
>>>>>>>>> whether
>>>>>>>>>>> their
>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of
>>> supported
>>>>>>>>>>>>>> options),
>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to. So what
>>>>>>>>>>> exactly is
>>>>>>>>>>>>>>>>> the difference between implementing caching in modules
>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
>>>>>>>>> the
>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> confront a situation that allows table options in DDL to
>>>>>>>>>>> control
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> behavior of the framework, which has never happened
>>>>>>>>> previously
>>>>>>>>>>> and
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of DDL
>>>>>>>>> options
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
>>>>>>>>> the
>>>>>>>>>>> scope
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the options + importance for the user business logic
>>> rather
>>>>>>>>> than
>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
>>> framework? I
>>>>>>>>>>> mean
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> in my design, for example, putting an option with lookup
>>>>>>>>> cache
>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
>>>>>>>>>>> because it
>>>>>>>>>>>>>>>>> directly affects the user's business logic (not just
>>>>>>>>> performance
>>>>>>>>>>>>>>>>> optimization) + touches just several functions of ONE
>>> table
>>>>>>>>>>> (there
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it really
>>>>>>>>>>> matter for
>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is located,
>>>>>>>>> which is
>>>>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which
>>> in
>>>>>>>>>>> some way
>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I don't see
>>> any
>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
>>>>>>>>> and
>>>>>>>>>>> the
>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but actually
>>>>>>>>> in our
>>>>>>>>>>>>>>>>> internal version we solved this problem quite easily - we
>>>>>>>>> reused
>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
>>>>>>>>>>> point is
>>>>>>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
>>>>>>>>>>> scanning
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
>>>>>>>>> class
>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
>>>>>>>>>>> InputFormat.
>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to reload
>>>>>>>>> cache
>>>>>>>>>>> data
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
>>>>>>>>> InputSplits,
>>>>>>>>>>> but
>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
>>> significantly
>>>>>>>>>>> reduces
>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know that
>>>>>>>>> usually
>>>>>>>>>>> we
>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe
>>> this
>>>>>>>>> one
>>>>>>>>>>> can
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
>>>>>>>>> maybe
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Providing the cache in the framework might introduce
>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of the
>>>>>>>>> connector
>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache options
>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
>>>>>>>>> code
>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to do is to
>>>>>>>>>>> redirect
>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
>>>>>>>>> add an
>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>> for options, if there was different naming), everything
>>>>>>>>> will be
>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
>>>>>>>>> refactoring at
>>>>>>>>>>> all,
>>>>>>>>>>>>>>>>> nothing will be changed for the connector because of
>>>>>>>>> backward
>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
>>>>>>>>> cache
>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs into the
>>>>>>>>>>> framework,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> instead make his own implementation with already existing
>>>>>>>>>>> configs
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the way down
>>>>>>>>> to
>>>>>>>>>>> the
>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> function, like what we do in the scan source
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
>>>>>>>>> connector
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
>>>>>>>>>>>>>>>>> (no database connector supports it currently). Also for
>>> some
>>>>>>>>>>>>>> databases
>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
>>>>>>>>> that we
>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache seems not
>>>>>>>>>>> quite
>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
>>>>>>>>> from the
>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in
>>> dimension
>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
>>>>>>>>> stream
>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users.
>>> If
>>>>>>>>> we
>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means the
>>> user
>>>>>>>>> can
>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It
>>> will
>>>>>>>>>>> gain a
>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
>>>>>>>>> really
>>>>>>>>>>>>>> shine
>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
>>> projections
>>>>>>>>>>> can't
>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up additional
>>>>>>>>>>>>>> possibilities
>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding this
>>> topic!
>>>>>>>>>>> Because
>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I think
>>>>>>>>> with
>>>>>>>>>>> the
>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
>>>>>>>>> We
>>>>>>>>>>> had
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
>>>>>>>>> like
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
>>>>>>>>> logic in
>>>>>>>>>>> the
>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
>>>>>>>>>>> function,
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
>>>>>>>>> with
>>>>>>>>>>> these
>>>>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
>>>>>>>>> of the
>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose to enable
>>>>>>>>>>> caching
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage
>>> is
>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
>>>>>>>>>>> caching on
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
>>>>>>>>>>> (whether
>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
>>>>>>>>> confront a
>>>>>>>>>>>>>>>> situation
>>>>>>>>>>>>>>>>> that allows table options in DDL to control the behavior
>>> of
>>>>>>>>> the
>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>> which has never happened previously and should be
>>> cautious.
>>>>>>>>>>> Under
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> current design the behavior of the framework should only
>>> be
>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
>>>>>>>>> these
>>>>>>>>>>>>>> general
>>>>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
>>>>>>>>> all
>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
>>>>>>>>> performance
>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>> connector in the community, and also widely used by our
>>>>>>>>> internal
>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
>>>>>>>>> TableFunction
>>>>>>>>>>>>>> works
>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
>>>>>>>>>>> interface for
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> all-caching scenario and the design would become more
>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might exist
>>> two
>>>>>>>>>>> caches
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> totally different strategies if the user incorrectly
>>>>>>>>> configures
>>>>>>>>>>> the
>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> (one in the framework and another implemented by the
>>> lookup
>>>>>>>>>>> source).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
>>>>>>>>>>> filters
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> projections should be pushed all the way down to the table
>>>>>>>>>>> function,
>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the runner with
>>>>>>>>> the
>>>>>>>>>>> cache.
>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
>>>>>>>>> pressure
>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>> external system, and only applying these optimizations to
>>>>>>>>> the
>>>>>>>>>>> cache
>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
>>>>>>>>> We
>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
>>>>>>>>> and we
>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
>>>>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
>>>>>>>>> metrics
>>>>>>>>>>> of the
>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as the
>>>>>>>>> first
>>>>>>>>>>> step:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
>>>>>>>>> (originally
>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
>>>>>>>>> the
>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>> goal, but implementation details are different. If we
>>>>>>>>> will
>>>>>>>>>>> go one
>>>>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean deleting
>>>>>>>>>>> existing
>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors. So I
>>>>>>>>> think we
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that and then
>>>>>>>>> work
>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for
>>> different
>>>>>>>>>>> parts
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
>>>>>>>>>>> proposed
>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests after
>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
>>>>>>>>>>> table, we
>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we can
>>>>>>>>> filter
>>>>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
>>>>>>>>> if
>>>>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less rows in
>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>>>>>>>> shared.
>>>>>>>>>>> I
>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
>>>>>>>>> conversations
>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
>>>>>>>>> Jira
>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>> where described the proposed changes in more details -
>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>>>>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
>>>>>>>>>>> satisfying
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
>>>>>>>>> with
>>>>>>>>>>> an
>>>>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
>>>>>>>>> an
>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
>>>>>>>>> layer
>>>>>>>>>>>>>> around X.
>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
>>>>>>>>> delegates to
>>>>>>>>>>> X in
>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
>>>>>>>>>>> operator
>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
>>>>>>>>> unnecessary
>>>>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
>>>>>>>>> the
>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
>>>>>>>>> save
>>>>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
>>>>>>>>> else
>>>>>>>>>>>>>>> remains
>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
>>>>>>>>> easily
>>>>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>>>>>>>> shared.
>>>>>>>>>>> I
>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
>>>>>>>>> committer
>>>>>>>>>>> yet,
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
>>>>>>>>>>> interested
>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
>>>>>>>>>>> company’s
>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
>>>>>>>>> this and
>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
>>>>>>>>> introducing an
>>>>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
>>>>>>>>> you
>>>>>>>>>>> know,
>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
>>>>>>>>> module,
>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
>>>>>>>>>>> convenient
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
>>>>>>>>>>> logic
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
>>>>>>>>>>> connected
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably in
>>>>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on another
>>>>>>>>>>> module,
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
>>>>>>>>>>> good.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
>>>>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
>>>>>>>>>>> depend on
>>>>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner will
>>>>>>>>>>> construct a
>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
>>>>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
>>>>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
>>>>>>>>> responsible
>>>>>>>>>>> for
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>>>>> flink-table-runtime
>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
>>>>>>>>> such a
>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
>>>>>>>>> named
>>>>>>>>>>> like
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
>>>>>>>>>>> mostly
>>>>>>>>>>>>>>>> consists
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
>>>>>>>>>>> condition
>>>>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
>>>>>>>>> 1000’
>>>>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing records in
>>>>>>>>>>> cache,
>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
>>>>>>>>>>> storing
>>>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
>>>>>>>>> size. So
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be increased by
>>>>>>>>> the
>>>>>>>>>>> user.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
>>>>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
>>>>>>>>> its
>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
>>>>>>>>>>> their
>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
>>>>>>>>>>> metrics
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
>>>>>>>>> which
>>>>>>>>>>> is a
>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
>>>>>>>>> Please
>>>>>>>>>>> take a
>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
>>>>>>>>> and
>>>>>>>>>>>>>> comments
>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best Regards,
>>>>>> 
>>>>>> Qingsheng Ren
>>>>>> 
>>>>>> Real-time Computing Team
>>>>>> Alibaba Cloud
>>>>>> 
>>>>>> Email: renqschn@gmail.com
>>>> 
>>> 
>>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Alexander Smirnov <sm...@gmail.com>.

Hi Qingsheng and devs,

I like the new concept of partial / full caching. Thanks for your
efforts, Qingsheng!

> The schedule-with-delay idea looks reasonable to me, but I think we need to redesign the builder API of full caching to make it more descriptive > for developers.

I've got the access to confluence (thanks Jark!) and added changes in
API considering new 'reload-start-time' option. At first I thought to
limit design with simple builder API, but then understood that the
behavior with various combinations of 'reload-interval' and
'reload-start-time' will be unclear for developers, so now this
behavior is specified in the API. Type of DDL option
'reload-start-time' is String, so connectors will need to parse UTC
time (e.g. 01:30:30) to LocalTime by themselves. It's possible to add
'timeType' to ConfigOptions (like existing 'durationType'), but I
think it's redundant for only one option (maybe there are other
similar ones that I don't know).

About 'lookup.async' option: if there are plans to let the framework
decide whether to use sync / async mode, so this option is redundant.
My proposal was because currently the connector explicitly provides
only one of two modes to the planner, and the user option is the only
way for him to decide which one to provide.

Best regards,
Smirnov Alexander

пн, 23 мая 2022 г. в 13:53, Qingsheng Ren <re...@gmail.com>:
>
> Hi Alexander,
>
> Thanks for the review! We recently updated the FLIP and you can find those changes from my latest email. Since some terminologies has changed so I’ll use the new concept for replying your comments.
>
> 1. Builder vs ‘of’
> I’m OK to use builder pattern if we have additional optional parameters for full caching mode (“rescan” previously). The schedule-with-delay idea looks reasonable to me, but I think we need to redesign the builder API of full caching to make it more descriptive for developers. Would you mind sharing your ideas about the API? For accessing the FLIP workspace you can just provide your account ID and ping any PMC member including Jark.
>
> 2. Common table options
> We have some discussions these days and propose to introduce 8 common table options about caching. It has been updated on the FLIP.
>
> 3. Retries
> I think we are on the same page :-)
>
> For your additional concerns:
> 1) The table option has been updated.
> 2) We got “lookup.cache” back for configuring whether to use partial or full caching mode.
>
> Best regards,
>
> Qingsheng
>
>
>
> > On May 19, 2022, at 17:25, Александр Смирнов <sm...@gmail.com> wrote:
> >
> > Also I have a few additions:
> > 1) maybe rename 'lookup.cache.maximum-size' to
> > 'lookup.cache.max-rows'? I think it will be more clear that we talk
> > not about bytes, but about the number of rows. Plus it fits more,
> > considering my optimization with filters.
> > 2) How will users enable rescanning? Are we going to separate caching
> > and rescanning from the options point of view? Like initially we had
> > one option 'lookup.cache' with values LRU / ALL. I think now we can
> > make a boolean option 'lookup.rescan'. RescanInterval can be
> > 'lookup.rescan.interval', etc.
> >
> > Best regards,
> > Alexander
> >
> > чт, 19 мая 2022 г. в 14:50, Александр Смирнов <sm...@gmail.com>:
> >>
> >> Hi Qingsheng and Jark,
> >>
> >> 1. Builders vs 'of'
> >> I understand that builders are used when we have multiple parameters.
> >> I suggested them because we could add parameters later. To prevent
> >> Builder for ScanRuntimeProvider from looking redundant I can suggest
> >> one more config now - "rescanStartTime".
> >> It's a time in UTC (LocalTime class) when the first reload of cache
> >> starts. This parameter can be thought of as 'initialDelay' (diff
> >> between current time and rescanStartTime) in method
> >> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
> >> useful when the dimension table is updated by some other scheduled job
> >> at a certain time. Or when the user simply wants a second scan (first
> >> cache reload) be delayed. This option can be used even without
> >> 'rescanInterval' - in this case 'rescanInterval' will be one day.
> >> If you are fine with this option, I would be very glad if you would
> >> give me access to edit FLIP page, so I could add it myself
> >>
> >> 2. Common table options
> >> I also think that FactoryUtil would be overloaded by all cache
> >> options. But maybe unify all suggested options, not only for default
> >> cache? I.e. class 'LookupOptions', that unifies default cache options,
> >> rescan options, 'async', 'maxRetries'. WDYT?
> >>
> >> 3. Retries
> >> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
> >>
> >> [1] https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >>
> >> Best regards,
> >> Alexander
> >>
> >> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
> >>>
> >>> Hi Jark and Alexander,
> >>>
> >>> Thanks for your comments! I’m also OK to introduce common table options. I prefer to introduce a new DefaultLookupCacheOptions class for holding these option definitions because putting all options into FactoryUtil would make it a bit ”crowded” and not well categorized.
> >>>
> >>> FLIP has been updated according to suggestions above:
> >>> 1. Use static “of” method for constructing RescanRuntimeProvider considering both arguments are required.
> >>> 2. Introduce new table options matching DefaultLookupCacheFactory
> >>>
> >>> Best,
> >>> Qingsheng
> >>>
> >>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
> >>>>
> >>>> Hi Alex,
> >>>>
> >>>> 1) retry logic
> >>>> I think we can extract some common retry logic into utilities, e.g. RetryUtils#tryTimes(times, call).
> >>>> This seems independent of this FLIP and can be reused by DataStream users.
> >>>> Maybe we can open an issue to discuss this and where to put it.
> >>>>
> >>>> 2) cache ConfigOptions
> >>>> I'm fine with defining cache config options in the framework.
> >>>> A candidate place to put is FactoryUtil which also includes "sink.parallelism", "format" options.
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>>
> >>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <sm...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Qingsheng,
> >>>>>
> >>>>> Thank you for considering my comments.
> >>>>>
> >>>>>> there might be custom logic before making retry, such as re-establish the connection
> >>>>>
> >>>>> Yes, I understand that. I meant that such logic can be placed in a
> >>>>> separate function, that can be implemented by connectors. Just moving
> >>>>> the retry logic would make connector's LookupFunction more concise +
> >>>>> avoid duplicate code. However, it's a minor change. The decision is up
> >>>>> to you.
> >>>>>
> >>>>>> We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
> >>>>>
> >>>>> What is the reason for that? One of the main goals of this FLIP was to
> >>>>> unify the configs, wasn't it? I understand that current cache design
> >>>>> doesn't depend on ConfigOptions, like was before. But still we can put
> >>>>> these options into the framework, so connectors can reuse them and
> >>>>> avoid code duplication, and, what is more significant, avoid possible
> >>>>> different options naming. This moment can be pointed out in
> >>>>> documentation for connector developers.
> >>>>>
> >>>>> Best regards,
> >>>>> Alexander
> >>>>>
> >>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
> >>>>>>
> >>>>>> Hi Alexander,
> >>>>>>
> >>>>>> Thanks for the review and glad to see we are on the same page! I think you forgot to cc the dev mailing list so I’m also quoting your reply under this email.
> >>>>>>
> >>>>>>> We can add 'maxRetryTimes' option into this class
> >>>>>>
> >>>>>> In my opinion the retry logic should be implemented in lookup() instead of in LookupFunction#eval(). Retrying is only meaningful under some specific retriable failures, and there might be custom logic before making retry, such as re-establish the connection (JdbcRowDataLookupFunction is an example), so it's more handy to leave it to the connector.
> >>>>>>
> >>>>>>> I don't see DDL options, that were in previous version of FLIP. Do you have any special plans for them?
> >>>>>>
> >>>>>> We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
> >>>>>>
> >>>>>> The rest of comments sound great and I’ll update the FLIP. Hope we can finalize our proposal soon!
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Qingsheng
> >>>>>>
> >>>>>>
> >>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Qingsheng and devs!
> >>>>>>>
> >>>>>>> I like the overall design of updated FLIP, however I have several
> >>>>>>> suggestions and questions.
> >>>>>>>
> >>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction is a good
> >>>>>>> idea. We can add 'maxRetryTimes' option into this class. 'eval' method
> >>>>>>> of new LookupFunction is great for this purpose. The same is for
> >>>>>>> 'async' case.
> >>>>>>>
> >>>>>>> 2) There might be other configs in future, such as 'cacheMissingKey'
> >>>>>>> in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
> >>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
> >>>>>>> RescanRuntimeProvider for more flexibility (use one 'build' method
> >>>>>>> instead of many 'of' methods in future)?
> >>>>>>>
> >>>>>>> 3) What are the plans for existing TableFunctionProvider and
> >>>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
> >>>>>>>
> >>>>>>> 4) Am I right that the current design does not assume usage of
> >>>>>>> user-provided LookupCache in re-scanning? In this case, it is not very
> >>>>>>> clear why do we need methods such as 'invalidate' or 'putAll' in
> >>>>>>> LookupCache.
> >>>>>>>
> >>>>>>> 5) I don't see DDL options, that were in previous version of FLIP. Do
> >>>>>>> you have any special plans for them?
> >>>>>>>
> >>>>>>> If you don't mind, I would be glad to be able to make small
> >>>>>>> adjustments to the FLIP document too. I think it's worth mentioning
> >>>>>>> about what exactly optimizations are planning in the future.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Smirnov Alexander
> >>>>>>>
> >>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
> >>>>>>>>
> >>>>>>>> Hi Alexander and devs,
> >>>>>>>>
> >>>>>>>> Thank you very much for the in-depth discussion! As Jark mentioned we were inspired by Alexander's idea and made a refactor on our design. FLIP-221 [1] has been updated to reflect our design now and we are happy to hear more suggestions from you!
> >>>>>>>>
> >>>>>>>> Compared to the previous design:
> >>>>>>>> 1. The lookup cache serves at table runtime level and is integrated as a component of LookupJoinRunner as discussed previously.
> >>>>>>>> 2. Interfaces are renamed and re-designed to reflect the new design.
> >>>>>>>> 3. We separate the all-caching case individually and introduce a new RescanRuntimeProvider to reuse the ability of scanning. We are planning to support SourceFunction / InputFormat for now considering the complexity of FLIP-27 Source API.
> >>>>>>>> 4. A new interface LookupFunction is introduced to make the semantic of lookup more straightforward for developers.
> >>>>>>>>
> >>>>>>>> For replying to Alexander:
> >>>>>>>>> However I'm a little confused whether InputFormat is deprecated or not. Am I right that it will be so in the future, but currently it's not?
> >>>>>>>> Yes you are right. InputFormat is not deprecated for now. I think it will be deprecated in the future but we don't have a clear plan for that.
> >>>>>>>>
> >>>>>>>> Thanks again for the discussion on this FLIP and looking forward to cooperating with you after we finalize the design and interfaces!
> >>>>>>>>
> >>>>>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>>
> >>>>>>>> Qingsheng
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <sm...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >>>>>>>>>
> >>>>>>>>> Glad to see that we came to a consensus on almost all points!
> >>>>>>>>>
> >>>>>>>>> However I'm a little confused whether InputFormat is deprecated or
> >>>>>>>>> not. Am I right that it will be so in the future, but currently it's
> >>>>>>>>> not? Actually I also think that for the first version it's OK to use
> >>>>>>>>> InputFormat in ALL cache realization, because supporting rescan
> >>>>>>>>> ability seems like a very distant prospect. But for this decision we
> >>>>>>>>> need a consensus among all discussion participants.
> >>>>>>>>>
> >>>>>>>>> In general, I don't have something to argue with your statements. All
> >>>>>>>>> of them correspond my ideas. Looking ahead, it would be nice to work
> >>>>>>>>> on this FLIP cooperatively. I've already done a lot of work on lookup
> >>>>>>>>> join caching with realization very close to the one we are discussing,
> >>>>>>>>> and want to share the results of this work. Anyway looking forward for
> >>>>>>>>> the FLIP update!
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>> Smirnov Alexander
> >>>>>>>>>
> >>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> >>>>>>>>>>
> >>>>>>>>>> Hi Alex,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for summarizing your points.
> >>>>>>>>>>
> >>>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed it several times
> >>>>>>>>>> and we have totally refactored the design.
> >>>>>>>>>> I'm glad to say we have reached a consensus on many of your points!
> >>>>>>>>>> Qingsheng is still working on updating the design docs and maybe can be
> >>>>>>>>>> available in the next few days.
> >>>>>>>>>> I will share some conclusions from our discussions:
> >>>>>>>>>>
> >>>>>>>>>> 1) we have refactored the design towards to "cache in framework" way.
> >>>>>>>>>>
> >>>>>>>>>> 2) a "LookupCache" interface for users to customize and a default
> >>>>>>>>>> implementation with builder for users to easy-use.
> >>>>>>>>>> This can both make it possible to both have flexibility and conciseness.
> >>>>>>>>>>
> >>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
> >>>>>>>>>> IO.
> >>>>>>>>>> Filter pushdown should be the final state and the unified way to both
> >>>>>>>>>> support pruning ALL cache and LRU cache,
> >>>>>>>>>> so I think we should make effort in this direction. If we need to support
> >>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> >>>>>>>>>> it for LRU cache as well? Either way, as we decide to implement the cache
> >>>>>>>>>> in the framework, we have the chance to support
> >>>>>>>>>> filter on cache anytime. This is an optimization and it doesn't affect the
> >>>>>>>>>> public API. I think we can create a JIRA issue to
> >>>>>>>>>> discuss it when the FLIP is accepted.
> >>>>>>>>>>
> >>>>>>>>>> 4) The idea to support ALL cache is similar to your proposal.
> >>>>>>>>>> In the first version, we will only support InputFormat, SourceFunction for
> >>>>>>>>>> cache all (invoke InputFormat in join operator).
> >>>>>>>>>> For FLIP-27 source, we need to join a true source operator instead of
> >>>>>>>>>> calling it embedded in the join operator.
> >>>>>>>>>> However, this needs another FLIP to support the re-scan ability for FLIP-27
> >>>>>>>>>> Source, and this can be a large work.
> >>>>>>>>>> In order to not block this issue, we can put the effort of FLIP-27 source
> >>>>>>>>>> integration into future work and integrate
> >>>>>>>>>> InputFormat&SourceFunction for now.
> >>>>>>>>>>
> >>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as they are not
> >>>>>>>>>> deprecated, otherwise, we have to introduce another function
> >>>>>>>>>> similar to them which is meaningless. We need to plan FLIP-27 source
> >>>>>>>>>> integration ASAP before InputFormat & SourceFunction are deprecated.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jark
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Martijn!
> >>>>>>>>>>>
> >>>>>>>>>>> Got it. Therefore, the realization with InputFormat is not considered.
> >>>>>>>>>>> Thanks for clearing that up!
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards,
> >>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>
> >>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> With regards to:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> But if there are plans to refactor all connectors to FLIP-27
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces will be
> >>>>>>>>>>>> deprecated and connectors will either be refactored to use the new ones
> >>>>>>>>>>> or
> >>>>>>>>>>>> dropped.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The caching should work for connectors that are using FLIP-27 interfaces,
> >>>>>>>>>>>> we should not introduce new features for old interfaces.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Martijn
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Jark!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry for the late response. I would like to make some comments and
> >>>>>>>>>>>>> clarify my points.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1) I agree with your first statement. I think we can achieve both
> >>>>>>>>>>>>> advantages this way: put the Cache interface in flink-table-common,
> >>>>>>>>>>>>> but have implementations of it in flink-table-runtime. Therefore if a
> >>>>>>>>>>>>> connector developer wants to use existing cache strategies and their
> >>>>>>>>>>>>> implementations, he can just pass lookupConfig to the planner, but if
> >>>>>>>>>>>>> he wants to have its own cache implementation in his TableFunction, it
> >>>>>>>>>>>>> will be possible for him to use the existing interface for this
> >>>>>>>>>>>>> purpose (we can explicitly point this out in the documentation). In
> >>>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
> >>>>>>>>>>>>> lookup requests that can never be cached
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case of LRU cache.
> >>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
> >>>>>>>>>>>>> store the response of the dimension table in cache, even after
> >>>>>>>>>>>>> applying calc function. I.e. if there are no rows after applying
> >>>>>>>>>>>>> filters to the result of the 'eval' method of TableFunction, we store
> >>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line will be
> >>>>>>>>>>>>> filled, but will require much less memory (in bytes). I.e. we don't
> >>>>>>>>>>>>> completely filter keys, by which result was pruned, but significantly
> >>>>>>>>>>>>> reduce required memory to store this result. If the user knows about
> >>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option before the start
> >>>>>>>>>>>>> of the job. But actually I came up with the idea that we can do this
> >>>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods of
> >>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
> >>>>>>>>>>>>> (value of cache). Therefore cache can automatically fit much more
> >>>>>>>>>>>>> records than before.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and projects
> >>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> >>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> >>>>>>>>>>> hard
> >>>>>>>>>>>>> to implement.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It's debatable how difficult it will be to implement filter pushdown.
> >>>>>>>>>>>>> But I think the fact that currently there is no database connector
> >>>>>>>>>>>>> with filter pushdown at least means that this feature won't be
> >>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about other
> >>>>>>>>>>>>> connectors (not in Flink repo), their databases might not support all
> >>>>>>>>>>>>> Flink filters (or not support filters at all). I think users are
> >>>>>>>>>>>>> interested in supporting cache filters optimization  independently of
> >>>>>>>>>>>>> supporting other features and solving more complex problems (or
> >>>>>>>>>>>>> unsolvable at all).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 3) I agree with your third statement. Actually in our internal version
> >>>>>>>>>>>>> I also tried to unify the logic of scanning and reloading data from
> >>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to unify the logic
> >>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
> >>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I settled on using
> >>>>>>>>>>>>> InputFormat, because it was used for scanning in all lookup
> >>>>>>>>>>>>> connectors. (I didn't know that there are plans to deprecate
> >>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
> >>>>>>>>>>>>> in ALL caching is not good idea, because this source was designed to
> >>>>>>>>>>>>> work in distributed environment (SplitEnumerator on JobManager and
> >>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
> >>>>>>>>>>>>> operator in our case). There is even no direct way to pass splits from
> >>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works through
> >>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
> >>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and easier. But if
> >>>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
> >>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join ALL cache in
> >>>>>>>>>>>>> favor of simple join with multiple scanning of batch source? The point
> >>>>>>>>>>>>> is that the only difference between lookup join ALL cache and simple
> >>>>>>>>>>>>> join with batch source is that in the first case scanning is performed
> >>>>>>>>>>>>> multiple times, in between which state (cache) is cleared (correct me
> >>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of simple join
> >>>>>>>>>>>>> to support state reloading + extend the functionality of scanning
> >>>>>>>>>>>>> batch source multiple times (this one should be easy with new FLIP-27
> >>>>>>>>>>>>> source, that unifies streaming/batch reading - we will need to change
> >>>>>>>>>>>>> only SplitEnumerator, which will pass splits again after some TTL).
> >>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal and will make
> >>>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe we can limit
> >>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So to sum up, my points is like this:
> >>>>>>>>>>>>> 1) There is a way to make both concise and flexible interfaces for
> >>>>>>>>>>>>> caching in lookup join.
> >>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU and ALL caches.
> >>>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
> >>>>>>>>>>>>> connectors, some of the connectors might not have the opportunity to
> >>>>>>>>>>>>> support filter pushdown + as I know, currently filter pushdown works
> >>>>>>>>>>>>> only for scanning (not lookup). So cache filters + projections
> >>>>>>>>>>>>> optimization should be independent from other features.
> >>>>>>>>>>>>> 4) ALL cache realization is a complex topic that involves multiple
> >>>>>>>>>>>>> aspects of how Flink is developing. Refusing from InputFormat in favor
> >>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really complex and
> >>>>>>>>>>>>> not clear, so maybe instead of that we can extend the functionality of
> >>>>>>>>>>>>> simple join or not refuse from InputFormat in case of lookup join ALL
> >>>>>>>>>>>>> cache?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>
> >>>>>>>>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It's great to see the active discussion! I want to share my ideas:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
> >>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should work (e.g.,
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>> pruning, compatibility).
> >>>>>>>>>>>>>> The framework way can provide more concise interfaces.
> >>>>>>>>>>>>>> The connector base way can define more flexible cache
> >>>>>>>>>>>>>> strategies/implementations.
> >>>>>>>>>>>>>> We are still investigating a way to see if we can have both
> >>>>>>>>>>> advantages.
> >>>>>>>>>>>>>> We should reach a consensus that the way should be a final state,
> >>>>>>>>>>> and we
> >>>>>>>>>>>>>> are on the path to it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2) filters and projections pushdown:
> >>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache can benefit a
> >>>>>>>>>>> lot
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>> ALL cache.
> >>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use cache to
> >>>>>>>>>>> reduce
> >>>>>>>>>>>>> IO
> >>>>>>>>>>>>>> requests to databases for better throughput.
> >>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>> requests that can never be cached
> >>>>>>>>>>>>>> and hit directly to the databases. That means the cache is
> >>>>>>>>>>> meaningless in
> >>>>>>>>>>>>>> this case.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters and projects
> >>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> >>>>>>>>>>> hard
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> implement.
> >>>>>>>>>>>>>> They should implement the pushdown interfaces to reduce IO and the
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>> That should be a final state that the scan source and lookup source
> >>>>>>>>>>> share
> >>>>>>>>>>>>>> the exact pushdown implementation.
> >>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic in caches,
> >>>>>>>>>>> which
> >>>>>>>>>>>>>> will complex the lookup join design.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 3) ALL cache abstraction
> >>>>>>>>>>>>>> All cache might be the most challenging part of this FLIP. We have
> >>>>>>>>>>> never
> >>>>>>>>>>>>>> provided a reload-lookup public interface.
> >>>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method of
> >>>>>>>>>>> TableFunction.
> >>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >>>>>>>>>>>>>> Ideally, connector implementation should share the logic of reload
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
> >>>>>>>>>>>>> Source.
> >>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and the FLIP-27
> >>>>>>>>>>>>> source
> >>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may make
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> scope of this FLIP much larger.
> >>>>>>>>>>>>>> We are still investigating how to abstract the ALL cache logic and
> >>>>>>>>>>> reuse
> >>>>>>>>>>>>>> the existing source interfaces.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It's a much more complicated activity and lies out of the scope of
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>> improvement. Because such pushdowns should be done for all
> >>>>>>>>>>>>> ScanTableSource
> >>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> >>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> One question regarding "And Alexander correctly mentioned that
> >>>>>>>>>>> filter
> >>>>>>>>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." -> Would
> >>>>>>>>>>> an
> >>>>>>>>>>>>>>>> alternative solution be to actually implement these filter
> >>>>>>>>>>> pushdowns?
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>> imagine that there are many more benefits to doing that, outside
> >>>>>>>>>>> of
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>> caching and metrics.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Martijn Visser
> >>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I do think that single cache implementation would be a nice
> >>>>>>>>>>>>> opportunity
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
> >>>>>>>>>>>>> semantics
> >>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> >>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the cache
> >>>>>>>>>>> size
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most handy way to do
> >>>>>>>>>>> it
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
> >>>>>>>>>>>>> through the
> >>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
> >>>>>>>>>>> mentioned
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
> >>>>>>>>>>>>>>>>> 2) The ability to set the different caching parameters for
> >>>>>>>>>>> different
> >>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it through DDL
> >>>>>>>>>>> rather
> >>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for all lookup
> >>>>>>>>>>> tables.
> >>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really deprives us of
> >>>>>>>>>>>>>>>>> extensibility (users won't be able to implement their own
> >>>>>>>>>>> cache).
> >>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>> probably it might be solved by creating more different cache
> >>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> a wider set of configurations.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> All these points are much closer to the schema proposed by
> >>>>>>>>>>>>> Alexander.
> >>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all these
> >>>>>>>>>>>>>>>> facilities
> >>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> >>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to express that
> >>>>>>>>>>> I
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
> >>>>>>>>>>> that
> >>>>>>>>>>>>>>>> others
> >>>>>>>>>>>>>>>>>> will join the conversation.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> >>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have questions
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
> >>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>> proc_time”
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
> >>>>>>>>>>> proc_time"
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said, users go
> >>>>>>>>>>> on it
> >>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one proposed
> >>>>>>>>>>> to
> >>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
> >>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly specify
> >>>>>>>>>>> whether
> >>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of supported
> >>>>>>>>>>>>>>>> options),
> >>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to. So what
> >>>>>>>>>>>>> exactly is
> >>>>>>>>>>>>>>>>>>> the difference between implementing caching in modules
> >>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
> >>>>>>>>>>>>> considered
> >>>>>>>>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> confront a situation that allows table options in DDL to
> >>>>>>>>>>>>> control
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> behavior of the framework, which has never happened
> >>>>>>>>>>> previously
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> be cautious
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of DDL
> >>>>>>>>>>> options
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
> >>>>>>>>>>> the
> >>>>>>>>>>>>> scope
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> the options + importance for the user business logic rather
> >>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the framework? I
> >>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> in my design, for example, putting an option with lookup
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
> >>>>>>>>>>>>> because it
> >>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not just
> >>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>> optimization) + touches just several functions of ONE table
> >>>>>>>>>>>>> (there
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it really
> >>>>>>>>>>>>> matter for
> >>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is located,
> >>>>>>>>>>> which is
> >>>>>>>>>>>>>>>>>>> affected by the applied option?
> >>>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
> >>>>>>>>>>>>> some way
> >>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I don't see any
> >>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
> >>>>>>>>>>> and
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>> would become more complex
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but actually
> >>>>>>>>>>> in our
> >>>>>>>>>>>>>>>>>>> internal version we solved this problem quite easily - we
> >>>>>>>>>>> reused
> >>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
> >>>>>>>>>>>>> point is
> >>>>>>>>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
> >>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
> >>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
> >>>>>>>>>>>>> InputFormat.
> >>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to reload
> >>>>>>>>>>> cache
> >>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
> >>>>>>>>>>> InputSplits,
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time significantly
> >>>>>>>>>>>>> reduces
> >>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know that
> >>>>>>>>>>> usually
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
> >>>>>>>>>>> one
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
> >>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>> are better ones.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might introduce
> >>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of the
> >>>>>>>>>>> connector
> >>>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache options
> >>>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
> >>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to do is to
> >>>>>>>>>>>>> redirect
> >>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
> >>>>>>>>>>> add an
> >>>>>>>>>>>>>>>> alias
> >>>>>>>>>>>>>>>>>>> for options, if there was different naming), everything
> >>>>>>>>>>> will be
> >>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
> >>>>>>>>>>> refactoring at
> >>>>>>>>>>>>> all,
> >>>>>>>>>>>>>>>>>>> nothing will be changed for the connector because of
> >>>>>>>>>>> backward
> >>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
> >>>>>>>>>>> cache
> >>>>>>>>>>>>> logic,
> >>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs into the
> >>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> instead make his own implementation with already existing
> >>>>>>>>>>>>> configs
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the way down
> >>>>>>>>>>> to
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
> >>>>>>>>>>> connector
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> >>>>>>>>>>>>>>>>>>> (no database connector supports it currently). Also for some
> >>>>>>>>>>>>>>>> databases
> >>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
> >>>>>>>>>>> that we
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>> in Flink.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache seems not
> >>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
> >>>>>>>>>>> from the
> >>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> 'users'
> >>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
> >>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
> >>>>>>>>>>> we
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means the user
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
> >>>>>>>>>>>>> gain a
> >>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
> >>>>>>>>>>> really
> >>>>>>>>>>>>>>>> shine
> >>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and projections
> >>>>>>>>>>>>> can't
> >>>>>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up additional
> >>>>>>>>>>>>>>>> possibilities
> >>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding this topic!
> >>>>>>>>>>>>> Because
> >>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I think
> >>>>>>>>>>> with
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> help
> >>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> >>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
> >>>>>>>>>>> We
> >>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
> >>>>>>>>>>> like
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
> >>>>>>>>>>> logic in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
> >>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
> >>>>>>>>>>> with
> >>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>> concerns:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> >>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
> >>>>>>>>>>> of the
> >>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose to enable
> >>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage is
> >>>>>>>>>>>>>>>> acceptable
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
> >>>>>>>>>>>>> caching on
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> table runtime level.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
> >>>>>>>>>>>>> (whether
> >>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
> >>>>>>>>>>> confront a
> >>>>>>>>>>>>>>>>>> situation
> >>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the behavior of
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>> which has never happened previously and should be cautious.
> >>>>>>>>>>>>> Under
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> current design the behavior of the framework should only be
> >>>>>>>>>>>>>>>> specified
> >>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
> >>>>>>>>>>> these
> >>>>>>>>>>>>>>>> general
> >>>>>>>>>>>>>>>>>>> configs to a specific table.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
> >>>>>>>>>>> all
> >>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
> >>>>>>>>>>> performance
> >>>>>>>>>>>>>>>> (like
> >>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>> connector in the community, and also widely used by our
> >>>>>>>>>>> internal
> >>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> >>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>> works
> >>>>>>>>>>>>>>>>>> fine
> >>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
> >>>>>>>>>>>>> interface for
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become more
> >>>>>>>>>>> complex.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
> >>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might exist two
> >>>>>>>>>>>>> caches
> >>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> totally different strategies if the user incorrectly
> >>>>>>>>>>> configures
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by the lookup
> >>>>>>>>>>>>> source).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
> >>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to the table
> >>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the runner with
> >>>>>>>>>>> the
> >>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
> >>>>>>>>>>> pressure
> >>>>>>>>>>>>> on the
> >>>>>>>>>>>>>>>>>>> external system, and only applying these optimizations to
> >>>>>>>>>>> the
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>> not quite useful.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
> >>>>>>>>>>> We
> >>>>>>>>>>>>>>>> prefer to
> >>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
> >>>>>>>>>>> and we
> >>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> >>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
> >>>>>>>>>>> metrics
> >>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as the
> >>>>>>>>>>> first
> >>>>>>>>>>>>> step:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> >>>>>>>>>>> (originally
> >>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
> >>>>>>>>>>> the
> >>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>> goal, but implementation details are different. If we
> >>>>>>>>>>> will
> >>>>>>>>>>>>> go one
> >>>>>>>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean deleting
> >>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors. So I
> >>>>>>>>>>> think we
> >>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that and then
> >>>>>>>>>>> work
> >>>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
> >>>>>>>>>>>>> parts
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
> >>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests after
> >>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
> >>>>>>>>>>>>> table, we
> >>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we can
> >>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>> responses,
> >>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
> >>>>>>>>>>> if
> >>>>>>>>>>>>>>>>> filtering
> >>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less rows in
> >>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>>>>>>>>>> shared.
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> >>>>>>>>>>> conversations
> >>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
> >>>>>>>>>>> Jira
> >>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more details -
> >>>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> >>>>>>>>>>> arvid@apache.org>:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
> >>>>>>>>>>>>> satisfying
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
> >>>>>>>>>>> with
> >>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>> easier
> >>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
> >>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
> >>>>>>>>>>> layer
> >>>>>>>>>>>>>>>> around X.
> >>>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> >>>>>>>>>>> delegates to
> >>>>>>>>>>>>> X in
> >>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
> >>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
> >>>>>>>>>>> unnecessary
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> first step
> >>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> requests
> >>>>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
> >>>>>>>>>>> save
> >>>>>>>>>>>>>>>>> memory).
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
> >>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>> limited to
> >>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
> >>>>>>>>>>> else
> >>>>>>>>>>>>>>>>> remains
> >>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
> >>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>> incorporate
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
> >>>>>>>>>>> later.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>>>>>>>>>> shared.
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> >>>>>>>>>>> committer
> >>>>>>>>>>>>> yet,
> >>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> >>>>>>>>>>>>> interested
> >>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
> >>>>>>>>>>>>> company’s
> >>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
> >>>>>>>>>>> this and
> >>>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>> open source.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
> >>>>>>>>>>> introducing an
> >>>>>>>>>>>>>>>>> abstract
> >>>>>>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
> >>>>>>>>>>> you
> >>>>>>>>>>>>> know,
> >>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> >>>>>>>>>>> module,
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>> provides
> >>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> >>>>>>>>>>>>> convenient
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> importing
> >>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
> >>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
> >>>>>>>>>>>>> connected
> >>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably in
> >>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on another
> >>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
> >>>>>>>>>>>>> good.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
> >>>>>>>>>>>>>>>>>> LookupTableSource
> >>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
> >>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
> >>>>>>>>>>>>> depend on
> >>>>>>>>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner will
> >>>>>>>>>>>>> construct a
> >>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> >>>>>>>>>>>>>>>> (ProcessFunctions
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
> >>>>>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> pinned
> >>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
> >>>>>>>>>>>>>>>> CacheConfig).
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> >>>>>>>>>>> responsible
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> –
> >>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> >>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>>>>>>>>>> flink-table-runtime
> >>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> >>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
> >>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
> >>>>>>>>>>> such a
> >>>>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
> >>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
> >>>>>>>>>>> named
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
> >>>>>>>>>>>>> mostly
> >>>>>>>>>>>>>>>>>> consists
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
> >>>>>>>>>>>>> condition
> >>>>>>>>>>>>>>>>> ‘JOIN …
> >>>>>>>>>>>>>>>>>>> ON
> >>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
> >>>>>>>>>>> 1000’
> >>>>>>>>>>>>>>>>> ‘calc’
> >>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
> >>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>> 1000.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing records in
> >>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
> >>>>>>>>>>>>> storing
> >>>>>>>>>>>>>>>>>> useless
> >>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> >>>>>>>>>>> size. So
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be increased by
> >>>>>>>>>>> the
> >>>>>>>>>>>>> user.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
> >>>>>>>>>>>>>>>> FLIP-221[1],
> >>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
> >>>>>>>>>>> its
> >>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
> >>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
> >>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> users and
> >>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
> >>>>>>>>>>> which
> >>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
> >>>>>>>>>>>>>>>> metrics,
> >>>>>>>>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> >>>>>>>>>>> Please
> >>>>>>>>>>>>> take a
> >>>>>>>>>>>>>>>>> look
> >>>>>>>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>> comments
> >>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best Regards,
> >>>>>>>>
> >>>>>>>> Qingsheng Ren
> >>>>>>>>
> >>>>>>>> Real-time Computing Team
> >>>>>>>> Alibaba Cloud
> >>>>>>>>
> >>>>>>>> Email: renqschn@gmail.com
> >>>>>>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi Becket, 

Thanks for your comments! 

1. We have removed the LookupCacheFactory in the latest design and we added open/close method to the LookupCache for initialization.

2. Custom reload strategy is a great idea! We added a new interface FullCachingReloadTrigger for developers to implement their own reload strategies.

3. Fixed in the latest version.

4. cacheMissingKey should only be meaningful if the cache is supplied, so we made it as an Optional<Boolean> to align with Optional<LookupCache>. To make it easier to understand we improved the builder of LookupFunctionProvider (now renamed as PartialCachingLookupProvider) that Builder#withCache requires both cache and cacheMissingKey as its arguments.

Best regards, 

Qingsheng

> On May 26, 2022, at 11:52, Becket Qin <be...@gmail.com> wrote:
> 
> Hi Qingsheng,
> 
> Thanks for updating the FLIP. A few comments / questions below:
> 
> 1. Is there a reason that we have both "XXXFactory" and "XXXProvider". What
> is the difference between them? If they are the same, can we just use
> XXXFactory everywhere?
> 
> 2. Regarding the FullCachingLookupProvider, should the reloading policy
> also be pluggable? Periodical reloading could be sometimes be tricky in
> practice. For example, if user uses 24 hours as the cache refresh interval
> and some nightly batch job delayed, the cache update may still see the
> stale data.
> 
> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity should be
> removed.
> 
> 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a little
> confusing to me. If Optional<LookupCacheFactory> getCacheFactory() returns
> a non-empty factory, doesn't that already indicates the framework to cache
> the missing keys? Also, why is this method returning an Optional<Boolean>
> instead of boolean?
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> 
> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <re...@gmail.com> wrote:
> 
>> Hi Lincoln and Jark,
>> 
>> Thanks for the comments! If the community reaches a consensus that we use
>> SQL hint instead of table options to decide whether to use sync or async
>> mode, it’s indeed not necessary to introduce the “lookup.async” option.
>> 
>> I think it’s a good idea to let the decision of async made on query level,
>> which could make better optimization with more infomation gathered by
>> planner. Is there any FLIP describing the issue in FLINK-27625? I thought
>> FLIP-234 is only proposing adding SQL hint for retry on missing instead of
>> the entire async mode to be controlled by hint.
>> 
>> Best regards,
>> 
>> Qingsheng
>> 
>>> On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com> wrote:
>>> 
>>> Hi Jark,
>>> 
>>> Thanks for your reply!
>>> 
>>> Currently 'lookup.async' just lies in HBase connector, I have no idea
>>> whether or when to remove it (we can discuss it in another issue for the
>>> HBase connector after FLINK-27625 is done), just not add it into a common
>>> option now.
>>> 
>>> Best,
>>> Lincoln Lee
>>> 
>>> 
>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
>>> 
>>>> Hi Lincoln,
>>>> 
>>>> I have taken a look at FLIP-234, and I agree with you that the
>> connectors
>>>> can
>>>> provide both async and sync runtime providers simultaneously instead of
>> one
>>>> of them.
>>>> At that point, "lookup.async" looks redundant. If this option is
>> planned to
>>>> be removed
>>>> in the long term, I think it makes sense not to introduce it in this
>> FLIP.
>>>> 
>>>> Best,
>>>> Jark
>>>> 
>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com>
>> wrote:
>>>> 
>>>>> Hi Qingsheng,
>>>>> 
>>>>> Sorry for jumping into the discussion so late. It's a good idea that we
>>>> can
>>>>> have a common table option. I have a minor comments on  'lookup.async'
>>>> that
>>>>> not make it a common option:
>>>>> 
>>>>> The table layer abstracts both sync and async lookup capabilities,
>>>>> connectors implementers can choose one or both, in the case of
>>>> implementing
>>>>> only one capability(status of the most of existing builtin connectors)
>>>>> 'lookup.async' will not be used.  And when a connector has both
>>>>> capabilities, I think this choice is more suitable for making decisions
>>>> at
>>>>> the query level, for example, table planner can choose the physical
>>>>> implementation of async lookup or sync lookup based on its cost model,
>> or
>>>>> users can give query hint based on their own better understanding.  If
>>>>> there is another common table option 'lookup.async', it may confuse the
>>>>> users in the long run.
>>>>> 
>>>>> So, I prefer to leave the 'lookup.async' option in private place (for
>> the
>>>>> current hbase connector) and not turn it into a common option.
>>>>> 
>>>>> WDYT?
>>>>> 
>>>>> Best,
>>>>> Lincoln Lee
>>>>> 
>>>>> 
>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>>>>> 
>>>>>> Hi Alexander,
>>>>>> 
>>>>>> Thanks for the review! We recently updated the FLIP and you can find
>>>>> those
>>>>>> changes from my latest email. Since some terminologies has changed so
>>>>> I’ll
>>>>>> use the new concept for replying your comments.
>>>>>> 
>>>>>> 1. Builder vs ‘of’
>>>>>> I’m OK to use builder pattern if we have additional optional
>> parameters
>>>>>> for full caching mode (“rescan” previously). The schedule-with-delay
>>>> idea
>>>>>> looks reasonable to me, but I think we need to redesign the builder
>> API
>>>>> of
>>>>>> full caching to make it more descriptive for developers. Would you
>> mind
>>>>>> sharing your ideas about the API? For accessing the FLIP workspace you
>>>>> can
>>>>>> just provide your account ID and ping any PMC member including Jark.
>>>>>> 
>>>>>> 2. Common table options
>>>>>> We have some discussions these days and propose to introduce 8 common
>>>>>> table options about caching. It has been updated on the FLIP.
>>>>>> 
>>>>>> 3. Retries
>>>>>> I think we are on the same page :-)
>>>>>> 
>>>>>> For your additional concerns:
>>>>>> 1) The table option has been updated.
>>>>>> 2) We got “lookup.cache” back for configuring whether to use partial
>> or
>>>>>> full caching mode.
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Qingsheng
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <sm...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Also I have a few additions:
>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear that we talk
>>>>>>> not about bytes, but about the number of rows. Plus it fits more,
>>>>>>> considering my optimization with filters.
>>>>>>> 2) How will users enable rescanning? Are we going to separate caching
>>>>>>> and rescanning from the options point of view? Like initially we had
>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think now we can
>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
>>>>>>> 'lookup.rescan.interval', etc.
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Alexander
>>>>>>> 
>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <smiralexan@gmail.com
>>>>> :
>>>>>>>> 
>>>>>>>> Hi Qingsheng and Jark,
>>>>>>>> 
>>>>>>>> 1. Builders vs 'of'
>>>>>>>> I understand that builders are used when we have multiple
>>>> parameters.
>>>>>>>> I suggested them because we could add parameters later. To prevent
>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I can suggest
>>>>>>>> one more config now - "rescanStartTime".
>>>>>>>> It's a time in UTC (LocalTime class) when the first reload of cache
>>>>>>>> starts. This parameter can be thought of as 'initialDelay' (diff
>>>>>>>> between current time and rescanStartTime) in method
>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
>>>>>>>> useful when the dimension table is updated by some other scheduled
>>>> job
>>>>>>>> at a certain time. Or when the user simply wants a second scan
>>>> (first
>>>>>>>> cache reload) be delayed. This option can be used even without
>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one day.
>>>>>>>> If you are fine with this option, I would be very glad if you would
>>>>>>>> give me access to edit FLIP page, so I could add it myself
>>>>>>>> 
>>>>>>>> 2. Common table options
>>>>>>>> I also think that FactoryUtil would be overloaded by all cache
>>>>>>>> options. But maybe unify all suggested options, not only for default
>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default cache
>>>> options,
>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
>>>>>>>> 
>>>>>>>> 3. Retries
>>>>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
>>>>>>>> 
>>>>>>>> [1]
>>>>>> 
>>>>> 
>>>> 
>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Alexander
>>>>>>>> 
>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
>>>>>>>>> 
>>>>>>>>> Hi Jark and Alexander,
>>>>>>>>> 
>>>>>>>>> Thanks for your comments! I’m also OK to introduce common table
>>>>>> options. I prefer to introduce a new DefaultLookupCacheOptions class
>>>> for
>>>>>> holding these option definitions because putting all options into
>>>>>> FactoryUtil would make it a bit ”crowded” and not well categorized.
>>>>>>>>> 
>>>>>>>>> FLIP has been updated according to suggestions above:
>>>>>>>>> 1. Use static “of” method for constructing RescanRuntimeProvider
>>>>>> considering both arguments are required.
>>>>>>>>> 2. Introduce new table options matching DefaultLookupCacheFactory
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Qingsheng
>>>>>>>>> 
>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Alex,
>>>>>>>>>> 
>>>>>>>>>> 1) retry logic
>>>>>>>>>> I think we can extract some common retry logic into utilities,
>>>> e.g.
>>>>>> RetryUtils#tryTimes(times, call).
>>>>>>>>>> This seems independent of this FLIP and can be reused by
>>>> DataStream
>>>>>> users.
>>>>>>>>>> Maybe we can open an issue to discuss this and where to put it.
>>>>>>>>>> 
>>>>>>>>>> 2) cache ConfigOptions
>>>>>>>>>> I'm fine with defining cache config options in the framework.
>>>>>>>>>> A candidate place to put is FactoryUtil which also includes
>>>>>> "sink.parallelism", "format" options.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
>>>>> smiralexan@gmail.com>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>> 
>>>>>>>>>>> Thank you for considering my comments.
>>>>>>>>>>> 
>>>>>>>>>>>> there might be custom logic before making retry, such as
>>>>>> re-establish the connection
>>>>>>>>>>> 
>>>>>>>>>>> Yes, I understand that. I meant that such logic can be placed in
>>>> a
>>>>>>>>>>> separate function, that can be implemented by connectors. Just
>>>>> moving
>>>>>>>>>>> the retry logic would make connector's LookupFunction more
>>>> concise
>>>>> +
>>>>>>>>>>> avoid duplicate code. However, it's a minor change. The decision
>>>> is
>>>>>> up
>>>>>>>>>>> to you.
>>>>>>>>>>> 
>>>>>>>>>>>> We decide not to provide common DDL options and let developers
>>>> to
>>>>>> define their own options as we do now per connector.
>>>>>>>>>>> 
>>>>>>>>>>> What is the reason for that? One of the main goals of this FLIP
>>>> was
>>>>>> to
>>>>>>>>>>> unify the configs, wasn't it? I understand that current cache
>>>>> design
>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But still we
>>>> can
>>>>>> put
>>>>>>>>>>> these options into the framework, so connectors can reuse them
>>>> and
>>>>>>>>>>> avoid code duplication, and, what is more significant, avoid
>>>>> possible
>>>>>>>>>>> different options naming. This moment can be pointed out in
>>>>>>>>>>> documentation for connector developers.
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Alexander
>>>>>>>>>>> 
>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the review and glad to see we are on the same page! I
>>>>>> think you forgot to cc the dev mailing list so I’m also quoting your
>>>>> reply
>>>>>> under this email.
>>>>>>>>>>>> 
>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
>>>>>>>>>>>> 
>>>>>>>>>>>> In my opinion the retry logic should be implemented in lookup()
>>>>>> instead of in LookupFunction#eval(). Retrying is only meaningful under
>>>>> some
>>>>>> specific retriable failures, and there might be custom logic before
>>>>> making
>>>>>> retry, such as re-establish the connection (JdbcRowDataLookupFunction
>>>> is
>>>>> an
>>>>>> example), so it's more handy to leave it to the connector.
>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't see DDL options, that were in previous version of FLIP.
>>>>> Do
>>>>>> you have any special plans for them?
>>>>>>>>>>>> 
>>>>>>>>>>>> We decide not to provide common DDL options and let developers
>>>> to
>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>> 
>>>>>>>>>>>> The rest of comments sound great and I’ll update the FLIP. Hope
>>>> we
>>>>>> can finalize our proposal soon!
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> 
>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
>>>>> smiralexan@gmail.com>
>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Qingsheng and devs!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I like the overall design of updated FLIP, however I have
>>>> several
>>>>>>>>>>>>> suggestions and questions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction
>>>> is a
>>>>>> good
>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this class. 'eval'
>>>>>> method
>>>>>>>>>>>>> of new LookupFunction is great for this purpose. The same is
>>>> for
>>>>>>>>>>>>> 'async' case.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2) There might be other configs in future, such as
>>>>>> 'cacheMissingKey'
>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
>>>>>> ScanRuntimeProvider.
>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
>>>>> method
>>>>>>>>>>>>> instead of many 'of' methods in future)?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 3) What are the plans for existing TableFunctionProvider and
>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 4) Am I right that the current design does not assume usage of
>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this case, it is
>>>> not
>>>>>> very
>>>>>>>>>>>>> clear why do we need methods such as 'invalidate' or 'putAll'
>>>> in
>>>>>>>>>>>>> LookupCache.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous version of
>>>>> FLIP.
>>>>>> Do
>>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If you don't mind, I would be glad to be able to make small
>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's worth
>>>>> mentioning
>>>>>>>>>>>>> about what exactly optimizations are planning in the future.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>> 
>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <renqschn@gmail.com
>>>>> :
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Alexander and devs,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thank you very much for the in-depth discussion! As Jark
>>>>>> mentioned we were inspired by Alexander's idea and made a refactor on
>>>> our
>>>>>> design. FLIP-221 [1] has been updated to reflect our design now and we
>>>>> are
>>>>>> happy to hear more suggestions from you!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Compared to the previous design:
>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level and is
>>>>>> integrated as a component of LookupJoinRunner as discussed previously.
>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the new
>>>>>> design.
>>>>>>>>>>>>>> 3. We separate the all-caching case individually and
>>>> introduce a
>>>>>> new RescanRuntimeProvider to reuse the ability of scanning. We are
>>>>> planning
>>>>>> to support SourceFunction / InputFormat for now considering the
>>>>> complexity
>>>>>> of FLIP-27 Source API.
>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to make the
>>>>>> semantic of lookup more straightforward for developers.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For replying to Alexander:
>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>>>> deprecated
>>>>>> or not. Am I right that it will be so in the future, but currently
>> it's
>>>>> not?
>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for now. I
>>>>> think
>>>>>> it will be deprecated in the future but we don't have a clear plan for
>>>>> that.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and looking
>>>> forward
>>>>>> to cooperating with you after we finalize the design and interfaces!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1]
>>>>>> 
>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
>>>>>> smiralexan@gmail.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost all points!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>>>> deprecated
>>>>>> or
>>>>>>>>>>>>>>> not. Am I right that it will be so in the future, but
>>>> currently
>>>>>> it's
>>>>>>>>>>>>>>> not? Actually I also think that for the first version it's OK
>>>>> to
>>>>>> use
>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because supporting
>>>> rescan
>>>>>>>>>>>>>>> ability seems like a very distant prospect. But for this
>>>>>> decision we
>>>>>>>>>>>>>>> need a consensus among all discussion participants.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In general, I don't have something to argue with your
>>>>>> statements. All
>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would be nice
>>>> to
>>>>>> work
>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot of work
>>>> on
>>>>>> lookup
>>>>>>>>>>>>>>> join caching with realization very close to the one we are
>>>>>> discussing,
>>>>>>>>>>>>>>> and want to share the results of this work. Anyway looking
>>>>>> forward for
>>>>>>>>>>>>>>> the FLIP update!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for summarizing your points.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed
>>>> it
>>>>>> several times
>>>>>>>>>>>>>>>> and we have totally refactored the design.
>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many of your
>>>>>> points!
>>>>>>>>>>>>>>>> Qingsheng is still working on updating the design docs and
>>>>>> maybe can be
>>>>>>>>>>>>>>>> available in the next few days.
>>>>>>>>>>>>>>>> I will share some conclusions from our discussions:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1) we have refactored the design towards to "cache in
>>>>>> framework" way.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize and a
>>>>>> default
>>>>>>>>>>>>>>>> implementation with builder for users to easy-use.
>>>>>>>>>>>>>>>> This can both make it possible to both have flexibility and
>>>>>> conciseness.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
>>>> cache,
>>>>>> esp reducing
>>>>>>>>>>>>>>>> IO.
>>>>>>>>>>>>>>>> Filter pushdown should be the final state and the unified
>>>> way
>>>>>> to both
>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
>>>>>>>>>>>>>>>> so I think we should make effort in this direction. If we
>>>> need
>>>>>> to support
>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide to
>>>>> implement
>>>>>> the cache
>>>>>>>>>>>>>>>> in the framework, we have the chance to support
>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization and it
>>>>> doesn't
>>>>>> affect the
>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
>>>> proposal.
>>>>>>>>>>>>>>>> In the first version, we will only support InputFormat,
>>>>>> SourceFunction for
>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source operator
>>>>>> instead of
>>>>>>>>>>>>>>>> calling it embedded in the join operator.
>>>>>>>>>>>>>>>> However, this needs another FLIP to support the re-scan
>>>>> ability
>>>>>> for FLIP-27
>>>>>>>>>>>>>>>> Source, and this can be a large work.
>>>>>>>>>>>>>>>> In order to not block this issue, we can put the effort of
>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>> integration into future work and integrate
>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as they
>>>>>> are not
>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another function
>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to plan
>>>> FLIP-27
>>>>>> source
>>>>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
>>>>>> deprecated.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Martijn!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat is not
>>>>>> considered.
>>>>>>>>>>>>>>>>> Thanks for clearing that up!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
>>>>>> martijn@ververica.com>:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> With regards to:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors to
>>>>> FLIP-27
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
>>>>>> interfaces will be
>>>>>>>>>>>>>>>>>> deprecated and connectors will either be refactored to use
>>>>>> the new ones
>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> dropped.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The caching should work for connectors that are using
>>>>> FLIP-27
>>>>>> interfaces,
>>>>>>>>>>>>>>>>>> we should not introduce new features for old interfaces.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Jark!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make some
>>>>>> comments and
>>>>>>>>>>>>>>>>>>> clarify my points.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we can
>>>>> achieve
>>>>>> both
>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
>>>>>> flink-table-common,
>>>>>>>>>>>>>>>>>>> but have implementations of it in flink-table-runtime.
>>>>>> Therefore if a
>>>>>>>>>>>>>>>>>>> connector developer wants to use existing cache
>>>> strategies
>>>>>> and their
>>>>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
>>>>>> planner, but if
>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation in his
>>>>>> TableFunction, it
>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing interface
>>>> for
>>>>>> this
>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
>>>>>> documentation). In
>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case
>>>> of
>>>>>> LRU cache.
>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here
>>>> we
>>>>>> always
>>>>>>>>>>>>>>>>>>> store the response of the dimension table in cache, even
>>>>>> after
>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows after
>>>>>> applying
>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
>>>>> TableFunction,
>>>>>> we store
>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line
>>>>> will
>>>>>> be
>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in bytes).
>>>> I.e.
>>>>>> we don't
>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was pruned, but
>>>>>> significantly
>>>>>>>>>>>>>>>>>>> reduce required memory to store this result. If the user
>>>>>> knows about
>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option
>>>> before
>>>>>> the start
>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea that we
>>>>> can
>>>>>> do this
>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher'
>>>>>> methods of
>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection
>>>> of
>>>>>> rows
>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can automatically fit
>>>>> much
>>>>>> more
>>>>>>>>>>>>>>>>>>> records than before.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and
>>>>>> projects
>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>>>> don't
>>>>>> mean it's
>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>> to implement.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to implement
>>>> filter
>>>>>> pushdown.
>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is no database
>>>>>> connector
>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this feature
>>>> won't
>>>>>> be
>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about
>>>>>> other
>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases might not
>>>>>> support all
>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I think
>>>>> users
>>>>>> are
>>>>>>>>>>>>>>>>>>> interested in supporting cache filters optimization
>>>>>> independently of
>>>>>>>>>>>>>>>>>>> supporting other features and solving more complex
>>>> problems
>>>>>> (or
>>>>>>>>>>>>>>>>>>> unsolvable at all).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
>>>>>> internal version
>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and reloading
>>>>>> data from
>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to
>>>> unify
>>>>>> the logic
>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
>>>>>> Source,...)
>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
>>>> settled
>>>>>> on using
>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in all
>>>> lookup
>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans to
>>>>> deprecate
>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this source was
>>>>>> designed to
>>>>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
>>>>>> JobManager and
>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
>>>> (lookup
>>>>>> join
>>>>>>>>>>>>>>>>>>> operator in our case). There is even no direct way to
>>>> pass
>>>>>> splits from
>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works through
>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
>>>> AddSplitEvents).
>>>>>> Usage of
>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
>>>>>> easier. But if
>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I
>>>>>> have the
>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join ALL
>>>>>> cache in
>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of batch
>>>>> source?
>>>>>> The point
>>>>>>>>>>>>>>>>>>> is that the only difference between lookup join ALL cache
>>>>>> and simple
>>>>>>>>>>>>>>>>>>> join with batch source is that in the first case scanning
>>>>> is
>>>>>> performed
>>>>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is cleared
>>>>>> (correct me
>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of
>>>>>> simple join
>>>>>>>>>>>>>>>>>>> to support state reloading + extend the functionality of
>>>>>> scanning
>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should be easy with
>>>>>> new FLIP-27
>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we will
>>>> need
>>>>>> to change
>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again after
>>>>>> some TTL).
>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal
>>>> and
>>>>>> will make
>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe
>>>> we
>>>>>> can limit
>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and flexible
>>>>>> interfaces for
>>>>>>>>>>>>>>>>>>> caching in lookup join.
>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU
>>>> and
>>>>>> ALL caches.
>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported
>>>> in
>>>>>> Flink
>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not have the
>>>>>> opportunity to
>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently filter
>>>>>> pushdown works
>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
>>>>>> projections
>>>>>>>>>>>>>>>>>>> optimization should be independent from other features.
>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that involves
>>>>>> multiple
>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
>>>>>> InputFormat in favor
>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really
>>>>>> complex and
>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
>>>>>> functionality of
>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
>>>>> lookup
>>>>>> join ALL
>>>>>>>>>>>>>>>>>>> cache?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want to share
>>>>> my
>>>>>> ideas:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should
>>>>>> work (e.g.,
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise interfaces.
>>>>>>>>>>>>>>>>>>>> The connector base way can define more flexible cache
>>>>>>>>>>>>>>>>>>>> strategies/implementations.
>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if we can have
>>>>> both
>>>>>>>>>>>>>>>>> advantages.
>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way should be a
>>>> final
>>>>>> state,
>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>> are on the path to it.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache
>>>> can
>>>>>> benefit a
>>>>>>>>>>>>>>>>> lot
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> ALL cache.
>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use
>>>>>> cache to
>>>>>>>>>>>>>>>>> reduce
>>>>>>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>> requests that can never be cached
>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That means the cache
>>>> is
>>>>>>>>>>>>>>>>> meaningless in
>>>>>>>>>>>>>>>>>>>> this case.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters
>>>>>> and projects
>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>>>> don't
>>>>>> mean it's
>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> implement.
>>>>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to reduce
>>>> IO
>>>>>> and the
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan source and
>>>>>> lookup source
>>>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic
>>>> in
>>>>>> caches,
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part of this
>>>> FLIP.
>>>>>> We have
>>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method
>>>> of
>>>>>>>>>>>>>>>>> TableFunction.
>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should share the logic
>>>>> of
>>>>>> reload
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
>>>>>> InputFormat/SourceFunction/FLIP-27
>>>>>>>>>>>>>>>>>>> Source.
>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and
>>>>> the
>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin,
>>>>> this
>>>>>> may make
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract the ALL cache
>>>>>> logic and
>>>>>>>>>>>>>>>>> reuse
>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies out of
>>>> the
>>>>>> scope of
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be done for
>>>>> all
>>>>>>>>>>>>>>>>>>> ScanTableSource
>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
>>>>> mentioned
>>>>>> that
>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
>>>> jdbc/hive/hbase."
>>>>>> -> Would
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement these
>>>>> filter
>>>>>>>>>>>>>>>>> pushdowns?
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to doing
>>>> that,
>>>>>> outside
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation would be
>>>> a
>>>>>> nice
>>>>>>>>>>>>>>>>>>> opportunity
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>> semantics
>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off
>>>>> the
>>>>>> cache
>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most handy
>>>>>> way to do
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to
>>>>>> pass it
>>>>>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
>>>>> correctly
>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
>>>>>> jdbc/hive/hbase.
>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
>>>> parameters
>>>>>> for
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
>>>> through
>>>>>> DDL
>>>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for all
>>>>>> lookup
>>>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
>>>>>> deprives us of
>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement their
>>>>> own
>>>>>>>>>>>>>>>>> cache).
>>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
>>>> different
>>>>>> cache
>>>>>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the schema
>>>> proposed
>>>>>> by
>>>>>>>>>>>>>>>>>>> Alexander.
>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and
>>>>> all
>>>>>> these
>>>>>>>>>>>>>>>>>>>>>> facilities
>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
>>>>>> express that
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic
>>>> and I
>>>>>> hope
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
>>>>>> questions
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
>>>>>> something?).
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS
>>>> OF
>>>>>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said,
>>>>> users
>>>>>> go
>>>>>>>>>>>>>>>>> on it
>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one
>>>>>> proposed
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean
>>>>>> other
>>>>>>>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
>>>>> specify
>>>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of
>>>>>> supported
>>>>>>>>>>>>>>>>>>>>>> options),
>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to. So
>>>>>> what
>>>>>>>>>>>>>>>>>>> exactly is
>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching in
>>>>> modules
>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from
>>>>> the
>>>>>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
>>>>>> breaking/non-breaking
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table options in
>>>>> DDL
>>>>>> to
>>>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never happened
>>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of
>>>> DDL
>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about
>>>>>> limiting
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user business
>>>> logic
>>>>>> rather
>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
>>>>>> framework? I
>>>>>>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option with
>>>>>> lookup
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
>>>>>> decision,
>>>>>>>>>>>>>>>>>>> because it
>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not
>>>> just
>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several functions of
>>>> ONE
>>>>>> table
>>>>>>>>>>>>>>>>>>> (there
>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it
>>>>>> really
>>>>>>>>>>>>>>>>>>> matter for
>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
>>>>> located,
>>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism',
>>>>>> which in
>>>>>>>>>>>>>>>>>>> some way
>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I
>>>> don't
>>>>>> see any
>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
>>>>>> scenario
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but
>>>>>> actually
>>>>>>>>>>>>>>>>> in our
>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
>>>> easily
>>>>> -
>>>>>> we
>>>>>>>>>>>>>>>>> reused
>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new
>>>>> API).
>>>>>> The
>>>>>>>>>>>>>>>>>>> point is
>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
>>>> InputFormat
>>>>>> for
>>>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it
>>>>> uses
>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper
>>>>> around
>>>>>>>>>>>>>>>>>>> InputFormat.
>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to
>>>>> reload
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
>>>>>>>>>>>>>>>>> InputSplits,
>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
>>>>>> significantly
>>>>>>>>>>>>>>>>>>> reduces
>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know
>>>>> that
>>>>>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but
>>>>> maybe
>>>>>> this
>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal
>>>>>> solution,
>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
>>>> introduce
>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of
>>>> the
>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache
>>>>>> options
>>>>>>>>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2
>>>>>> different
>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to
>>>> do
>>>>>> is to
>>>>>>>>>>>>>>>>>>> redirect
>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+
>>>>>> maybe
>>>>>>>>>>>>>>>>> add an
>>>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different naming),
>>>>> everything
>>>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
>>>>>>>>>>>>>>>>> refactoring at
>>>>>>>>>>>>>>>>>>> all,
>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector because
>>>> of
>>>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his
>>>>> own
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs into
>>>>> the
>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with already
>>>>>> existing
>>>>>>>>>>>>>>>>>>> configs
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare
>>>> case).
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the
>>>> way
>>>>>> down
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the
>>>>> ONLY
>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it currently). Also
>>>>>> for some
>>>>>>>>>>>>>>>>>>>>>> databases
>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
>>>>> filters
>>>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache
>>>> seems
>>>>>> not
>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of
>>>>> data
>>>>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in
>>>>>> dimension
>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and
>>>>>> input
>>>>>>>>>>>>>>>>> stream
>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of
>>>>>> users. If
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means
>>>>> the
>>>>>> user
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times.
>>>>> It
>>>>>> will
>>>>>>>>>>>>>>>>>>> gain a
>>>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization
>>>> starts
>>>>>> to
>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>> shine
>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
>>>>>> projections
>>>>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
>>>>>> additional
>>>>>>>>>>>>>>>>>>>>>> possibilities
>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
>>>>>> useful'.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding
>>>> this
>>>>>> topic!
>>>>>>>>>>>>>>>>>>> Because
>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I
>>>>>> think
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
>>>>>> consensus.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>>>>>>>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late
>>>>>> response!
>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard
>>>>> and
>>>>>> I’d
>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the
>>>>> cache
>>>>>>>>>>>>>>>>> logic in
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided
>>>>>> table
>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
>>>>>> TableFunction
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
>>>>>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the
>>>>>> content
>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose to
>>>>>> enable
>>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this
>>>>>> breakage is
>>>>>>>>>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to
>>>>>> provide
>>>>>>>>>>>>>>>>>>> caching on
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
>>>>>> framework
>>>>>>>>>>>>>>>>>>> (whether
>>>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have
>>>> to
>>>>>>>>>>>>>>>>> confront a
>>>>>>>>>>>>>>>>>>>>>>>> situation
>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
>>>>>> behavior of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and should be
>>>>>> cautious.
>>>>>>>>>>>>>>>>>>> Under
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the framework should
>>>>>> only be
>>>>>>>>>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to
>>>>>> apply
>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>> general
>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and
>>>>>> refresh
>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely used by
>>>>> our
>>>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
>>>>>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>> works
>>>>>>>>>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a
>>>>> new
>>>>>>>>>>>>>>>>>>> interface for
>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become
>>>> more
>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might
>>>>>> exist two
>>>>>>>>>>>>>>>>>>> caches
>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
>>>> incorrectly
>>>>>>>>>>>>>>>>> configures
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by
>>>> the
>>>>>> lookup
>>>>>>>>>>>>>>>>>>> source).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I
>>>>>> think
>>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to
>>>> the
>>>>>> table
>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
>>>> runner
>>>>>> with
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O
>>>> and
>>>>>>>>>>>>>>>>> pressure
>>>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
>>>>> optimizations
>>>>>> to
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our
>>>>>> ideas.
>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
>>>>>> TableFunction,
>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
>>>> regulate
>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов
>>>> <
>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as
>>>> the
>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>> step:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
>>>>>>>>>>>>>>>>> (originally
>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they
>>>>>> follow
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are different.
>>>> If
>>>>> we
>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>> go one
>>>>>>>>>>>>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
>>>>>> deleting
>>>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors.
>>>> So
>>>>> I
>>>>>>>>>>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that
>>>> and
>>>>>> then
>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for
>>>>>> different
>>>>>>>>>>>>>>>>>>> parts
>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
>>>>>> introducing
>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests
>>>> after
>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the
>>>>>> lookup
>>>>>>>>>>>>>>>>>>> table, we
>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we
>>>>> can
>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
>>>>>> pushdown. So
>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less
>>>>> rows
>>>>>> in
>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
>>>> not
>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
>>>>>>>>>>>>>>>>> conversations
>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I
>>>>> made a
>>>>>>>>>>>>>>>>> Jira
>>>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
>>>>> details
>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>>>>>>>>>>>>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was
>>>> not
>>>>>>>>>>>>>>>>>>> satisfying
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also
>>>>> live
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making
>>>>>> caching
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
>>>> caching
>>>>>>>>>>>>>>>>> layer
>>>>>>>>>>>>>>>>>>>>>> around X.
>>>>>>>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
>>>>>>>>>>>>>>>>> delegates to
>>>>>>>>>>>>>>>>>>> X in
>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it
>>>> into
>>>>>> the
>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
>>>>>>>>>>>>>>>>> unnecessary
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
>>>>> receive
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
>>>>> interesting
>>>>>> to
>>>>>>>>>>>>>>>>> save
>>>>>>>>>>>>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of
>>>> this
>>>>>> FLIP
>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
>>>>>> Everything
>>>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>>>>> remains
>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we
>>>> can
>>>>>>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed
>>>> out
>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
>>>> not
>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
>>>> Смирнов
>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
>>>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>>>> yet,
>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
>>>>>>>>>>>>>>>>>>> interested
>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in
>>>> my
>>>>>>>>>>>>>>>>>>> company’s
>>>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts
>>>> on
>>>>>>>>>>>>>>>>> this and
>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
>>>>>>>>>>>>>>>>> introducing an
>>>>>>>>>>>>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction).
>>>>> As
>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> know,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
>>>>>>>>>>>>>>>>>>> convenient
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction
>>>>>> contains
>>>>>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
>>>> everything
>>>>>>>>>>>>>>>>>>> connected
>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably
>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on
>>>>>> another
>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t
>>>>>> sound
>>>>>>>>>>>>>>>>>>> good.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’
>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to
>>>>>> only
>>>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they
>>>>> won’t
>>>>>>>>>>>>>>>>>>> depend on
>>>>>>>>>>>>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner
>>>> will
>>>>>>>>>>>>>>>>>>> construct a
>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks
>>>>>> like
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually
>>>> yours
>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
>>>>>>>>>>>>>>>>> responsible
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>>>>>>>>>>>>> flink-table-runtime
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
>>>> LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage
>>>> of
>>>>>>>>>>>>>>>>> such a
>>>>>>>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can
>>>>>> apply
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc
>>>> was
>>>>>>>>>>>>>>>>> named
>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
>>>>> actually
>>>>>>>>>>>>>>>>>>> mostly
>>>>>>>>>>>>>>>>>>>>>>>> consists
>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table
>>>> B
>>>>>>>>>>>>>>>>>>> condition
>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
>>>>> B.salary >
>>>>>>>>>>>>>>>>> 1000’
>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age +
>>>> 10
>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
>>>> records
>>>>> in
>>>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters =
>>>>>> avoid
>>>>>>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
>>>>>>>>>>>>>>>>> size. So
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be increased
>>>>> by
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> user.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion
>>>>> about
>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache
>>>>> and
>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
>>>>>> implement
>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
>>>> standard
>>>>> of
>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
>>>>> joins,
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including
>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
>>>>> suggestions
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> comments
>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi devs,

Thanks for the in-depth discussion! We recently update some designs of FLIP-221 [1]:

1. Introduce a new interface “FullCachingReloadTrigger” for developer to customize the reload strategy. The previous design was only time-based and not flexable enough. Developers can implement any logic on this interface to trigger a full caching reload.

2. LookupFunctionProvider / AsyncLookupFunctionProvider are renamed to PartialCachingLookupProvider / AsyncPartialCachingLookupProvider, in order to be symmetic with “FullCachingLookupProvider”

3. Remove lookup option “lookup.async” because FLIP-234 is planning to move the decision of whether to use async mode to the planner.

4. LookupCacheMetricGroup is renamed to CacheMetricGroup because it’s under flink-metrics-core and all caching behaviours could be able to reuse it. 

A POC has been pushed to my GitHub repo [2] to reflect the update. Some implementations on FullCachingReloadTrigger maybe quite naive to just reflect how the interface works. 

Looking forward to your comments!

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221%3A+Abstraction+for+lookup+source+cache+and+metric
[2] https://github.com/PatrickRen/flink/tree/FLIP-221-framework

> On Jun 1, 2022, at 04:58, Jing Ge <ji...@ververica.com> wrote:
> 
> Hi Jark,
> 
> Thanks for clarifying it. It would be fine. as long as we could provide the
> no-cache solution. I was just wondering if the client side cache could
> really help when HBase is used, since the data to look up should be huge.
> Depending how much data will be cached on the client side, the data that
> should be lru in e.g. LruBlockCache will not be lru anymore. In the worst
> case scenario, once the cached data at client side is expired, the request
> will hit disk which will cause extra latency temporarily, if I am not
> mistaken.
> 
> Best regards,
> Jing
> 
> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
> 
>> Hi Jing Ge,
>> 
>> What do you mean about the "impact on the block cache used by HBase"?
>> In my understanding, the connector cache and HBase cache are totally two
>> things.
>> The connector cache is a local/client cache, and the HBase cache is a
>> server cache.
>> 
>>> does it make sense to have a no-cache solution as one of the
>> default solutions so that customers will have no effort for the migration
>> if they want to stick with Hbase cache
>> 
>> The implementation migration should be transparent to users. Take the HBase
>> connector as
>> an example,  it already supports lookup cache but is disabled by default.
>> After migration, the
>> connector still disables cache by default (i.e. no-cache solution). No
>> migration effort for users.
>> 
>> HBase cache and connector cache are two different things. HBase cache can't
>> simply replace
>> connector cache. Because one of the most important usages for connector
>> cache is reducing
>> the I/O request/response and improving the throughput, which can achieve
>> by just using a server cache.
>> 
>> Best,
>> Jark
>> 
>> 
>> 
>> 
>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
>> 
>>> Thanks all for the valuable discussion. The new feature looks very
>>> interesting.
>>> 
>>> According to the FLIP description: "*Currently we have JDBC, Hive and
>> HBase
>>> connector implemented lookup table source. All existing implementations
>>> will be migrated to the current design and the migration will be
>>> transparent to end users*." I was only wondering if we should pay
>> attention
>>> to HBase and similar DBs. Since, commonly, the lookup data will be huge
>>> while using HBase, partial caching will be used in this case, if I am not
>>> mistaken, which might have an impact on the block cache used by HBase,
>> e.g.
>>> LruBlockCache.
>>> Another question is that, since HBase provides a sophisticated cache
>>> solution, does it make sense to have a no-cache solution as one of the
>>> default solutions so that customers will have no effort for the migration
>>> if they want to stick with Hbase cache?
>>> 
>>> Best regards,
>>> Jing
>>> 
>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <ji...@gmail.com>
>>> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> I think the problem now is below:
>>>> 1. AllCache and PartialCache interface on the non-uniform, one needs to
>>>> provide LookupProvider, the other needs to provide CacheBuilder.
>>>> 2. AllCache definition is not flexible, for example, PartialCache can
>> use
>>>> any custom storage, while the AllCache can not, AllCache can also be
>>>> considered to store memory or disk, also need a flexible strategy.
>>>> 3. AllCache can not customize ReloadStrategy, currently only
>>>> ScheduledReloadStrategy.
>>>> 
>>>> In order to solve the above problems, the following are my ideas.
>>>> 
>>>> ## Top level cache interfaces:
>>>> 
>>>> ```
>>>> 
>>>> public interface CacheLookupProvider extends
>>>> LookupTableSource.LookupRuntimeProvider {
>>>> 
>>>>    CacheBuilder createCacheBuilder();
>>>> }
>>>> 
>>>> 
>>>> public interface CacheBuilder {
>>>>    Cache create();
>>>> }
>>>> 
>>>> 
>>>> public interface Cache {
>>>> 
>>>>    /**
>>>>     * Returns the value associated with key in this cache, or null if
>>>> there is no cached value for
>>>>     * key.
>>>>     */
>>>>    @Nullable
>>>>    Collection<RowData> getIfPresent(RowData key);
>>>> 
>>>>    /** Returns the number of key-value mappings in the cache. */
>>>>    long size();
>>>> }
>>>> 
>>>> ```
>>>> 
>>>> ## Partial cache
>>>> 
>>>> ```
>>>> 
>>>> public interface PartialCacheLookupFunction extends
>> CacheLookupProvider {
>>>> 
>>>>    @Override
>>>>    PartialCacheBuilder createCacheBuilder();
>>>> 
>>>> /** Creates an {@link LookupFunction} instance. */
>>>> LookupFunction createLookupFunction();
>>>> }
>>>> 
>>>> 
>>>> public interface PartialCacheBuilder extends CacheBuilder {
>>>> 
>>>>    PartialCache create();
>>>> }
>>>> 
>>>> 
>>>> public interface PartialCache extends Cache {
>>>> 
>>>>    /**
>>>>     * Associates the specified value rows with the specified key row
>>>> in the cache. If the cache
>>>>     * previously contained value associated with the key, the old
>>>> value is replaced by the
>>>>     * specified value.
>>>>     *
>>>>     * @return the previous value rows associated with key, or null if
>>>> there was no mapping for key.
>>>>     * @param key - key row with which the specified value is to be
>>>> associated
>>>>     * @param value – value rows to be associated with the specified
>> key
>>>>     */
>>>>    Collection<RowData> put(RowData key, Collection<RowData> value);
>>>> 
>>>>    /** Discards any cached value for the specified key. */
>>>>    void invalidate(RowData key);
>>>> }
>>>> 
>>>> ```
>>>> 
>>>> ## All cache
>>>> ```
>>>> 
>>>> public interface AllCacheLookupProvider extends CacheLookupProvider {
>>>> 
>>>>    void registerReloadStrategy(ScheduledExecutorService
>>>> executorService, Reloader reloader);
>>>> 
>>>>    ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
>>>> 
>>>>    @Override
>>>>    AllCacheBuilder createCacheBuilder();
>>>> }
>>>> 
>>>> 
>>>> public interface AllCacheBuilder extends CacheBuilder {
>>>> 
>>>>    AllCache create();
>>>> }
>>>> 
>>>> 
>>>> public interface AllCache extends Cache {
>>>> 
>>>>    void putAll(Iterator<Map<RowData, RowData>> allEntries);
>>>> 
>>>>    void clearAll();
>>>> }
>>>> 
>>>> 
>>>> public interface Reloader {
>>>> 
>>>>    void reload();
>>>> }
>>>> 
>>>> ```
>>>> 
>>>> Best,
>>>> Jingsong
>>>> 
>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <ji...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Thanks Qingsheng and all for your discussion.
>>>>> 
>>>>> Very sorry to jump in so late.
>>>>> 
>>>>> Maybe I missed something?
>>>>> My first impression when I saw the cache interface was, why don't we
>>>>> provide an interface similar to guava cache [1], on top of guava
>> cache,
>>>>> caffeine also makes extensions for asynchronous calls.[2]
>>>>> There is also the bulk load in caffeine too.
>>>>> 
>>>>> I am also more confused why first from LookupCacheFactory.Builder and
>>>> then
>>>>> to Factory to create Cache.
>>>>> 
>>>>> [1] https://github.com/google/guava
>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
>>>>> 
>>>>> Best,
>>>>> Jingsong
>>>>> 
>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
>>>>> 
>>>>>> After looking at the new introduced ReloadTime and Becket's comment,
>>>>>> I agree with Becket we should have a pluggable reloading strategy.
>>>>>> We can provide some common implementations, e.g., periodic
>> reloading,
>>>> and
>>>>>> daily reloading.
>>>>>> But there definitely be some connector- or business-specific
>> reloading
>>>>>> strategies, e.g.
>>>>>> notify by a zookeeper watcher, reload once a new Hive partition is
>>>>>> complete.
>>>>>> 
>>>>>> Best,
>>>>>> Jark
>>>>>> 
>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> Hi Qingsheng,
>>>>>>> 
>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
>>>>>>> 
>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
>>> "XXXProvider".
>>>>>>> What is the difference between them? If they are the same, can we
>>> just
>>>>>> use
>>>>>>> XXXFactory everywhere?
>>>>>>> 
>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
>>>> policy
>>>>>>> also be pluggable? Periodical reloading could be sometimes be
>> tricky
>>>> in
>>>>>>> practice. For example, if user uses 24 hours as the cache refresh
>>>>>> interval
>>>>>>> and some nightly batch job delayed, the cache update may still see
>>> the
>>>>>>> stale data.
>>>>>>> 
>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
>>> should
>>>> be
>>>>>>> removed.
>>>>>>> 
>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a
>>>>>> little
>>>>>>> confusing to me. If Optional<LookupCacheFactory> getCacheFactory()
>>>>>> returns
>>>>>>> a non-empty factory, doesn't that already indicates the framework
>> to
>>>>>> cache
>>>>>>> the missing keys? Also, why is this method returning an
>>>>>> Optional<Boolean>
>>>>>>> instead of boolean?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Jiangjie (Becket) Qin
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <renqschn@gmail.com
>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Lincoln and Jark,
>>>>>>>> 
>>>>>>>> Thanks for the comments! If the community reaches a consensus
>> that
>>> we
>>>>>> use
>>>>>>>> SQL hint instead of table options to decide whether to use sync
>> or
>>>>>> async
>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
>>>> option.
>>>>>>>> 
>>>>>>>> I think it’s a good idea to let the decision of async made on
>> query
>>>>>>>> level, which could make better optimization with more infomation
>>>>>> gathered
>>>>>>>> by planner. Is there any FLIP describing the issue in
>> FLINK-27625?
>>> I
>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
>>>> missing
>>>>>>>> instead of the entire async mode to be controlled by hint.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> 
>>>>>>>> Qingsheng
>>>>>>>> 
>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <lincoln.86xy@gmail.com
>>> 
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Jark,
>>>>>>>>> 
>>>>>>>>> Thanks for your reply!
>>>>>>>>> 
>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
>> no
>>>> idea
>>>>>>>>> whether or when to remove it (we can discuss it in another
>> issue
>>>> for
>>>>>> the
>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
>> into
>>> a
>>>>>>>> common
>>>>>>>>> option now.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Lincoln Lee
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
>>>>>>>>> 
>>>>>>>>>> Hi Lincoln,
>>>>>>>>>> 
>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that the
>>>>>>>> connectors
>>>>>>>>>> can
>>>>>>>>>> provide both async and sync runtime providers simultaneously
>>>> instead
>>>>>>>> of one
>>>>>>>>>> of them.
>>>>>>>>>> At that point, "lookup.async" looks redundant. If this option
>> is
>>>>>>>> planned to
>>>>>>>>>> be removed
>>>>>>>>>> in the long term, I think it makes sense not to introduce it
>> in
>>>> this
>>>>>>>> FLIP.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>> 
>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
>>> lincoln.86xy@gmail.com
>>>>> 
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>> 
>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
>> idea
>>>>>> that
>>>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>> have a common table option. I have a minor comments on
>>>>>> 'lookup.async'
>>>>>>>>>> that
>>>>>>>>>>> not make it a common option:
>>>>>>>>>>> 
>>>>>>>>>>> The table layer abstracts both sync and async lookup
>>>> capabilities,
>>>>>>>>>>> connectors implementers can choose one or both, in the case
>> of
>>>>>>>>>> implementing
>>>>>>>>>>> only one capability(status of the most of existing builtin
>>>>>> connectors)
>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
>> both
>>>>>>>>>>> capabilities, I think this choice is more suitable for making
>>>>>>>> decisions
>>>>>>>>>> at
>>>>>>>>>>> the query level, for example, table planner can choose the
>>>> physical
>>>>>>>>>>> implementation of async lookup or sync lookup based on its
>> cost
>>>>>>>> model, or
>>>>>>>>>>> users can give query hint based on their own better
>>>>>> understanding.  If
>>>>>>>>>>> there is another common table option 'lookup.async', it may
>>>> confuse
>>>>>>>> the
>>>>>>>>>>> users in the long run.
>>>>>>>>>>> 
>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
>>> place
>>>>>> (for
>>>>>>>> the
>>>>>>>>>>> current hbase connector) and not turn it into a common
>> option.
>>>>>>>>>>> 
>>>>>>>>>>> WDYT?
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Lincoln Lee
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and you
>>> can
>>>>>> find
>>>>>>>>>>> those
>>>>>>>>>>>> changes from my latest email. Since some terminologies has
>>>>>> changed so
>>>>>>>>>>> I’ll
>>>>>>>>>>>> use the new concept for replying your comments.
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Builder vs ‘of’
>>>>>>>>>>>> I’m OK to use builder pattern if we have additional optional
>>>>>>>> parameters
>>>>>>>>>>>> for full caching mode (“rescan” previously). The
>>>>>> schedule-with-delay
>>>>>>>>>> idea
>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign the
>>>>>> builder
>>>>>>>> API
>>>>>>>>>>> of
>>>>>>>>>>>> full caching to make it more descriptive for developers.
>> Would
>>>> you
>>>>>>>> mind
>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
>>>> workspace
>>>>>>>> you
>>>>>>>>>>> can
>>>>>>>>>>>> just provide your account ID and ping any PMC member
>> including
>>>>>> Jark.
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. Common table options
>>>>>>>>>>>> We have some discussions these days and propose to
>> introduce 8
>>>>>> common
>>>>>>>>>>>> table options about caching. It has been updated on the
>> FLIP.
>>>>>>>>>>>> 
>>>>>>>>>>>> 3. Retries
>>>>>>>>>>>> I think we are on the same page :-)
>>>>>>>>>>>> 
>>>>>>>>>>>> For your additional concerns:
>>>>>>>>>>>> 1) The table option has been updated.
>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to use
>>>>>> partial
>>>>>>>> or
>>>>>>>>>>>> full caching mode.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Also I have a few additions:
>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear that
>>> we
>>>>>> talk
>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it fits
>>>> more,
>>>>>>>>>>>>> considering my optimization with filters.
>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
>> separate
>>>>>>>> caching
>>>>>>>>>>>>> and rescanning from the options point of view? Like
>> initially
>>>> we
>>>>>> had
>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
>> now
>>> we
>>>>>> can
>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
>> be
>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>> 
>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
>>>>>> smiralexan@gmail.com
>>>>>>>>>>> :
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Qingsheng and Jark,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Builders vs 'of'
>>>>>>>>>>>>>> I understand that builders are used when we have multiple
>>>>>>>>>> parameters.
>>>>>>>>>>>>>> I suggested them because we could add parameters later. To
>>>>>> prevent
>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
>> can
>>>>>>>> suggest
>>>>>>>>>>>>>> one more config now - "rescanStartTime".
>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first reload
>>> of
>>>>>> cache
>>>>>>>>>>>>>> starts. This parameter can be thought of as 'initialDelay'
>>>> (diff
>>>>>>>>>>>>>> between current time and rescanStartTime) in method
>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
>> can
>>> be
>>>>>> very
>>>>>>>>>>>>>> useful when the dimension table is updated by some other
>>>>>> scheduled
>>>>>>>>>> job
>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a second
>>> scan
>>>>>>>>>> (first
>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
>>> without
>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
>> one
>>>>>> day.
>>>>>>>>>>>>>> If you are fine with this option, I would be very glad if
>>> you
>>>>>> would
>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it myself
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. Common table options
>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
>>> cache
>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
>> for
>>>>>>>> default
>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
>>> cache
>>>>>>>>>> options,
>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 3. Retries
>>>>>>>>>>>>>> I'm fine with suggestion close to
>> RetryUtils#tryTimes(times,
>>>>>> call)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
>>> renqschn@gmail.com
>>>>> :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Jark and Alexander,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce common
>>>> table
>>>>>>>>>>>> options. I prefer to introduce a new
>> DefaultLookupCacheOptions
>>>>>> class
>>>>>>>>>> for
>>>>>>>>>>>> holding these option definitions because putting all options
>>>> into
>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
>>>>>> categorized.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
>>>>>> RescanRuntimeProvider
>>>>>>>>>>>> considering both arguments are required.
>>>>>>>>>>>>>>> 2. Introduce new table options matching
>>>>>> DefaultLookupCacheFactory
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
>> imjark@gmail.com>
>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1) retry logic
>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
>>>> utilities,
>>>>>>>>>> e.g.
>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused by
>>>>>>>>>> DataStream
>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where to
>>> put
>>>>>> it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2) cache ConfigOptions
>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
>>>> framework.
>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
>>> includes
>>>>>>>>>>>> "sink.parallelism", "format" options.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thank you for considering my comments.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> there might be custom logic before making retry, such
>> as
>>>>>>>>>>>> re-establish the connection
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can be
>>>>>> placed in
>>>>>>>>>> a
>>>>>>>>>>>>>>>>> separate function, that can be implemented by
>> connectors.
>>>>>> Just
>>>>>>>>>>> moving
>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
>>> more
>>>>>>>>>> concise
>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change. The
>>>>>> decision
>>>>>>>>>> is
>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>> to you.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>>>> developers
>>>>>>>>>> to
>>>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
>>> this
>>>>>> FLIP
>>>>>>>>>> was
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that current
>>>> cache
>>>>>>>>>>> design
>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
>>> still
>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>>> put
>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
>> reuse
>>>>>> them
>>>>>>>>>> and
>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more significant,
>>>> avoid
>>>>>>>>>>> possible
>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
>> out
>>> in
>>>>>>>>>>>>>>>>> documentation for connector developers.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
>>>>>> renqschn@gmail.com>:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
>> same
>>>>>> page!
>>>>>>>> I
>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
>>> quoting
>>>>>> your
>>>>>>>>>>> reply
>>>>>>>>>>>> under this email.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented in
>>>>>> lookup()
>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
>>> meaningful
>>>>>>>> under
>>>>>>>>>>> some
>>>>>>>>>>>> specific retriable failures, and there might be custom logic
>>>>>> before
>>>>>>>>>>> making
>>>>>>>>>>>> retry, such as re-establish the connection
>>>>>> (JdbcRowDataLookupFunction
>>>>>>>>>> is
>>>>>>>>>>> an
>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
>> version
>>> of
>>>>>>>> FLIP.
>>>>>>>>>>> Do
>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>>>> developers
>>>>>>>>>> to
>>>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
>>> FLIP.
>>>>>> Hope
>>>>>>>>>> we
>>>>>>>>>>>> can finalize our proposal soon!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however I
>>> have
>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>> suggestions and questions.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
>>>>>> TableFunction
>>>>>>>>>> is a
>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
>>> class.
>>>>>>>> 'eval'
>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose. The
>>> same
>>>>>> is
>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> 'async' case.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
>>>>>>>>>>>> 'cacheMissingKey'
>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
>>>>>>>>>>>> ScanRuntimeProvider.
>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
>> and
>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
>>>> 'build'
>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
>>> TableFunctionProvider
>>>>>> and
>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
>>>>>> deprecated.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not assume
>>>>>> usage of
>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
>> case,
>>>> it
>>>>>> is
>>>>>>>>>> not
>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate' or
>>>>>> 'putAll'
>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> LookupCache.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
>>> version
>>>>>> of
>>>>>>>>>>> FLIP.
>>>>>>>>>>>> Do
>>>>>>>>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to make
>>>> small
>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
>>> worth
>>>>>>>>>>> mentioning
>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in the
>>>>>> future.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
>>>>>> renqschn@gmail.com
>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion! As
>>> Jark
>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
>>>>>> refactor on
>>>>>>>>>> our
>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our design
>>> now
>>>>>> and
>>>>>>>> we
>>>>>>>>>>> are
>>>>>>>>>>>> happy to hear more suggestions from you!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
>> and
>>> is
>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
>>>>>>>> previously.
>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect
>>> the
>>>>>> new
>>>>>>>>>>>> design.
>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually and
>>>>>>>>>> introduce a
>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of scanning.
>> We
>>>> are
>>>>>>>>>>> planning
>>>>>>>>>>>> to support SourceFunction / InputFormat for now considering
>>> the
>>>>>>>>>>> complexity
>>>>>>>>>>>> of FLIP-27 Source API.
>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
>>> make
>>>>>> the
>>>>>>>>>>>> semantic of lookup more straightforward for developers.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
>> is
>>>>>>>>>> deprecated
>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
>>>> currently
>>>>>>>> it's
>>>>>>>>>>> not?
>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for
>>>> now.
>>>>>> I
>>>>>>>>>>> think
>>>>>>>>>>>> it will be deprecated in the future but we don't have a
>> clear
>>>> plan
>>>>>>>> for
>>>>>>>>>>> that.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
>>> looking
>>>>>>>>>> forward
>>>>>>>>>>>> to cooperating with you after we finalize the design and
>>>>>> interfaces!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
>>>>>>>>>>>> smiralexan@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
>> all
>>>>>>>> points!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
>> is
>>>>>>>>>> deprecated
>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
>> but
>>>>>>>>>> currently
>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
>> version
>>>>>> it's
>>>>>>>> OK
>>>>>>>>>>> to
>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
>>>> supporting
>>>>>>>>>> rescan
>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But for
>>>> this
>>>>>>>>>>>> decision we
>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion participants.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
>> your
>>>>>>>>>>>> statements. All
>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
>> would
>>> be
>>>>>> nice
>>>>>>>>>> to
>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot
>>> of
>>>>>> work
>>>>>>>>>> on
>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the one
>>> we
>>>>>> are
>>>>>>>>>>>> discussing,
>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work. Anyway
>>>>>> looking
>>>>>>>>>>>> forward for
>>>>>>>>>>>>>>>>>>>>> the FLIP update!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
>>> imjark@gmail.com
>>>>> :
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
>>>>>> discussed
>>>>>>>>>> it
>>>>>>>>>>>> several times
>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
>> many
>>> of
>>>>>> your
>>>>>>>>>>>> points!
>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the design
>>> docs
>>>>>> and
>>>>>>>>>>>> maybe can be
>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
>> discussions:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to "cache
>>> in
>>>>>>>>>>>> framework" way.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
>> customize
>>>> and
>>>>>> a
>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to easy-use.
>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
>>>> flexibility
>>>>>> and
>>>>>>>>>>>> conciseness.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
>>> lookup
>>>>>>>>>> cache,
>>>>>>>>>>>> esp reducing
>>>>>>>>>>>>>>>>>>>>>> IO.
>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and the
>>>>>> unified
>>>>>>>>>> way
>>>>>>>>>>>> to both
>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
>> direction.
>>> If
>>>>>> we
>>>>>>>>>> need
>>>>>>>>>>>> to support
>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide
>>> to
>>>>>>>>>>> implement
>>>>>>>>>>>> the cache
>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
>> and
>>>> it
>>>>>>>>>>> doesn't
>>>>>>>>>>>> affect the
>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
>> your
>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
>>>> InputFormat,
>>>>>>>>>>>> SourceFunction for
>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source
>>>>>> operator
>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
>>>> re-scan
>>>>>>>>>>> ability
>>>>>>>>>>>> for FLIP-27
>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
>>>> effort
>>>>>> of
>>>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
>> InputFormat&SourceFunction,
>>>> as
>>>>>>>> they
>>>>>>>>>>>> are not
>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
>> another
>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
>>> plan
>>>>>>>>>> FLIP-27
>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
>> SourceFunction
>>>> are
>>>>>>>>>>>> deprecated.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
>> InputFormat
>>>> is
>>>>>> not
>>>>>>>>>>>> considered.
>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
>>>>>>>>>>>> martijn@ververica.com>:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
>> connectors
>>>> to
>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
>> The
>>>> old
>>>>>>>>>>>> interfaces will be
>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
>>> refactored
>>>> to
>>>>>>>> use
>>>>>>>>>>>> the new ones
>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>> dropped.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that are
>>>> using
>>>>>>>>>>> FLIP-27
>>>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов
>> <
>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
>> make
>>>>>> some
>>>>>>>>>>>> comments and
>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
>> we
>>>> can
>>>>>>>>>>> achieve
>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
>>>>>>>>>>>> flink-table-common,
>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
>>>>>> flink-table-runtime.
>>>>>>>>>>>> Therefore if a
>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing cache
>>>>>>>>>> strategies
>>>>>>>>>>>> and their
>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig
>> to
>>>> the
>>>>>>>>>>>> planner, but if
>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
>> in
>>>> his
>>>>>>>>>>>> TableFunction, it
>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
>>>>>> interface
>>>>>>>>>> for
>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
>> the
>>>>>>>>>>>> documentation). In
>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
>> unified.
>>>>>> WDYT?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
>> cache,
>>> we
>>>>>> will
>>>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
>> optimization
>>> in
>>>>>> case
>>>>>>>>>> of
>>>>>>>>>>>> LRU cache.
>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
>>> Collection<RowData>>.
>>>>>> Here
>>>>>>>>>> we
>>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
>>> cache,
>>>>>> even
>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
>> rows
>>>>>> after
>>>>>>>>>>>> applying
>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
>>>>>>>>>>> TableFunction,
>>>>>>>>>>>> we store
>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
>>> cache
>>>>>> line
>>>>>>>>>>> will
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
>>>> bytes).
>>>>>>>>>> I.e.
>>>>>>>>>>>> we don't
>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
>>> pruned,
>>>>>> but
>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result. If
>>> the
>>>>>> user
>>>>>>>>>>>> knows about
>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
>>>> option
>>>>>>>>>> before
>>>>>>>>>>>> the start
>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
>> idea
>>>>>> that we
>>>>>>>>>>> can
>>>>>>>>>>>> do this
>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and
>>>>>> 'weigher'
>>>>>>>>>>>> methods of
>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
>>>>>> collection
>>>>>>>>>> of
>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
>>> automatically
>>>>>> fit
>>>>>>>>>>> much
>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
>>>> filters
>>>>>> and
>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
>>>> interfaces,
>>>>>>>>>> don't
>>>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
>>>> implement
>>>>>>>>>> filter
>>>>>>>>>>>> pushdown.
>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is no
>>>>>> database
>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
>>>> feature
>>>>>>>>>> won't
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
>>> talk
>>>>>> about
>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases
>>>> might
>>>>>>>> not
>>>>>>>>>>>> support all
>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at all).
>> I
>>>>>> think
>>>>>>>>>>> users
>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
>>> optimization
>>>>>>>>>>>> independently of
>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
>>> complex
>>>>>>>>>> problems
>>>>>>>>>>>> (or
>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually
>> in
>>>> our
>>>>>>>>>>>> internal version
>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and
>>>>>>>> reloading
>>>>>>>>>>>> data from
>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a
>>> way
>>>> to
>>>>>>>>>> unify
>>>>>>>>>>>> the logic
>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
>>>>>>>> SourceFunction,
>>>>>>>>>>>> Source,...)
>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
>> result
>>> I
>>>>>>>>>> settled
>>>>>>>>>>>> on using
>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
>> in
>>>> all
>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans
>>> to
>>>>>>>>>>> deprecate
>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
>>> usage
>>>> of
>>>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
>>>> source
>>>>>> was
>>>>>>>>>>>> designed to
>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
>> (SplitEnumerator
>>> on
>>>>>>>>>>>> JobManager and
>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
>>> operator
>>>>>>>>>> (lookup
>>>>>>>>>>>> join
>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no direct
>>> way
>>>> to
>>>>>>>>>> pass
>>>>>>>>>>>> splits from
>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
>> works
>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
>>>>>>>>>> AddSplitEvents).
>>>>>>>>>>>> Usage of
>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
>> clearer
>>>> and
>>>>>>>>>>>> easier. But if
>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
>>>>>> FLIP-27, I
>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
>> lookup
>>>> join
>>>>>>>> ALL
>>>>>>>>>>>> cache in
>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of
>>>> batch
>>>>>>>>>>> source?
>>>>>>>>>>>> The point
>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup join
>>> ALL
>>>>>>>> cache
>>>>>>>>>>>> and simple
>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
>> case
>>>>>>>> scanning
>>>>>>>>>>> is
>>>>>>>>>>>> performed
>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state (cache)
>> is
>>>>>>>> cleared
>>>>>>>>>>>> (correct me
>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
>>>>>> functionality of
>>>>>>>>>>>> simple join
>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
>>>>>> functionality of
>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should be
>>>> easy
>>>>>>>> with
>>>>>>>>>>>> new FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading -
>> we
>>>>>> will
>>>>>>>>>> need
>>>>>>>>>>>> to change
>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
>>> again
>>>>>> after
>>>>>>>>>>>> some TTL).
>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
>> long-term
>>>>>> goal
>>>>>>>>>> and
>>>>>>>>>>>> will make
>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
>> said.
>>>>>> Maybe
>>>>>>>>>> we
>>>>>>>>>>>> can limit
>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
>>> (InputFormats).
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
>>> flexible
>>>>>>>>>>>> interfaces for
>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both
>>> in
>>>>>> LRU
>>>>>>>>>> and
>>>>>>>>>>>> ALL caches.
>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
>>>>>> supported
>>>>>>>>>> in
>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
>> have
>>>> the
>>>>>>>>>>>> opportunity to
>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently
>>>> filter
>>>>>>>>>>>> pushdown works
>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
>> filters
>>> +
>>>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
>>>>>> features.
>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
>> that
>>>>>>>> involves
>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
>> from
>>>>>>>>>>>> InputFormat in favor
>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
>> realization
>>>>>> really
>>>>>>>>>>>> complex and
>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
>> extend
>>>> the
>>>>>>>>>>>> functionality of
>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
>>> case
>>>> of
>>>>>>>>>>> lookup
>>>>>>>>>>>> join ALL
>>>>>>>>>>>>>>>>>>>>>>>>> cache?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
>>>> imjark@gmail.com
>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
>> want
>>> to
>>>>>>>> share
>>>>>>>>>>> my
>>>>>>>>>>>> ideas:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
>>> connectors
>>>>>> base
>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
>> ways
>>>>>> should
>>>>>>>>>>>> work (e.g.,
>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
>> flexible
>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if we
>>> can
>>>>>> have
>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>> advantages.
>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
>> should
>>>> be a
>>>>>>>>>> final
>>>>>>>>>>>> state,
>>>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
>> into
>>>>>> cache
>>>>>>>>>> can
>>>>>>>>>>>> benefit a
>>>>>>>>>>>>>>>>>>>>>>> lot
>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
>>>> Connectors
>>>>>> use
>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>> reduce
>>>>>>>>>>>>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
>> cache,
>>> we
>>>>>> will
>>>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That means
>>> the
>>>>>> cache
>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> meaningless in
>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to
>> do
>>>>>>>> filters
>>>>>>>>>>>> and projects
>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
>>>> interfaces,
>>>>>>>>>> don't
>>>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces
>> to
>>>>>> reduce
>>>>>>>>>> IO
>>>>>>>>>>>> and the
>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
>>> source
>>>>>> and
>>>>>>>>>>>> lookup source
>>>>>>>>>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
>>> pushdown
>>>>>> logic
>>>>>>>>>> in
>>>>>>>>>>>> caches,
>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
>> of
>>>> this
>>>>>>>>>> FLIP.
>>>>>>>>>>>> We have
>>>>>>>>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
>> "eval"
>>>>>> method
>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should share
>>> the
>>>>>>>> logic
>>>>>>>>>>> of
>>>>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>> Source.
>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
>>>> deprecated,
>>>>>> and
>>>>>>>>>>> the
>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
>>>>>> LookupJoin,
>>>>>>>>>>> this
>>>>>>>>>>>> may make
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract the
>>> ALL
>>>>>>>> cache
>>>>>>>>>>>> logic and
>>>>>>>>>>>>>>>>>>>>>>> reuse
>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
>>>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
>> lies
>>>> out
>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>> scope of
>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be
>>>> done
>>>>>> for
>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
>>> correctly
>>>>>>>>>>> mentioned
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
>>>>>>>>>> jdbc/hive/hbase."
>>>>>>>>>>>> -> Would
>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
>> implement
>>>>>> these
>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to
>>>> doing
>>>>>>>>>> that,
>>>>>>>>>>>> outside
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>>>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
>>> improvement!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation
>>>>>> would be
>>>>>>>>>> a
>>>>>>>>>>>> nice
>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
>> SYSTEM_TIME
>>>> AS
>>>>>> OF
>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>>>> semantics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
>>>>>> implemented.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
>> say
>>>>>> that:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
>> to
>>>> cut
>>>>>> off
>>>>>>>>>>> the
>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the
>>> most
>>>>>>>> handy
>>>>>>>>>>>> way to do
>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit
>>>>>> harder to
>>>>>>>>>>>> pass it
>>>>>>>>>>>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
>>> Alexander
>>>>>>>>>>> correctly
>>>>>>>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
>> for
>>>>>>>>>>>> jdbc/hive/hbase.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
>>>>>>>>>> parameters
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
>> set
>>> it
>>>>>>>>>> through
>>>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
>> options
>>>> for
>>>>>>>> all
>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
>>>> really
>>>>>>>>>>>> deprives us of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
>>> implement
>>>>>>>> their
>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>> cache).
>>>>>>>>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
>> more
>>>>>>>>>> different
>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
>>> schema
>>>>>>>>>> proposed
>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not
>>>> right
>>>>>> and
>>>>>>>>>>> all
>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
>>>> architecture?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
>> Visser <
>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
>>> wanted
>>>> to
>>>>>>>>>>>> express that
>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this
>>>> topic
>>>>>>>>>> and I
>>>>>>>>>>>> hope
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
>>>> Смирнов <
>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
>>> However, I
>>>>>> have
>>>>>>>>>>>> questions
>>>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
>> get
>>>>>>>>>>>> something?).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of
>>> "FOR
>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
>>>> SYSTEM_TIME
>>>>>> AS
>>>>>>>>>> OF
>>>>>>>>>>>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
>> you
>>>>>> said,
>>>>>>>>>>> users
>>>>>>>>>>>> go
>>>>>>>>>>>>>>>>>>>>>>> on it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance
>>> (no
>>>>>> one
>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do
>>> you
>>>>>> mean
>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
>>>> explicitly
>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the
>>>> list
>>>>>> of
>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
>>> want
>>>>>> to.
>>>>>>>> So
>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
>> caching
>>>> in
>>>>>>>>>>> modules
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
>>> flink-table-common
>>>>>> from
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
>>>>>>>>>>>> breaking/non-breaking
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
>>>> proc_time"?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
>>>>>> options in
>>>>>>>>>>> DDL
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
>> never
>>>>>>>> happened
>>>>>>>>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
>>>> semantics
>>>>>> of
>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
>> it
>>>>>> about
>>>>>>>>>>>> limiting
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
>>>> business
>>>>>>>>>> logic
>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic
>> in
>>>> the
>>>>>>>>>>>> framework? I
>>>>>>>>>>>>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
>>> option
>>>>>> with
>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the
>>>> wrong
>>>>>>>>>>>> decision,
>>>>>>>>>>>>>>>>>>>>>>>>> because it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
>> logic
>>>> (not
>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
>>>> functions
>>>>>> of
>>>>>>>>>> ONE
>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>> (there
>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
>> caches).
>>>>>> Does it
>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>> matter for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
>> logic
>>> is
>>>>>>>>>>> located,
>>>>>>>>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
>>>>>> 'sink.parallelism',
>>>>>>>>>>>> which in
>>>>>>>>>>>>>>>>>>>>>>>>> some way
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework"
>>> and
>>>> I
>>>>>>>>>> don't
>>>>>>>>>>>> see any
>>>>>>>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
>>>> all-caching
>>>>>>>>>>>> scenario
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
>>> discussion,
>>>>>> but
>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>>>> in our
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
>>> quite
>>>>>>>>>> easily
>>>>>>>>>>> -
>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>> reused
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
>> for
>>> a
>>>>>> new
>>>>>>>>>>> API).
>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>> point is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
>>>>>>>>>> InputFormat
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even
>>> Hive
>>>>>> - it
>>>>>>>>>>> uses
>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a
>>>>>> wrapper
>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
>>> ability
>>>>>> to
>>>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
>>> number
>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload
>>>> time
>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>>>> reduces
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
>>> blocking). I
>>>>>> know
>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
>>> code,
>>>>>> but
>>>>>>>>>>> maybe
>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
>> an
>>>>>> ideal
>>>>>>>>>>>> solution,
>>>>>>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
>> might
>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
>>>> developer
>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
>> new
>>>>>> cache
>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options
>>>> into
>>>>>> 2
>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will
>>>> need
>>>>>> to
>>>>>>>>>> do
>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>>>>>>>>>>> redirect
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
>>>>>> LookupConfig
>>>>>>>> (+
>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>> add an
>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
>>> naming),
>>>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
>>> won't
>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>> refactoring at
>>>>>>>>>>>>>>>>>>>>>>>>> all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector
>>>>>> because
>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants
>> to
>>>> use
>>>>>>>> his
>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
>>>> configs
>>>>>>>> into
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with
>>>>>> already
>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>> configs
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a
>>>> rare
>>>>>>>>>> case).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed
>>> all
>>>>>> the
>>>>>>>>>> way
>>>>>>>>>>>> down
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is
>>> that
>>>>>> the
>>>>>>>>>>> ONLY
>>>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
>>>>>> FileSystemTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
>>>> currently).
>>>>>>>> Also
>>>>>>>>>>>> for some
>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
>>>> complex
>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the
>>>> cache
>>>>>>>>>> seems
>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large
>>>>>> amount of
>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
>>>> suppose
>>>>>> in
>>>>>>>>>>>> dimension
>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20
>> to
>>>> 40,
>>>>>>>> and
>>>>>>>>>>>> input
>>>>>>>>>>>>>>>>>>>>>>> stream
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
>> by
>>>> age
>>>>>> of
>>>>>>>>>>>> users. If
>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
>>> This
>>>>>> means
>>>>>>>>>>> the
>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
>> almost
>>> 2
>>>>>>>> times.
>>>>>>>>>>> It
>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>> gain a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
>>>> optimization
>>>>>>>>>> starts
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
>>> filters
>>>>>> and
>>>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
>> opens
>>> up
>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not
>>>> quite
>>>>>>>>>>>> useful'.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
>>>>>> regarding
>>>>>>>>>> this
>>>>>>>>>>>> topic!
>>>>>>>>>>>>>>>>>>>>>>>>> Because
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
>>> points,
>>>>>> and I
>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
>> come
>>>> to a
>>>>>>>>>>>> consensus.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
>> Ren
>>> <
>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for
>> my
>>>>>> late
>>>>>>>>>>>> response!
>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
>> and
>>>>>> Leonard
>>>>>>>>>>> and
>>>>>>>>>>>> I’d
>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
>>> implementing
>>>>>> the
>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>> logic in
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
>>>>>> user-provided
>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
>> extending
>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic
>> of
>>>>>> "FOR
>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
>>> reflect
>>>>>> the
>>>>>>>>>>>> content
>>>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users
>>>>>> choose
>>>>>>>> to
>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
>> that
>>>>>> this
>>>>>>>>>>>> breakage is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
>> prefer
>>>> not
>>>>>> to
>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>>>>>>> caching on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
>> in
>>>> the
>>>>>>>>>>>> framework
>>>>>>>>>>>>>>>>>>>>>>>>> (whether
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
>> TableFunction),
>>> we
>>>>>> have
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> confront a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
>> control
>>>> the
>>>>>>>>>>>> behavior of
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
>>>> should
>>>>>> be
>>>>>>>>>>>> cautious.
>>>>>>>>>>>>>>>>>>>>>>>>> Under
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
>>> framework
>>>>>>>> should
>>>>>>>>>>>> only be
>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
>> it’s
>>>>>> hard
>>>>>>>> to
>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source
>>>> loads
>>>>>> and
>>>>>>>>>>>> refresh
>>>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
>>> high
>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
>> widely
>>>>>> used
>>>>>>>> by
>>>>>>>>>>> our
>>>>>>>>>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
>> the
>>>>>> user’s
>>>>>>>>>>>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
>>>>>> introduce a
>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>>>>>> interface for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would
>>>>>> become
>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework
>>>> might
>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
>>> there
>>>>>> might
>>>>>>>>>>>> exist two
>>>>>>>>>>>>>>>>>>>>>>>>> caches
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>> configures
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
>>> implemented
>>>>>> by
>>>>>>>>>> the
>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>> source).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
>>>>>> Alexander, I
>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way
>>> down
>>>>>> to
>>>>>>>>>> the
>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of
>>> the
>>>>>>>>>> runner
>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
>>> network
>>>>>> I/O
>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> pressure
>>>>>>>>>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
>>>>>>>>>>> optimizations
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
>>>> reflect
>>>>>> our
>>>>>>>>>>>> ideas.
>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
>> of
>>>>>>>>>>>> TableFunction,
>>>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
>>>>>> (CachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers
>>> and
>>>>>>>>>> regulate
>>>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
>> Александр
>>>>>> Смирнов
>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
>>>> solution
>>>>>> as
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>>>> step:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually
>>>>>> exclusive
>>>>>>>>>>>>>>>>>>>>>>> (originally
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
>>>> conceptually
>>>>>>>> they
>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
>>>>>> different.
>>>>>>>>>> If
>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>> go one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
>> will
>>>> mean
>>>>>>>>>>>> deleting
>>>>>>>>>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
>>>>>> connectors.
>>>>>>>>>> So
>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
>>> about
>>>>>> that
>>>>>>>>>> and
>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
>>> tasks
>>>>>> for
>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>> parts
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
>> unification
>>> /
>>>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
>>> Qingsheng?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
>>>> requests
>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
>> fields
>>>> of
>>>>>> the
>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>> table, we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
>> after
>>>>>> that we
>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
>>> filter
>>>>>>>>>>>> pushdown. So
>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
>>> much
>>>>>> less
>>>>>>>>>>> rows
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
>>>> architecture
>>>>>> is
>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
>>> kinds
>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> conversations
>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
>> confluence,
>>>> so
>>>>>> I
>>>>>>>>>>> made a
>>>>>>>>>>>>>>>>>>>>>>> Jira
>>>>>>>>>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in
>>>> more
>>>>>>>>>>> details
>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
>> Heise
>>> <
>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
>>> inconsistency
>>>>>> was
>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
>>> could
>>>>>> also
>>>>>>>>>>> live
>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of
>>>>>> making
>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
>>> devise a
>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>> layer
>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
>> CachingTableFunction
>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> delegates to
>>>>>>>>>>>>>>>>>>>>>>>>> X in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
>>> Lifting
>>>>>> it
>>>>>>>>>> into
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
>>>>>> probably
>>>>>>>>>>>>>>>>>>>>>>> unnecessary
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
>> will
>>>> only
>>>>>>>>>>> receive
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
>> more
>>>>>>>>>>> interesting
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> save
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
>>> changes
>>>> of
>>>>>>>>>> this
>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
>>>> interfaces.
>>>>>>>>>>>> Everything
>>>>>>>>>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That
>>>> means
>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander
>>>>>> pointed
>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
>>>> architecture
>>>>>> is
>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
>>> Александр
>>>>>>>>>> Смирнов
>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
>> I'm
>>>>>> not a
>>>>>>>>>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>>>>>>>>>> yet,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
>>> FLIP
>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>> interested
>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
>>>>>> feature in
>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>>>> company’s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our
>>>>>> thoughts
>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>> this and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative
>>>> than
>>>>>>>>>>>>>>>>>>>>>>> introducing an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
>>>>>>>> (CachingTableFunction).
>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>> know,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
>>>>>> flink-table-common
>>>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables –
>>>> it’s
>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>> convenient
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
>>>>>> CachingTableFunction
>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
>>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>>>>>> connected
>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module,
>>>>>> probably
>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
>>>> depend
>>>>>> on
>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
>> which
>>>>>> doesn’t
>>>>>>>>>>>> sound
>>>>>>>>>>>>>>>>>>>>>>>>> good.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
>>>>>>>> ‘getLookupConfig’
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
>>>>>> connectors
>>>>>>>> to
>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
>>> therefore
>>>>>> they
>>>>>>>>>>> won’t
>>>>>>>>>>>>>>>>>>>>>>>>> depend on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
>>>> planner
>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>> construct a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
>>> runtime
>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
>>> Architecture
>>>>>>>> looks
>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
>>>> actually
>>>>>>>>>> yours
>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that
>>> will
>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>> responsible
>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
>>>>>> inheritors.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
>>> AsyncLookupJoinRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
>>>>>>>>>> LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful
>>>>>> advantage
>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> such a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
>> level,
>>>> we
>>>>>> can
>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
>>>>>> LookupJoinRunnerWithCalc
>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>>>>>> named
>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function,
>>>> which
>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>>>>>> mostly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
>>> lookup
>>>>>> table
>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>>>>>>>> condition
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
>>> WHERE
>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>> 1000’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age =
>>>>>> B.age +
>>>>>>>>>> 10
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
>>> storing
>>>>>>>>>> records
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced:
>>>>>> filters =
>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
>> reduce
>>>>>>>> records’
>>>>>>>>>>>>>>>>>>>>>>> size. So
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be
>>>>>>>> increased
>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> user.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren
>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
>>>>>> discussion
>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
>>> table
>>>>>>>> cache
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
>>>> should
>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
>> isn’t a
>>>>>>>>>> standard
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with
>>>> lookup
>>>>>>>>>>> joins,
>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
>>>>>> including
>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
>> table
>>>>>>>> options.
>>>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
>> Any
>>>>>>>>>>> suggestions
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jark Wu <im...@gmail.com>.

The current FLIP looks good to me.
+1 to start the vote.
Thank you so much for the effort, Qingsheng!

Best,
Jark

On Wed, 22 Jun 2022 at 17:03, Jingsong Li <ji...@gmail.com> wrote:

> Qingsheng Thanks for the update,
>
> Looks good to me!
>
> Best,
> Jingsong
>
> On Wed, Jun 22, 2022 at 5:00 PM Qingsheng Ren <re...@apache.org> wrote:
> >
> > Hi Jingsong,
> >
> > 1. Updated and thanks for the reminder!
> >
> > 2. We could do so for implementation but as public interface I prefer
> not to introduce another layer and expose too much since this FLIP is
> already a huge one with bunch of classes and interfaces.
> >
> > Best,
> > Qingsheng
> >
> > > On Jun 22, 2022, at 11:16, Jingsong Li <ji...@gmail.com> wrote:
> > >
> > > Thanks Qingsheng and all.
> > >
> > > I like this design.
> > >
> > > Some comments:
> > >
> > > 1. LookupCache implements Serializable?
> > >
> > > 2. Minor: After FLIP-234 [1], there should be many connectors that
> > > implement both PartialCachingLookupProvider and
> > > PartialCachingAsyncLookupProvider. Can we extract a common interface
> > > for `LookupCache getCache();` to ensure consistency?
> > >
> > > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-234%3A+Support+Retryable+Lookup+Join+To+Solve+Delayed+Updates+Issue+In+External+Systems
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, Jun 21, 2022 at 4:09 PM Qingsheng Ren <re...@apache.org>
> wrote:
> > >>
> > >> Hi devs,
> > >>
> > >> I’d like to push FLIP-221 forward a little bit. Recently we had some
> offline discussions and updated the FLIP. Here’s the diff compared to the
> previous version:
> > >>
> > >> 1. (Async)LookupFunctionProvider is designed as a base interface for
> constructing lookup functions.
> > >> 2. From the LookupFunction we extend PartialCaching /
> FullCachingLookupProvider for partial and full caching mode.
> > >> 3. Introduce CacheReloadTrigger for specifying reload stratrgy in
> full caching mode, and provide 2 default implementations (Periodic /
> TimedCacheReloadTrigger)
> > >>
> > >> Looking forward to your replies~
> > >>
> > >> Best,
> > >> Qingsheng
> > >>
> > >>> On Jun 2, 2022, at 17:15, Qingsheng Ren <re...@gmail.com> wrote:
> > >>>
> > >>> Hi Becket,
> > >>>
> > >>> Thanks for your feedback!
> > >>>
> > >>> 1. An alternative way is to let the implementation of cache to decide
> > >>> whether to store a missing key in the cache instead of the framework.
> > >>> This sounds more reasonable and makes the LookupProvider interface
> > >>> cleaner. I can update the FLIP and clarify in the JavaDoc of
> > >>> LookupCache#put that the cache should decide whether to store an
> empty
> > >>> collection.
> > >>>
> > >>> 2. Initially the builder pattern is for the extensibility of
> > >>> LookupProvider interfaces that we could need to add more
> > >>> configurations in the future. We can remove the builder now as we
> have
> > >>> resolved the issue in 1. As for the builder in DefaultLookupCache I
> > >>> prefer to keep it because we have a lot of arguments in the
> > >>> constructor.
> > >>>
> > >>> 3. I think this might overturn the overall design. I agree with
> > >>> Becket's idea that the API design should be layered considering
> > >>> extensibility and it'll be great to have one unified interface
> > >>> supporting both partial, full and even mixed custom strategies, but
> we
> > >>> have some issues to resolve. The original purpose of treating full
> > >>> caching separately is that we'd like to reuse the ability of
> > >>> ScanRuntimeProvider. Developers just need to hand over Source /
> > >>> SourceFunction / InputFormat so that the framework could be able to
> > >>> compose the underlying topology and control the reload (maybe in a
> > >>> distributed way). Under your design we leave the reload operation
> > >>> totally to the CacheStrategy and I think it will be hard for
> > >>> developers to reuse the source in the initializeCache method.
> > >>>
> > >>> Best regards,
> > >>>
> > >>> Qingsheng
> > >>>
> > >>> On Thu, Jun 2, 2022 at 1:50 PM Becket Qin <be...@gmail.com>
> wrote:
> > >>>>
> > >>>> Thanks for updating the FLIP, Qingsheng. A few more comments:
> > >>>>
> > >>>> 1. I am still not sure about what is the use case for
> cacheMissingKey().
> > >>>> More specifically, when would users want to have getCache() return a
> > >>>> non-empty value and cacheMissingKey() returns false?
> > >>>>
> > >>>> 2. The builder pattern. Usually the builder pattern is used when
> there are
> > >>>> a lot of variations of constructors. For example, if a class has
> three
> > >>>> variables and all of them are optional, so there could potentially
> be many
> > >>>> combinations of the variables. But in this FLIP, I don't see such
> case.
> > >>>> What is the reason we have builders for all the classes?
> > >>>>
> > >>>> 3. Should the caching strategy be excluded from the top level
> provider API?
> > >>>> Technically speaking, the Flink framework should only have two
> interfaces
> > >>>> to deal with:
> > >>>>   A) LookupFunction
> > >>>>   B) AsyncLookupFunction
> > >>>> Orthogonally, we *believe* there are two different strategies
> people can do
> > >>>> caching. Note that the Flink framework does not care what is the
> caching
> > >>>> strategy here.
> > >>>>   a) partial caching
> > >>>>   b) full caching
> > >>>>
> > >>>> Putting them together, we end up with 3 combinations that we think
> are
> > >>>> valid:
> > >>>>    Aa) PartialCachingLookupFunctionProvider
> > >>>>    Ba) PartialCachingAsyncLookupFunctionProvider
> > >>>>    Ab) FullCachingLookupFunctionProvider
> > >>>>
> > >>>> However, the caching strategy could actually be quite flexible.
> E.g. an
> > >>>> initial full cache load followed by some partial updates. Also, I
> am not
> > >>>> 100% sure if the full caching will always use ScanTableSource.
> Including
> > >>>> the caching strategy in the top level provider API would make it
> harder to
> > >>>> extend.
> > >>>>
> > >>>> One possible solution is to just have *LookupFunctionProvider* and
> > >>>> *AsyncLookupFunctionProvider
> > >>>> *as the top level API, both with a getCacheStrategy() method
> returning an
> > >>>> optional CacheStrategy. The CacheStrategy class would have the
> following
> > >>>> methods:
> > >>>> 1. void open(Context), the context exposes some of the resources
> that may
> > >>>> be useful for the the caching strategy, e.g. an ExecutorService
> that is
> > >>>> synchronized with the data processing, or a cache refresh trigger
> which
> > >>>> blocks data processing and refresh the cache.
> > >>>> 2. void initializeCache(), a blocking method allows users to
> pre-populate
> > >>>> the cache before processing any data if they wish.
> > >>>> 3. void maybeCache(RowData key, Collection<RowData> value),
> blocking or
> > >>>> non-blocking method.
> > >>>> 4. void refreshCache(), a blocking / non-blocking method that is
> invoked by
> > >>>> the Flink framework when the cache refresh trigger is pulled.
> > >>>>
> > >>>> In the above design, partial caching and full caching would be
> > >>>> implementations of the CachingStrategy. And it is OK for users to
> implement
> > >>>> their own CachingStrategy if they want to.
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>> Jiangjie (Becket) Qin
> > >>>>
> > >>>>
> > >>>> On Thu, Jun 2, 2022 at 12:14 PM Jark Wu <im...@gmail.com> wrote:
> > >>>>
> > >>>>> Thank Qingsheng for the detailed summary and updates,
> > >>>>>
> > >>>>> The changes look good to me in general. I just have one minor
> improvement
> > >>>>> comment.
> > >>>>> Could we add a static util method to the "FullCachingReloadTrigger"
> > >>>>> interface for quick usage?
> > >>>>>
> > >>>>> #periodicReloadAtFixedRate(Duration)
> > >>>>> #periodicReloadWithFixedDelay(Duration)
> > >>>>>
> > >>>>> I think we can also do this for LookupCache. Because users may not
> know
> > >>>>> where is the default
> > >>>>> implementations and how to use them.
> > >>>>>
> > >>>>> Best,
> > >>>>> Jark
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <re...@gmail.com>
> wrote:
> > >>>>>
> > >>>>>> Hi Jingsong,
> > >>>>>>
> > >>>>>> Thanks for your comments!
> > >>>>>>
> > >>>>>>> AllCache definition is not flexible, for example, PartialCache
> can use
> > >>>>>> any custom storage, while the AllCache can not, AllCache can also
> be
> > >>>>>> considered to store memory or disk, also need a flexible strategy.
> > >>>>>>
> > >>>>>> We had an offline discussion with Jark and Leonard. Basically we
> think
> > >>>>>> exposing the interface of full cache storage to connector
> developers
> > >>>>> might
> > >>>>>> limit our future optimizations. The storage of full caching
> shouldn’t
> > >>>>> have
> > >>>>>> too many variations for different lookup tables so making it
> pluggable
> > >>>>>> might not help a lot. Also I think it is not quite easy for
> connector
> > >>>>>> developers to implement such an optimized storage. We can keep
> optimizing
> > >>>>>> this storage in the future and all full caching lookup tables
> would
> > >>>>> benefit
> > >>>>>> from this.
> > >>>>>>
> > >>>>>>> We are more inclined to deprecate the connector `async` option
> when
> > >>>>>> discussing FLIP-234. Can we remove this option from this FLIP?
> > >>>>>>
> > >>>>>> Thanks for the reminder! This option has been removed in the
> latest
> > >>>>>> version.
> > >>>>>>
> > >>>>>> Best regards,
> > >>>>>>
> > >>>>>> Qingsheng
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>> Thanks Alexander for your reply. We can discuss the new
> interface when
> > >>>>> it
> > >>>>>>> comes out.
> > >>>>>>>
> > >>>>>>> We are more inclined to deprecate the connector `async` option
> when
> > >>>>>>> discussing FLIP-234 [1]. We should use hint to let planner
> decide.
> > >>>>>>> Although the discussion has not yet produced a conclusion, can we
> > >>>>> remove
> > >>>>>>> this option from this FLIP? It doesn't seem to be related to
> this FLIP,
> > >>>>>> but
> > >>>>>>> more to FLIP-234, and we can form a conclusion over there.
> > >>>>>>>
> > >>>>>>> [1]
> https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Jingsong
> > >>>>>>>
> > >>>>>>> On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com>
> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Jark,
> > >>>>>>>>
> > >>>>>>>> Thanks for clarifying it. It would be fine. as long as we could
> > >>>>> provide
> > >>>>>> the
> > >>>>>>>> no-cache solution. I was just wondering if the client side
> cache could
> > >>>>>>>> really help when HBase is used, since the data to look up
> should be
> > >>>>>> huge.
> > >>>>>>>> Depending how much data will be cached on the client side, the
> data
> > >>>>> that
> > >>>>>>>> should be lru in e.g. LruBlockCache will not be lru anymore. In
> the
> > >>>>>> worst
> > >>>>>>>> case scenario, once the cached data at client side is expired,
> the
> > >>>>>> request
> > >>>>>>>> will hit disk which will cause extra latency temporarily, if I
> am not
> > >>>>>>>> mistaken.
> > >>>>>>>>
> > >>>>>>>> Best regards,
> > >>>>>>>> Jing
> > >>>>>>>>
> > >>>>>>>> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com>
> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Jing Ge,
> > >>>>>>>>>
> > >>>>>>>>> What do you mean about the "impact on the block cache used by
> HBase"?
> > >>>>>>>>> In my understanding, the connector cache and HBase cache are
> totally
> > >>>>>> two
> > >>>>>>>>> things.
> > >>>>>>>>> The connector cache is a local/client cache, and the HBase
> cache is a
> > >>>>>>>>> server cache.
> > >>>>>>>>>
> > >>>>>>>>>> does it make sense to have a no-cache solution as one of the
> > >>>>>>>>> default solutions so that customers will have no effort for the
> > >>>>>> migration
> > >>>>>>>>> if they want to stick with Hbase cache
> > >>>>>>>>>
> > >>>>>>>>> The implementation migration should be transparent to users.
> Take the
> > >>>>>>>> HBase
> > >>>>>>>>> connector as
> > >>>>>>>>> an example,  it already supports lookup cache but is disabled
> by
> > >>>>>> default.
> > >>>>>>>>> After migration, the
> > >>>>>>>>> connector still disables cache by default (i.e. no-cache
> solution).
> > >>>>> No
> > >>>>>>>>> migration effort for users.
> > >>>>>>>>>
> > >>>>>>>>> HBase cache and connector cache are two different things.
> HBase cache
> > >>>>>>>> can't
> > >>>>>>>>> simply replace
> > >>>>>>>>> connector cache. Because one of the most important usages for
> > >>>>> connector
> > >>>>>>>>> cache is reducing
> > >>>>>>>>> the I/O request/response and improving the throughput, which
> can
> > >>>>>> achieve
> > >>>>>>>>> by just using a server cache.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Jark
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com>
> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Thanks all for the valuable discussion. The new feature looks
> very
> > >>>>>>>>>> interesting.
> > >>>>>>>>>>
> > >>>>>>>>>> According to the FLIP description: "*Currently we have JDBC,
> Hive
> > >>>>> and
> > >>>>>>>>> HBase
> > >>>>>>>>>> connector implemented lookup table source. All existing
> > >>>>>> implementations
> > >>>>>>>>>> will be migrated to the current design and the migration will
> be
> > >>>>>>>>>> transparent to end users*." I was only wondering if we should
> pay
> > >>>>>>>>> attention
> > >>>>>>>>>> to HBase and similar DBs. Since, commonly, the lookup data
> will be
> > >>>>>> huge
> > >>>>>>>>>> while using HBase, partial caching will be used in this case,
> if I
> > >>>>> am
> > >>>>>>>> not
> > >>>>>>>>>> mistaken, which might have an impact on the block cache used
> by
> > >>>>> HBase,
> > >>>>>>>>> e.g.
> > >>>>>>>>>> LruBlockCache.
> > >>>>>>>>>> Another question is that, since HBase provides a
> sophisticated cache
> > >>>>>>>>>> solution, does it make sense to have a no-cache solution as
> one of
> > >>>>> the
> > >>>>>>>>>> default solutions so that customers will have no effort for
> the
> > >>>>>>>> migration
> > >>>>>>>>>> if they want to stick with Hbase cache?
> > >>>>>>>>>>
> > >>>>>>>>>> Best regards,
> > >>>>>>>>>> Jing
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <
> > >>>>> jingsonglee0@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi all,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think the problem now is below:
> > >>>>>>>>>>> 1. AllCache and PartialCache interface on the non-uniform,
> one
> > >>>>> needs
> > >>>>>>>> to
> > >>>>>>>>>>> provide LookupProvider, the other needs to provide
> CacheBuilder.
> > >>>>>>>>>>> 2. AllCache definition is not flexible, for example,
> PartialCache
> > >>>>> can
> > >>>>>>>>> use
> > >>>>>>>>>>> any custom storage, while the AllCache can not, AllCache can
> also
> > >>>>> be
> > >>>>>>>>>>> considered to store memory or disk, also need a flexible
> strategy.
> > >>>>>>>>>>> 3. AllCache can not customize ReloadStrategy, currently only
> > >>>>>>>>>>> ScheduledReloadStrategy.
> > >>>>>>>>>>>
> > >>>>>>>>>>> In order to solve the above problems, the following are my
> ideas.
> > >>>>>>>>>>>
> > >>>>>>>>>>> ## Top level cache interfaces:
> > >>>>>>>>>>>
> > >>>>>>>>>>> ```
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface CacheLookupProvider extends
> > >>>>>>>>>>> LookupTableSource.LookupRuntimeProvider {
> > >>>>>>>>>>>
> > >>>>>>>>>>>  CacheBuilder createCacheBuilder();
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface CacheBuilder {
> > >>>>>>>>>>>  Cache create();
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface Cache {
> > >>>>>>>>>>>
> > >>>>>>>>>>>  /**
> > >>>>>>>>>>>   * Returns the value associated with key in this cache, or
> null
> > >>>>>>>> if
> > >>>>>>>>>>> there is no cached value for
> > >>>>>>>>>>>   * key.
> > >>>>>>>>>>>   */
> > >>>>>>>>>>>  @Nullable
> > >>>>>>>>>>>  Collection<RowData> getIfPresent(RowData key);
> > >>>>>>>>>>>
> > >>>>>>>>>>>  /** Returns the number of key-value mappings in the cache.
> */
> > >>>>>>>>>>>  long size();
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>> ```
> > >>>>>>>>>>>
> > >>>>>>>>>>> ## Partial cache
> > >>>>>>>>>>>
> > >>>>>>>>>>> ```
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface PartialCacheLookupFunction extends
> > >>>>>>>>> CacheLookupProvider {
> > >>>>>>>>>>>
> > >>>>>>>>>>>  @Override
> > >>>>>>>>>>>  PartialCacheBuilder createCacheBuilder();
> > >>>>>>>>>>>
> > >>>>>>>>>>> /** Creates an {@link LookupFunction} instance. */
> > >>>>>>>>>>> LookupFunction createLookupFunction();
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface PartialCacheBuilder extends CacheBuilder {
> > >>>>>>>>>>>
> > >>>>>>>>>>>  PartialCache create();
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface PartialCache extends Cache {
> > >>>>>>>>>>>
> > >>>>>>>>>>>  /**
> > >>>>>>>>>>>   * Associates the specified value rows with the specified
> key
> > >>>>> row
> > >>>>>>>>>>> in the cache. If the cache
> > >>>>>>>>>>>   * previously contained value associated with the key, the
> old
> > >>>>>>>>>>> value is replaced by the
> > >>>>>>>>>>>   * specified value.
> > >>>>>>>>>>>   *
> > >>>>>>>>>>>   * @return the previous value rows associated with key, or
> null
> > >>>>>>>> if
> > >>>>>>>>>>> there was no mapping for key.
> > >>>>>>>>>>>   * @param key - key row with which the specified value is
> to be
> > >>>>>>>>>>> associated
> > >>>>>>>>>>>   * @param value – value rows to be associated with the
> specified
> > >>>>>>>>> key
> > >>>>>>>>>>>   */
> > >>>>>>>>>>>  Collection<RowData> put(RowData key, Collection<RowData>
> value);
> > >>>>>>>>>>>
> > >>>>>>>>>>>  /** Discards any cached value for the specified key. */
> > >>>>>>>>>>>  void invalidate(RowData key);
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>> ```
> > >>>>>>>>>>>
> > >>>>>>>>>>> ## All cache
> > >>>>>>>>>>> ```
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface AllCacheLookupProvider extends
> > >>>>> CacheLookupProvider {
> > >>>>>>>>>>>
> > >>>>>>>>>>>  void registerReloadStrategy(ScheduledExecutorService
> > >>>>>>>>>>> executorService, Reloader reloader);
> > >>>>>>>>>>>
> > >>>>>>>>>>>  ScanTableSource.ScanRuntimeProvider
> getScanRuntimeProvider();
> > >>>>>>>>>>>
> > >>>>>>>>>>>  @Override
> > >>>>>>>>>>>  AllCacheBuilder createCacheBuilder();
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface AllCacheBuilder extends CacheBuilder {
> > >>>>>>>>>>>
> > >>>>>>>>>>>  AllCache create();
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface AllCache extends Cache {
> > >>>>>>>>>>>
> > >>>>>>>>>>>  void putAll(Iterator<Map<RowData, RowData>> allEntries);
> > >>>>>>>>>>>
> > >>>>>>>>>>>  void clearAll();
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> public interface Reloader {
> > >>>>>>>>>>>
> > >>>>>>>>>>>  void reload();
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>> ```
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jingsong
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <
> > >>>>> jingsonglee0@gmail.com
> > >>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Thanks Qingsheng and all for your discussion.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Very sorry to jump in so late.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Maybe I missed something?
> > >>>>>>>>>>>> My first impression when I saw the cache interface was, why
> don't
> > >>>>>>>> we
> > >>>>>>>>>>>> provide an interface similar to guava cache [1], on top of
> guava
> > >>>>>>>>> cache,
> > >>>>>>>>>>>> caffeine also makes extensions for asynchronous calls.[2]
> > >>>>>>>>>>>> There is also the bulk load in caffeine too.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I am also more confused why first from
> LookupCacheFactory.Builder
> > >>>>>>>> and
> > >>>>>>>>>>> then
> > >>>>>>>>>>>> to Factory to create Cache.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> [1] https://github.com/google/guava
> > >>>>>>>>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Jingsong
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> After looking at the new introduced ReloadTime and Becket's
> > >>>>>>>> comment,
> > >>>>>>>>>>>>> I agree with Becket we should have a pluggable reloading
> > >>>>> strategy.
> > >>>>>>>>>>>>> We can provide some common implementations, e.g., periodic
> > >>>>>>>>> reloading,
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> daily reloading.
> > >>>>>>>>>>>>> But there definitely be some connector- or
> business-specific
> > >>>>>>>>> reloading
> > >>>>>>>>>>>>> strategies, e.g.
> > >>>>>>>>>>>>> notify by a zookeeper watcher, reload once a new Hive
> partition
> > >>>>> is
> > >>>>>>>>>>>>> complete.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <
> becket.qin@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for updating the FLIP. A few comments / questions
> below:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
> > >>>>>>>>>> "XXXProvider".
> > >>>>>>>>>>>>>> What is the difference between them? If they are the
> same, can
> > >>>>>>>> we
> > >>>>>>>>>> just
> > >>>>>>>>>>>>> use
> > >>>>>>>>>>>>>> XXXFactory everywhere?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 2. Regarding the FullCachingLookupProvider, should the
> reloading
> > >>>>>>>>>>> policy
> > >>>>>>>>>>>>>> also be pluggable? Periodical reloading could be
> sometimes be
> > >>>>>>>>> tricky
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>> practice. For example, if user uses 24 hours as the cache
> > >>>>>>>> refresh
> > >>>>>>>>>>>>> interval
> > >>>>>>>>>>>>>> and some nightly batch job delayed, the cache update may
> still
> > >>>>>>>> see
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>> stale data.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 3. In DefaultLookupCacheFactory, it looks like
> InitialCapacity
> > >>>>>>>>>> should
> > >>>>>>>>>>> be
> > >>>>>>>>>>>>>> removed.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
> > >>>>>>>> seems a
> > >>>>>>>>>>>>> little
> > >>>>>>>>>>>>>> confusing to me. If Optional<LookupCacheFactory>
> > >>>>>>>> getCacheFactory()
> > >>>>>>>>>>>>> returns
> > >>>>>>>>>>>>>> a non-empty factory, doesn't that already indicates the
> > >>>>>>>> framework
> > >>>>>>>>> to
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>> the missing keys? Also, why is this method returning an
> > >>>>>>>>>>>>> Optional<Boolean>
> > >>>>>>>>>>>>>> instead of boolean?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
> > >>>>>>>> renqschn@gmail.com
> > >>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi Lincoln and Jark,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for the comments! If the community reaches a
> consensus
> > >>>>>>>>> that
> > >>>>>>>>>> we
> > >>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>> SQL hint instead of table options to decide whether to
> use sync
> > >>>>>>>>> or
> > >>>>>>>>>>>>> async
> > >>>>>>>>>>>>>>> mode, it’s indeed not necessary to introduce the
> “lookup.async”
> > >>>>>>>>>>> option.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I think it’s a good idea to let the decision of async
> made on
> > >>>>>>>>> query
> > >>>>>>>>>>>>>>> level, which could make better optimization with more
> > >>>>>>>> infomation
> > >>>>>>>>>>>>> gathered
> > >>>>>>>>>>>>>>> by planner. Is there any FLIP describing the issue in
> > >>>>>>>>> FLINK-27625?
> > >>>>>>>>>> I
> > >>>>>>>>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for
> retry on
> > >>>>>>>>>>> missing
> > >>>>>>>>>>>>>>> instead of the entire async mode to be controlled by
> hint.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
> > >>>>>>>> lincoln.86xy@gmail.com
> > >>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Jark,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for your reply!
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Currently 'lookup.async' just lies in HBase connector,
> I have
> > >>>>>>>>> no
> > >>>>>>>>>>> idea
> > >>>>>>>>>>>>>>>> whether or when to remove it (we can discuss it in
> another
> > >>>>>>>>> issue
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> HBase connector after FLINK-27625 is done), just not
> add it
> > >>>>>>>>> into
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>> common
> > >>>>>>>>>>>>>>>> option now.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>> Lincoln Lee
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi Lincoln,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you
> that
> > >>>>>>>> the
> > >>>>>>>>>>>>>>> connectors
> > >>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>> provide both async and sync runtime providers
> simultaneously
> > >>>>>>>>>>> instead
> > >>>>>>>>>>>>>>> of one
> > >>>>>>>>>>>>>>>>> of them.
> > >>>>>>>>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
> > >>>>>>>> option
> > >>>>>>>>> is
> > >>>>>>>>>>>>>>> planned to
> > >>>>>>>>>>>>>>>>> be removed
> > >>>>>>>>>>>>>>>>> in the long term, I think it makes sense not to
> introduce it
> > >>>>>>>>> in
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>>>> FLIP.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> > >>>>>>>>>> lincoln.86xy@gmail.com
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a
> good
> > >>>>>>>>> idea
> > >>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>> have a common table option. I have a minor comments on
> > >>>>>>>>>>>>> 'lookup.async'
> > >>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>> not make it a common option:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The table layer abstracts both sync and async lookup
> > >>>>>>>>>>> capabilities,
> > >>>>>>>>>>>>>>>>>> connectors implementers can choose one or both, in
> the case
> > >>>>>>>>> of
> > >>>>>>>>>>>>>>>>> implementing
> > >>>>>>>>>>>>>>>>>> only one capability(status of the most of existing
> builtin
> > >>>>>>>>>>>>> connectors)
> > >>>>>>>>>>>>>>>>>> 'lookup.async' will not be used.  And when a
> connector has
> > >>>>>>>>> both
> > >>>>>>>>>>>>>>>>>> capabilities, I think this choice is more suitable for
> > >>>>>>>> making
> > >>>>>>>>>>>>>>> decisions
> > >>>>>>>>>>>>>>>>> at
> > >>>>>>>>>>>>>>>>>> the query level, for example, table planner can
> choose the
> > >>>>>>>>>>> physical
> > >>>>>>>>>>>>>>>>>> implementation of async lookup or sync lookup based
> on its
> > >>>>>>>>> cost
> > >>>>>>>>>>>>>>> model, or
> > >>>>>>>>>>>>>>>>>> users can give query hint based on their own better
> > >>>>>>>>>>>>> understanding.  If
> > >>>>>>>>>>>>>>>>>> there is another common table option 'lookup.async',
> it may
> > >>>>>>>>>>> confuse
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> users in the long run.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in
> private
> > >>>>>>>>>> place
> > >>>>>>>>>>>>> (for
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> current hbase connector) and not turn it into a common
> > >>>>>>>>> option.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> WDYT?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>> Lincoln Lee
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一
> 14:54写道：
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hi Alexander,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP
> and
> > >>>>>>>> you
> > >>>>>>>>>> can
> > >>>>>>>>>>>>> find
> > >>>>>>>>>>>>>>>>>> those
> > >>>>>>>>>>>>>>>>>>> changes from my latest email. Since some
> terminologies has
> > >>>>>>>>>>>>> changed so
> > >>>>>>>>>>>>>>>>>> I’ll
> > >>>>>>>>>>>>>>>>>>> use the new concept for replying your comments.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 1. Builder vs ‘of’
> > >>>>>>>>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
> > >>>>>>>> optional
> > >>>>>>>>>>>>>>> parameters
> > >>>>>>>>>>>>>>>>>>> for full caching mode (“rescan” previously). The
> > >>>>>>>>>>>>> schedule-with-delay
> > >>>>>>>>>>>>>>>>> idea
> > >>>>>>>>>>>>>>>>>>> looks reasonable to me, but I think we need to
> redesign
> > >>>>>>>> the
> > >>>>>>>>>>>>> builder
> > >>>>>>>>>>>>>>> API
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> full caching to make it more descriptive for
> developers.
> > >>>>>>>>> Would
> > >>>>>>>>>>> you
> > >>>>>>>>>>>>>>> mind
> > >>>>>>>>>>>>>>>>>>> sharing your ideas about the API? For accessing the
> FLIP
> > >>>>>>>>>>> workspace
> > >>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>> just provide your account ID and ping any PMC member
> > >>>>>>>>> including
> > >>>>>>>>>>>>> Jark.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 2. Common table options
> > >>>>>>>>>>>>>>>>>>> We have some discussions these days and propose to
> > >>>>>>>>> introduce 8
> > >>>>>>>>>>>>> common
> > >>>>>>>>>>>>>>>>>>> table options about caching. It has been updated on
> the
> > >>>>>>>>> FLIP.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 3. Retries
> > >>>>>>>>>>>>>>>>>>> I think we are on the same page :-)
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> For your additional concerns:
> > >>>>>>>>>>>>>>>>>>> 1) The table option has been updated.
> > >>>>>>>>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring
> whether to
> > >>>>>>>> use
> > >>>>>>>>>>>>> partial
> > >>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>> full caching mode.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> > >>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Also I have a few additions:
> > >>>>>>>>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> > >>>>>>>>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more
> clear
> > >>>>>>>> that
> > >>>>>>>>>> we
> > >>>>>>>>>>>>> talk
> > >>>>>>>>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus
> it
> > >>>>>>>> fits
> > >>>>>>>>>>> more,
> > >>>>>>>>>>>>>>>>>>>> considering my optimization with filters.
> > >>>>>>>>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
> > >>>>>>>>> separate
> > >>>>>>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>> and rescanning from the options point of view? Like
> > >>>>>>>>> initially
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> had
> > >>>>>>>>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I
> think
> > >>>>>>>>> now
> > >>>>>>>>>> we
> > >>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>> make a boolean option 'lookup.rescan'.
> RescanInterval can
> > >>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> > >>>>>>>>>>>>> smiralexan@gmail.com
> > >>>>>>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and Jark,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 1. Builders vs 'of'
> > >>>>>>>>>>>>>>>>>>>>> I understand that builders are used when we have
> > >>>>>>>> multiple
> > >>>>>>>>>>>>>>>>> parameters.
> > >>>>>>>>>>>>>>>>>>>>> I suggested them because we could add parameters
> later.
> > >>>>>>>> To
> > >>>>>>>>>>>>> prevent
> > >>>>>>>>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking
> redundant I
> > >>>>>>>>> can
> > >>>>>>>>>>>>>>> suggest
> > >>>>>>>>>>>>>>>>>>>>> one more config now - "rescanStartTime".
> > >>>>>>>>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
> > >>>>>>>> reload
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>> starts. This parameter can be thought of as
> > >>>>>>>> 'initialDelay'
> > >>>>>>>>>>> (diff
> > >>>>>>>>>>>>>>>>>>>>> between current time and rescanStartTime) in method
> > >>>>>>>>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1]
> . It
> > >>>>>>>>> can
> > >>>>>>>>>> be
> > >>>>>>>>>>>>> very
> > >>>>>>>>>>>>>>>>>>>>> useful when the dimension table is updated by some
> other
> > >>>>>>>>>>>>> scheduled
> > >>>>>>>>>>>>>>>>> job
> > >>>>>>>>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
> > >>>>>>>> second
> > >>>>>>>>>> scan
> > >>>>>>>>>>>>>>>>> (first
> > >>>>>>>>>>>>>>>>>>>>> cache reload) be delayed. This option can be used
> even
> > >>>>>>>>>> without
> > >>>>>>>>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval'
> will be
> > >>>>>>>>> one
> > >>>>>>>>>>>>> day.
> > >>>>>>>>>>>>>>>>>>>>> If you are fine with this option, I would be very
> glad
> > >>>>>>>> if
> > >>>>>>>>>> you
> > >>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
> > >>>>>>>> myself
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 2. Common table options
> > >>>>>>>>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded
> by all
> > >>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>> options. But maybe unify all suggested options,
> not only
> > >>>>>>>>> for
> > >>>>>>>>>>>>>>> default
> > >>>>>>>>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies
> default
> > >>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>> options,
> > >>>>>>>>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 3. Retries
> > >>>>>>>>>>>>>>>>>>>>> I'm fine with suggestion close to
> > >>>>>>>>> RetryUtils#tryTimes(times,
> > >>>>>>>>>>>>> call)
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> > >>>>>>>>>> renqschn@gmail.com
> > >>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Hi Jark and Alexander,
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
> > >>>>>>>> common
> > >>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>> options. I prefer to introduce a new
> > >>>>>>>>> DefaultLookupCacheOptions
> > >>>>>>>>>>>>> class
> > >>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> holding these option definitions because putting all
> > >>>>>>>> options
> > >>>>>>>>>>> into
> > >>>>>>>>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not
> well
> > >>>>>>>>>>>>> categorized.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> FLIP has been updated according to suggestions
> above:
> > >>>>>>>>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
> > >>>>>>>>>>>>> RescanRuntimeProvider
> > >>>>>>>>>>>>>>>>>>> considering both arguments are required.
> > >>>>>>>>>>>>>>>>>>>>>> 2. Introduce new table options matching
> > >>>>>>>>>>>>> DefaultLookupCacheFactory
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> > >>>>>>>>> imjark@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 1) retry logic
> > >>>>>>>>>>>>>>>>>>>>>>> I think we can extract some common retry logic
> into
> > >>>>>>>>>>> utilities,
> > >>>>>>>>>>>>>>>>> e.g.
> > >>>>>>>>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
> > >>>>>>>>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be
> reused
> > >>>>>>>> by
> > >>>>>>>>>>>>>>>>> DataStream
> > >>>>>>>>>>>>>>>>>>> users.
> > >>>>>>>>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and
> where
> > >>>>>>>> to
> > >>>>>>>>>> put
> > >>>>>>>>>>>>> it.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 2) cache ConfigOptions
> > >>>>>>>>>>>>>>>>>>>>>>> I'm fine with defining cache config options in
> the
> > >>>>>>>>>>> framework.
> > >>>>>>>>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which
> also
> > >>>>>>>>>> includes
> > >>>>>>>>>>>>>>>>>>> "sink.parallelism", "format" options.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> > >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Thank you for considering my comments.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> there might be custom logic before making
> retry,
> > >>>>>>>> such
> > >>>>>>>>> as
> > >>>>>>>>>>>>>>>>>>> re-establish the connection
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic
> can
> > >>>>>>>> be
> > >>>>>>>>>>>>> placed in
> > >>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>>>>> separate function, that can be implemented by
> > >>>>>>>>> connectors.
> > >>>>>>>>>>>>> Just
> > >>>>>>>>>>>>>>>>>> moving
> > >>>>>>>>>>>>>>>>>>>>>>>> the retry logic would make connector's
> LookupFunction
> > >>>>>>>>>> more
> > >>>>>>>>>>>>>>>>> concise
> > >>>>>>>>>>>>>>>>>> +
> > >>>>>>>>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor
> change.
> > >>>>>>>> The
> > >>>>>>>>>>>>> decision
> > >>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>> up
> > >>>>>>>>>>>>>>>>>>>>>>>> to you.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options
> and let
> > >>>>>>>>>>>>> developers
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> What is the reason for that? One of the main
> goals of
> > >>>>>>>>>> this
> > >>>>>>>>>>>>> FLIP
> > >>>>>>>>>>>>>>>>> was
> > >>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
> > >>>>>>>> current
> > >>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>> design
> > >>>>>>>>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was
> before. But
> > >>>>>>>>>> still
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>> put
> > >>>>>>>>>>>>>>>>>>>>>>>> these options into the framework, so connectors
> can
> > >>>>>>>>> reuse
> > >>>>>>>>>>>>> them
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
> > >>>>>>>> significant,
> > >>>>>>>>>>> avoid
> > >>>>>>>>>>>>>>>>>> possible
> > >>>>>>>>>>>>>>>>>>>>>>>> different options naming. This moment can be
> pointed
> > >>>>>>>>> out
> > >>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>> documentation for connector developers.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> > >>>>>>>>>>>>> renqschn@gmail.com>:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander,
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are
> on the
> > >>>>>>>>> same
> > >>>>>>>>>>>>> page!
> > >>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m
> also
> > >>>>>>>>>> quoting
> > >>>>>>>>>>>>> your
> > >>>>>>>>>>>>>>>>>> reply
> > >>>>>>>>>>>>>>>>>>> under this email.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this
> class
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be
> implemented
> > >>>>>>>> in
> > >>>>>>>>>>>>> lookup()
> > >>>>>>>>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
> > >>>>>>>>>> meaningful
> > >>>>>>>>>>>>>>> under
> > >>>>>>>>>>>>>>>>>> some
> > >>>>>>>>>>>>>>>>>>> specific retriable failures, and there might be
> custom
> > >>>>>>>> logic
> > >>>>>>>>>>>>> before
> > >>>>>>>>>>>>>>>>>> making
> > >>>>>>>>>>>>>>>>>>> retry, such as re-establish the connection
> > >>>>>>>>>>>>> (JdbcRowDataLookupFunction
> > >>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>> example), so it's more handy to leave it to the
> connector.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
> > >>>>>>>>> version
> > >>>>>>>>>> of
> > >>>>>>>>>>>>>>> FLIP.
> > >>>>>>>>>>>>>>>>>> Do
> > >>>>>>>>>>>>>>>>>>> you have any special plans for them?
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options
> and let
> > >>>>>>>>>>>>> developers
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll
> update the
> > >>>>>>>>>> FLIP.
> > >>>>>>>>>>>>> Hope
> > >>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>> can finalize our proposal soon!
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> > >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP,
> however
> > >>>>>>>> I
> > >>>>>>>>>> have
> > >>>>>>>>>>>>>>>>> several
> > >>>>>>>>>>>>>>>>>>>>>>>>>> suggestions and questions.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> > >>>>>>>>>>>>> TableFunction
> > >>>>>>>>>>>>>>>>> is a
> > >>>>>>>>>>>>>>>>>>> good
> > >>>>>>>>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into
> this
> > >>>>>>>>>> class.
> > >>>>>>>>>>>>>>> 'eval'
> > >>>>>>>>>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this
> purpose.
> > >>>>>>>> The
> > >>>>>>>>>> same
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 'async' case.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future,
> such as
> > >>>>>>>>>>>>>>>>>>> 'cacheMissingKey'
> > >>>>>>>>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval'
> in
> > >>>>>>>>>>>>>>>>>>> ScanRuntimeProvider.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in
> LookupFunctionProvider
> > >>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility
> (use one
> > >>>>>>>>>>> 'build'
> > >>>>>>>>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
> > >>>>>>>>>> TableFunctionProvider
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they
> should be
> > >>>>>>>>>>>>> deprecated.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
> > >>>>>>>> assume
> > >>>>>>>>>>>>> usage of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In
> this
> > >>>>>>>>> case,
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>> very
> > >>>>>>>>>>>>>>>>>>>>>>>>>> clear why do we need methods such as
> 'invalidate'
> > >>>>>>>> or
> > >>>>>>>>>>>>> 'putAll'
> > >>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> LookupCache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in
> previous
> > >>>>>>>>>> version
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>> FLIP.
> > >>>>>>>>>>>>>>>>>>> Do
> > >>>>>>>>>>>>>>>>>>>>>>>>>> you have any special plans for them?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able
> to
> > >>>>>>>> make
> > >>>>>>>>>>> small
> > >>>>>>>>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think
> it's
> > >>>>>>>>>> worth
> > >>>>>>>>>>>>>>>>>> mentioning
> > >>>>>>>>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning
> in
> > >>>>>>>> the
> > >>>>>>>>>>>>> future.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> > >>>>>>>>>>>>> renqschn@gmail.com
> > >>>>>>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth
> discussion!
> > >>>>>>>> As
> > >>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and
> made a
> > >>>>>>>>>>>>> refactor on
> > >>>>>>>>>>>>>>>>> our
> > >>>>>>>>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
> > >>>>>>>> design
> > >>>>>>>>>> now
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>> happy to hear more suggestions from you!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime
> level
> > >>>>>>>>> and
> > >>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>> integrated as a component of LookupJoinRunner as
> discussed
> > >>>>>>>>>>>>>>> previously.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
> > >>>>>>>> reflect
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> new
> > >>>>>>>>>>>>>>>>>>> design.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case
> individually
> > >>>>>>>> and
> > >>>>>>>>>>>>>>>>> introduce a
> > >>>>>>>>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
> > >>>>>>>> scanning.
> > >>>>>>>>> We
> > >>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>> planning
> > >>>>>>>>>>>>>>>>>>> to support SourceFunction / InputFormat for now
> > >>>>>>>> considering
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> complexity
> > >>>>>>>>>>>>>>>>>>> of FLIP-27 Source API.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is
> introduced to
> > >>>>>>>>>> make
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> semantic of lookup more straightforward for
> developers.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether
> InputFormat
> > >>>>>>>>> is
> > >>>>>>>>>>>>>>>>> deprecated
> > >>>>>>>>>>>>>>>>>>> or not. Am I right that it will be so in the future,
> but
> > >>>>>>>>>>> currently
> > >>>>>>>>>>>>>>> it's
> > >>>>>>>>>>>>>>>>>> not?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not
> deprecated
> > >>>>>>>> for
> > >>>>>>>>>>> now.
> > >>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>> it will be deprecated in the future but we don't
> have a
> > >>>>>>>>> clear
> > >>>>>>>>>>> plan
> > >>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>> that.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP
> and
> > >>>>>>>>>> looking
> > >>>>>>>>>>>>>>>>> forward
> > >>>>>>>>>>>>>>>>>>> to cooperating with you after we finalize the design
> and
> > >>>>>>>>>>>>> interfaces!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
> > >>>>>>>> Смирнов <
> > >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on
> almost
> > >>>>>>>>> all
> > >>>>>>>>>>>>>>> points!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether
> InputFormat
> > >>>>>>>>> is
> > >>>>>>>>>>>>>>>>> deprecated
> > >>>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the
> future,
> > >>>>>>>>> but
> > >>>>>>>>>>>>>>>>> currently
> > >>>>>>>>>>>>>>>>>>> it's
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the
> first
> > >>>>>>>>> version
> > >>>>>>>>>>>>> it's
> > >>>>>>>>>>>>>>> OK
> > >>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization,
> because
> > >>>>>>>>>>> supporting
> > >>>>>>>>>>>>>>>>> rescan
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect.
> But
> > >>>>>>>> for
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>> decision we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
> > >>>>>>>> participants.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue
> with
> > >>>>>>>>> your
> > >>>>>>>>>>>>>>>>>>> statements. All
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead,
> it
> > >>>>>>>>> would
> > >>>>>>>>>> be
> > >>>>>>>>>>>>> nice
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already
> done a
> > >>>>>>>> lot
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to
> the
> > >>>>>>>> one
> > >>>>>>>>>> we
> > >>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>> discussing,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
> > >>>>>>>> Anyway
> > >>>>>>>>>>>>> looking
> > >>>>>>>>>>>>>>>>>>> forward for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> the FLIP update!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> > >>>>>>>>>> imjark@gmail.com
> > >>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and
> I have
> > >>>>>>>>>>>>> discussed
> > >>>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>>> several times
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a
> consensus on
> > >>>>>>>>> many
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> your
> > >>>>>>>>>>>>>>>>>>> points!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
> > >>>>>>>> design
> > >>>>>>>>>> docs
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> maybe can be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
> > >>>>>>>>> discussions:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
> > >>>>>>>> "cache
> > >>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>> framework" way.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> > >>>>>>>>> customize
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>> default
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
> > >>>>>>>> easy-use.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
> > >>>>>>>>>>> flexibility
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> conciseness.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL
> and LRU
> > >>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>> cache,
> > >>>>>>>>>>>>>>>>>>> esp reducing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state
> and
> > >>>>>>>> the
> > >>>>>>>>>>>>> unified
> > >>>>>>>>>>>>>>>>> way
> > >>>>>>>>>>>>>>>>>>> to both
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
> > >>>>>>>>> direction.
> > >>>>>>>>>> If
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>> need
> > >>>>>>>>>>>>>>>>>>> to support
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why
> not
> > >>>>>>>> use
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
> > >>>>>>>> decide
> > >>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> implement
> > >>>>>>>>>>>>>>>>>>> the cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to
> support
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an
> optimization
> > >>>>>>>>> and
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>> doesn't
> > >>>>>>>>>>>>>>>>>>> affect the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA
> issue
> > >>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is
> similar to
> > >>>>>>>>> your
> > >>>>>>>>>>>>>>>>> proposal.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
> > >>>>>>>>>>> InputFormat,
> > >>>>>>>>>>>>>>>>>>> SourceFunction for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join
> operator).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
> > >>>>>>>> source
> > >>>>>>>>>>>>> operator
> > >>>>>>>>>>>>>>>>>>> instead of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to
> support the
> > >>>>>>>>>>> re-scan
> > >>>>>>>>>>>>>>>>>> ability
> > >>>>>>>>>>>>>>>>>>> for FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can
> put the
> > >>>>>>>>>>> effort
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> FLIP-27 source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
> > >>>>>>>>> InputFormat&SourceFunction,
> > >>>>>>>>>>> as
> > >>>>>>>>>>>>>>> they
> > >>>>>>>>>>>>>>>>>>> are not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> > >>>>>>>>> another
> > >>>>>>>>>>>>>>> function
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We
> need to
> > >>>>>>>>>> plan
> > >>>>>>>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>>>>>>>> source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
> > >>>>>>>>> SourceFunction
> > >>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>> deprecated.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр
> Смирнов
> > >>>>>>>> <
> > >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
> > >>>>>>>>> InputFormat
> > >>>>>>>>>>> is
> > >>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>> considered.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn
> Visser <
> > >>>>>>>>>>>>>>>>>>> martijn@ververica.com>:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
> > >>>>>>>>> connectors
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all
> connectors.
> > >>>>>>>>> The
> > >>>>>>>>>>> old
> > >>>>>>>>>>>>>>>>>>> interfaces will be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
> > >>>>>>>>>> refactored
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>>>> the new ones
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dropped.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors
> that
> > >>>>>>>> are
> > >>>>>>>>>>> using
> > >>>>>>>>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>>>>>>>> interfaces,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for
> old
> > >>>>>>>>>>>>> interfaces.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
> > >>>>>>>> Смирнов
> > >>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would
> like to
> > >>>>>>>>> make
> > >>>>>>>>>>>>> some
> > >>>>>>>>>>>>>>>>>>> comments and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I
> think
> > >>>>>>>>> we
> > >>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>> achieve
> > >>>>>>>>>>>>>>>>>>> both
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache
> interface
> > >>>>>>>> in
> > >>>>>>>>>>>>>>>>>>> flink-table-common,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
> > >>>>>>>>>>>>> flink-table-runtime.
> > >>>>>>>>>>>>>>>>>>> Therefore if a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use
> existing
> > >>>>>>>> cache
> > >>>>>>>>>>>>>>>>> strategies
> > >>>>>>>>>>>>>>>>>>> and their
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
> > >>>>>>>> lookupConfig
> > >>>>>>>>> to
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> planner, but if
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache
> implementation
> > >>>>>>>>> in
> > >>>>>>>>>>> his
> > >>>>>>>>>>>>>>>>>>> TableFunction, it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the
> existing
> > >>>>>>>>>>>>> interface
> > >>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this
> out in
> > >>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> documentation). In
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
> > >>>>>>>>> unified.
> > >>>>>>>>>>>>> WDYT?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in
> the
> > >>>>>>>>> cache,
> > >>>>>>>>>> we
> > >>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>> have 90% of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> > >>>>>>>>> optimization
> > >>>>>>>>>> in
> > >>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> LRU cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> > >>>>>>>>>> Collection<RowData>>.
> > >>>>>>>>>>>>> Here
> > >>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>> always
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension
> table in
> > >>>>>>>>>> cache,
> > >>>>>>>>>>>>> even
> > >>>>>>>>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there
> are no
> > >>>>>>>>> rows
> > >>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>> applying
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval'
> method of
> > >>>>>>>>>>>>>>>>>> TableFunction,
> > >>>>>>>>>>>>>>>>>>> we store
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys.
> Therefore the
> > >>>>>>>>>> cache
> > >>>>>>>>>>>>> line
> > >>>>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less
> memory (in
> > >>>>>>>>>>> bytes).
> > >>>>>>>>>>>>>>>>> I.e.
> > >>>>>>>>>>>>>>>>>>> we don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result
> was
> > >>>>>>>>>> pruned,
> > >>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>>> significantly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this
> result.
> > >>>>>>>> If
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>>> knows about
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the
> 'max-rows'
> > >>>>>>>>>>> option
> > >>>>>>>>>>>>>>>>> before
> > >>>>>>>>>>>>>>>>>>> the start
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with
> the
> > >>>>>>>>> idea
> > >>>>>>>>>>>>> that we
> > >>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>> do this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the
> 'maximumWeight'
> > >>>>>>>> and
> > >>>>>>>>>>>>> 'weigher'
> > >>>>>>>>>>>>>>>>>>> methods of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size
> of the
> > >>>>>>>>>>>>> collection
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> rows
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> > >>>>>>>>>> automatically
> > >>>>>>>>>>>>> fit
> > >>>>>>>>>>>>>>>>>> much
> > >>>>>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way
> to do
> > >>>>>>>>>>> filters
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> projects
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown
> and
> > >>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > >>>>>>>>>>> interfaces,
> > >>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>> mean it's
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be
> to
> > >>>>>>>>>>> implement
> > >>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>> pushdown.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently
> there is
> > >>>>>>>> no
> > >>>>>>>>>>>>> database
> > >>>>>>>>>>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means
> that this
> > >>>>>>>>>>> feature
> > >>>>>>>>>>>>>>>>> won't
> > >>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover,
> if we
> > >>>>>>>>>> talk
> > >>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>> other
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
> > >>>>>>>> databases
> > >>>>>>>>>>> might
> > >>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>> support all
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
> > >>>>>>>> all).
> > >>>>>>>>> I
> > >>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>> users
> > >>>>>>>>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
> > >>>>>>>>>> optimization
> > >>>>>>>>>>>>>>>>>>> independently of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving
> more
> > >>>>>>>>>> complex
> > >>>>>>>>>>>>>>>>> problems
> > >>>>>>>>>>>>>>>>>>> (or
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
> > >>>>>>>> Actually
> > >>>>>>>>> in
> > >>>>>>>>>>> our
> > >>>>>>>>>>>>>>>>>>> internal version
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of
> scanning
> > >>>>>>>> and
> > >>>>>>>>>>>>>>> reloading
> > >>>>>>>>>>>>>>>>>>> data from
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't
> find
> > >>>>>>>> a
> > >>>>>>>>>> way
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> unify
> > >>>>>>>>>>>>>>>>>>> the logic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders
> (InputFormat,
> > >>>>>>>>>>>>>>> SourceFunction,
> > >>>>>>>>>>>>>>>>>>> Source,...)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As
> a
> > >>>>>>>>> result
> > >>>>>>>>>> I
> > >>>>>>>>>>>>>>>>> settled
> > >>>>>>>>>>>>>>>>>>> on using
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for
> scanning
> > >>>>>>>>> in
> > >>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there
> are
> > >>>>>>>> plans
> > >>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> deprecate
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27
> Source). IMO
> > >>>>>>>>>> usage
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> FLIP-27 source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea,
> because this
> > >>>>>>>>>>> source
> > >>>>>>>>>>>>> was
> > >>>>>>>>>>>>>>>>>>> designed to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
> > >>>>>>>>> (SplitEnumerator
> > >>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>> JobManager and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in
> one
> > >>>>>>>>>> operator
> > >>>>>>>>>>>>>>>>> (lookup
> > >>>>>>>>>>>>>>>>>>> join
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
> > >>>>>>>> direct
> > >>>>>>>>>> way
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> pass
> > >>>>>>>>>>>>>>>>>>> splits from
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this
> logic
> > >>>>>>>>> works
> > >>>>>>>>>>>>>>> through
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to
> send
> > >>>>>>>>>>>>>>>>> AddSplitEvents).
> > >>>>>>>>>>>>>>>>>>> Usage of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much
> more
> > >>>>>>>>> clearer
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> easier. But if
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all
> connectors to
> > >>>>>>>>>>>>> FLIP-27, I
> > >>>>>>>>>>>>>>>>>>> have the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse
> from
> > >>>>>>>>> lookup
> > >>>>>>>>>>> join
> > >>>>>>>>>>>>>>> ALL
> > >>>>>>>>>>>>>>>>>>> cache in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple
> scanning
> > >>>>>>>> of
> > >>>>>>>>>>> batch
> > >>>>>>>>>>>>>>>>>> source?
> > >>>>>>>>>>>>>>>>>>> The point
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between
> lookup
> > >>>>>>>> join
> > >>>>>>>>>> ALL
> > >>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>> and simple
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the
> first
> > >>>>>>>>> case
> > >>>>>>>>>>>>>>> scanning
> > >>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>> performed
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
> > >>>>>>>> (cache)
> > >>>>>>>>> is
> > >>>>>>>>>>>>>>> cleared
> > >>>>>>>>>>>>>>>>>>> (correct me
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> > >>>>>>>>>>>>> functionality of
> > >>>>>>>>>>>>>>>>>>> simple join
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
> > >>>>>>>>>>>>> functionality of
> > >>>>>>>>>>>>>>>>>>> scanning
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one
> should
> > >>>>>>>> be
> > >>>>>>>>>>> easy
> > >>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>> new FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch
> reading
> > >>>>>>>> -
> > >>>>>>>>> we
> > >>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>> need
> > >>>>>>>>>>>>>>>>>>> to change
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass
> splits
> > >>>>>>>>>> again
> > >>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>> some TTL).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> > >>>>>>>>> long-term
> > >>>>>>>>>>>>> goal
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> will make
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than
> you
> > >>>>>>>>> said.
> > >>>>>>>>>>>>> Maybe
> > >>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>> can limit
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> > >>>>>>>>>> (InputFormats).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise
> and
> > >>>>>>>>>> flexible
> > >>>>>>>>>>>>>>>>>>> interfaces for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is
> important
> > >>>>>>>> both
> > >>>>>>>>>> in
> > >>>>>>>>>>>>> LRU
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> ALL caches.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown
> will be
> > >>>>>>>>>>>>> supported
> > >>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>> Flink
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors
> might not
> > >>>>>>>>> have
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> opportunity to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
> > >>>>>>>> currently
> > >>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>> pushdown works
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> > >>>>>>>>> filters
> > >>>>>>>>>> +
> > >>>>>>>>>>>>>>>>>>> projections
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from
> other
> > >>>>>>>>>>>>> features.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex
> topic
> > >>>>>>>>> that
> > >>>>>>>>>>>>>>> involves
> > >>>>>>>>>>>>>>>>>>> multiple
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing.
> Refusing
> > >>>>>>>>> from
> > >>>>>>>>>>>>>>>>>>> InputFormat in favor
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> > >>>>>>>>> realization
> > >>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>> complex and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we
> can
> > >>>>>>>>> extend
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> functionality of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from
> InputFormat in
> > >>>>>>>>>> case
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>> join ALL
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> > >>>>>>>>>>> imjark@gmail.com
> > >>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active
> discussion! I
> > >>>>>>>>> want
> > >>>>>>>>>> to
> > >>>>>>>>>>>>>>> share
> > >>>>>>>>>>>>>>>>>> my
> > >>>>>>>>>>>>>>>>>>> ideas:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> > >>>>>>>>>> connectors
> > >>>>>>>>>>>>> base
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this.
> Both
> > >>>>>>>>> ways
> > >>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>> work (e.g.,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more
> concise
> > >>>>>>>>>>>>> interfaces.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
> > >>>>>>>>> flexible
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to
> see if
> > >>>>>>>> we
> > >>>>>>>>>> can
> > >>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>> both
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the
> way
> > >>>>>>>>> should
> > >>>>>>>>>>> be a
> > >>>>>>>>>>>>>>>>> final
> > >>>>>>>>>>>>>>>>>>> state,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter
> pushdown
> > >>>>>>>>> into
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>> benefit a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lot
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU
> cache.
> > >>>>>>>>>>> Connectors
> > >>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>>>> cache to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better
> throughput.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in
> the
> > >>>>>>>>> cache,
> > >>>>>>>>>> we
> > >>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>> have 90% of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
> > >>>>>>>> means
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> meaningless in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard
> way
> > >>>>>>>> to
> > >>>>>>>>> do
> > >>>>>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>>>>> and projects
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown
> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > >>>>>>>>>>> interfaces,
> > >>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>> mean it's
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
> > >>>>>>>> interfaces
> > >>>>>>>>> to
> > >>>>>>>>>>>>> reduce
> > >>>>>>>>>>>>>>>>> IO
> > >>>>>>>>>>>>>>>>>>> and the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the
> scan
> > >>>>>>>>>> source
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> lookup source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate
> the
> > >>>>>>>>>> pushdown
> > >>>>>>>>>>>>> logic
> > >>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>> caches,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most
> challenging part
> > >>>>>>>>> of
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>> FLIP.
> > >>>>>>>>>>>>>>>>>>> We have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> never
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public
> interface.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in
> the
> > >>>>>>>>> "eval"
> > >>>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g.,
> Hive).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation
> should
> > >>>>>>>> share
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>> logic
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> reload
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> > >>>>>>>>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> > >>>>>>>>>>> deprecated,
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27
> source in
> > >>>>>>>>>>>>> LookupJoin,
> > >>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>> may make
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to
> abstract
> > >>>>>>>> the
> > >>>>>>>>>> ALL
> > >>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>> logic and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reuse
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman
> Boyko <
> > >>>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity
> and
> > >>>>>>>>> lies
> > >>>>>>>>>>> out
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> scope of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns
> should
> > >>>>>>>> be
> > >>>>>>>>>>> done
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup
> ones).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
> > >>>>>>>> Visser <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> > >>>>>>>>>> correctly
> > >>>>>>>>>>>>>>>>>> mentioned
> > >>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> > >>>>>>>>>>>>>>>>> jdbc/hive/hbase."
> > >>>>>>>>>>>>>>>>>>> -> Would
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> > >>>>>>>>> implement
> > >>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more
> benefits
> > >>>>>>>> to
> > >>>>>>>>>>> doing
> > >>>>>>>>>>>>>>>>> that,
> > >>>>>>>>>>>>>>>>>>> outside
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman
> Boyko <
> > >>>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> > >>>>>>>>>> improvement!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
> > >>>>>>>> implementation
> > >>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>> nice
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> > >>>>>>>>> SYSTEM_TIME
> > >>>>>>>>>>> AS
> > >>>>>>>>>>>>> OF
> > >>>>>>>>>>>>>>>>>>> proc_time"
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will
> be
> > >>>>>>>>>>>>> implemented.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes,
> I can
> > >>>>>>>>> say
> > >>>>>>>>>>>>> that:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the
> opportunity
> > >>>>>>>>> to
> > >>>>>>>>>>> cut
> > >>>>>>>>>>>>> off
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data.
> And
> > >>>>>>>> the
> > >>>>>>>>>> most
> > >>>>>>>>>>>>>>> handy
> > >>>>>>>>>>>>>>>>>>> way to do
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would
> be a
> > >>>>>>>> bit
> > >>>>>>>>>>>>> harder to
> > >>>>>>>>>>>>>>>>>>> pass it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction.
> And
> > >>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>>>>> correctly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not
> implemented
> > >>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> jdbc/hive/hbase.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
> > >>>>>>>> caching
> > >>>>>>>>>>>>>>>>> parameters
> > >>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would
> prefer to
> > >>>>>>>>> set
> > >>>>>>>>>> it
> > >>>>>>>>>>>>>>>>> through
> > >>>>>>>>>>>>>>>>>>> DDL
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rather
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and
> other
> > >>>>>>>>> options
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the
> framework
> > >>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>> deprives us of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able
> to
> > >>>>>>>>>> implement
> > >>>>>>>>>>>>>>> their
> > >>>>>>>>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by
> creating
> > >>>>>>>>> more
> > >>>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to
> the
> > >>>>>>>>>> schema
> > >>>>>>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if
> I'm
> > >>>>>>>> not
> > >>>>>>>>>>> right
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> > >>>>>>>>>>> architecture?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> > >>>>>>>>> Visser <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but
> just
> > >>>>>>>>>> wanted
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> express that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion
> on
> > >>>>>>>> this
> > >>>>>>>>>>> topic
> > >>>>>>>>>>>>>>>>> and I
> > >>>>>>>>>>>>>>>>>>> hope
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15,
> Александр
> > >>>>>>>>>>> Смирнов <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> > >>>>>>>>>> However, I
> > >>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>> questions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I
> didn't
> > >>>>>>>>> get
> > >>>>>>>>>>>>>>>>>>> something?).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the
> semantic
> > >>>>>>>> of
> > >>>>>>>>>> "FOR
> > >>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> > >>>>>>>>>>> SYSTEM_TIME
> > >>>>>>>>>>>>> AS
> > >>>>>>>>>>>>>>>>> OF
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time"
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching,
> but as
> > >>>>>>>>> you
> > >>>>>>>>>>>>> said,
> > >>>>>>>>>>>>>>>>>> users
> > >>>>>>>>>>>>>>>>>>> go
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
> > >>>>>>>> performance
> > >>>>>>>>>> (no
> > >>>>>>>>>>>>> one
> > >>>>>>>>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enable
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by
> users
> > >>>>>>>> do
> > >>>>>>>>>> you
> > >>>>>>>>>>>>> mean
> > >>>>>>>>>>>>>>>>>>> other
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case
> developers
> > >>>>>>>>>>> explicitly
> > >>>>>>>>>>>>>>>>>> specify
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whether
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not
> (in
> > >>>>>>>> the
> > >>>>>>>>>>> list
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> supported
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they
> don't
> > >>>>>>>>>> want
> > >>>>>>>>>>>>> to.
> > >>>>>>>>>>>>>>> So
> > >>>>>>>>>>>>>>>>>>> what
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between
> implementing
> > >>>>>>>>> caching
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>> modules
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> > >>>>>>>>>> flink-table-common
> > >>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> considered
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect
> on
> > >>>>>>>>>>>>>>>>>>> breaking/non-breaking
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS
> OF
> > >>>>>>>>>>> proc_time"?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows
> table
> > >>>>>>>>>>>>> options in
> > >>>>>>>>>>>>>>>>>> DDL
> > >>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> control
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which
> has
> > >>>>>>>>> never
> > >>>>>>>>>>>>>>> happened
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> previously
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences
> of
> > >>>>>>>>>>> semantics
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> DDL
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"),
> isn't
> > >>>>>>>>> it
> > >>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>> limiting
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the
> user
> > >>>>>>>>>>> business
> > >>>>>>>>>>>>>>>>> logic
> > >>>>>>>>>>>>>>>>>>> rather
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
> > >>>>>>>> logic
> > >>>>>>>>> in
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> framework? I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mean
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example,
> putting an
> > >>>>>>>>>> option
> > >>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would
> be
> > >>>>>>>> the
> > >>>>>>>>>>> wrong
> > >>>>>>>>>>>>>>>>>>> decision,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's
> business
> > >>>>>>>>> logic
> > >>>>>>>>>>> (not
> > >>>>>>>>>>>>>>>>> just
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just
> several
> > >>>>>>>>>>> functions
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> ONE
> > >>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (there
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> > >>>>>>>>> caches).
> > >>>>>>>>>>>>> Does it
> > >>>>>>>>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matter for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where
> the
> > >>>>>>>>> logic
> > >>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>> located,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> > >>>>>>>>>>>>> 'sink.parallelism',
> > >>>>>>>>>>>>>>>>>>> which in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some way
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
> > >>>>>>>> framework"
> > >>>>>>>>>> and
> > >>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>> see any
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for
> this
> > >>>>>>>>>>> all-caching
> > >>>>>>>>>>>>>>>>>>> scenario
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> > >>>>>>>>>> discussion,
> > >>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>>> actually
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in our
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this
> problem
> > >>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>> easily
> > >>>>>>>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reused
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no
> need
> > >>>>>>>>> for
> > >>>>>>>>>> a
> > >>>>>>>>>>>>> new
> > >>>>>>>>>>>>>>>>>> API).
> > >>>>>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup
> connectors
> > >>>>>>>> use
> > >>>>>>>>>>>>>>>>> InputFormat
> > >>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scanning
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC
> and
> > >>>>>>>> even
> > >>>>>>>>>> Hive
> > >>>>>>>>>>>>> - it
> > >>>>>>>>>>>>>>>>>> uses
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually
> just
> > >>>>>>>> a
> > >>>>>>>>>>>>> wrapper
> > >>>>>>>>>>>>>>>>>> around
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is
> the
> > >>>>>>>>>> ability
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> reload
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads
> depends on
> > >>>>>>>>>> number
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
> > >>>>>>>> reload
> > >>>>>>>>>>> time
> > >>>>>>>>>>>>>>>>>>> significantly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduces
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> > >>>>>>>>>> blocking). I
> > >>>>>>>>>>>>> know
> > >>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> usually
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in
> Flink
> > >>>>>>>>>> code,
> > >>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>> maybe
> > >>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say
> that it's
> > >>>>>>>>> an
> > >>>>>>>>>>>>> ideal
> > >>>>>>>>>>>>>>>>>>> solution,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> maybe
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the
> framework
> > >>>>>>>>> might
> > >>>>>>>>>>>>>>>>> introduce
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when
> the
> > >>>>>>>>>>> developer
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and
> will use
> > >>>>>>>>> new
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>> options
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
> > >>>>>>>> options
> > >>>>>>>>>>> into
> > >>>>>>>>>>>>> 2
> > >>>>>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all
> he
> > >>>>>>>> will
> > >>>>>>>>>>> need
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>>>>> is to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> redirect
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the
> framework's
> > >>>>>>>>>>>>> LookupConfig
> > >>>>>>>>>>>>>>> (+
> > >>>>>>>>>>>>>>>>>>> maybe
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> add an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was
> different
> > >>>>>>>>>> naming),
> > >>>>>>>>>>>>>>>>>> everything
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the
> developer
> > >>>>>>>>>> won't
> > >>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> refactoring at
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
> > >>>>>>>> connector
> > >>>>>>>>>>>>> because
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> backward
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
> > >>>>>>>> wants
> > >>>>>>>>> to
> > >>>>>>>>>>> use
> > >>>>>>>>>>>>>>> his
> > >>>>>>>>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some
> of the
> > >>>>>>>>>>> configs
> > >>>>>>>>>>>>>>> into
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own
> implementation
> > >>>>>>>> with
> > >>>>>>>>>>>>> already
> > >>>>>>>>>>>>>>>>>>> existing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
> > >>>>>>>> it's a
> > >>>>>>>>>>> rare
> > >>>>>>>>>>>>>>>>> case).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
> > >>>>>>>> pushed
> > >>>>>>>>>> all
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> way
> > >>>>>>>>>>>>>>>>>>> down
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the
> scan
> > >>>>>>>>>> source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the
> truth
> > >>>>>>>> is
> > >>>>>>>>>> that
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> ONLY
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> > >>>>>>>>>>>>> FileSystemTableSource
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> > >>>>>>>>>>> currently).
> > >>>>>>>>>>>>>>> Also
> > >>>>>>>>>>>>>>>>>>> for some
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to
> pushdown such
> > >>>>>>>>>>> complex
> > >>>>>>>>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these
> optimizations to
> > >>>>>>>> the
> > >>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>> seems
> > >>>>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
> > >>>>>>>> large
> > >>>>>>>>>>>>> amount of
> > >>>>>>>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple
> example,
> > >>>>>>>>>>> suppose
> > >>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>> dimension
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values
> from
> > >>>>>>>> 20
> > >>>>>>>>> to
> > >>>>>>>>>>> 40,
> > >>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> input
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stream
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly
> distributed
> > >>>>>>>>> by
> > >>>>>>>>>>> age
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> users. If
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in
> cache.
> > >>>>>>>>>> This
> > >>>>>>>>>>>>> means
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows'
> by
> > >>>>>>>>> almost
> > >>>>>>>>>> 2
> > >>>>>>>>>>>>>>> times.
> > >>>>>>>>>>>>>>>>>> It
> > >>>>>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gain a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> > >>>>>>>>>>> optimization
> > >>>>>>>>>>>>>>>>> starts
> > >>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables
> without
> > >>>>>>>>>> filters
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> projections
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can.
> This
> > >>>>>>>>> opens
> > >>>>>>>>>> up
> > >>>>>>>>>>>>>>>>>>> additional
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound
> as
> > >>>>>>>> 'not
> > >>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>>>> useful'.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other
> voices
> > >>>>>>>>>>>>> regarding
> > >>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>> topic!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of
> controversial
> > >>>>>>>>>> points,
> > >>>>>>>>>>>>> and I
> > >>>>>>>>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for
> us to
> > >>>>>>>>> come
> > >>>>>>>>>>> to a
> > >>>>>>>>>>>>>>>>>>> consensus.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33,
> Qingsheng
> > >>>>>>>>> Ren
> > >>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and
> sorry
> > >>>>>>>> for
> > >>>>>>>>> my
> > >>>>>>>>>>>>> late
> > >>>>>>>>>>>>>>>>>>> response!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> had
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with
> Jark
> > >>>>>>>>> and
> > >>>>>>>>>>>>> Leonard
> > >>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> I’d
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> > >>>>>>>>>> implementing
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around
> the
> > >>>>>>>>>>>>> user-provided
> > >>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> > >>>>>>>>> extending
> > >>>>>>>>>>>>>>>>>>> TableFunction
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
> > >>>>>>>> semantic
> > >>>>>>>>> of
> > >>>>>>>>>>>>> "FOR
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t
> truly
> > >>>>>>>>>> reflect
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> content
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying.
> If
> > >>>>>>>> users
> > >>>>>>>>>>>>> choose
> > >>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> enable
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly
> indicate
> > >>>>>>>>> that
> > >>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>> breakage is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So
> we
> > >>>>>>>>> prefer
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> provide
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache
> implementation
> > >>>>>>>>> in
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> framework
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (whether
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> > >>>>>>>>> TableFunction),
> > >>>>>>>>>> we
> > >>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL
> to
> > >>>>>>>>> control
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> behavior of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened
> previously and
> > >>>>>>>>>>> should
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>> cautious.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Under
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> > >>>>>>>>>> framework
> > >>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>> only be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations
> (“table.exec.xxx”), and
> > >>>>>>>>> it’s
> > >>>>>>>>>>>>> hard
> > >>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
> > >>>>>>>> source
> > >>>>>>>>>>> loads
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> refresh
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to
> achieve
> > >>>>>>>>>> high
> > >>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and
> also
> > >>>>>>>>> widely
> > >>>>>>>>>>>>> used
> > >>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>> our
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache
> around
> > >>>>>>>>> the
> > >>>>>>>>>>>>> user’s
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we
> have to
> > >>>>>>>>>>>>> introduce a
> > >>>>>>>>>>>>>>>>>> new
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interface for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the
> design
> > >>>>>>>> would
> > >>>>>>>>>>>>> become
> > >>>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> complex.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
> > >>>>>>>> framework
> > >>>>>>>>>>> might
> > >>>>>>>>>>>>>>>>>>> introduce
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources
> like
> > >>>>>>>>>> there
> > >>>>>>>>>>>>> might
> > >>>>>>>>>>>>>>>>>>> exist two
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caches
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if
> the
> > >>>>>>>> user
> > >>>>>>>>>>>>>>>>> incorrectly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configures
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> > >>>>>>>>>> implemented
> > >>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization
> mentioned by
> > >>>>>>>>>>>>> Alexander, I
> > >>>>>>>>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all
> the
> > >>>>>>>> way
> > >>>>>>>>>> down
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source,
> instead
> > >>>>>>>> of
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> runner
> > >>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce
> the
> > >>>>>>>>>> network
> > >>>>>>>>>>>>> I/O
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pressure
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
> > >>>>>>>> these
> > >>>>>>>>>>>>>>>>>> optimizations
> > >>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the
> FLIP[1] to
> > >>>>>>>>>>> reflect
> > >>>>>>>>>>>>> our
> > >>>>>>>>>>>>>>>>>>> ideas.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as
> a part
> > >>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> TableFunction,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> > >>>>>>>>>>>>> (CachingTableFunction,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
> > >>>>>>>> developers
> > >>>>>>>>>> and
> > >>>>>>>>>>>>>>>>> regulate
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
> > >>>>>>>> reference.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> > >>>>>>>>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> > >>>>>>>>> Александр
> > >>>>>>>>>>>>> Смирнов
> > >>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your
> message.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an
> easier
> > >>>>>>>>>>> solution
> > >>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> step:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
> > >>>>>>>> mutually
> > >>>>>>>>>>>>> exclusive
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (originally
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> > >>>>>>>>>>> conceptually
> > >>>>>>>>>>>>>>> they
> > >>>>>>>>>>>>>>>>>>> follow
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation
> details are
> > >>>>>>>>>>>>> different.
> > >>>>>>>>>>>>>>>>> If
> > >>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> go one
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the
> future
> > >>>>>>>>> will
> > >>>>>>>>>>> mean
> > >>>>>>>>>>>>>>>>>>> deleting
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API
> for
> > >>>>>>>>>>>>> connectors.
> > >>>>>>>>>>>>>>>>> So
> > >>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> think we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the
> community
> > >>>>>>>>>> about
> > >>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> then
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the
> work on
> > >>>>>>>>>> tasks
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parts
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> > >>>>>>>>> unification
> > >>>>>>>>>> /
> > >>>>>>>>>>>>>>>>>>> introducing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> > >>>>>>>>>> Qingsheng?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only
> receive the
> > >>>>>>>>>>> requests
> > >>>>>>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied
> to
> > >>>>>>>>> fields
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table, we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and
> only
> > >>>>>>>>> after
> > >>>>>>>>>>>>> that we
> > >>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't
> have
> > >>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>> pushdown. So
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there
> will be
> > >>>>>>>>>> much
> > >>>>>>>>>>>>> less
> > >>>>>>>>>>>>>>>>>> rows
> > >>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > >>>>>>>>>>> architecture
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> > >>>>>>>> honest.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new
> to such
> > >>>>>>>>>> kinds
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> conversations
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> > >>>>>>>>> confluence,
> > >>>>>>>>>>> so
> > >>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>> made a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jira
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed
> changes
> > >>>>>>>> in
> > >>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>> details
> > >>>>>>>>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49,
> Arvid
> > >>>>>>>>> Heise
> > >>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> > >>>>>>>>>> inconsistency
> > >>>>>>>>>>>>> was
> > >>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea
> though but
> > >>>>>>>>>> could
> > >>>>>>>>>>>>> also
> > >>>>>>>>>>>>>>>>>> live
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step:
> Instead
> > >>>>>>>> of
> > >>>>>>>>>>>>> making
> > >>>>>>>>>>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X,
> rather
> > >>>>>>>>>> devise a
> > >>>>>>>>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> layer
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> > >>>>>>>>> CachingTableFunction
> > >>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delegates to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> X in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the
> cache.
> > >>>>>>>>>> Lifting
> > >>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>> into
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better
> but is
> > >>>>>>>>>>>>> probably
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unnecessary
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the
> source
> > >>>>>>>>> will
> > >>>>>>>>>>> only
> > >>>>>>>>>>>>>>>>>> receive
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection
> may be
> > >>>>>>>>> more
> > >>>>>>>>>>>>>>>>>> interesting
> > >>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> save
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all
> the
> > >>>>>>>>>> changes
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>> FLIP
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> > >>>>>>>>>>> interfaces.
> > >>>>>>>>>>>>>>>>>>> Everything
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table
> runtime.
> > >>>>>>>> That
> > >>>>>>>>>>> means
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easily
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
> > >>>>>>>> Alexander
> > >>>>>>>>>>>>> pointed
> > >>>>>>>>>>>>>>>>> out
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> later.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > >>>>>>>>>>> architecture
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> > >>>>>>>> honest.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> > >>>>>>>>>> Александр
> > >>>>>>>>>>>>>>>>> Смирнов
> > >>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is
> Alexander,
> > >>>>>>>>> I'm
> > >>>>>>>>>>>>> not a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> committer
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> yet,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one.
> And this
> > >>>>>>>>>> FLIP
> > >>>>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a
> similar
> > >>>>>>>>>>>>> feature in
> > >>>>>>>>>>>>>>>>> my
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> company’s
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to
> share
> > >>>>>>>> our
> > >>>>>>>>>>>>> thoughts
> > >>>>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
> > >>>>>>>> alternative
> > >>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> > >>>>>>>>>>>>>>> (CachingTableFunction).
> > >>>>>>>>>>>>>>>>>> As
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> > >>>>>>>>>>>>> flink-table-common
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
> > >>>>>>>> tables –
> > >>>>>>>>>>> it’s
> > >>>>>>>>>>>>>>> very
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> convenient
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> > >>>>>>>>>>>>> CachingTableFunction
> > >>>>>>>>>>>>>>>>>>> contains
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this
> class
> > >>>>>>>> and
> > >>>>>>>>>>>>>>>>> everything
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connected
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
> > >>>>>>>> module,
> > >>>>>>>>>>>>> probably
> > >>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require
> connectors to
> > >>>>>>>>>>> depend
> > >>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>> another
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime
> logic,
> > >>>>>>>>> which
> > >>>>>>>>>>>>> doesn’t
> > >>>>>>>>>>>>>>>>>>> sound
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> good.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> > >>>>>>>>>>>>>>> ‘getLookupConfig’
> > >>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to
> allow
> > >>>>>>>>>>>>> connectors
> > >>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> only
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pass
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> > >>>>>>>>>> therefore
> > >>>>>>>>>>>>> they
> > >>>>>>>>>>>>>>>>>> won’t
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> depend on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these
> configs
> > >>>>>>>>>>> planner
> > >>>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> construct a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with
> corresponding
> > >>>>>>>>>> runtime
> > >>>>>>>>>>>>> logic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> > >>>>>>>>>> Architecture
> > >>>>>>>>>>>>>>> looks
> > >>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class
> there is
> > >>>>>>>>>>> actually
> > >>>>>>>>>>>>>>>>> yours
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in
> flink-table-planner,
> > >>>>>>>> that
> > >>>>>>>>>> will
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responsible
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and
> his
> > >>>>>>>>>>>>> inheritors.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup
> join in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> > >>>>>>>>>> AsyncLookupJoinRunner,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> > >>>>>>>>>>>>>>>>> LookupJoinCachingRunner,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> LookupJoinCachingRunnerWithCalc,
> > >>>>>>>> etc.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
> > >>>>>>>> powerful
> > >>>>>>>>>>>>> advantage
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a
> lower
> > >>>>>>>>> level,
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> > >>>>>>>>>>>>> LookupJoinRunnerWithCalc
> > >>>>>>>>>>>>>>>>> was
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> named
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
> > >>>>>>>> function,
> > >>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>> actually
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mostly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A
> with
> > >>>>>>>>>> lookup
> > >>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>> B
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> condition
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age
> + 10
> > >>>>>>>>>> WHERE
> > >>>>>>>>>>>>>>>>>> B.salary >
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000’
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
> > >>>>>>>> A.age =
> > >>>>>>>>>>>>> B.age +
> > >>>>>>>>>>>>>>>>> 10
> > >>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function
> before
> > >>>>>>>>>> storing
> > >>>>>>>>>>>>>>>>> records
> > >>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
> > >>>>>>>> reduced:
> > >>>>>>>>>>>>> filters =
> > >>>>>>>>>>>>>>>>>>> avoid
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections
> =
> > >>>>>>>>> reduce
> > >>>>>>>>>>>>>>> records’
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size. So
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in
> cache can
> > >>>>>>>> be
> > >>>>>>>>>>>>>>> increased
> > >>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11
> Qingsheng
> > >>>>>>>> Ren
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to
> start a
> > >>>>>>>>>>>>> discussion
> > >>>>>>>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of
> lookup
> > >>>>>>>>>> table
> > >>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> its
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table
> source
> > >>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>> implement
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and
> there
> > >>>>>>>>> isn’t a
> > >>>>>>>>>>>>>>>>> standard
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their
> jobs
> > >>>>>>>> with
> > >>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>> joins,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some
> new APIs
> > >>>>>>>>>>>>> including
> > >>>>>>>>>>>>>>>>>>> cache,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and
> new
> > >>>>>>>>> table
> > >>>>>>>>>>>>>>> options.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> take a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more
> details.
> > >>>>>>>>> Any
> > >>>>>>>>>>>>>>>>>> suggestions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jingsong Li <ji...@gmail.com>.

Qingsheng Thanks for the update,

Looks good to me!

Best,
Jingsong

On Wed, Jun 22, 2022 at 5:00 PM Qingsheng Ren <re...@apache.org> wrote:
>
> Hi Jingsong,
>
> 1. Updated and thanks for the reminder!
>
> 2. We could do so for implementation but as public interface I prefer not to introduce another layer and expose too much since this FLIP is already a huge one with bunch of classes and interfaces.
>
> Best,
> Qingsheng
>
> > On Jun 22, 2022, at 11:16, Jingsong Li <ji...@gmail.com> wrote:
> >
> > Thanks Qingsheng and all.
> >
> > I like this design.
> >
> > Some comments:
> >
> > 1. LookupCache implements Serializable?
> >
> > 2. Minor: After FLIP-234 [1], there should be many connectors that
> > implement both PartialCachingLookupProvider and
> > PartialCachingAsyncLookupProvider. Can we extract a common interface
> > for `LookupCache getCache();` to ensure consistency?
> >
> > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-234%3A+Support+Retryable+Lookup+Join+To+Solve+Delayed+Updates+Issue+In+External+Systems
> >
> > Best,
> > Jingsong
> >
> > On Tue, Jun 21, 2022 at 4:09 PM Qingsheng Ren <re...@apache.org> wrote:
> >>
> >> Hi devs,
> >>
> >> I’d like to push FLIP-221 forward a little bit. Recently we had some offline discussions and updated the FLIP. Here’s the diff compared to the previous version:
> >>
> >> 1. (Async)LookupFunctionProvider is designed as a base interface for constructing lookup functions.
> >> 2. From the LookupFunction we extend PartialCaching / FullCachingLookupProvider for partial and full caching mode.
> >> 3. Introduce CacheReloadTrigger for specifying reload stratrgy in full caching mode, and provide 2 default implementations (Periodic / TimedCacheReloadTrigger)
> >>
> >> Looking forward to your replies~
> >>
> >> Best,
> >> Qingsheng
> >>
> >>> On Jun 2, 2022, at 17:15, Qingsheng Ren <re...@gmail.com> wrote:
> >>>
> >>> Hi Becket,
> >>>
> >>> Thanks for your feedback!
> >>>
> >>> 1. An alternative way is to let the implementation of cache to decide
> >>> whether to store a missing key in the cache instead of the framework.
> >>> This sounds more reasonable and makes the LookupProvider interface
> >>> cleaner. I can update the FLIP and clarify in the JavaDoc of
> >>> LookupCache#put that the cache should decide whether to store an empty
> >>> collection.
> >>>
> >>> 2. Initially the builder pattern is for the extensibility of
> >>> LookupProvider interfaces that we could need to add more
> >>> configurations in the future. We can remove the builder now as we have
> >>> resolved the issue in 1. As for the builder in DefaultLookupCache I
> >>> prefer to keep it because we have a lot of arguments in the
> >>> constructor.
> >>>
> >>> 3. I think this might overturn the overall design. I agree with
> >>> Becket's idea that the API design should be layered considering
> >>> extensibility and it'll be great to have one unified interface
> >>> supporting both partial, full and even mixed custom strategies, but we
> >>> have some issues to resolve. The original purpose of treating full
> >>> caching separately is that we'd like to reuse the ability of
> >>> ScanRuntimeProvider. Developers just need to hand over Source /
> >>> SourceFunction / InputFormat so that the framework could be able to
> >>> compose the underlying topology and control the reload (maybe in a
> >>> distributed way). Under your design we leave the reload operation
> >>> totally to the CacheStrategy and I think it will be hard for
> >>> developers to reuse the source in the initializeCache method.
> >>>
> >>> Best regards,
> >>>
> >>> Qingsheng
> >>>
> >>> On Thu, Jun 2, 2022 at 1:50 PM Becket Qin <be...@gmail.com> wrote:
> >>>>
> >>>> Thanks for updating the FLIP, Qingsheng. A few more comments:
> >>>>
> >>>> 1. I am still not sure about what is the use case for cacheMissingKey().
> >>>> More specifically, when would users want to have getCache() return a
> >>>> non-empty value and cacheMissingKey() returns false?
> >>>>
> >>>> 2. The builder pattern. Usually the builder pattern is used when there are
> >>>> a lot of variations of constructors. For example, if a class has three
> >>>> variables and all of them are optional, so there could potentially be many
> >>>> combinations of the variables. But in this FLIP, I don't see such case.
> >>>> What is the reason we have builders for all the classes?
> >>>>
> >>>> 3. Should the caching strategy be excluded from the top level provider API?
> >>>> Technically speaking, the Flink framework should only have two interfaces
> >>>> to deal with:
> >>>>   A) LookupFunction
> >>>>   B) AsyncLookupFunction
> >>>> Orthogonally, we *believe* there are two different strategies people can do
> >>>> caching. Note that the Flink framework does not care what is the caching
> >>>> strategy here.
> >>>>   a) partial caching
> >>>>   b) full caching
> >>>>
> >>>> Putting them together, we end up with 3 combinations that we think are
> >>>> valid:
> >>>>    Aa) PartialCachingLookupFunctionProvider
> >>>>    Ba) PartialCachingAsyncLookupFunctionProvider
> >>>>    Ab) FullCachingLookupFunctionProvider
> >>>>
> >>>> However, the caching strategy could actually be quite flexible. E.g. an
> >>>> initial full cache load followed by some partial updates. Also, I am not
> >>>> 100% sure if the full caching will always use ScanTableSource. Including
> >>>> the caching strategy in the top level provider API would make it harder to
> >>>> extend.
> >>>>
> >>>> One possible solution is to just have *LookupFunctionProvider* and
> >>>> *AsyncLookupFunctionProvider
> >>>> *as the top level API, both with a getCacheStrategy() method returning an
> >>>> optional CacheStrategy. The CacheStrategy class would have the following
> >>>> methods:
> >>>> 1. void open(Context), the context exposes some of the resources that may
> >>>> be useful for the the caching strategy, e.g. an ExecutorService that is
> >>>> synchronized with the data processing, or a cache refresh trigger which
> >>>> blocks data processing and refresh the cache.
> >>>> 2. void initializeCache(), a blocking method allows users to pre-populate
> >>>> the cache before processing any data if they wish.
> >>>> 3. void maybeCache(RowData key, Collection<RowData> value), blocking or
> >>>> non-blocking method.
> >>>> 4. void refreshCache(), a blocking / non-blocking method that is invoked by
> >>>> the Flink framework when the cache refresh trigger is pulled.
> >>>>
> >>>> In the above design, partial caching and full caching would be
> >>>> implementations of the CachingStrategy. And it is OK for users to implement
> >>>> their own CachingStrategy if they want to.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>> On Thu, Jun 2, 2022 at 12:14 PM Jark Wu <im...@gmail.com> wrote:
> >>>>
> >>>>> Thank Qingsheng for the detailed summary and updates,
> >>>>>
> >>>>> The changes look good to me in general. I just have one minor improvement
> >>>>> comment.
> >>>>> Could we add a static util method to the "FullCachingReloadTrigger"
> >>>>> interface for quick usage?
> >>>>>
> >>>>> #periodicReloadAtFixedRate(Duration)
> >>>>> #periodicReloadWithFixedDelay(Duration)
> >>>>>
> >>>>> I think we can also do this for LookupCache. Because users may not know
> >>>>> where is the default
> >>>>> implementations and how to use them.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <re...@gmail.com> wrote:
> >>>>>
> >>>>>> Hi Jingsong,
> >>>>>>
> >>>>>> Thanks for your comments!
> >>>>>>
> >>>>>>> AllCache definition is not flexible, for example, PartialCache can use
> >>>>>> any custom storage, while the AllCache can not, AllCache can also be
> >>>>>> considered to store memory or disk, also need a flexible strategy.
> >>>>>>
> >>>>>> We had an offline discussion with Jark and Leonard. Basically we think
> >>>>>> exposing the interface of full cache storage to connector developers
> >>>>> might
> >>>>>> limit our future optimizations. The storage of full caching shouldn’t
> >>>>> have
> >>>>>> too many variations for different lookup tables so making it pluggable
> >>>>>> might not help a lot. Also I think it is not quite easy for connector
> >>>>>> developers to implement such an optimized storage. We can keep optimizing
> >>>>>> this storage in the future and all full caching lookup tables would
> >>>>> benefit
> >>>>>> from this.
> >>>>>>
> >>>>>>> We are more inclined to deprecate the connector `async` option when
> >>>>>> discussing FLIP-234. Can we remove this option from this FLIP?
> >>>>>>
> >>>>>> Thanks for the reminder! This option has been removed in the latest
> >>>>>> version.
> >>>>>>
> >>>>>> Best regards,
> >>>>>>
> >>>>>> Qingsheng
> >>>>>>
> >>>>>>
> >>>>>>> On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Thanks Alexander for your reply. We can discuss the new interface when
> >>>>> it
> >>>>>>> comes out.
> >>>>>>>
> >>>>>>> We are more inclined to deprecate the connector `async` option when
> >>>>>>> discussing FLIP-234 [1]. We should use hint to let planner decide.
> >>>>>>> Although the discussion has not yet produced a conclusion, can we
> >>>>> remove
> >>>>>>> this option from this FLIP? It doesn't seem to be related to this FLIP,
> >>>>>> but
> >>>>>>> more to FLIP-234, and we can form a conclusion over there.
> >>>>>>>
> >>>>>>> [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jingsong
> >>>>>>>
> >>>>>>> On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:
> >>>>>>>
> >>>>>>>> Hi Jark,
> >>>>>>>>
> >>>>>>>> Thanks for clarifying it. It would be fine. as long as we could
> >>>>> provide
> >>>>>> the
> >>>>>>>> no-cache solution. I was just wondering if the client side cache could
> >>>>>>>> really help when HBase is used, since the data to look up should be
> >>>>>> huge.
> >>>>>>>> Depending how much data will be cached on the client side, the data
> >>>>> that
> >>>>>>>> should be lru in e.g. LruBlockCache will not be lru anymore. In the
> >>>>>> worst
> >>>>>>>> case scenario, once the cached data at client side is expired, the
> >>>>>> request
> >>>>>>>> will hit disk which will cause extra latency temporarily, if I am not
> >>>>>>>> mistaken.
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Jing
> >>>>>>>>
> >>>>>>>> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Jing Ge,
> >>>>>>>>>
> >>>>>>>>> What do you mean about the "impact on the block cache used by HBase"?
> >>>>>>>>> In my understanding, the connector cache and HBase cache are totally
> >>>>>> two
> >>>>>>>>> things.
> >>>>>>>>> The connector cache is a local/client cache, and the HBase cache is a
> >>>>>>>>> server cache.
> >>>>>>>>>
> >>>>>>>>>> does it make sense to have a no-cache solution as one of the
> >>>>>>>>> default solutions so that customers will have no effort for the
> >>>>>> migration
> >>>>>>>>> if they want to stick with Hbase cache
> >>>>>>>>>
> >>>>>>>>> The implementation migration should be transparent to users. Take the
> >>>>>>>> HBase
> >>>>>>>>> connector as
> >>>>>>>>> an example,  it already supports lookup cache but is disabled by
> >>>>>> default.
> >>>>>>>>> After migration, the
> >>>>>>>>> connector still disables cache by default (i.e. no-cache solution).
> >>>>> No
> >>>>>>>>> migration effort for users.
> >>>>>>>>>
> >>>>>>>>> HBase cache and connector cache are two different things. HBase cache
> >>>>>>>> can't
> >>>>>>>>> simply replace
> >>>>>>>>> connector cache. Because one of the most important usages for
> >>>>> connector
> >>>>>>>>> cache is reducing
> >>>>>>>>> the I/O request/response and improving the throughput, which can
> >>>>>> achieve
> >>>>>>>>> by just using a server cache.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jark
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks all for the valuable discussion. The new feature looks very
> >>>>>>>>>> interesting.
> >>>>>>>>>>
> >>>>>>>>>> According to the FLIP description: "*Currently we have JDBC, Hive
> >>>>> and
> >>>>>>>>> HBase
> >>>>>>>>>> connector implemented lookup table source. All existing
> >>>>>> implementations
> >>>>>>>>>> will be migrated to the current design and the migration will be
> >>>>>>>>>> transparent to end users*." I was only wondering if we should pay
> >>>>>>>>> attention
> >>>>>>>>>> to HBase and similar DBs. Since, commonly, the lookup data will be
> >>>>>> huge
> >>>>>>>>>> while using HBase, partial caching will be used in this case, if I
> >>>>> am
> >>>>>>>> not
> >>>>>>>>>> mistaken, which might have an impact on the block cache used by
> >>>>> HBase,
> >>>>>>>>> e.g.
> >>>>>>>>>> LruBlockCache.
> >>>>>>>>>> Another question is that, since HBase provides a sophisticated cache
> >>>>>>>>>> solution, does it make sense to have a no-cache solution as one of
> >>>>> the
> >>>>>>>>>> default solutions so that customers will have no effort for the
> >>>>>>>> migration
> >>>>>>>>>> if they want to stick with Hbase cache?
> >>>>>>>>>>
> >>>>>>>>>> Best regards,
> >>>>>>>>>> Jing
> >>>>>>>>>>
> >>>>>>>>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <
> >>>>> jingsonglee0@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> I think the problem now is below:
> >>>>>>>>>>> 1. AllCache and PartialCache interface on the non-uniform, one
> >>>>> needs
> >>>>>>>> to
> >>>>>>>>>>> provide LookupProvider, the other needs to provide CacheBuilder.
> >>>>>>>>>>> 2. AllCache definition is not flexible, for example, PartialCache
> >>>>> can
> >>>>>>>>> use
> >>>>>>>>>>> any custom storage, while the AllCache can not, AllCache can also
> >>>>> be
> >>>>>>>>>>> considered to store memory or disk, also need a flexible strategy.
> >>>>>>>>>>> 3. AllCache can not customize ReloadStrategy, currently only
> >>>>>>>>>>> ScheduledReloadStrategy.
> >>>>>>>>>>>
> >>>>>>>>>>> In order to solve the above problems, the following are my ideas.
> >>>>>>>>>>>
> >>>>>>>>>>> ## Top level cache interfaces:
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> public interface CacheLookupProvider extends
> >>>>>>>>>>> LookupTableSource.LookupRuntimeProvider {
> >>>>>>>>>>>
> >>>>>>>>>>>  CacheBuilder createCacheBuilder();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface CacheBuilder {
> >>>>>>>>>>>  Cache create();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface Cache {
> >>>>>>>>>>>
> >>>>>>>>>>>  /**
> >>>>>>>>>>>   * Returns the value associated with key in this cache, or null
> >>>>>>>> if
> >>>>>>>>>>> there is no cached value for
> >>>>>>>>>>>   * key.
> >>>>>>>>>>>   */
> >>>>>>>>>>>  @Nullable
> >>>>>>>>>>>  Collection<RowData> getIfPresent(RowData key);
> >>>>>>>>>>>
> >>>>>>>>>>>  /** Returns the number of key-value mappings in the cache. */
> >>>>>>>>>>>  long size();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> ## Partial cache
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> public interface PartialCacheLookupFunction extends
> >>>>>>>>> CacheLookupProvider {
> >>>>>>>>>>>
> >>>>>>>>>>>  @Override
> >>>>>>>>>>>  PartialCacheBuilder createCacheBuilder();
> >>>>>>>>>>>
> >>>>>>>>>>> /** Creates an {@link LookupFunction} instance. */
> >>>>>>>>>>> LookupFunction createLookupFunction();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface PartialCacheBuilder extends CacheBuilder {
> >>>>>>>>>>>
> >>>>>>>>>>>  PartialCache create();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface PartialCache extends Cache {
> >>>>>>>>>>>
> >>>>>>>>>>>  /**
> >>>>>>>>>>>   * Associates the specified value rows with the specified key
> >>>>> row
> >>>>>>>>>>> in the cache. If the cache
> >>>>>>>>>>>   * previously contained value associated with the key, the old
> >>>>>>>>>>> value is replaced by the
> >>>>>>>>>>>   * specified value.
> >>>>>>>>>>>   *
> >>>>>>>>>>>   * @return the previous value rows associated with key, or null
> >>>>>>>> if
> >>>>>>>>>>> there was no mapping for key.
> >>>>>>>>>>>   * @param key - key row with which the specified value is to be
> >>>>>>>>>>> associated
> >>>>>>>>>>>   * @param value – value rows to be associated with the specified
> >>>>>>>>> key
> >>>>>>>>>>>   */
> >>>>>>>>>>>  Collection<RowData> put(RowData key, Collection<RowData> value);
> >>>>>>>>>>>
> >>>>>>>>>>>  /** Discards any cached value for the specified key. */
> >>>>>>>>>>>  void invalidate(RowData key);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> ## All cache
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> public interface AllCacheLookupProvider extends
> >>>>> CacheLookupProvider {
> >>>>>>>>>>>
> >>>>>>>>>>>  void registerReloadStrategy(ScheduledExecutorService
> >>>>>>>>>>> executorService, Reloader reloader);
> >>>>>>>>>>>
> >>>>>>>>>>>  ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> >>>>>>>>>>>
> >>>>>>>>>>>  @Override
> >>>>>>>>>>>  AllCacheBuilder createCacheBuilder();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface AllCacheBuilder extends CacheBuilder {
> >>>>>>>>>>>
> >>>>>>>>>>>  AllCache create();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface AllCache extends Cache {
> >>>>>>>>>>>
> >>>>>>>>>>>  void putAll(Iterator<Map<RowData, RowData>> allEntries);
> >>>>>>>>>>>
> >>>>>>>>>>>  void clearAll();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface Reloader {
> >>>>>>>>>>>
> >>>>>>>>>>>  void reload();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jingsong
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <
> >>>>> jingsonglee0@gmail.com
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks Qingsheng and all for your discussion.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Very sorry to jump in so late.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Maybe I missed something?
> >>>>>>>>>>>> My first impression when I saw the cache interface was, why don't
> >>>>>>>> we
> >>>>>>>>>>>> provide an interface similar to guava cache [1], on top of guava
> >>>>>>>>> cache,
> >>>>>>>>>>>> caffeine also makes extensions for asynchronous calls.[2]
> >>>>>>>>>>>> There is also the bulk load in caffeine too.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am also more confused why first from LookupCacheFactory.Builder
> >>>>>>>> and
> >>>>>>>>>>> then
> >>>>>>>>>>>> to Factory to create Cache.
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1] https://github.com/google/guava
> >>>>>>>>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jingsong
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> After looking at the new introduced ReloadTime and Becket's
> >>>>>>>> comment,
> >>>>>>>>>>>>> I agree with Becket we should have a pluggable reloading
> >>>>> strategy.
> >>>>>>>>>>>>> We can provide some common implementations, e.g., periodic
> >>>>>>>>> reloading,
> >>>>>>>>>>> and
> >>>>>>>>>>>>> daily reloading.
> >>>>>>>>>>>>> But there definitely be some connector- or business-specific
> >>>>>>>>> reloading
> >>>>>>>>>>>>> strategies, e.g.
> >>>>>>>>>>>>> notify by a zookeeper watcher, reload once a new Hive partition
> >>>>> is
> >>>>>>>>>>>>> complete.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
> >>>>>>>>>> "XXXProvider".
> >>>>>>>>>>>>>> What is the difference between them? If they are the same, can
> >>>>>>>> we
> >>>>>>>>>> just
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>> XXXFactory everywhere?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
> >>>>>>>>>>> policy
> >>>>>>>>>>>>>> also be pluggable? Periodical reloading could be sometimes be
> >>>>>>>>> tricky
> >>>>>>>>>>> in
> >>>>>>>>>>>>>> practice. For example, if user uses 24 hours as the cache
> >>>>>>>> refresh
> >>>>>>>>>>>>> interval
> >>>>>>>>>>>>>> and some nightly batch job delayed, the cache update may still
> >>>>>>>> see
> >>>>>>>>>> the
> >>>>>>>>>>>>>> stale data.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> >>>>>>>>>> should
> >>>>>>>>>>> be
> >>>>>>>>>>>>>> removed.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
> >>>>>>>> seems a
> >>>>>>>>>>>>> little
> >>>>>>>>>>>>>> confusing to me. If Optional<LookupCacheFactory>
> >>>>>>>> getCacheFactory()
> >>>>>>>>>>>>> returns
> >>>>>>>>>>>>>> a non-empty factory, doesn't that already indicates the
> >>>>>>>> framework
> >>>>>>>>> to
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>> the missing keys? Also, why is this method returning an
> >>>>>>>>>>>>> Optional<Boolean>
> >>>>>>>>>>>>>> instead of boolean?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
> >>>>>>>> renqschn@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Lincoln and Jark,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the comments! If the community reaches a consensus
> >>>>>>>>> that
> >>>>>>>>>> we
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>>> SQL hint instead of table options to decide whether to use sync
> >>>>>>>>> or
> >>>>>>>>>>>>> async
> >>>>>>>>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
> >>>>>>>>>>> option.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think it’s a good idea to let the decision of async made on
> >>>>>>>>> query
> >>>>>>>>>>>>>>> level, which could make better optimization with more
> >>>>>>>> infomation
> >>>>>>>>>>>>> gathered
> >>>>>>>>>>>>>>> by planner. Is there any FLIP describing the issue in
> >>>>>>>>> FLINK-27625?
> >>>>>>>>>> I
> >>>>>>>>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
> >>>>>>>>>>> missing
> >>>>>>>>>>>>>>> instead of the entire async mode to be controlled by hint.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
> >>>>>>>> lincoln.86xy@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for your reply!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
> >>>>>>>>> no
> >>>>>>>>>>> idea
> >>>>>>>>>>>>>>>> whether or when to remove it (we can discuss it in another
> >>>>>>>>> issue
> >>>>>>>>>>> for
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
> >>>>>>>>> into
> >>>>>>>>>> a
> >>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>> option now.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Lincoln Lee
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Lincoln,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
> >>>>>>>> the
> >>>>>>>>>>>>>>> connectors
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> provide both async and sync runtime providers simultaneously
> >>>>>>>>>>> instead
> >>>>>>>>>>>>>>> of one
> >>>>>>>>>>>>>>>>> of them.
> >>>>>>>>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
> >>>>>>>> option
> >>>>>>>>> is
> >>>>>>>>>>>>>>> planned to
> >>>>>>>>>>>>>>>>> be removed
> >>>>>>>>>>>>>>>>> in the long term, I think it makes sense not to introduce it
> >>>>>>>>> in
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> >>>>>>>>>> lincoln.86xy@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
> >>>>>>>>> idea
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> have a common table option. I have a minor comments on
> >>>>>>>>>>>>> 'lookup.async'
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> not make it a common option:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The table layer abstracts both sync and async lookup
> >>>>>>>>>>> capabilities,
> >>>>>>>>>>>>>>>>>> connectors implementers can choose one or both, in the case
> >>>>>>>>> of
> >>>>>>>>>>>>>>>>> implementing
> >>>>>>>>>>>>>>>>>> only one capability(status of the most of existing builtin
> >>>>>>>>>>>>> connectors)
> >>>>>>>>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
> >>>>>>>>> both
> >>>>>>>>>>>>>>>>>> capabilities, I think this choice is more suitable for
> >>>>>>>> making
> >>>>>>>>>>>>>>> decisions
> >>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>> the query level, for example, table planner can choose the
> >>>>>>>>>>> physical
> >>>>>>>>>>>>>>>>>> implementation of async lookup or sync lookup based on its
> >>>>>>>>> cost
> >>>>>>>>>>>>>>> model, or
> >>>>>>>>>>>>>>>>>> users can give query hint based on their own better
> >>>>>>>>>>>>> understanding.  If
> >>>>>>>>>>>>>>>>>> there is another common table option 'lookup.async', it may
> >>>>>>>>>>> confuse
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> users in the long run.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
> >>>>>>>>>> place
> >>>>>>>>>>>>> (for
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> current hbase connector) and not turn it into a common
> >>>>>>>>> option.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> WDYT?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Lincoln Lee
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Alexander,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
> >>>>>>>> you
> >>>>>>>>>> can
> >>>>>>>>>>>>> find
> >>>>>>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>>>>> changes from my latest email. Since some terminologies has
> >>>>>>>>>>>>> changed so
> >>>>>>>>>>>>>>>>>> I’ll
> >>>>>>>>>>>>>>>>>>> use the new concept for replying your comments.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. Builder vs ‘of’
> >>>>>>>>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
> >>>>>>>> optional
> >>>>>>>>>>>>>>> parameters
> >>>>>>>>>>>>>>>>>>> for full caching mode (“rescan” previously). The
> >>>>>>>>>>>>> schedule-with-delay
> >>>>>>>>>>>>>>>>> idea
> >>>>>>>>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
> >>>>>>>> the
> >>>>>>>>>>>>> builder
> >>>>>>>>>>>>>>> API
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> full caching to make it more descriptive for developers.
> >>>>>>>>> Would
> >>>>>>>>>>> you
> >>>>>>>>>>>>>>> mind
> >>>>>>>>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
> >>>>>>>>>>> workspace
> >>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> just provide your account ID and ping any PMC member
> >>>>>>>>> including
> >>>>>>>>>>>>> Jark.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2. Common table options
> >>>>>>>>>>>>>>>>>>> We have some discussions these days and propose to
> >>>>>>>>> introduce 8
> >>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>> table options about caching. It has been updated on the
> >>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 3. Retries
> >>>>>>>>>>>>>>>>>>> I think we are on the same page :-)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For your additional concerns:
> >>>>>>>>>>>>>>>>>>> 1) The table option has been updated.
> >>>>>>>>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
> >>>>>>>> use
> >>>>>>>>>>>>> partial
> >>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>> full caching mode.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> >>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Also I have a few additions:
> >>>>>>>>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> >>>>>>>>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
> >>>>>>>> that
> >>>>>>>>>> we
> >>>>>>>>>>>>> talk
> >>>>>>>>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
> >>>>>>>> fits
> >>>>>>>>>>> more,
> >>>>>>>>>>>>>>>>>>>> considering my optimization with filters.
> >>>>>>>>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
> >>>>>>>>> separate
> >>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>> and rescanning from the options point of view? Like
> >>>>>>>>> initially
> >>>>>>>>>>> we
> >>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
> >>>>>>>>> now
> >>>>>>>>>> we
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
> >>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> >>>>>>>>>>>>> smiralexan@gmail.com
> >>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and Jark,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1. Builders vs 'of'
> >>>>>>>>>>>>>>>>>>>>> I understand that builders are used when we have
> >>>>>>>> multiple
> >>>>>>>>>>>>>>>>> parameters.
> >>>>>>>>>>>>>>>>>>>>> I suggested them because we could add parameters later.
> >>>>>>>> To
> >>>>>>>>>>>>> prevent
> >>>>>>>>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
> >>>>>>>>> can
> >>>>>>>>>>>>>>> suggest
> >>>>>>>>>>>>>>>>>>>>> one more config now - "rescanStartTime".
> >>>>>>>>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
> >>>>>>>> reload
> >>>>>>>>>> of
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>> starts. This parameter can be thought of as
> >>>>>>>> 'initialDelay'
> >>>>>>>>>>> (diff
> >>>>>>>>>>>>>>>>>>>>> between current time and rescanStartTime) in method
> >>>>>>>>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
> >>>>>>>>> can
> >>>>>>>>>> be
> >>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>> useful when the dimension table is updated by some other
> >>>>>>>>>>>>> scheduled
> >>>>>>>>>>>>>>>>> job
> >>>>>>>>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
> >>>>>>>> second
> >>>>>>>>>> scan
> >>>>>>>>>>>>>>>>> (first
> >>>>>>>>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
> >>>>>>>>>> without
> >>>>>>>>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
> >>>>>>>>> one
> >>>>>>>>>>>>> day.
> >>>>>>>>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
> >>>>>>>> if
> >>>>>>>>>> you
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
> >>>>>>>> myself
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2. Common table options
> >>>>>>>>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
> >>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
> >>>>>>>>> for
> >>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
> >>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> options,
> >>>>>>>>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 3. Retries
> >>>>>>>>>>>>>>>>>>>>> I'm fine with suggestion close to
> >>>>>>>>> RetryUtils#tryTimes(times,
> >>>>>>>>>>>>> call)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> >>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Jark and Alexander,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
> >>>>>>>> common
> >>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> options. I prefer to introduce a new
> >>>>>>>>> DefaultLookupCacheOptions
> >>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> holding these option definitions because putting all
> >>>>>>>> options
> >>>>>>>>>>> into
> >>>>>>>>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
> >>>>>>>>>>>>> categorized.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
> >>>>>>>>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
> >>>>>>>>>>>>> RescanRuntimeProvider
> >>>>>>>>>>>>>>>>>>> considering both arguments are required.
> >>>>>>>>>>>>>>>>>>>>>> 2. Introduce new table options matching
> >>>>>>>>>>>>> DefaultLookupCacheFactory
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> >>>>>>>>> imjark@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 1) retry logic
> >>>>>>>>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
> >>>>>>>>>>> utilities,
> >>>>>>>>>>>>>>>>> e.g.
> >>>>>>>>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
> >>>>>>>>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
> >>>>>>>> by
> >>>>>>>>>>>>>>>>> DataStream
> >>>>>>>>>>>>>>>>>>> users.
> >>>>>>>>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
> >>>>>>>> to
> >>>>>>>>>> put
> >>>>>>>>>>>>> it.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 2) cache ConfigOptions
> >>>>>>>>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
> >>>>>>>>>>> framework.
> >>>>>>>>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
> >>>>>>>>>> includes
> >>>>>>>>>>>>>>>>>>> "sink.parallelism", "format" options.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thank you for considering my comments.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
> >>>>>>>> such
> >>>>>>>>> as
> >>>>>>>>>>>>>>>>>>> re-establish the connection
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
> >>>>>>>> be
> >>>>>>>>>>>>> placed in
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> separate function, that can be implemented by
> >>>>>>>>> connectors.
> >>>>>>>>>>>>> Just
> >>>>>>>>>>>>>>>>>> moving
> >>>>>>>>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
> >>>>>>>>>> more
> >>>>>>>>>>>>>>>>> concise
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
> >>>>>>>> The
> >>>>>>>>>>>>> decision
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>>>>>>> to you.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> >>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
> >>>>>>>>>> this
> >>>>>>>>>>>>> FLIP
> >>>>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
> >>>>>>>> current
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> >>>>>>>>>> still
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> put
> >>>>>>>>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
> >>>>>>>>> reuse
> >>>>>>>>>>>>> them
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
> >>>>>>>> significant,
> >>>>>>>>>>> avoid
> >>>>>>>>>>>>>>>>>> possible
> >>>>>>>>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
> >>>>>>>>> out
> >>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> documentation for connector developers.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> >>>>>>>>>>>>> renqschn@gmail.com>:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
> >>>>>>>>> same
> >>>>>>>>>>>>> page!
> >>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
> >>>>>>>>>> quoting
> >>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>> reply
> >>>>>>>>>>>>>>>>>>> under this email.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
> >>>>>>>> in
> >>>>>>>>>>>>> lookup()
> >>>>>>>>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
> >>>>>>>>>> meaningful
> >>>>>>>>>>>>>>> under
> >>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>> specific retriable failures, and there might be custom
> >>>>>>>> logic
> >>>>>>>>>>>>> before
> >>>>>>>>>>>>>>>>>> making
> >>>>>>>>>>>>>>>>>>> retry, such as re-establish the connection
> >>>>>>>>>>>>> (JdbcRowDataLookupFunction
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
> >>>>>>>>> version
> >>>>>>>>>> of
> >>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>> Do
> >>>>>>>>>>>>>>>>>>> you have any special plans for them?
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> >>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
> >>>>>>>>>> FLIP.
> >>>>>>>>>>>>> Hope
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> can finalize our proposal soon!
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
> >>>>>>>> I
> >>>>>>>>>> have
> >>>>>>>>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>>>>>>>>>> suggestions and questions.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> >>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> >>>>>>>>>> class.
> >>>>>>>>>>>>>>> 'eval'
> >>>>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
> >>>>>>>> The
> >>>>>>>>>> same
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>> 'async' case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
> >>>>>>>>>>>>>>>>>>> 'cacheMissingKey'
> >>>>>>>>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> >>>>>>>>>>>>>>>>>>> ScanRuntimeProvider.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
> >>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> >>>>>>>>>>> 'build'
> >>>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
> >>>>>>>>>> TableFunctionProvider
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> >>>>>>>>>>>>> deprecated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
> >>>>>>>> assume
> >>>>>>>>>>>>> usage of
> >>>>>>>>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
> >>>>>>>>> case,
> >>>>>>>>>>> it
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
> >>>>>>>> or
> >>>>>>>>>>>>> 'putAll'
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>> LookupCache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
> >>>>>>>>>> version
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>>> Do
> >>>>>>>>>>>>>>>>>>>>>>>>>> you have any special plans for them?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
> >>>>>>>> make
> >>>>>>>>>>> small
> >>>>>>>>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
> >>>>>>>>>> worth
> >>>>>>>>>>>>>>>>>> mentioning
> >>>>>>>>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
> >>>>>>>> the
> >>>>>>>>>>>>> future.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> >>>>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
> >>>>>>>> As
> >>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
> >>>>>>>>>>>>> refactor on
> >>>>>>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
> >>>>>>>> design
> >>>>>>>>>> now
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> happy to hear more suggestions from you!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
> >>>>>>>>> and
> >>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
> >>>>>>>>>>>>>>> previously.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
> >>>>>>>> reflect
> >>>>>>>>>> the
> >>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>> design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
> >>>>>>>> and
> >>>>>>>>>>>>>>>>> introduce a
> >>>>>>>>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
> >>>>>>>> scanning.
> >>>>>>>>> We
> >>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>> planning
> >>>>>>>>>>>>>>>>>>> to support SourceFunction / InputFormat for now
> >>>>>>>> considering
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> complexity
> >>>>>>>>>>>>>>>>>>> of FLIP-27 Source API.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> >>>>>>>>>> make
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> semantic of lookup more straightforward for developers.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> >>>>>>>>> is
> >>>>>>>>>>>>>>>>> deprecated
> >>>>>>>>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
> >>>>>>>>>>> currently
> >>>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>> not?
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
> >>>>>>>> for
> >>>>>>>>>>> now.
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>> it will be deprecated in the future but we don't have a
> >>>>>>>>> clear
> >>>>>>>>>>> plan
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> that.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> >>>>>>>>>> looking
> >>>>>>>>>>>>>>>>> forward
> >>>>>>>>>>>>>>>>>>> to cooperating with you after we finalize the design and
> >>>>>>>>>>>>> interfaces!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
> >>>>>>>> Смирнов <
> >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
> >>>>>>>>> all
> >>>>>>>>>>>>>>> points!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> >>>>>>>>> is
> >>>>>>>>>>>>>>>>> deprecated
> >>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
> >>>>>>>>> but
> >>>>>>>>>>>>>>>>> currently
> >>>>>>>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
> >>>>>>>>> version
> >>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>> OK
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
> >>>>>>>>>>> supporting
> >>>>>>>>>>>>>>>>> rescan
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
> >>>>>>>> for
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> decision we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
> >>>>>>>> participants.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
> >>>>>>>>> your
> >>>>>>>>>>>>>>>>>>> statements. All
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
> >>>>>>>>> would
> >>>>>>>>>> be
> >>>>>>>>>>>>> nice
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
> >>>>>>>> lot
> >>>>>>>>>> of
> >>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
> >>>>>>>> one
> >>>>>>>>>> we
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> discussing,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
> >>>>>>>> Anyway
> >>>>>>>>>>>>> looking
> >>>>>>>>>>>>>>>>>>> forward for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the FLIP update!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> >>>>>>>>>> imjark@gmail.com
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> >>>>>>>>>>>>> discussed
> >>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>> several times
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
> >>>>>>>>> many
> >>>>>>>>>> of
> >>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>>> points!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
> >>>>>>>> design
> >>>>>>>>>> docs
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> maybe can be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
> >>>>>>>>> discussions:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
> >>>>>>>> "cache
> >>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> framework" way.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> >>>>>>>>> customize
> >>>>>>>>>>> and
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
> >>>>>>>> easy-use.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
> >>>>>>>>>>> flexibility
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> conciseness.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> >>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>> esp reducing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
> >>>>>>>> the
> >>>>>>>>>>>>> unified
> >>>>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>>>>> to both
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
> >>>>>>>>> direction.
> >>>>>>>>>> If
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>> to support
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
> >>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
> >>>>>>>> decide
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>>> the cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
> >>>>>>>>> and
> >>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> doesn't
> >>>>>>>>>>>>>>>>>>> affect the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
> >>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
> >>>>>>>>> your
> >>>>>>>>>>>>>>>>> proposal.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
> >>>>>>>>>>> InputFormat,
> >>>>>>>>>>>>>>>>>>> SourceFunction for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
> >>>>>>>> source
> >>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>> instead of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
> >>>>>>>>>>> re-scan
> >>>>>>>>>>>>>>>>>> ability
> >>>>>>>>>>>>>>>>>>> for FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
> >>>>>>>>>>> effort
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> FLIP-27 source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
> >>>>>>>>> InputFormat&SourceFunction,
> >>>>>>>>>>> as
> >>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>> are not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> >>>>>>>>> another
> >>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
> >>>>>>>>>> plan
> >>>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
> >>>>>>>>> SourceFunction
> >>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> deprecated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
> >>>>>>>> <
> >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
> >>>>>>>>> InputFormat
> >>>>>>>>>>> is
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> considered.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> >>>>>>>>>>>>>>>>>>> martijn@ververica.com>:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
> >>>>>>>>> connectors
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
> >>>>>>>>> The
> >>>>>>>>>>> old
> >>>>>>>>>>>>>>>>>>> interfaces will be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
> >>>>>>>>>> refactored
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>> the new ones
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dropped.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
> >>>>>>>> are
> >>>>>>>>>>> using
> >>>>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>> interfaces,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
> >>>>>>>>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
> >>>>>>>> Смирнов
> >>>>>>>>> <
> >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
> >>>>>>>>> make
> >>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>> comments and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
> >>>>>>>>> we
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> achieve
> >>>>>>>>>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
> >>>>>>>> in
> >>>>>>>>>>>>>>>>>>> flink-table-common,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
> >>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>>> Therefore if a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
> >>>>>>>> cache
> >>>>>>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>>>> and their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
> >>>>>>>> lookupConfig
> >>>>>>>>> to
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> planner, but if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
> >>>>>>>>> in
> >>>>>>>>>>> his
> >>>>>>>>>>>>>>>>>>> TableFunction, it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
> >>>>>>>>>>>>> interface
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> documentation). In
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
> >>>>>>>>> unified.
> >>>>>>>>>>>>> WDYT?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> >>>>>>>>> cache,
> >>>>>>>>>> we
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>> have 90% of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> >>>>>>>>> optimization
> >>>>>>>>>> in
> >>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> LRU cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> >>>>>>>>>> Collection<RowData>>.
> >>>>>>>>>>>>> Here
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> always
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
> >>>>>>>>>> cache,
> >>>>>>>>>>>>> even
> >>>>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
> >>>>>>>>> rows
> >>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>> applying
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> >>>>>>>>>>>>>>>>>> TableFunction,
> >>>>>>>>>>>>>>>>>>> we store
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> >>>>>>>>>> cache
> >>>>>>>>>>>>> line
> >>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> >>>>>>>>>>> bytes).
> >>>>>>>>>>>>>>>>> I.e.
> >>>>>>>>>>>>>>>>>>> we don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
> >>>>>>>>>> pruned,
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
> >>>>>>>> If
> >>>>>>>>>> the
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>> knows about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> >>>>>>>>>>> option
> >>>>>>>>>>>>>>>>> before
> >>>>>>>>>>>>>>>>>>> the start
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
> >>>>>>>>> idea
> >>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> do this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
> >>>>>>>> and
> >>>>>>>>>>>>> 'weigher'
> >>>>>>>>>>>>>>>>>>> methods of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> >>>>>>>>>>>>> collection
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> >>>>>>>>>> automatically
> >>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>> much
> >>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> >>>>>>>>>>> filters
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> projects
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> >>>>>>>>>>> interfaces,
> >>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> mean it's
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> >>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>> pushdown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
> >>>>>>>> no
> >>>>>>>>>>>>> database
> >>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> >>>>>>>>>>> feature
> >>>>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> >>>>>>>>>> talk
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
> >>>>>>>> databases
> >>>>>>>>>>> might
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> support all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
> >>>>>>>> all).
> >>>>>>>>> I
> >>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
> >>>>>>>>>> optimization
> >>>>>>>>>>>>>>>>>>> independently of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
> >>>>>>>>>> complex
> >>>>>>>>>>>>>>>>> problems
> >>>>>>>>>>>>>>>>>>> (or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
> >>>>>>>> Actually
> >>>>>>>>> in
> >>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>> internal version
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
> >>>>>>>> and
> >>>>>>>>>>>>>>> reloading
> >>>>>>>>>>>>>>>>>>> data from
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
> >>>>>>>> a
> >>>>>>>>>> way
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> unify
> >>>>>>>>>>>>>>>>>>> the logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> >>>>>>>>>>>>>>> SourceFunction,
> >>>>>>>>>>>>>>>>>>> Source,...)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
> >>>>>>>>> result
> >>>>>>>>>> I
> >>>>>>>>>>>>>>>>> settled
> >>>>>>>>>>>>>>>>>>> on using
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
> >>>>>>>>> in
> >>>>>>>>>>> all
> >>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
> >>>>>>>> plans
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> deprecate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> >>>>>>>>>> usage
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> FLIP-27 source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> >>>>>>>>>>> source
> >>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>>>> designed to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
> >>>>>>>>> (SplitEnumerator
> >>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> JobManager and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> >>>>>>>>>> operator
> >>>>>>>>>>>>>>>>> (lookup
> >>>>>>>>>>>>>>>>>>> join
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
> >>>>>>>> direct
> >>>>>>>>>> way
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>>> splits from
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
> >>>>>>>>> works
> >>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> >>>>>>>>>>>>>>>>> AddSplitEvents).
> >>>>>>>>>>>>>>>>>>> Usage of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
> >>>>>>>>> clearer
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> easier. But if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> >>>>>>>>>>>>> FLIP-27, I
> >>>>>>>>>>>>>>>>>>> have the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
> >>>>>>>>> lookup
> >>>>>>>>>>> join
> >>>>>>>>>>>>>>> ALL
> >>>>>>>>>>>>>>>>>>> cache in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
> >>>>>>>> of
> >>>>>>>>>>> batch
> >>>>>>>>>>>>>>>>>> source?
> >>>>>>>>>>>>>>>>>>> The point
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
> >>>>>>>> join
> >>>>>>>>>> ALL
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> and simple
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
> >>>>>>>>> case
> >>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> performed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
> >>>>>>>> (cache)
> >>>>>>>>> is
> >>>>>>>>>>>>>>> cleared
> >>>>>>>>>>>>>>>>>>> (correct me
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> >>>>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>>>> simple join
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
> >>>>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
> >>>>>>>> be
> >>>>>>>>>>> easy
> >>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> new FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
> >>>>>>>> -
> >>>>>>>>> we
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>> to change
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> >>>>>>>>>> again
> >>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>> some TTL).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> >>>>>>>>> long-term
> >>>>>>>>>>>>> goal
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> will make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
> >>>>>>>>> said.
> >>>>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> can limit
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> >>>>>>>>>> (InputFormats).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> >>>>>>>>>> flexible
> >>>>>>>>>>>>>>>>>>> interfaces for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
> >>>>>>>> both
> >>>>>>>>>> in
> >>>>>>>>>>>>> LRU
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> ALL caches.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> >>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
> >>>>>>>>> have
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> opportunity to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
> >>>>>>>> currently
> >>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>> pushdown works
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> >>>>>>>>> filters
> >>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>> projections
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
> >>>>>>>>>>>>> features.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
> >>>>>>>>> that
> >>>>>>>>>>>>>>> involves
> >>>>>>>>>>>>>>>>>>> multiple
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
> >>>>>>>>> from
> >>>>>>>>>>>>>>>>>>> InputFormat in favor
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> >>>>>>>>> realization
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>> complex and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
> >>>>>>>>> extend
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> >>>>>>>>>> case
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>> join ALL
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> >>>>>>>>>>> imjark@gmail.com
> >>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
> >>>>>>>>> want
> >>>>>>>>>> to
> >>>>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>> ideas:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> >>>>>>>>>> connectors
> >>>>>>>>>>>>> base
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
> >>>>>>>>> ways
> >>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> work (e.g.,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
> >>>>>>>>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
> >>>>>>>>> flexible
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
> >>>>>>>> we
> >>>>>>>>>> can
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
> >>>>>>>>> should
> >>>>>>>>>>> be a
> >>>>>>>>>>>>>>>>> final
> >>>>>>>>>>>>>>>>>>> state,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
> >>>>>>>>> into
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> benefit a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lot
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> >>>>>>>>>>> Connectors
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> >>>>>>>>> cache,
> >>>>>>>>>> we
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>> have 90% of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
> >>>>>>>> means
> >>>>>>>>>> the
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> meaningless in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
> >>>>>>>> to
> >>>>>>>>> do
> >>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>> and projects
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> >>>>>>>>>>> interfaces,
> >>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> mean it's
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
> >>>>>>>> interfaces
> >>>>>>>>> to
> >>>>>>>>>>>>> reduce
> >>>>>>>>>>>>>>>>> IO
> >>>>>>>>>>>>>>>>>>> and the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
> >>>>>>>>>> source
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> lookup source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> >>>>>>>>>> pushdown
> >>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> caches,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
> >>>>>>>>> of
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>>> We have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> never
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
> >>>>>>>>> "eval"
> >>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
> >>>>>>>> share
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> reload
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> >>>>>>>>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> >>>>>>>>>>> deprecated,
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> >>>>>>>>>>>>> LookupJoin,
> >>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> may make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
> >>>>>>>> the
> >>>>>>>>>> ALL
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> logic and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reuse
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> >>>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
> >>>>>>>>> lies
> >>>>>>>>>>> out
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> scope of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
> >>>>>>>> be
> >>>>>>>>>>> done
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
> >>>>>>>> Visser <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> >>>>>>>>>> correctly
> >>>>>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> >>>>>>>>>>>>>>>>> jdbc/hive/hbase."
> >>>>>>>>>>>>>>>>>>> -> Would
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> >>>>>>>>> implement
> >>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
> >>>>>>>> to
> >>>>>>>>>>> doing
> >>>>>>>>>>>>>>>>> that,
> >>>>>>>>>>>>>>>>>>> outside
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> >>>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> >>>>>>>>>> improvement!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
> >>>>>>>> implementation
> >>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> nice
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> >>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>> AS
> >>>>>>>>>>>>> OF
> >>>>>>>>>>>>>>>>>>> proc_time"
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> >>>>>>>>>>>>> implemented.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
> >>>>>>>>> say
> >>>>>>>>>>>>> that:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
> >>>>>>>>> to
> >>>>>>>>>>> cut
> >>>>>>>>>>>>> off
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
> >>>>>>>> the
> >>>>>>>>>> most
> >>>>>>>>>>>>>>> handy
> >>>>>>>>>>>>>>>>>>> way to do
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
> >>>>>>>> bit
> >>>>>>>>>>>>> harder to
> >>>>>>>>>>>>>>>>>>> pass it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> >>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>> correctly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
> >>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> jdbc/hive/hbase.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
> >>>>>>>> caching
> >>>>>>>>>>>>>>>>> parameters
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
> >>>>>>>>> set
> >>>>>>>>>> it
> >>>>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
> >>>>>>>>> options
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> >>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>> deprives us of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> >>>>>>>>>> implement
> >>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
> >>>>>>>>> more
> >>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> >>>>>>>>>> schema
> >>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
> >>>>>>>> not
> >>>>>>>>>>> right
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> >>>>>>>>>>> architecture?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> >>>>>>>>> Visser <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> >>>>>>>>>> wanted
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> express that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
> >>>>>>>> this
> >>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>> and I
> >>>>>>>>>>>>>>>>>>> hope
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> >>>>>>>>>>> Смирнов <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> >>>>>>>>>> However, I
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>> questions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
> >>>>>>>>> get
> >>>>>>>>>>>>>>>>>>> something?).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
> >>>>>>>> of
> >>>>>>>>>> "FOR
> >>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> >>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>> AS
> >>>>>>>>>>>>>>>>> OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time"
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
> >>>>>>>>> you
> >>>>>>>>>>>>> said,
> >>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>> go
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
> >>>>>>>> performance
> >>>>>>>>>> (no
> >>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
> >>>>>>>> do
> >>>>>>>>>> you
> >>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> >>>>>>>>>>> explicitly
> >>>>>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whether
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
> >>>>>>>> the
> >>>>>>>>>>> list
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> >>>>>>>>>> want
> >>>>>>>>>>>>> to.
> >>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>> what
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
> >>>>>>>>> caching
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> modules
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> >>>>>>>>>> flink-table-common
> >>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> considered
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> >>>>>>>>>>>>>>>>>>> breaking/non-breaking
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> >>>>>>>>>>> proc_time"?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> >>>>>>>>>>>>> options in
> >>>>>>>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> control
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
> >>>>>>>>> never
> >>>>>>>>>>>>>>> happened
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> previously
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> >>>>>>>>>>> semantics
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
> >>>>>>>>> it
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>> limiting
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> >>>>>>>>>>> business
> >>>>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
> >>>>>>>> logic
> >>>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> framework? I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> >>>>>>>>>> option
> >>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
> >>>>>>>> the
> >>>>>>>>>>> wrong
> >>>>>>>>>>>>>>>>>>> decision,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
> >>>>>>>>> logic
> >>>>>>>>>>> (not
> >>>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> >>>>>>>>>>> functions
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> ONE
> >>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (there
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> >>>>>>>>> caches).
> >>>>>>>>>>>>> Does it
> >>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matter for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
> >>>>>>>>> logic
> >>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> located,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> >>>>>>>>>>>>> 'sink.parallelism',
> >>>>>>>>>>>>>>>>>>> which in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some way
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
> >>>>>>>> framework"
> >>>>>>>>>> and
> >>>>>>>>>>> I
> >>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> see any
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> >>>>>>>>>>> all-caching
> >>>>>>>>>>>>>>>>>>> scenario
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> >>>>>>>>>> discussion,
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> >>>>>>>>>> quite
> >>>>>>>>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reused
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
> >>>>>>>>> for
> >>>>>>>>>> a
> >>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>> API).
> >>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
> >>>>>>>> use
> >>>>>>>>>>>>>>>>> InputFormat
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
> >>>>>>>> even
> >>>>>>>>>> Hive
> >>>>>>>>>>>>> - it
> >>>>>>>>>>>>>>>>>> uses
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
> >>>>>>>> a
> >>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> >>>>>>>>>> ability
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> reload
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> >>>>>>>>>> number
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
> >>>>>>>> reload
> >>>>>>>>>>> time
> >>>>>>>>>>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduces
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> >>>>>>>>>> blocking). I
> >>>>>>>>>>>>> know
> >>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> usually
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> >>>>>>>>>> code,
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
> >>>>>>>>> an
> >>>>>>>>>>>>> ideal
> >>>>>>>>>>>>>>>>>>> solution,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
> >>>>>>>>> might
> >>>>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> >>>>>>>>>>> developer
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
> >>>>>>>>> new
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
> >>>>>>>> options
> >>>>>>>>>>> into
> >>>>>>>>>>>>> 2
> >>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
> >>>>>>>> will
> >>>>>>>>>>> need
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>>> is to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> redirect
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> >>>>>>>>>>>>> LookupConfig
> >>>>>>>>>>>>>>> (+
> >>>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> add an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> >>>>>>>>>> naming),
> >>>>>>>>>>>>>>>>>> everything
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> >>>>>>>>>> won't
> >>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> refactoring at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
> >>>>>>>> connector
> >>>>>>>>>>>>> because
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> backward
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
> >>>>>>>> wants
> >>>>>>>>> to
> >>>>>>>>>>> use
> >>>>>>>>>>>>>>> his
> >>>>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> >>>>>>>>>>> configs
> >>>>>>>>>>>>>>> into
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
> >>>>>>>> with
> >>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
> >>>>>>>> it's a
> >>>>>>>>>>> rare
> >>>>>>>>>>>>>>>>> case).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
> >>>>>>>> pushed
> >>>>>>>>>> all
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>>>>> down
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> >>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
> >>>>>>>> is
> >>>>>>>>>> that
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> ONLY
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> >>>>>>>>>>>>> FileSystemTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> >>>>>>>>>>> currently).
> >>>>>>>>>>>>>>> Also
> >>>>>>>>>>>>>>>>>>> for some
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> >>>>>>>>>>> complex
> >>>>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
> >>>>>>>> the
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
> >>>>>>>> large
> >>>>>>>>>>>>> amount of
> >>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> >>>>>>>>>>> suppose
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> dimension
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
> >>>>>>>> 20
> >>>>>>>>> to
> >>>>>>>>>>> 40,
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> input
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
> >>>>>>>>> by
> >>>>>>>>>>> age
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> users. If
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> >>>>>>>>>> This
> >>>>>>>>>>>>> means
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
> >>>>>>>>> almost
> >>>>>>>>>> 2
> >>>>>>>>>>>>>>> times.
> >>>>>>>>>>>>>>>>>> It
> >>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gain a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> >>>>>>>>>>> optimization
> >>>>>>>>>>>>>>>>> starts
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> >>>>>>>>>> filters
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> projections
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
> >>>>>>>>> opens
> >>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>> additional
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
> >>>>>>>> 'not
> >>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>> useful'.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> >>>>>>>>>>>>> regarding
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> topic!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> >>>>>>>>>> points,
> >>>>>>>>>>>>> and I
> >>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
> >>>>>>>>> come
> >>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>> consensus.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
> >>>>>>>>> Ren
> >>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
> >>>>>>>> for
> >>>>>>>>> my
> >>>>>>>>>>>>> late
> >>>>>>>>>>>>>>>>>>> response!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
> >>>>>>>>> and
> >>>>>>>>>>>>> Leonard
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> I’d
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> >>>>>>>>>> implementing
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> >>>>>>>>>>>>> user-provided
> >>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> >>>>>>>>> extending
> >>>>>>>>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
> >>>>>>>> semantic
> >>>>>>>>> of
> >>>>>>>>>>>>> "FOR
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> >>>>>>>>>> reflect
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> content
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
> >>>>>>>> users
> >>>>>>>>>>>>> choose
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
> >>>>>>>>> that
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> breakage is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
> >>>>>>>>> prefer
> >>>>>>>>>>> not
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> provide
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
> >>>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> framework
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (whether
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> >>>>>>>>> TableFunction),
> >>>>>>>>>> we
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
> >>>>>>>>> control
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> behavior of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> >>>>>>>>>>> should
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> cautious.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Under
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> >>>>>>>>>> framework
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> only be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
> >>>>>>>>> it’s
> >>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
> >>>>>>>> source
> >>>>>>>>>>> loads
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> refresh
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> >>>>>>>>>> high
> >>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
> >>>>>>>>> widely
> >>>>>>>>>>>>> used
> >>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
> >>>>>>>>> the
> >>>>>>>>>>>>> user’s
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> >>>>>>>>>>>>> introduce a
> >>>>>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interface for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
> >>>>>>>> would
> >>>>>>>>>>>>> become
> >>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> complex.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
> >>>>>>>> framework
> >>>>>>>>>>> might
> >>>>>>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> >>>>>>>>>> there
> >>>>>>>>>>>>> might
> >>>>>>>>>>>>>>>>>>> exist two
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caches
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
> >>>>>>>> user
> >>>>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configures
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> >>>>>>>>>> implemented
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> >>>>>>>>>>>>> Alexander, I
> >>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
> >>>>>>>> way
> >>>>>>>>>> down
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>> runner
> >>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> >>>>>>>>>> network
> >>>>>>>>>>>>> I/O
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pressure
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
> >>>>>>>> these
> >>>>>>>>>>>>>>>>>> optimizations
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> >>>>>>>>>>> reflect
> >>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>> ideas.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
> >>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> TableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> >>>>>>>>>>>>> (CachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
> >>>>>>>> developers
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>>> regulate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
> >>>>>>>> reference.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> >>>>>>>>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> >>>>>>>>> Александр
> >>>>>>>>>>>>> Смирнов
> >>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> >>>>>>>>>>> solution
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> step:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
> >>>>>>>> mutually
> >>>>>>>>>>>>> exclusive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (originally
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> >>>>>>>>>>> conceptually
> >>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>> follow
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> >>>>>>>>>>>>> different.
> >>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> go one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
> >>>>>>>>> will
> >>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>> deleting
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> >>>>>>>>>>>>> connectors.
> >>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> think we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> >>>>>>>>>> about
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> >>>>>>>>>> tasks
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parts
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> >>>>>>>>> unification
> >>>>>>>>>> /
> >>>>>>>>>>>>>>>>>>> introducing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> >>>>>>>>>> Qingsheng?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> >>>>>>>>>>> requests
> >>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
> >>>>>>>>> fields
> >>>>>>>>>>> of
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table, we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
> >>>>>>>>> after
> >>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> >>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>> pushdown. So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> >>>>>>>>>> much
> >>>>>>>>>>>>> less
> >>>>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> >>>>>>>>>>> architecture
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> >>>>>>>> honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> >>>>>>>>>> kinds
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> conversations
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> >>>>>>>>> confluence,
> >>>>>>>>>>> so
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> made a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jira
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
> >>>>>>>> in
> >>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>> details
> >>>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
> >>>>>>>>> Heise
> >>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> >>>>>>>>>> inconsistency
> >>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> >>>>>>>>>> could
> >>>>>>>>>>>>> also
> >>>>>>>>>>>>>>>>>> live
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
> >>>>>>>> of
> >>>>>>>>>>>>> making
> >>>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> >>>>>>>>>> devise a
> >>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> layer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> >>>>>>>>> CachingTableFunction
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delegates to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> X in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> >>>>>>>>>> Lifting
> >>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> into
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> >>>>>>>>>>>>> probably
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unnecessary
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
> >>>>>>>>> will
> >>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>> receive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
> >>>>>>>>> more
> >>>>>>>>>>>>>>>>>> interesting
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> save
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> >>>>>>>>>> changes
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> FLIP
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> >>>>>>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>> Everything
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
> >>>>>>>> That
> >>>>>>>>>>> means
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
> >>>>>>>> Alexander
> >>>>>>>>>>>>> pointed
> >>>>>>>>>>>>>>>>> out
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> later.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> >>>>>>>>>>> architecture
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> >>>>>>>> honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> >>>>>>>>>> Александр
> >>>>>>>>>>>>>>>>> Смирнов
> >>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
> >>>>>>>>> I'm
> >>>>>>>>>>>>> not a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> committer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> yet,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> >>>>>>>>>> FLIP
> >>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> >>>>>>>>>>>>> feature in
> >>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> company’s
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
> >>>>>>>> our
> >>>>>>>>>>>>> thoughts
> >>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
> >>>>>>>> alternative
> >>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> >>>>>>>>>>>>>>> (CachingTableFunction).
> >>>>>>>>>>>>>>>>>> As
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> >>>>>>>>>>>>> flink-table-common
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
> >>>>>>>> tables –
> >>>>>>>>>>> it’s
> >>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> convenient
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> >>>>>>>>>>>>> CachingTableFunction
> >>>>>>>>>>>>>>>>>>> contains
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
> >>>>>>>> and
> >>>>>>>>>>>>>>>>> everything
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connected
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
> >>>>>>>> module,
> >>>>>>>>>>>>> probably
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> >>>>>>>>>>> depend
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
> >>>>>>>>> which
> >>>>>>>>>>>>> doesn’t
> >>>>>>>>>>>>>>>>>>> sound
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> good.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> >>>>>>>>>>>>>>> ‘getLookupConfig’
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> >>>>>>>>>>>>> connectors
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> >>>>>>>>>> therefore
> >>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>> won’t
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> depend on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> >>>>>>>>>>> planner
> >>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> construct a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> >>>>>>>>>> runtime
> >>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> >>>>>>>>>> Architecture
> >>>>>>>>>>>>>>> looks
> >>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> >>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>> yours
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
> >>>>>>>> that
> >>>>>>>>>> will
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responsible
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> >>>>>>>>>>>>> inheritors.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> >>>>>>>>>> AsyncLookupJoinRunner,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> >>>>>>>>>>>>>>>>> LookupJoinCachingRunner,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
> >>>>>>>> etc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
> >>>>>>>> powerful
> >>>>>>>>>>>>> advantage
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
> >>>>>>>>> level,
> >>>>>>>>>>> we
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> >>>>>>>>>>>>> LookupJoinRunnerWithCalc
> >>>>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> named
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
> >>>>>>>> function,
> >>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mostly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> >>>>>>>>>> lookup
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>> B
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> condition
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> >>>>>>>>>> WHERE
> >>>>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
> >>>>>>>> A.age =
> >>>>>>>>>>>>> B.age +
> >>>>>>>>>>>>>>>>> 10
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> >>>>>>>>>> storing
> >>>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
> >>>>>>>> reduced:
> >>>>>>>>>>>>> filters =
> >>>>>>>>>>>>>>>>>>> avoid
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
> >>>>>>>>> reduce
> >>>>>>>>>>>>>>> records’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size. So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
> >>>>>>>> be
> >>>>>>>>>>>>>>> increased
> >>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
> >>>>>>>> Ren
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> >>>>>>>>>>>>> discussion
> >>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> >>>>>>>>>> table
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> >>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
> >>>>>>>>> isn’t a
> >>>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
> >>>>>>>> with
> >>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>> joins,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> >>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
> >>>>>>>>> table
> >>>>>>>>>>>>>>> options.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> take a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
> >>>>>>>>> Any
> >>>>>>>>>>>>>>>>>> suggestions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Alexander Smirnov <sm...@gmail.com>.

Hi Qingsheng,

I like the current design, thanks for your efforts! I have no
objections at the moment.
+1 to start the vote.

Best regards,
Alexander

ср, 22 июн. 2022 г. в 15:59, Qingsheng Ren <re...@apache.org>:
>
> Hi Jingsong,
>
> 1. Updated and thanks for the reminder!
>
> 2. We could do so for implementation but as public interface I prefer not to introduce another layer and expose too much since this FLIP is already a huge one with bunch of classes and interfaces.
>
> Best,
> Qingsheng
>
> > On Jun 22, 2022, at 11:16, Jingsong Li <ji...@gmail.com> wrote:
> >
> > Thanks Qingsheng and all.
> >
> > I like this design.
> >
> > Some comments:
> >
> > 1. LookupCache implements Serializable?
> >
> > 2. Minor: After FLIP-234 [1], there should be many connectors that
> > implement both PartialCachingLookupProvider and
> > PartialCachingAsyncLookupProvider. Can we extract a common interface
> > for `LookupCache getCache();` to ensure consistency?
> >
> > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-234%3A+Support+Retryable+Lookup+Join+To+Solve+Delayed+Updates+Issue+In+External+Systems
> >
> > Best,
> > Jingsong
> >
> > On Tue, Jun 21, 2022 at 4:09 PM Qingsheng Ren <re...@apache.org> wrote:
> >>
> >> Hi devs,
> >>
> >> I’d like to push FLIP-221 forward a little bit. Recently we had some offline discussions and updated the FLIP. Here’s the diff compared to the previous version:
> >>
> >> 1. (Async)LookupFunctionProvider is designed as a base interface for constructing lookup functions.
> >> 2. From the LookupFunction we extend PartialCaching / FullCachingLookupProvider for partial and full caching mode.
> >> 3. Introduce CacheReloadTrigger for specifying reload stratrgy in full caching mode, and provide 2 default implementations (Periodic / TimedCacheReloadTrigger)
> >>
> >> Looking forward to your replies~
> >>
> >> Best,
> >> Qingsheng
> >>
> >>> On Jun 2, 2022, at 17:15, Qingsheng Ren <re...@gmail.com> wrote:
> >>>
> >>> Hi Becket,
> >>>
> >>> Thanks for your feedback!
> >>>
> >>> 1. An alternative way is to let the implementation of cache to decide
> >>> whether to store a missing key in the cache instead of the framework.
> >>> This sounds more reasonable and makes the LookupProvider interface
> >>> cleaner. I can update the FLIP and clarify in the JavaDoc of
> >>> LookupCache#put that the cache should decide whether to store an empty
> >>> collection.
> >>>
> >>> 2. Initially the builder pattern is for the extensibility of
> >>> LookupProvider interfaces that we could need to add more
> >>> configurations in the future. We can remove the builder now as we have
> >>> resolved the issue in 1. As for the builder in DefaultLookupCache I
> >>> prefer to keep it because we have a lot of arguments in the
> >>> constructor.
> >>>
> >>> 3. I think this might overturn the overall design. I agree with
> >>> Becket's idea that the API design should be layered considering
> >>> extensibility and it'll be great to have one unified interface
> >>> supporting both partial, full and even mixed custom strategies, but we
> >>> have some issues to resolve. The original purpose of treating full
> >>> caching separately is that we'd like to reuse the ability of
> >>> ScanRuntimeProvider. Developers just need to hand over Source /
> >>> SourceFunction / InputFormat so that the framework could be able to
> >>> compose the underlying topology and control the reload (maybe in a
> >>> distributed way). Under your design we leave the reload operation
> >>> totally to the CacheStrategy and I think it will be hard for
> >>> developers to reuse the source in the initializeCache method.
> >>>
> >>> Best regards,
> >>>
> >>> Qingsheng
> >>>
> >>> On Thu, Jun 2, 2022 at 1:50 PM Becket Qin <be...@gmail.com> wrote:
> >>>>
> >>>> Thanks for updating the FLIP, Qingsheng. A few more comments:
> >>>>
> >>>> 1. I am still not sure about what is the use case for cacheMissingKey().
> >>>> More specifically, when would users want to have getCache() return a
> >>>> non-empty value and cacheMissingKey() returns false?
> >>>>
> >>>> 2. The builder pattern. Usually the builder pattern is used when there are
> >>>> a lot of variations of constructors. For example, if a class has three
> >>>> variables and all of them are optional, so there could potentially be many
> >>>> combinations of the variables. But in this FLIP, I don't see such case.
> >>>> What is the reason we have builders for all the classes?
> >>>>
> >>>> 3. Should the caching strategy be excluded from the top level provider API?
> >>>> Technically speaking, the Flink framework should only have two interfaces
> >>>> to deal with:
> >>>>   A) LookupFunction
> >>>>   B) AsyncLookupFunction
> >>>> Orthogonally, we *believe* there are two different strategies people can do
> >>>> caching. Note that the Flink framework does not care what is the caching
> >>>> strategy here.
> >>>>   a) partial caching
> >>>>   b) full caching
> >>>>
> >>>> Putting them together, we end up with 3 combinations that we think are
> >>>> valid:
> >>>>    Aa) PartialCachingLookupFunctionProvider
> >>>>    Ba) PartialCachingAsyncLookupFunctionProvider
> >>>>    Ab) FullCachingLookupFunctionProvider
> >>>>
> >>>> However, the caching strategy could actually be quite flexible. E.g. an
> >>>> initial full cache load followed by some partial updates. Also, I am not
> >>>> 100% sure if the full caching will always use ScanTableSource. Including
> >>>> the caching strategy in the top level provider API would make it harder to
> >>>> extend.
> >>>>
> >>>> One possible solution is to just have *LookupFunctionProvider* and
> >>>> *AsyncLookupFunctionProvider
> >>>> *as the top level API, both with a getCacheStrategy() method returning an
> >>>> optional CacheStrategy. The CacheStrategy class would have the following
> >>>> methods:
> >>>> 1. void open(Context), the context exposes some of the resources that may
> >>>> be useful for the the caching strategy, e.g. an ExecutorService that is
> >>>> synchronized with the data processing, or a cache refresh trigger which
> >>>> blocks data processing and refresh the cache.
> >>>> 2. void initializeCache(), a blocking method allows users to pre-populate
> >>>> the cache before processing any data if they wish.
> >>>> 3. void maybeCache(RowData key, Collection<RowData> value), blocking or
> >>>> non-blocking method.
> >>>> 4. void refreshCache(), a blocking / non-blocking method that is invoked by
> >>>> the Flink framework when the cache refresh trigger is pulled.
> >>>>
> >>>> In the above design, partial caching and full caching would be
> >>>> implementations of the CachingStrategy. And it is OK for users to implement
> >>>> their own CachingStrategy if they want to.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>> On Thu, Jun 2, 2022 at 12:14 PM Jark Wu <im...@gmail.com> wrote:
> >>>>
> >>>>> Thank Qingsheng for the detailed summary and updates,
> >>>>>
> >>>>> The changes look good to me in general. I just have one minor improvement
> >>>>> comment.
> >>>>> Could we add a static util method to the "FullCachingReloadTrigger"
> >>>>> interface for quick usage?
> >>>>>
> >>>>> #periodicReloadAtFixedRate(Duration)
> >>>>> #periodicReloadWithFixedDelay(Duration)
> >>>>>
> >>>>> I think we can also do this for LookupCache. Because users may not know
> >>>>> where is the default
> >>>>> implementations and how to use them.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <re...@gmail.com> wrote:
> >>>>>
> >>>>>> Hi Jingsong,
> >>>>>>
> >>>>>> Thanks for your comments!
> >>>>>>
> >>>>>>> AllCache definition is not flexible, for example, PartialCache can use
> >>>>>> any custom storage, while the AllCache can not, AllCache can also be
> >>>>>> considered to store memory or disk, also need a flexible strategy.
> >>>>>>
> >>>>>> We had an offline discussion with Jark and Leonard. Basically we think
> >>>>>> exposing the interface of full cache storage to connector developers
> >>>>> might
> >>>>>> limit our future optimizations. The storage of full caching shouldn’t
> >>>>> have
> >>>>>> too many variations for different lookup tables so making it pluggable
> >>>>>> might not help a lot. Also I think it is not quite easy for connector
> >>>>>> developers to implement such an optimized storage. We can keep optimizing
> >>>>>> this storage in the future and all full caching lookup tables would
> >>>>> benefit
> >>>>>> from this.
> >>>>>>
> >>>>>>> We are more inclined to deprecate the connector `async` option when
> >>>>>> discussing FLIP-234. Can we remove this option from this FLIP?
> >>>>>>
> >>>>>> Thanks for the reminder! This option has been removed in the latest
> >>>>>> version.
> >>>>>>
> >>>>>> Best regards,
> >>>>>>
> >>>>>> Qingsheng
> >>>>>>
> >>>>>>
> >>>>>>> On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Thanks Alexander for your reply. We can discuss the new interface when
> >>>>> it
> >>>>>>> comes out.
> >>>>>>>
> >>>>>>> We are more inclined to deprecate the connector `async` option when
> >>>>>>> discussing FLIP-234 [1]. We should use hint to let planner decide.
> >>>>>>> Although the discussion has not yet produced a conclusion, can we
> >>>>> remove
> >>>>>>> this option from this FLIP? It doesn't seem to be related to this FLIP,
> >>>>>> but
> >>>>>>> more to FLIP-234, and we can form a conclusion over there.
> >>>>>>>
> >>>>>>> [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jingsong
> >>>>>>>
> >>>>>>> On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:
> >>>>>>>
> >>>>>>>> Hi Jark,
> >>>>>>>>
> >>>>>>>> Thanks for clarifying it. It would be fine. as long as we could
> >>>>> provide
> >>>>>> the
> >>>>>>>> no-cache solution. I was just wondering if the client side cache could
> >>>>>>>> really help when HBase is used, since the data to look up should be
> >>>>>> huge.
> >>>>>>>> Depending how much data will be cached on the client side, the data
> >>>>> that
> >>>>>>>> should be lru in e.g. LruBlockCache will not be lru anymore. In the
> >>>>>> worst
> >>>>>>>> case scenario, once the cached data at client side is expired, the
> >>>>>> request
> >>>>>>>> will hit disk which will cause extra latency temporarily, if I am not
> >>>>>>>> mistaken.
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Jing
> >>>>>>>>
> >>>>>>>> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Jing Ge,
> >>>>>>>>>
> >>>>>>>>> What do you mean about the "impact on the block cache used by HBase"?
> >>>>>>>>> In my understanding, the connector cache and HBase cache are totally
> >>>>>> two
> >>>>>>>>> things.
> >>>>>>>>> The connector cache is a local/client cache, and the HBase cache is a
> >>>>>>>>> server cache.
> >>>>>>>>>
> >>>>>>>>>> does it make sense to have a no-cache solution as one of the
> >>>>>>>>> default solutions so that customers will have no effort for the
> >>>>>> migration
> >>>>>>>>> if they want to stick with Hbase cache
> >>>>>>>>>
> >>>>>>>>> The implementation migration should be transparent to users. Take the
> >>>>>>>> HBase
> >>>>>>>>> connector as
> >>>>>>>>> an example,  it already supports lookup cache but is disabled by
> >>>>>> default.
> >>>>>>>>> After migration, the
> >>>>>>>>> connector still disables cache by default (i.e. no-cache solution).
> >>>>> No
> >>>>>>>>> migration effort for users.
> >>>>>>>>>
> >>>>>>>>> HBase cache and connector cache are two different things. HBase cache
> >>>>>>>> can't
> >>>>>>>>> simply replace
> >>>>>>>>> connector cache. Because one of the most important usages for
> >>>>> connector
> >>>>>>>>> cache is reducing
> >>>>>>>>> the I/O request/response and improving the throughput, which can
> >>>>>> achieve
> >>>>>>>>> by just using a server cache.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jark
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks all for the valuable discussion. The new feature looks very
> >>>>>>>>>> interesting.
> >>>>>>>>>>
> >>>>>>>>>> According to the FLIP description: "*Currently we have JDBC, Hive
> >>>>> and
> >>>>>>>>> HBase
> >>>>>>>>>> connector implemented lookup table source. All existing
> >>>>>> implementations
> >>>>>>>>>> will be migrated to the current design and the migration will be
> >>>>>>>>>> transparent to end users*." I was only wondering if we should pay
> >>>>>>>>> attention
> >>>>>>>>>> to HBase and similar DBs. Since, commonly, the lookup data will be
> >>>>>> huge
> >>>>>>>>>> while using HBase, partial caching will be used in this case, if I
> >>>>> am
> >>>>>>>> not
> >>>>>>>>>> mistaken, which might have an impact on the block cache used by
> >>>>> HBase,
> >>>>>>>>> e.g.
> >>>>>>>>>> LruBlockCache.
> >>>>>>>>>> Another question is that, since HBase provides a sophisticated cache
> >>>>>>>>>> solution, does it make sense to have a no-cache solution as one of
> >>>>> the
> >>>>>>>>>> default solutions so that customers will have no effort for the
> >>>>>>>> migration
> >>>>>>>>>> if they want to stick with Hbase cache?
> >>>>>>>>>>
> >>>>>>>>>> Best regards,
> >>>>>>>>>> Jing
> >>>>>>>>>>
> >>>>>>>>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <
> >>>>> jingsonglee0@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> I think the problem now is below:
> >>>>>>>>>>> 1. AllCache and PartialCache interface on the non-uniform, one
> >>>>> needs
> >>>>>>>> to
> >>>>>>>>>>> provide LookupProvider, the other needs to provide CacheBuilder.
> >>>>>>>>>>> 2. AllCache definition is not flexible, for example, PartialCache
> >>>>> can
> >>>>>>>>> use
> >>>>>>>>>>> any custom storage, while the AllCache can not, AllCache can also
> >>>>> be
> >>>>>>>>>>> considered to store memory or disk, also need a flexible strategy.
> >>>>>>>>>>> 3. AllCache can not customize ReloadStrategy, currently only
> >>>>>>>>>>> ScheduledReloadStrategy.
> >>>>>>>>>>>
> >>>>>>>>>>> In order to solve the above problems, the following are my ideas.
> >>>>>>>>>>>
> >>>>>>>>>>> ## Top level cache interfaces:
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> public interface CacheLookupProvider extends
> >>>>>>>>>>> LookupTableSource.LookupRuntimeProvider {
> >>>>>>>>>>>
> >>>>>>>>>>>  CacheBuilder createCacheBuilder();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface CacheBuilder {
> >>>>>>>>>>>  Cache create();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface Cache {
> >>>>>>>>>>>
> >>>>>>>>>>>  /**
> >>>>>>>>>>>   * Returns the value associated with key in this cache, or null
> >>>>>>>> if
> >>>>>>>>>>> there is no cached value for
> >>>>>>>>>>>   * key.
> >>>>>>>>>>>   */
> >>>>>>>>>>>  @Nullable
> >>>>>>>>>>>  Collection<RowData> getIfPresent(RowData key);
> >>>>>>>>>>>
> >>>>>>>>>>>  /** Returns the number of key-value mappings in the cache. */
> >>>>>>>>>>>  long size();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> ## Partial cache
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> public interface PartialCacheLookupFunction extends
> >>>>>>>>> CacheLookupProvider {
> >>>>>>>>>>>
> >>>>>>>>>>>  @Override
> >>>>>>>>>>>  PartialCacheBuilder createCacheBuilder();
> >>>>>>>>>>>
> >>>>>>>>>>> /** Creates an {@link LookupFunction} instance. */
> >>>>>>>>>>> LookupFunction createLookupFunction();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface PartialCacheBuilder extends CacheBuilder {
> >>>>>>>>>>>
> >>>>>>>>>>>  PartialCache create();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface PartialCache extends Cache {
> >>>>>>>>>>>
> >>>>>>>>>>>  /**
> >>>>>>>>>>>   * Associates the specified value rows with the specified key
> >>>>> row
> >>>>>>>>>>> in the cache. If the cache
> >>>>>>>>>>>   * previously contained value associated with the key, the old
> >>>>>>>>>>> value is replaced by the
> >>>>>>>>>>>   * specified value.
> >>>>>>>>>>>   *
> >>>>>>>>>>>   * @return the previous value rows associated with key, or null
> >>>>>>>> if
> >>>>>>>>>>> there was no mapping for key.
> >>>>>>>>>>>   * @param key - key row with which the specified value is to be
> >>>>>>>>>>> associated
> >>>>>>>>>>>   * @param value – value rows to be associated with the specified
> >>>>>>>>> key
> >>>>>>>>>>>   */
> >>>>>>>>>>>  Collection<RowData> put(RowData key, Collection<RowData> value);
> >>>>>>>>>>>
> >>>>>>>>>>>  /** Discards any cached value for the specified key. */
> >>>>>>>>>>>  void invalidate(RowData key);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> ## All cache
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> public interface AllCacheLookupProvider extends
> >>>>> CacheLookupProvider {
> >>>>>>>>>>>
> >>>>>>>>>>>  void registerReloadStrategy(ScheduledExecutorService
> >>>>>>>>>>> executorService, Reloader reloader);
> >>>>>>>>>>>
> >>>>>>>>>>>  ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> >>>>>>>>>>>
> >>>>>>>>>>>  @Override
> >>>>>>>>>>>  AllCacheBuilder createCacheBuilder();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface AllCacheBuilder extends CacheBuilder {
> >>>>>>>>>>>
> >>>>>>>>>>>  AllCache create();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface AllCache extends Cache {
> >>>>>>>>>>>
> >>>>>>>>>>>  void putAll(Iterator<Map<RowData, RowData>> allEntries);
> >>>>>>>>>>>
> >>>>>>>>>>>  void clearAll();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> public interface Reloader {
> >>>>>>>>>>>
> >>>>>>>>>>>  void reload();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jingsong
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <
> >>>>> jingsonglee0@gmail.com
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks Qingsheng and all for your discussion.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Very sorry to jump in so late.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Maybe I missed something?
> >>>>>>>>>>>> My first impression when I saw the cache interface was, why don't
> >>>>>>>> we
> >>>>>>>>>>>> provide an interface similar to guava cache [1], on top of guava
> >>>>>>>>> cache,
> >>>>>>>>>>>> caffeine also makes extensions for asynchronous calls.[2]
> >>>>>>>>>>>> There is also the bulk load in caffeine too.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am also more confused why first from LookupCacheFactory.Builder
> >>>>>>>> and
> >>>>>>>>>>> then
> >>>>>>>>>>>> to Factory to create Cache.
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1] https://github.com/google/guava
> >>>>>>>>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jingsong
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> After looking at the new introduced ReloadTime and Becket's
> >>>>>>>> comment,
> >>>>>>>>>>>>> I agree with Becket we should have a pluggable reloading
> >>>>> strategy.
> >>>>>>>>>>>>> We can provide some common implementations, e.g., periodic
> >>>>>>>>> reloading,
> >>>>>>>>>>> and
> >>>>>>>>>>>>> daily reloading.
> >>>>>>>>>>>>> But there definitely be some connector- or business-specific
> >>>>>>>>> reloading
> >>>>>>>>>>>>> strategies, e.g.
> >>>>>>>>>>>>> notify by a zookeeper watcher, reload once a new Hive partition
> >>>>> is
> >>>>>>>>>>>>> complete.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
> >>>>>>>>>> "XXXProvider".
> >>>>>>>>>>>>>> What is the difference between them? If they are the same, can
> >>>>>>>> we
> >>>>>>>>>> just
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>> XXXFactory everywhere?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
> >>>>>>>>>>> policy
> >>>>>>>>>>>>>> also be pluggable? Periodical reloading could be sometimes be
> >>>>>>>>> tricky
> >>>>>>>>>>> in
> >>>>>>>>>>>>>> practice. For example, if user uses 24 hours as the cache
> >>>>>>>> refresh
> >>>>>>>>>>>>> interval
> >>>>>>>>>>>>>> and some nightly batch job delayed, the cache update may still
> >>>>>>>> see
> >>>>>>>>>> the
> >>>>>>>>>>>>>> stale data.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> >>>>>>>>>> should
> >>>>>>>>>>> be
> >>>>>>>>>>>>>> removed.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
> >>>>>>>> seems a
> >>>>>>>>>>>>> little
> >>>>>>>>>>>>>> confusing to me. If Optional<LookupCacheFactory>
> >>>>>>>> getCacheFactory()
> >>>>>>>>>>>>> returns
> >>>>>>>>>>>>>> a non-empty factory, doesn't that already indicates the
> >>>>>>>> framework
> >>>>>>>>> to
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>> the missing keys? Also, why is this method returning an
> >>>>>>>>>>>>> Optional<Boolean>
> >>>>>>>>>>>>>> instead of boolean?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
> >>>>>>>> renqschn@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Lincoln and Jark,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the comments! If the community reaches a consensus
> >>>>>>>>> that
> >>>>>>>>>> we
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>>> SQL hint instead of table options to decide whether to use sync
> >>>>>>>>> or
> >>>>>>>>>>>>> async
> >>>>>>>>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
> >>>>>>>>>>> option.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think it’s a good idea to let the decision of async made on
> >>>>>>>>> query
> >>>>>>>>>>>>>>> level, which could make better optimization with more
> >>>>>>>> infomation
> >>>>>>>>>>>>> gathered
> >>>>>>>>>>>>>>> by planner. Is there any FLIP describing the issue in
> >>>>>>>>> FLINK-27625?
> >>>>>>>>>> I
> >>>>>>>>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
> >>>>>>>>>>> missing
> >>>>>>>>>>>>>>> instead of the entire async mode to be controlled by hint.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
> >>>>>>>> lincoln.86xy@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for your reply!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
> >>>>>>>>> no
> >>>>>>>>>>> idea
> >>>>>>>>>>>>>>>> whether or when to remove it (we can discuss it in another
> >>>>>>>>> issue
> >>>>>>>>>>> for
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
> >>>>>>>>> into
> >>>>>>>>>> a
> >>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>> option now.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Lincoln Lee
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Lincoln,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
> >>>>>>>> the
> >>>>>>>>>>>>>>> connectors
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> provide both async and sync runtime providers simultaneously
> >>>>>>>>>>> instead
> >>>>>>>>>>>>>>> of one
> >>>>>>>>>>>>>>>>> of them.
> >>>>>>>>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
> >>>>>>>> option
> >>>>>>>>> is
> >>>>>>>>>>>>>>> planned to
> >>>>>>>>>>>>>>>>> be removed
> >>>>>>>>>>>>>>>>> in the long term, I think it makes sense not to introduce it
> >>>>>>>>> in
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> >>>>>>>>>> lincoln.86xy@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
> >>>>>>>>> idea
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> have a common table option. I have a minor comments on
> >>>>>>>>>>>>> 'lookup.async'
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> not make it a common option:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The table layer abstracts both sync and async lookup
> >>>>>>>>>>> capabilities,
> >>>>>>>>>>>>>>>>>> connectors implementers can choose one or both, in the case
> >>>>>>>>> of
> >>>>>>>>>>>>>>>>> implementing
> >>>>>>>>>>>>>>>>>> only one capability(status of the most of existing builtin
> >>>>>>>>>>>>> connectors)
> >>>>>>>>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
> >>>>>>>>> both
> >>>>>>>>>>>>>>>>>> capabilities, I think this choice is more suitable for
> >>>>>>>> making
> >>>>>>>>>>>>>>> decisions
> >>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>> the query level, for example, table planner can choose the
> >>>>>>>>>>> physical
> >>>>>>>>>>>>>>>>>> implementation of async lookup or sync lookup based on its
> >>>>>>>>> cost
> >>>>>>>>>>>>>>> model, or
> >>>>>>>>>>>>>>>>>> users can give query hint based on their own better
> >>>>>>>>>>>>> understanding.  If
> >>>>>>>>>>>>>>>>>> there is another common table option 'lookup.async', it may
> >>>>>>>>>>> confuse
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> users in the long run.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
> >>>>>>>>>> place
> >>>>>>>>>>>>> (for
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> current hbase connector) and not turn it into a common
> >>>>>>>>> option.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> WDYT?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Lincoln Lee
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Alexander,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
> >>>>>>>> you
> >>>>>>>>>> can
> >>>>>>>>>>>>> find
> >>>>>>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>>>>> changes from my latest email. Since some terminologies has
> >>>>>>>>>>>>> changed so
> >>>>>>>>>>>>>>>>>> I’ll
> >>>>>>>>>>>>>>>>>>> use the new concept for replying your comments.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. Builder vs ‘of’
> >>>>>>>>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
> >>>>>>>> optional
> >>>>>>>>>>>>>>> parameters
> >>>>>>>>>>>>>>>>>>> for full caching mode (“rescan” previously). The
> >>>>>>>>>>>>> schedule-with-delay
> >>>>>>>>>>>>>>>>> idea
> >>>>>>>>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
> >>>>>>>> the
> >>>>>>>>>>>>> builder
> >>>>>>>>>>>>>>> API
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> full caching to make it more descriptive for developers.
> >>>>>>>>> Would
> >>>>>>>>>>> you
> >>>>>>>>>>>>>>> mind
> >>>>>>>>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
> >>>>>>>>>>> workspace
> >>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> just provide your account ID and ping any PMC member
> >>>>>>>>> including
> >>>>>>>>>>>>> Jark.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2. Common table options
> >>>>>>>>>>>>>>>>>>> We have some discussions these days and propose to
> >>>>>>>>> introduce 8
> >>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>> table options about caching. It has been updated on the
> >>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 3. Retries
> >>>>>>>>>>>>>>>>>>> I think we are on the same page :-)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For your additional concerns:
> >>>>>>>>>>>>>>>>>>> 1) The table option has been updated.
> >>>>>>>>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
> >>>>>>>> use
> >>>>>>>>>>>>> partial
> >>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>> full caching mode.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> >>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Also I have a few additions:
> >>>>>>>>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> >>>>>>>>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
> >>>>>>>> that
> >>>>>>>>>> we
> >>>>>>>>>>>>> talk
> >>>>>>>>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
> >>>>>>>> fits
> >>>>>>>>>>> more,
> >>>>>>>>>>>>>>>>>>>> considering my optimization with filters.
> >>>>>>>>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
> >>>>>>>>> separate
> >>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>> and rescanning from the options point of view? Like
> >>>>>>>>> initially
> >>>>>>>>>>> we
> >>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
> >>>>>>>>> now
> >>>>>>>>>> we
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
> >>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> >>>>>>>>>>>>> smiralexan@gmail.com
> >>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and Jark,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1. Builders vs 'of'
> >>>>>>>>>>>>>>>>>>>>> I understand that builders are used when we have
> >>>>>>>> multiple
> >>>>>>>>>>>>>>>>> parameters.
> >>>>>>>>>>>>>>>>>>>>> I suggested them because we could add parameters later.
> >>>>>>>> To
> >>>>>>>>>>>>> prevent
> >>>>>>>>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
> >>>>>>>>> can
> >>>>>>>>>>>>>>> suggest
> >>>>>>>>>>>>>>>>>>>>> one more config now - "rescanStartTime".
> >>>>>>>>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
> >>>>>>>> reload
> >>>>>>>>>> of
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>> starts. This parameter can be thought of as
> >>>>>>>> 'initialDelay'
> >>>>>>>>>>> (diff
> >>>>>>>>>>>>>>>>>>>>> between current time and rescanStartTime) in method
> >>>>>>>>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
> >>>>>>>>> can
> >>>>>>>>>> be
> >>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>> useful when the dimension table is updated by some other
> >>>>>>>>>>>>> scheduled
> >>>>>>>>>>>>>>>>> job
> >>>>>>>>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
> >>>>>>>> second
> >>>>>>>>>> scan
> >>>>>>>>>>>>>>>>> (first
> >>>>>>>>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
> >>>>>>>>>> without
> >>>>>>>>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
> >>>>>>>>> one
> >>>>>>>>>>>>> day.
> >>>>>>>>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
> >>>>>>>> if
> >>>>>>>>>> you
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
> >>>>>>>> myself
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2. Common table options
> >>>>>>>>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
> >>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
> >>>>>>>>> for
> >>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
> >>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> options,
> >>>>>>>>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 3. Retries
> >>>>>>>>>>>>>>>>>>>>> I'm fine with suggestion close to
> >>>>>>>>> RetryUtils#tryTimes(times,
> >>>>>>>>>>>>> call)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> >>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Jark and Alexander,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
> >>>>>>>> common
> >>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> options. I prefer to introduce a new
> >>>>>>>>> DefaultLookupCacheOptions
> >>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> holding these option definitions because putting all
> >>>>>>>> options
> >>>>>>>>>>> into
> >>>>>>>>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
> >>>>>>>>>>>>> categorized.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
> >>>>>>>>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
> >>>>>>>>>>>>> RescanRuntimeProvider
> >>>>>>>>>>>>>>>>>>> considering both arguments are required.
> >>>>>>>>>>>>>>>>>>>>>> 2. Introduce new table options matching
> >>>>>>>>>>>>> DefaultLookupCacheFactory
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> >>>>>>>>> imjark@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 1) retry logic
> >>>>>>>>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
> >>>>>>>>>>> utilities,
> >>>>>>>>>>>>>>>>> e.g.
> >>>>>>>>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
> >>>>>>>>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
> >>>>>>>> by
> >>>>>>>>>>>>>>>>> DataStream
> >>>>>>>>>>>>>>>>>>> users.
> >>>>>>>>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
> >>>>>>>> to
> >>>>>>>>>> put
> >>>>>>>>>>>>> it.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 2) cache ConfigOptions
> >>>>>>>>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
> >>>>>>>>>>> framework.
> >>>>>>>>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
> >>>>>>>>>> includes
> >>>>>>>>>>>>>>>>>>> "sink.parallelism", "format" options.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thank you for considering my comments.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
> >>>>>>>> such
> >>>>>>>>> as
> >>>>>>>>>>>>>>>>>>> re-establish the connection
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
> >>>>>>>> be
> >>>>>>>>>>>>> placed in
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> separate function, that can be implemented by
> >>>>>>>>> connectors.
> >>>>>>>>>>>>> Just
> >>>>>>>>>>>>>>>>>> moving
> >>>>>>>>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
> >>>>>>>>>> more
> >>>>>>>>>>>>>>>>> concise
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
> >>>>>>>> The
> >>>>>>>>>>>>> decision
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>>>>>>> to you.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> >>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
> >>>>>>>>>> this
> >>>>>>>>>>>>> FLIP
> >>>>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
> >>>>>>>> current
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> >>>>>>>>>> still
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> put
> >>>>>>>>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
> >>>>>>>>> reuse
> >>>>>>>>>>>>> them
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
> >>>>>>>> significant,
> >>>>>>>>>>> avoid
> >>>>>>>>>>>>>>>>>> possible
> >>>>>>>>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
> >>>>>>>>> out
> >>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> documentation for connector developers.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> >>>>>>>>>>>>> renqschn@gmail.com>:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
> >>>>>>>>> same
> >>>>>>>>>>>>> page!
> >>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
> >>>>>>>>>> quoting
> >>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>> reply
> >>>>>>>>>>>>>>>>>>> under this email.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
> >>>>>>>> in
> >>>>>>>>>>>>> lookup()
> >>>>>>>>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
> >>>>>>>>>> meaningful
> >>>>>>>>>>>>>>> under
> >>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>> specific retriable failures, and there might be custom
> >>>>>>>> logic
> >>>>>>>>>>>>> before
> >>>>>>>>>>>>>>>>>> making
> >>>>>>>>>>>>>>>>>>> retry, such as re-establish the connection
> >>>>>>>>>>>>> (JdbcRowDataLookupFunction
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
> >>>>>>>>> version
> >>>>>>>>>> of
> >>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>> Do
> >>>>>>>>>>>>>>>>>>> you have any special plans for them?
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> >>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
> >>>>>>>>>> FLIP.
> >>>>>>>>>>>>> Hope
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> can finalize our proposal soon!
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
> >>>>>>>> I
> >>>>>>>>>> have
> >>>>>>>>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>>>>>>>>>> suggestions and questions.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> >>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> >>>>>>>>>> class.
> >>>>>>>>>>>>>>> 'eval'
> >>>>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
> >>>>>>>> The
> >>>>>>>>>> same
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>> 'async' case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
> >>>>>>>>>>>>>>>>>>> 'cacheMissingKey'
> >>>>>>>>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> >>>>>>>>>>>>>>>>>>> ScanRuntimeProvider.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
> >>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> >>>>>>>>>>> 'build'
> >>>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
> >>>>>>>>>> TableFunctionProvider
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> >>>>>>>>>>>>> deprecated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
> >>>>>>>> assume
> >>>>>>>>>>>>> usage of
> >>>>>>>>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
> >>>>>>>>> case,
> >>>>>>>>>>> it
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
> >>>>>>>> or
> >>>>>>>>>>>>> 'putAll'
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>> LookupCache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
> >>>>>>>>>> version
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>>> Do
> >>>>>>>>>>>>>>>>>>>>>>>>>> you have any special plans for them?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
> >>>>>>>> make
> >>>>>>>>>>> small
> >>>>>>>>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
> >>>>>>>>>> worth
> >>>>>>>>>>>>>>>>>> mentioning
> >>>>>>>>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
> >>>>>>>> the
> >>>>>>>>>>>>> future.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> >>>>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
> >>>>>>>> As
> >>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
> >>>>>>>>>>>>> refactor on
> >>>>>>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
> >>>>>>>> design
> >>>>>>>>>> now
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> happy to hear more suggestions from you!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
> >>>>>>>>> and
> >>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
> >>>>>>>>>>>>>>> previously.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
> >>>>>>>> reflect
> >>>>>>>>>> the
> >>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>> design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
> >>>>>>>> and
> >>>>>>>>>>>>>>>>> introduce a
> >>>>>>>>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
> >>>>>>>> scanning.
> >>>>>>>>> We
> >>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>> planning
> >>>>>>>>>>>>>>>>>>> to support SourceFunction / InputFormat for now
> >>>>>>>> considering
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> complexity
> >>>>>>>>>>>>>>>>>>> of FLIP-27 Source API.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> >>>>>>>>>> make
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> semantic of lookup more straightforward for developers.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> >>>>>>>>> is
> >>>>>>>>>>>>>>>>> deprecated
> >>>>>>>>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
> >>>>>>>>>>> currently
> >>>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>> not?
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
> >>>>>>>> for
> >>>>>>>>>>> now.
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>> it will be deprecated in the future but we don't have a
> >>>>>>>>> clear
> >>>>>>>>>>> plan
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> that.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> >>>>>>>>>> looking
> >>>>>>>>>>>>>>>>> forward
> >>>>>>>>>>>>>>>>>>> to cooperating with you after we finalize the design and
> >>>>>>>>>>>>> interfaces!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
> >>>>>>>> Смирнов <
> >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
> >>>>>>>>> all
> >>>>>>>>>>>>>>> points!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> >>>>>>>>> is
> >>>>>>>>>>>>>>>>> deprecated
> >>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
> >>>>>>>>> but
> >>>>>>>>>>>>>>>>> currently
> >>>>>>>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
> >>>>>>>>> version
> >>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>> OK
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
> >>>>>>>>>>> supporting
> >>>>>>>>>>>>>>>>> rescan
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
> >>>>>>>> for
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> decision we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
> >>>>>>>> participants.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
> >>>>>>>>> your
> >>>>>>>>>>>>>>>>>>> statements. All
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
> >>>>>>>>> would
> >>>>>>>>>> be
> >>>>>>>>>>>>> nice
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
> >>>>>>>> lot
> >>>>>>>>>> of
> >>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
> >>>>>>>> one
> >>>>>>>>>> we
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> discussing,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
> >>>>>>>> Anyway
> >>>>>>>>>>>>> looking
> >>>>>>>>>>>>>>>>>>> forward for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the FLIP update!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> >>>>>>>>>> imjark@gmail.com
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> >>>>>>>>>>>>> discussed
> >>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>> several times
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
> >>>>>>>>> many
> >>>>>>>>>> of
> >>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>>> points!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
> >>>>>>>> design
> >>>>>>>>>> docs
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> maybe can be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
> >>>>>>>>> discussions:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
> >>>>>>>> "cache
> >>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> framework" way.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> >>>>>>>>> customize
> >>>>>>>>>>> and
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
> >>>>>>>> easy-use.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
> >>>>>>>>>>> flexibility
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> conciseness.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> >>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>> esp reducing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
> >>>>>>>> the
> >>>>>>>>>>>>> unified
> >>>>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>>>>> to both
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
> >>>>>>>>> direction.
> >>>>>>>>>> If
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>> to support
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
> >>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
> >>>>>>>> decide
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>>> the cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
> >>>>>>>>> and
> >>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> doesn't
> >>>>>>>>>>>>>>>>>>> affect the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
> >>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
> >>>>>>>>> your
> >>>>>>>>>>>>>>>>> proposal.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
> >>>>>>>>>>> InputFormat,
> >>>>>>>>>>>>>>>>>>> SourceFunction for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
> >>>>>>>> source
> >>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>> instead of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
> >>>>>>>>>>> re-scan
> >>>>>>>>>>>>>>>>>> ability
> >>>>>>>>>>>>>>>>>>> for FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
> >>>>>>>>>>> effort
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> FLIP-27 source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
> >>>>>>>>> InputFormat&SourceFunction,
> >>>>>>>>>>> as
> >>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>> are not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> >>>>>>>>> another
> >>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
> >>>>>>>>>> plan
> >>>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
> >>>>>>>>> SourceFunction
> >>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> deprecated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
> >>>>>>>> <
> >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
> >>>>>>>>> InputFormat
> >>>>>>>>>>> is
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> considered.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> >>>>>>>>>>>>>>>>>>> martijn@ververica.com>:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
> >>>>>>>>> connectors
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
> >>>>>>>>> The
> >>>>>>>>>>> old
> >>>>>>>>>>>>>>>>>>> interfaces will be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
> >>>>>>>>>> refactored
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>> the new ones
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dropped.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
> >>>>>>>> are
> >>>>>>>>>>> using
> >>>>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>> interfaces,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
> >>>>>>>>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
> >>>>>>>> Смирнов
> >>>>>>>>> <
> >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
> >>>>>>>>> make
> >>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>> comments and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
> >>>>>>>>> we
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> achieve
> >>>>>>>>>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
> >>>>>>>> in
> >>>>>>>>>>>>>>>>>>> flink-table-common,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
> >>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>>> Therefore if a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
> >>>>>>>> cache
> >>>>>>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>>>> and their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
> >>>>>>>> lookupConfig
> >>>>>>>>> to
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> planner, but if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
> >>>>>>>>> in
> >>>>>>>>>>> his
> >>>>>>>>>>>>>>>>>>> TableFunction, it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
> >>>>>>>>>>>>> interface
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> documentation). In
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
> >>>>>>>>> unified.
> >>>>>>>>>>>>> WDYT?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> >>>>>>>>> cache,
> >>>>>>>>>> we
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>> have 90% of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> >>>>>>>>> optimization
> >>>>>>>>>> in
> >>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> LRU cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> >>>>>>>>>> Collection<RowData>>.
> >>>>>>>>>>>>> Here
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> always
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
> >>>>>>>>>> cache,
> >>>>>>>>>>>>> even
> >>>>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
> >>>>>>>>> rows
> >>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>> applying
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> >>>>>>>>>>>>>>>>>> TableFunction,
> >>>>>>>>>>>>>>>>>>> we store
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> >>>>>>>>>> cache
> >>>>>>>>>>>>> line
> >>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> >>>>>>>>>>> bytes).
> >>>>>>>>>>>>>>>>> I.e.
> >>>>>>>>>>>>>>>>>>> we don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
> >>>>>>>>>> pruned,
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
> >>>>>>>> If
> >>>>>>>>>> the
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>> knows about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> >>>>>>>>>>> option
> >>>>>>>>>>>>>>>>> before
> >>>>>>>>>>>>>>>>>>> the start
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
> >>>>>>>>> idea
> >>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> do this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
> >>>>>>>> and
> >>>>>>>>>>>>> 'weigher'
> >>>>>>>>>>>>>>>>>>> methods of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> >>>>>>>>>>>>> collection
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> >>>>>>>>>> automatically
> >>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>> much
> >>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> >>>>>>>>>>> filters
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> projects
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> >>>>>>>>>>> interfaces,
> >>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> mean it's
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> >>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>> pushdown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
> >>>>>>>> no
> >>>>>>>>>>>>> database
> >>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> >>>>>>>>>>> feature
> >>>>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> >>>>>>>>>> talk
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
> >>>>>>>> databases
> >>>>>>>>>>> might
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> support all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
> >>>>>>>> all).
> >>>>>>>>> I
> >>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
> >>>>>>>>>> optimization
> >>>>>>>>>>>>>>>>>>> independently of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
> >>>>>>>>>> complex
> >>>>>>>>>>>>>>>>> problems
> >>>>>>>>>>>>>>>>>>> (or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
> >>>>>>>> Actually
> >>>>>>>>> in
> >>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>> internal version
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
> >>>>>>>> and
> >>>>>>>>>>>>>>> reloading
> >>>>>>>>>>>>>>>>>>> data from
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
> >>>>>>>> a
> >>>>>>>>>> way
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> unify
> >>>>>>>>>>>>>>>>>>> the logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> >>>>>>>>>>>>>>> SourceFunction,
> >>>>>>>>>>>>>>>>>>> Source,...)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
> >>>>>>>>> result
> >>>>>>>>>> I
> >>>>>>>>>>>>>>>>> settled
> >>>>>>>>>>>>>>>>>>> on using
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
> >>>>>>>>> in
> >>>>>>>>>>> all
> >>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
> >>>>>>>> plans
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> deprecate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> >>>>>>>>>> usage
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> FLIP-27 source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> >>>>>>>>>>> source
> >>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>>>> designed to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
> >>>>>>>>> (SplitEnumerator
> >>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> JobManager and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> >>>>>>>>>> operator
> >>>>>>>>>>>>>>>>> (lookup
> >>>>>>>>>>>>>>>>>>> join
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
> >>>>>>>> direct
> >>>>>>>>>> way
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>>> splits from
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
> >>>>>>>>> works
> >>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> >>>>>>>>>>>>>>>>> AddSplitEvents).
> >>>>>>>>>>>>>>>>>>> Usage of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
> >>>>>>>>> clearer
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> easier. But if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> >>>>>>>>>>>>> FLIP-27, I
> >>>>>>>>>>>>>>>>>>> have the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
> >>>>>>>>> lookup
> >>>>>>>>>>> join
> >>>>>>>>>>>>>>> ALL
> >>>>>>>>>>>>>>>>>>> cache in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
> >>>>>>>> of
> >>>>>>>>>>> batch
> >>>>>>>>>>>>>>>>>> source?
> >>>>>>>>>>>>>>>>>>> The point
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
> >>>>>>>> join
> >>>>>>>>>> ALL
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> and simple
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
> >>>>>>>>> case
> >>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> performed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
> >>>>>>>> (cache)
> >>>>>>>>> is
> >>>>>>>>>>>>>>> cleared
> >>>>>>>>>>>>>>>>>>> (correct me
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> >>>>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>>>> simple join
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
> >>>>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
> >>>>>>>> be
> >>>>>>>>>>> easy
> >>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> new FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
> >>>>>>>> -
> >>>>>>>>> we
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>> to change
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> >>>>>>>>>> again
> >>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>> some TTL).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> >>>>>>>>> long-term
> >>>>>>>>>>>>> goal
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> will make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
> >>>>>>>>> said.
> >>>>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> can limit
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> >>>>>>>>>> (InputFormats).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> >>>>>>>>>> flexible
> >>>>>>>>>>>>>>>>>>> interfaces for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
> >>>>>>>> both
> >>>>>>>>>> in
> >>>>>>>>>>>>> LRU
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> ALL caches.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> >>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
> >>>>>>>>> have
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> opportunity to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
> >>>>>>>> currently
> >>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>> pushdown works
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> >>>>>>>>> filters
> >>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>> projections
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
> >>>>>>>>>>>>> features.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
> >>>>>>>>> that
> >>>>>>>>>>>>>>> involves
> >>>>>>>>>>>>>>>>>>> multiple
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
> >>>>>>>>> from
> >>>>>>>>>>>>>>>>>>> InputFormat in favor
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> >>>>>>>>> realization
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>> complex and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
> >>>>>>>>> extend
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> >>>>>>>>>> case
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>> join ALL
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> >>>>>>>>>>> imjark@gmail.com
> >>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
> >>>>>>>>> want
> >>>>>>>>>> to
> >>>>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>> ideas:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> >>>>>>>>>> connectors
> >>>>>>>>>>>>> base
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
> >>>>>>>>> ways
> >>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> work (e.g.,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
> >>>>>>>>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
> >>>>>>>>> flexible
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
> >>>>>>>> we
> >>>>>>>>>> can
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
> >>>>>>>>> should
> >>>>>>>>>>> be a
> >>>>>>>>>>>>>>>>> final
> >>>>>>>>>>>>>>>>>>> state,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
> >>>>>>>>> into
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> benefit a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lot
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> >>>>>>>>>>> Connectors
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> >>>>>>>>> cache,
> >>>>>>>>>> we
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>> have 90% of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
> >>>>>>>> means
> >>>>>>>>>> the
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> meaningless in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
> >>>>>>>> to
> >>>>>>>>> do
> >>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>> and projects
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> >>>>>>>>>>> interfaces,
> >>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> mean it's
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
> >>>>>>>> interfaces
> >>>>>>>>> to
> >>>>>>>>>>>>> reduce
> >>>>>>>>>>>>>>>>> IO
> >>>>>>>>>>>>>>>>>>> and the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
> >>>>>>>>>> source
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> lookup source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> >>>>>>>>>> pushdown
> >>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> caches,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
> >>>>>>>>> of
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>>> We have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> never
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
> >>>>>>>>> "eval"
> >>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
> >>>>>>>> share
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> reload
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> >>>>>>>>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> >>>>>>>>>>> deprecated,
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> >>>>>>>>>>>>> LookupJoin,
> >>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> may make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
> >>>>>>>> the
> >>>>>>>>>> ALL
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> logic and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reuse
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> >>>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
> >>>>>>>>> lies
> >>>>>>>>>>> out
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> scope of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
> >>>>>>>> be
> >>>>>>>>>>> done
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
> >>>>>>>> Visser <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> >>>>>>>>>> correctly
> >>>>>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> >>>>>>>>>>>>>>>>> jdbc/hive/hbase."
> >>>>>>>>>>>>>>>>>>> -> Would
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> >>>>>>>>> implement
> >>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
> >>>>>>>> to
> >>>>>>>>>>> doing
> >>>>>>>>>>>>>>>>> that,
> >>>>>>>>>>>>>>>>>>> outside
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> >>>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> >>>>>>>>>> improvement!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
> >>>>>>>> implementation
> >>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> nice
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> >>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>> AS
> >>>>>>>>>>>>> OF
> >>>>>>>>>>>>>>>>>>> proc_time"
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> >>>>>>>>>>>>> implemented.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
> >>>>>>>>> say
> >>>>>>>>>>>>> that:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
> >>>>>>>>> to
> >>>>>>>>>>> cut
> >>>>>>>>>>>>> off
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
> >>>>>>>> the
> >>>>>>>>>> most
> >>>>>>>>>>>>>>> handy
> >>>>>>>>>>>>>>>>>>> way to do
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
> >>>>>>>> bit
> >>>>>>>>>>>>> harder to
> >>>>>>>>>>>>>>>>>>> pass it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> >>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>> correctly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
> >>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> jdbc/hive/hbase.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
> >>>>>>>> caching
> >>>>>>>>>>>>>>>>> parameters
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
> >>>>>>>>> set
> >>>>>>>>>> it
> >>>>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
> >>>>>>>>> options
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> >>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>> deprives us of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> >>>>>>>>>> implement
> >>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
> >>>>>>>>> more
> >>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> >>>>>>>>>> schema
> >>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
> >>>>>>>> not
> >>>>>>>>>>> right
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> >>>>>>>>>>> architecture?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> >>>>>>>>> Visser <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> >>>>>>>>>> wanted
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> express that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
> >>>>>>>> this
> >>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>> and I
> >>>>>>>>>>>>>>>>>>> hope
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> >>>>>>>>>>> Смирнов <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> >>>>>>>>>> However, I
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>> questions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
> >>>>>>>>> get
> >>>>>>>>>>>>>>>>>>> something?).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
> >>>>>>>> of
> >>>>>>>>>> "FOR
> >>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> >>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>> AS
> >>>>>>>>>>>>>>>>> OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time"
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
> >>>>>>>>> you
> >>>>>>>>>>>>> said,
> >>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>> go
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
> >>>>>>>> performance
> >>>>>>>>>> (no
> >>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
> >>>>>>>> do
> >>>>>>>>>> you
> >>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> >>>>>>>>>>> explicitly
> >>>>>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whether
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
> >>>>>>>> the
> >>>>>>>>>>> list
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> >>>>>>>>>> want
> >>>>>>>>>>>>> to.
> >>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>> what
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
> >>>>>>>>> caching
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> modules
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> >>>>>>>>>> flink-table-common
> >>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> considered
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> >>>>>>>>>>>>>>>>>>> breaking/non-breaking
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> >>>>>>>>>>> proc_time"?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> >>>>>>>>>>>>> options in
> >>>>>>>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> control
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
> >>>>>>>>> never
> >>>>>>>>>>>>>>> happened
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> previously
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> >>>>>>>>>>> semantics
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
> >>>>>>>>> it
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>> limiting
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> >>>>>>>>>>> business
> >>>>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
> >>>>>>>> logic
> >>>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> framework? I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> >>>>>>>>>> option
> >>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
> >>>>>>>> the
> >>>>>>>>>>> wrong
> >>>>>>>>>>>>>>>>>>> decision,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
> >>>>>>>>> logic
> >>>>>>>>>>> (not
> >>>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> >>>>>>>>>>> functions
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> ONE
> >>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (there
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> >>>>>>>>> caches).
> >>>>>>>>>>>>> Does it
> >>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matter for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
> >>>>>>>>> logic
> >>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> located,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> >>>>>>>>>>>>> 'sink.parallelism',
> >>>>>>>>>>>>>>>>>>> which in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some way
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
> >>>>>>>> framework"
> >>>>>>>>>> and
> >>>>>>>>>>> I
> >>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> see any
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> >>>>>>>>>>> all-caching
> >>>>>>>>>>>>>>>>>>> scenario
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> >>>>>>>>>> discussion,
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> >>>>>>>>>> quite
> >>>>>>>>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reused
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
> >>>>>>>>> for
> >>>>>>>>>> a
> >>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>> API).
> >>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
> >>>>>>>> use
> >>>>>>>>>>>>>>>>> InputFormat
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
> >>>>>>>> even
> >>>>>>>>>> Hive
> >>>>>>>>>>>>> - it
> >>>>>>>>>>>>>>>>>> uses
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
> >>>>>>>> a
> >>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> >>>>>>>>>> ability
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> reload
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> >>>>>>>>>> number
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
> >>>>>>>> reload
> >>>>>>>>>>> time
> >>>>>>>>>>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduces
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> >>>>>>>>>> blocking). I
> >>>>>>>>>>>>> know
> >>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> usually
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> >>>>>>>>>> code,
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
> >>>>>>>>> an
> >>>>>>>>>>>>> ideal
> >>>>>>>>>>>>>>>>>>> solution,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
> >>>>>>>>> might
> >>>>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> >>>>>>>>>>> developer
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
> >>>>>>>>> new
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
> >>>>>>>> options
> >>>>>>>>>>> into
> >>>>>>>>>>>>> 2
> >>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
> >>>>>>>> will
> >>>>>>>>>>> need
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>>> is to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> redirect
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> >>>>>>>>>>>>> LookupConfig
> >>>>>>>>>>>>>>> (+
> >>>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> add an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> >>>>>>>>>> naming),
> >>>>>>>>>>>>>>>>>> everything
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> >>>>>>>>>> won't
> >>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> refactoring at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
> >>>>>>>> connector
> >>>>>>>>>>>>> because
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> backward
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
> >>>>>>>> wants
> >>>>>>>>> to
> >>>>>>>>>>> use
> >>>>>>>>>>>>>>> his
> >>>>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> >>>>>>>>>>> configs
> >>>>>>>>>>>>>>> into
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
> >>>>>>>> with
> >>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
> >>>>>>>> it's a
> >>>>>>>>>>> rare
> >>>>>>>>>>>>>>>>> case).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
> >>>>>>>> pushed
> >>>>>>>>>> all
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>>>>> down
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> >>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
> >>>>>>>> is
> >>>>>>>>>> that
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> ONLY
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> >>>>>>>>>>>>> FileSystemTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> >>>>>>>>>>> currently).
> >>>>>>>>>>>>>>> Also
> >>>>>>>>>>>>>>>>>>> for some
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> >>>>>>>>>>> complex
> >>>>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
> >>>>>>>> the
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
> >>>>>>>> large
> >>>>>>>>>>>>> amount of
> >>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> >>>>>>>>>>> suppose
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> dimension
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
> >>>>>>>> 20
> >>>>>>>>> to
> >>>>>>>>>>> 40,
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> input
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
> >>>>>>>>> by
> >>>>>>>>>>> age
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> users. If
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> >>>>>>>>>> This
> >>>>>>>>>>>>> means
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
> >>>>>>>>> almost
> >>>>>>>>>> 2
> >>>>>>>>>>>>>>> times.
> >>>>>>>>>>>>>>>>>> It
> >>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gain a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> >>>>>>>>>>> optimization
> >>>>>>>>>>>>>>>>> starts
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> >>>>>>>>>> filters
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> projections
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
> >>>>>>>>> opens
> >>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>> additional
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
> >>>>>>>> 'not
> >>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>> useful'.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> >>>>>>>>>>>>> regarding
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> topic!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> >>>>>>>>>> points,
> >>>>>>>>>>>>> and I
> >>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
> >>>>>>>>> come
> >>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>> consensus.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
> >>>>>>>>> Ren
> >>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
> >>>>>>>> for
> >>>>>>>>> my
> >>>>>>>>>>>>> late
> >>>>>>>>>>>>>>>>>>> response!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
> >>>>>>>>> and
> >>>>>>>>>>>>> Leonard
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> I’d
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> >>>>>>>>>> implementing
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> >>>>>>>>>>>>> user-provided
> >>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> >>>>>>>>> extending
> >>>>>>>>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
> >>>>>>>> semantic
> >>>>>>>>> of
> >>>>>>>>>>>>> "FOR
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> >>>>>>>>>> reflect
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> content
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
> >>>>>>>> users
> >>>>>>>>>>>>> choose
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
> >>>>>>>>> that
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> breakage is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
> >>>>>>>>> prefer
> >>>>>>>>>>> not
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> provide
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
> >>>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> framework
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (whether
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> >>>>>>>>> TableFunction),
> >>>>>>>>>> we
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
> >>>>>>>>> control
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> behavior of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> >>>>>>>>>>> should
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> cautious.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Under
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> >>>>>>>>>> framework
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> only be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
> >>>>>>>>> it’s
> >>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
> >>>>>>>> source
> >>>>>>>>>>> loads
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> refresh
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> >>>>>>>>>> high
> >>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
> >>>>>>>>> widely
> >>>>>>>>>>>>> used
> >>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
> >>>>>>>>> the
> >>>>>>>>>>>>> user’s
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> >>>>>>>>>>>>> introduce a
> >>>>>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interface for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
> >>>>>>>> would
> >>>>>>>>>>>>> become
> >>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> complex.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
> >>>>>>>> framework
> >>>>>>>>>>> might
> >>>>>>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> >>>>>>>>>> there
> >>>>>>>>>>>>> might
> >>>>>>>>>>>>>>>>>>> exist two
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caches
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
> >>>>>>>> user
> >>>>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configures
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> >>>>>>>>>> implemented
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> >>>>>>>>>>>>> Alexander, I
> >>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
> >>>>>>>> way
> >>>>>>>>>> down
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>> runner
> >>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> >>>>>>>>>> network
> >>>>>>>>>>>>> I/O
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pressure
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
> >>>>>>>> these
> >>>>>>>>>>>>>>>>>> optimizations
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> >>>>>>>>>>> reflect
> >>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>> ideas.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
> >>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> TableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> >>>>>>>>>>>>> (CachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
> >>>>>>>> developers
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>>> regulate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
> >>>>>>>> reference.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> >>>>>>>>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> >>>>>>>>> Александр
> >>>>>>>>>>>>> Смирнов
> >>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> >>>>>>>>>>> solution
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> step:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
> >>>>>>>> mutually
> >>>>>>>>>>>>> exclusive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (originally
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> >>>>>>>>>>> conceptually
> >>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>> follow
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> >>>>>>>>>>>>> different.
> >>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> go one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
> >>>>>>>>> will
> >>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>> deleting
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> >>>>>>>>>>>>> connectors.
> >>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> think we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> >>>>>>>>>> about
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> >>>>>>>>>> tasks
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parts
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> >>>>>>>>> unification
> >>>>>>>>>> /
> >>>>>>>>>>>>>>>>>>> introducing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> >>>>>>>>>> Qingsheng?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> >>>>>>>>>>> requests
> >>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
> >>>>>>>>> fields
> >>>>>>>>>>> of
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table, we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
> >>>>>>>>> after
> >>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> >>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>> pushdown. So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> >>>>>>>>>> much
> >>>>>>>>>>>>> less
> >>>>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> >>>>>>>>>>> architecture
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> >>>>>>>> honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> >>>>>>>>>> kinds
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> conversations
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> >>>>>>>>> confluence,
> >>>>>>>>>>> so
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> made a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jira
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
> >>>>>>>> in
> >>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>> details
> >>>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
> >>>>>>>>> Heise
> >>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> >>>>>>>>>> inconsistency
> >>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> >>>>>>>>>> could
> >>>>>>>>>>>>> also
> >>>>>>>>>>>>>>>>>> live
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
> >>>>>>>> of
> >>>>>>>>>>>>> making
> >>>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> >>>>>>>>>> devise a
> >>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> layer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> >>>>>>>>> CachingTableFunction
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delegates to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> X in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> >>>>>>>>>> Lifting
> >>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> into
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> >>>>>>>>>>>>> probably
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unnecessary
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
> >>>>>>>>> will
> >>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>> receive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
> >>>>>>>>> more
> >>>>>>>>>>>>>>>>>> interesting
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> save
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> >>>>>>>>>> changes
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> FLIP
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> >>>>>>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>> Everything
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
> >>>>>>>> That
> >>>>>>>>>>> means
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
> >>>>>>>> Alexander
> >>>>>>>>>>>>> pointed
> >>>>>>>>>>>>>>>>> out
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> later.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> >>>>>>>>>>> architecture
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> >>>>>>>> honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> >>>>>>>>>> Александр
> >>>>>>>>>>>>>>>>> Смирнов
> >>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
> >>>>>>>>> I'm
> >>>>>>>>>>>>> not a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> committer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> yet,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> >>>>>>>>>> FLIP
> >>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> >>>>>>>>>>>>> feature in
> >>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> company’s
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
> >>>>>>>> our
> >>>>>>>>>>>>> thoughts
> >>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
> >>>>>>>> alternative
> >>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> >>>>>>>>>>>>>>> (CachingTableFunction).
> >>>>>>>>>>>>>>>>>> As
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> >>>>>>>>>>>>> flink-table-common
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
> >>>>>>>> tables –
> >>>>>>>>>>> it’s
> >>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> convenient
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> >>>>>>>>>>>>> CachingTableFunction
> >>>>>>>>>>>>>>>>>>> contains
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
> >>>>>>>> and
> >>>>>>>>>>>>>>>>> everything
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connected
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
> >>>>>>>> module,
> >>>>>>>>>>>>> probably
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> >>>>>>>>>>> depend
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
> >>>>>>>>> which
> >>>>>>>>>>>>> doesn’t
> >>>>>>>>>>>>>>>>>>> sound
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> good.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> >>>>>>>>>>>>>>> ‘getLookupConfig’
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> >>>>>>>>>>>>> connectors
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> >>>>>>>>>> therefore
> >>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>> won’t
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> depend on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> >>>>>>>>>>> planner
> >>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> construct a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> >>>>>>>>>> runtime
> >>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> >>>>>>>>>> Architecture
> >>>>>>>>>>>>>>> looks
> >>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> >>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>> yours
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
> >>>>>>>> that
> >>>>>>>>>> will
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responsible
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> >>>>>>>>>>>>> inheritors.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> >>>>>>>>>> AsyncLookupJoinRunner,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> >>>>>>>>>>>>>>>>> LookupJoinCachingRunner,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
> >>>>>>>> etc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
> >>>>>>>> powerful
> >>>>>>>>>>>>> advantage
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
> >>>>>>>>> level,
> >>>>>>>>>>> we
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> >>>>>>>>>>>>> LookupJoinRunnerWithCalc
> >>>>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> named
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
> >>>>>>>> function,
> >>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mostly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> >>>>>>>>>> lookup
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>> B
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> condition
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> >>>>>>>>>> WHERE
> >>>>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
> >>>>>>>> A.age =
> >>>>>>>>>>>>> B.age +
> >>>>>>>>>>>>>>>>> 10
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> >>>>>>>>>> storing
> >>>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
> >>>>>>>> reduced:
> >>>>>>>>>>>>> filters =
> >>>>>>>>>>>>>>>>>>> avoid
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
> >>>>>>>>> reduce
> >>>>>>>>>>>>>>> records’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size. So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
> >>>>>>>> be
> >>>>>>>>>>>>>>> increased
> >>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
> >>>>>>>> Ren
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> >>>>>>>>>>>>> discussion
> >>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> >>>>>>>>>> table
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> >>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
> >>>>>>>>> isn’t a
> >>>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
> >>>>>>>> with
> >>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>> joins,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> >>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
> >>>>>>>>> table
> >>>>>>>>>>>>>>> options.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> take a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
> >>>>>>>>> Any
> >>>>>>>>>>>>>>>>>> suggestions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@apache.org>.

Hi Jingsong,

1. Updated and thanks for the reminder!

2. We could do so for implementation but as public interface I prefer not to introduce another layer and expose too much since this FLIP is already a huge one with bunch of classes and interfaces.

Best,
Qingsheng

> On Jun 22, 2022, at 11:16, Jingsong Li <ji...@gmail.com> wrote:
> 
> Thanks Qingsheng and all.
> 
> I like this design.
> 
> Some comments:
> 
> 1. LookupCache implements Serializable?
> 
> 2. Minor: After FLIP-234 [1], there should be many connectors that
> implement both PartialCachingLookupProvider and
> PartialCachingAsyncLookupProvider. Can we extract a common interface
> for `LookupCache getCache();` to ensure consistency?
> 
> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-234%3A+Support+Retryable+Lookup+Join+To+Solve+Delayed+Updates+Issue+In+External+Systems
> 
> Best,
> Jingsong
> 
> On Tue, Jun 21, 2022 at 4:09 PM Qingsheng Ren <re...@apache.org> wrote:
>> 
>> Hi devs,
>> 
>> I’d like to push FLIP-221 forward a little bit. Recently we had some offline discussions and updated the FLIP. Here’s the diff compared to the previous version:
>> 
>> 1. (Async)LookupFunctionProvider is designed as a base interface for constructing lookup functions.
>> 2. From the LookupFunction we extend PartialCaching / FullCachingLookupProvider for partial and full caching mode.
>> 3. Introduce CacheReloadTrigger for specifying reload stratrgy in full caching mode, and provide 2 default implementations (Periodic / TimedCacheReloadTrigger)
>> 
>> Looking forward to your replies~
>> 
>> Best,
>> Qingsheng
>> 
>>> On Jun 2, 2022, at 17:15, Qingsheng Ren <re...@gmail.com> wrote:
>>> 
>>> Hi Becket,
>>> 
>>> Thanks for your feedback!
>>> 
>>> 1. An alternative way is to let the implementation of cache to decide
>>> whether to store a missing key in the cache instead of the framework.
>>> This sounds more reasonable and makes the LookupProvider interface
>>> cleaner. I can update the FLIP and clarify in the JavaDoc of
>>> LookupCache#put that the cache should decide whether to store an empty
>>> collection.
>>> 
>>> 2. Initially the builder pattern is for the extensibility of
>>> LookupProvider interfaces that we could need to add more
>>> configurations in the future. We can remove the builder now as we have
>>> resolved the issue in 1. As for the builder in DefaultLookupCache I
>>> prefer to keep it because we have a lot of arguments in the
>>> constructor.
>>> 
>>> 3. I think this might overturn the overall design. I agree with
>>> Becket's idea that the API design should be layered considering
>>> extensibility and it'll be great to have one unified interface
>>> supporting both partial, full and even mixed custom strategies, but we
>>> have some issues to resolve. The original purpose of treating full
>>> caching separately is that we'd like to reuse the ability of
>>> ScanRuntimeProvider. Developers just need to hand over Source /
>>> SourceFunction / InputFormat so that the framework could be able to
>>> compose the underlying topology and control the reload (maybe in a
>>> distributed way). Under your design we leave the reload operation
>>> totally to the CacheStrategy and I think it will be hard for
>>> developers to reuse the source in the initializeCache method.
>>> 
>>> Best regards,
>>> 
>>> Qingsheng
>>> 
>>> On Thu, Jun 2, 2022 at 1:50 PM Becket Qin <be...@gmail.com> wrote:
>>>> 
>>>> Thanks for updating the FLIP, Qingsheng. A few more comments:
>>>> 
>>>> 1. I am still not sure about what is the use case for cacheMissingKey().
>>>> More specifically, when would users want to have getCache() return a
>>>> non-empty value and cacheMissingKey() returns false?
>>>> 
>>>> 2. The builder pattern. Usually the builder pattern is used when there are
>>>> a lot of variations of constructors. For example, if a class has three
>>>> variables and all of them are optional, so there could potentially be many
>>>> combinations of the variables. But in this FLIP, I don't see such case.
>>>> What is the reason we have builders for all the classes?
>>>> 
>>>> 3. Should the caching strategy be excluded from the top level provider API?
>>>> Technically speaking, the Flink framework should only have two interfaces
>>>> to deal with:
>>>>   A) LookupFunction
>>>>   B) AsyncLookupFunction
>>>> Orthogonally, we *believe* there are two different strategies people can do
>>>> caching. Note that the Flink framework does not care what is the caching
>>>> strategy here.
>>>>   a) partial caching
>>>>   b) full caching
>>>> 
>>>> Putting them together, we end up with 3 combinations that we think are
>>>> valid:
>>>>    Aa) PartialCachingLookupFunctionProvider
>>>>    Ba) PartialCachingAsyncLookupFunctionProvider
>>>>    Ab) FullCachingLookupFunctionProvider
>>>> 
>>>> However, the caching strategy could actually be quite flexible. E.g. an
>>>> initial full cache load followed by some partial updates. Also, I am not
>>>> 100% sure if the full caching will always use ScanTableSource. Including
>>>> the caching strategy in the top level provider API would make it harder to
>>>> extend.
>>>> 
>>>> One possible solution is to just have *LookupFunctionProvider* and
>>>> *AsyncLookupFunctionProvider
>>>> *as the top level API, both with a getCacheStrategy() method returning an
>>>> optional CacheStrategy. The CacheStrategy class would have the following
>>>> methods:
>>>> 1. void open(Context), the context exposes some of the resources that may
>>>> be useful for the the caching strategy, e.g. an ExecutorService that is
>>>> synchronized with the data processing, or a cache refresh trigger which
>>>> blocks data processing and refresh the cache.
>>>> 2. void initializeCache(), a blocking method allows users to pre-populate
>>>> the cache before processing any data if they wish.
>>>> 3. void maybeCache(RowData key, Collection<RowData> value), blocking or
>>>> non-blocking method.
>>>> 4. void refreshCache(), a blocking / non-blocking method that is invoked by
>>>> the Flink framework when the cache refresh trigger is pulled.
>>>> 
>>>> In the above design, partial caching and full caching would be
>>>> implementations of the CachingStrategy. And it is OK for users to implement
>>>> their own CachingStrategy if they want to.
>>>> 
>>>> Thanks,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> On Thu, Jun 2, 2022 at 12:14 PM Jark Wu <im...@gmail.com> wrote:
>>>> 
>>>>> Thank Qingsheng for the detailed summary and updates,
>>>>> 
>>>>> The changes look good to me in general. I just have one minor improvement
>>>>> comment.
>>>>> Could we add a static util method to the "FullCachingReloadTrigger"
>>>>> interface for quick usage?
>>>>> 
>>>>> #periodicReloadAtFixedRate(Duration)
>>>>> #periodicReloadWithFixedDelay(Duration)
>>>>> 
>>>>> I think we can also do this for LookupCache. Because users may not know
>>>>> where is the default
>>>>> implementations and how to use them.
>>>>> 
>>>>> Best,
>>>>> Jark
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <re...@gmail.com> wrote:
>>>>> 
>>>>>> Hi Jingsong,
>>>>>> 
>>>>>> Thanks for your comments!
>>>>>> 
>>>>>>> AllCache definition is not flexible, for example, PartialCache can use
>>>>>> any custom storage, while the AllCache can not, AllCache can also be
>>>>>> considered to store memory or disk, also need a flexible strategy.
>>>>>> 
>>>>>> We had an offline discussion with Jark and Leonard. Basically we think
>>>>>> exposing the interface of full cache storage to connector developers
>>>>> might
>>>>>> limit our future optimizations. The storage of full caching shouldn’t
>>>>> have
>>>>>> too many variations for different lookup tables so making it pluggable
>>>>>> might not help a lot. Also I think it is not quite easy for connector
>>>>>> developers to implement such an optimized storage. We can keep optimizing
>>>>>> this storage in the future and all full caching lookup tables would
>>>>> benefit
>>>>>> from this.
>>>>>> 
>>>>>>> We are more inclined to deprecate the connector `async` option when
>>>>>> discussing FLIP-234. Can we remove this option from this FLIP?
>>>>>> 
>>>>>> Thanks for the reminder! This option has been removed in the latest
>>>>>> version.
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Qingsheng
>>>>>> 
>>>>>> 
>>>>>>> On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Thanks Alexander for your reply. We can discuss the new interface when
>>>>> it
>>>>>>> comes out.
>>>>>>> 
>>>>>>> We are more inclined to deprecate the connector `async` option when
>>>>>>> discussing FLIP-234 [1]. We should use hint to let planner decide.
>>>>>>> Although the discussion has not yet produced a conclusion, can we
>>>>> remove
>>>>>>> this option from this FLIP? It doesn't seem to be related to this FLIP,
>>>>>> but
>>>>>>> more to FLIP-234, and we can form a conclusion over there.
>>>>>>> 
>>>>>>> [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jingsong
>>>>>>> 
>>>>>>> On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:
>>>>>>> 
>>>>>>>> Hi Jark,
>>>>>>>> 
>>>>>>>> Thanks for clarifying it. It would be fine. as long as we could
>>>>> provide
>>>>>> the
>>>>>>>> no-cache solution. I was just wondering if the client side cache could
>>>>>>>> really help when HBase is used, since the data to look up should be
>>>>>> huge.
>>>>>>>> Depending how much data will be cached on the client side, the data
>>>>> that
>>>>>>>> should be lru in e.g. LruBlockCache will not be lru anymore. In the
>>>>>> worst
>>>>>>>> case scenario, once the cached data at client side is expired, the
>>>>>> request
>>>>>>>> will hit disk which will cause extra latency temporarily, if I am not
>>>>>>>> mistaken.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Jing
>>>>>>>> 
>>>>>>>> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi Jing Ge,
>>>>>>>>> 
>>>>>>>>> What do you mean about the "impact on the block cache used by HBase"?
>>>>>>>>> In my understanding, the connector cache and HBase cache are totally
>>>>>> two
>>>>>>>>> things.
>>>>>>>>> The connector cache is a local/client cache, and the HBase cache is a
>>>>>>>>> server cache.
>>>>>>>>> 
>>>>>>>>>> does it make sense to have a no-cache solution as one of the
>>>>>>>>> default solutions so that customers will have no effort for the
>>>>>> migration
>>>>>>>>> if they want to stick with Hbase cache
>>>>>>>>> 
>>>>>>>>> The implementation migration should be transparent to users. Take the
>>>>>>>> HBase
>>>>>>>>> connector as
>>>>>>>>> an example,  it already supports lookup cache but is disabled by
>>>>>> default.
>>>>>>>>> After migration, the
>>>>>>>>> connector still disables cache by default (i.e. no-cache solution).
>>>>> No
>>>>>>>>> migration effort for users.
>>>>>>>>> 
>>>>>>>>> HBase cache and connector cache are two different things. HBase cache
>>>>>>>> can't
>>>>>>>>> simply replace
>>>>>>>>> connector cache. Because one of the most important usages for
>>>>> connector
>>>>>>>>> cache is reducing
>>>>>>>>> the I/O request/response and improving the throughput, which can
>>>>>> achieve
>>>>>>>>> by just using a server cache.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Jark
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks all for the valuable discussion. The new feature looks very
>>>>>>>>>> interesting.
>>>>>>>>>> 
>>>>>>>>>> According to the FLIP description: "*Currently we have JDBC, Hive
>>>>> and
>>>>>>>>> HBase
>>>>>>>>>> connector implemented lookup table source. All existing
>>>>>> implementations
>>>>>>>>>> will be migrated to the current design and the migration will be
>>>>>>>>>> transparent to end users*." I was only wondering if we should pay
>>>>>>>>> attention
>>>>>>>>>> to HBase and similar DBs. Since, commonly, the lookup data will be
>>>>>> huge
>>>>>>>>>> while using HBase, partial caching will be used in this case, if I
>>>>> am
>>>>>>>> not
>>>>>>>>>> mistaken, which might have an impact on the block cache used by
>>>>> HBase,
>>>>>>>>> e.g.
>>>>>>>>>> LruBlockCache.
>>>>>>>>>> Another question is that, since HBase provides a sophisticated cache
>>>>>>>>>> solution, does it make sense to have a no-cache solution as one of
>>>>> the
>>>>>>>>>> default solutions so that customers will have no effort for the
>>>>>>>> migration
>>>>>>>>>> if they want to stick with Hbase cache?
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> Jing
>>>>>>>>>> 
>>>>>>>>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <
>>>>> jingsonglee0@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> I think the problem now is below:
>>>>>>>>>>> 1. AllCache and PartialCache interface on the non-uniform, one
>>>>> needs
>>>>>>>> to
>>>>>>>>>>> provide LookupProvider, the other needs to provide CacheBuilder.
>>>>>>>>>>> 2. AllCache definition is not flexible, for example, PartialCache
>>>>> can
>>>>>>>>> use
>>>>>>>>>>> any custom storage, while the AllCache can not, AllCache can also
>>>>> be
>>>>>>>>>>> considered to store memory or disk, also need a flexible strategy.
>>>>>>>>>>> 3. AllCache can not customize ReloadStrategy, currently only
>>>>>>>>>>> ScheduledReloadStrategy.
>>>>>>>>>>> 
>>>>>>>>>>> In order to solve the above problems, the following are my ideas.
>>>>>>>>>>> 
>>>>>>>>>>> ## Top level cache interfaces:
>>>>>>>>>>> 
>>>>>>>>>>> ```
>>>>>>>>>>> 
>>>>>>>>>>> public interface CacheLookupProvider extends
>>>>>>>>>>> LookupTableSource.LookupRuntimeProvider {
>>>>>>>>>>> 
>>>>>>>>>>>  CacheBuilder createCacheBuilder();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> public interface CacheBuilder {
>>>>>>>>>>>  Cache create();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> public interface Cache {
>>>>>>>>>>> 
>>>>>>>>>>>  /**
>>>>>>>>>>>   * Returns the value associated with key in this cache, or null
>>>>>>>> if
>>>>>>>>>>> there is no cached value for
>>>>>>>>>>>   * key.
>>>>>>>>>>>   */
>>>>>>>>>>>  @Nullable
>>>>>>>>>>>  Collection<RowData> getIfPresent(RowData key);
>>>>>>>>>>> 
>>>>>>>>>>>  /** Returns the number of key-value mappings in the cache. */
>>>>>>>>>>>  long size();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> ```
>>>>>>>>>>> 
>>>>>>>>>>> ## Partial cache
>>>>>>>>>>> 
>>>>>>>>>>> ```
>>>>>>>>>>> 
>>>>>>>>>>> public interface PartialCacheLookupFunction extends
>>>>>>>>> CacheLookupProvider {
>>>>>>>>>>> 
>>>>>>>>>>>  @Override
>>>>>>>>>>>  PartialCacheBuilder createCacheBuilder();
>>>>>>>>>>> 
>>>>>>>>>>> /** Creates an {@link LookupFunction} instance. */
>>>>>>>>>>> LookupFunction createLookupFunction();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> public interface PartialCacheBuilder extends CacheBuilder {
>>>>>>>>>>> 
>>>>>>>>>>>  PartialCache create();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> public interface PartialCache extends Cache {
>>>>>>>>>>> 
>>>>>>>>>>>  /**
>>>>>>>>>>>   * Associates the specified value rows with the specified key
>>>>> row
>>>>>>>>>>> in the cache. If the cache
>>>>>>>>>>>   * previously contained value associated with the key, the old
>>>>>>>>>>> value is replaced by the
>>>>>>>>>>>   * specified value.
>>>>>>>>>>>   *
>>>>>>>>>>>   * @return the previous value rows associated with key, or null
>>>>>>>> if
>>>>>>>>>>> there was no mapping for key.
>>>>>>>>>>>   * @param key - key row with which the specified value is to be
>>>>>>>>>>> associated
>>>>>>>>>>>   * @param value – value rows to be associated with the specified
>>>>>>>>> key
>>>>>>>>>>>   */
>>>>>>>>>>>  Collection<RowData> put(RowData key, Collection<RowData> value);
>>>>>>>>>>> 
>>>>>>>>>>>  /** Discards any cached value for the specified key. */
>>>>>>>>>>>  void invalidate(RowData key);
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> ```
>>>>>>>>>>> 
>>>>>>>>>>> ## All cache
>>>>>>>>>>> ```
>>>>>>>>>>> 
>>>>>>>>>>> public interface AllCacheLookupProvider extends
>>>>> CacheLookupProvider {
>>>>>>>>>>> 
>>>>>>>>>>>  void registerReloadStrategy(ScheduledExecutorService
>>>>>>>>>>> executorService, Reloader reloader);
>>>>>>>>>>> 
>>>>>>>>>>>  ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
>>>>>>>>>>> 
>>>>>>>>>>>  @Override
>>>>>>>>>>>  AllCacheBuilder createCacheBuilder();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> public interface AllCacheBuilder extends CacheBuilder {
>>>>>>>>>>> 
>>>>>>>>>>>  AllCache create();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> public interface AllCache extends Cache {
>>>>>>>>>>> 
>>>>>>>>>>>  void putAll(Iterator<Map<RowData, RowData>> allEntries);
>>>>>>>>>>> 
>>>>>>>>>>>  void clearAll();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> public interface Reloader {
>>>>>>>>>>> 
>>>>>>>>>>>  void reload();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> ```
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jingsong
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <
>>>>> jingsonglee0@gmail.com
>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks Qingsheng and all for your discussion.
>>>>>>>>>>>> 
>>>>>>>>>>>> Very sorry to jump in so late.
>>>>>>>>>>>> 
>>>>>>>>>>>> Maybe I missed something?
>>>>>>>>>>>> My first impression when I saw the cache interface was, why don't
>>>>>>>> we
>>>>>>>>>>>> provide an interface similar to guava cache [1], on top of guava
>>>>>>>>> cache,
>>>>>>>>>>>> caffeine also makes extensions for asynchronous calls.[2]
>>>>>>>>>>>> There is also the bulk load in caffeine too.
>>>>>>>>>>>> 
>>>>>>>>>>>> I am also more confused why first from LookupCacheFactory.Builder
>>>>>>>> and
>>>>>>>>>>> then
>>>>>>>>>>>> to Factory to create Cache.
>>>>>>>>>>>> 
>>>>>>>>>>>> [1] https://github.com/google/guava
>>>>>>>>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jingsong
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com>
>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> After looking at the new introduced ReloadTime and Becket's
>>>>>>>> comment,
>>>>>>>>>>>>> I agree with Becket we should have a pluggable reloading
>>>>> strategy.
>>>>>>>>>>>>> We can provide some common implementations, e.g., periodic
>>>>>>>>> reloading,
>>>>>>>>>>> and
>>>>>>>>>>>>> daily reloading.
>>>>>>>>>>>>> But there definitely be some connector- or business-specific
>>>>>>>>> reloading
>>>>>>>>>>>>> strategies, e.g.
>>>>>>>>>>>>> notify by a zookeeper watcher, reload once a new Hive partition
>>>>> is
>>>>>>>>>>>>> complete.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Jark
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
>>>>>>>>>> "XXXProvider".
>>>>>>>>>>>>>> What is the difference between them? If they are the same, can
>>>>>>>> we
>>>>>>>>>> just
>>>>>>>>>>>>> use
>>>>>>>>>>>>>> XXXFactory everywhere?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
>>>>>>>>>>> policy
>>>>>>>>>>>>>> also be pluggable? Periodical reloading could be sometimes be
>>>>>>>>> tricky
>>>>>>>>>>> in
>>>>>>>>>>>>>> practice. For example, if user uses 24 hours as the cache
>>>>>>>> refresh
>>>>>>>>>>>>> interval
>>>>>>>>>>>>>> and some nightly batch job delayed, the cache update may still
>>>>>>>> see
>>>>>>>>>> the
>>>>>>>>>>>>>> stale data.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
>>>>>>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>>>>> removed.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
>>>>>>>> seems a
>>>>>>>>>>>>> little
>>>>>>>>>>>>>> confusing to me. If Optional<LookupCacheFactory>
>>>>>>>> getCacheFactory()
>>>>>>>>>>>>> returns
>>>>>>>>>>>>>> a non-empty factory, doesn't that already indicates the
>>>>>>>> framework
>>>>>>>>> to
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>> the missing keys? Also, why is this method returning an
>>>>>>>>>>>>> Optional<Boolean>
>>>>>>>>>>>>>> instead of boolean?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
>>>>>>>> renqschn@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Lincoln and Jark,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the comments! If the community reaches a consensus
>>>>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>>> use
>>>>>>>>>>>>>>> SQL hint instead of table options to decide whether to use sync
>>>>>>>>> or
>>>>>>>>>>>>> async
>>>>>>>>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
>>>>>>>>>>> option.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I think it’s a good idea to let the decision of async made on
>>>>>>>>> query
>>>>>>>>>>>>>>> level, which could make better optimization with more
>>>>>>>> infomation
>>>>>>>>>>>>> gathered
>>>>>>>>>>>>>>> by planner. Is there any FLIP describing the issue in
>>>>>>>>> FLINK-27625?
>>>>>>>>>> I
>>>>>>>>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
>>>>>>>>>>> missing
>>>>>>>>>>>>>>> instead of the entire async mode to be controlled by hint.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
>>>>>>>> lincoln.86xy@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Jark,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for your reply!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
>>>>>>>>> no
>>>>>>>>>>> idea
>>>>>>>>>>>>>>>> whether or when to remove it (we can discuss it in another
>>>>>>>>> issue
>>>>>>>>>>> for
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
>>>>>>>>> into
>>>>>>>>>> a
>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>> option now.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Lincoln Lee
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Lincoln,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
>>>>>>>> the
>>>>>>>>>>>>>>> connectors
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> provide both async and sync runtime providers simultaneously
>>>>>>>>>>> instead
>>>>>>>>>>>>>>> of one
>>>>>>>>>>>>>>>>> of them.
>>>>>>>>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
>>>>>>>> option
>>>>>>>>> is
>>>>>>>>>>>>>>> planned to
>>>>>>>>>>>>>>>>> be removed
>>>>>>>>>>>>>>>>> in the long term, I think it makes sense not to introduce it
>>>>>>>>> in
>>>>>>>>>>> this
>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
>>>>>>>>>> lincoln.86xy@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
>>>>>>>>> idea
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> have a common table option. I have a minor comments on
>>>>>>>>>>>>> 'lookup.async'
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> not make it a common option:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The table layer abstracts both sync and async lookup
>>>>>>>>>>> capabilities,
>>>>>>>>>>>>>>>>>> connectors implementers can choose one or both, in the case
>>>>>>>>> of
>>>>>>>>>>>>>>>>> implementing
>>>>>>>>>>>>>>>>>> only one capability(status of the most of existing builtin
>>>>>>>>>>>>> connectors)
>>>>>>>>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
>>>>>>>>> both
>>>>>>>>>>>>>>>>>> capabilities, I think this choice is more suitable for
>>>>>>>> making
>>>>>>>>>>>>>>> decisions
>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>> the query level, for example, table planner can choose the
>>>>>>>>>>> physical
>>>>>>>>>>>>>>>>>> implementation of async lookup or sync lookup based on its
>>>>>>>>> cost
>>>>>>>>>>>>>>> model, or
>>>>>>>>>>>>>>>>>> users can give query hint based on their own better
>>>>>>>>>>>>> understanding.  If
>>>>>>>>>>>>>>>>>> there is another common table option 'lookup.async', it may
>>>>>>>>>>> confuse
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> users in the long run.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
>>>>>>>>>> place
>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> current hbase connector) and not turn it into a common
>>>>>>>>> option.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Lincoln Lee
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
>>>>>>>> you
>>>>>>>>>> can
>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>> changes from my latest email. Since some terminologies has
>>>>>>>>>>>>> changed so
>>>>>>>>>>>>>>>>>> I’ll
>>>>>>>>>>>>>>>>>>> use the new concept for replying your comments.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. Builder vs ‘of’
>>>>>>>>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
>>>>>>>> optional
>>>>>>>>>>>>>>> parameters
>>>>>>>>>>>>>>>>>>> for full caching mode (“rescan” previously). The
>>>>>>>>>>>>> schedule-with-delay
>>>>>>>>>>>>>>>>> idea
>>>>>>>>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
>>>>>>>> the
>>>>>>>>>>>>> builder
>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> full caching to make it more descriptive for developers.
>>>>>>>>> Would
>>>>>>>>>>> you
>>>>>>>>>>>>>>> mind
>>>>>>>>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
>>>>>>>>>>> workspace
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> just provide your account ID and ping any PMC member
>>>>>>>>> including
>>>>>>>>>>>>> Jark.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2. Common table options
>>>>>>>>>>>>>>>>>>> We have some discussions these days and propose to
>>>>>>>>> introduce 8
>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>> table options about caching. It has been updated on the
>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 3. Retries
>>>>>>>>>>>>>>>>>>> I think we are on the same page :-)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> For your additional concerns:
>>>>>>>>>>>>>>>>>>> 1) The table option has been updated.
>>>>>>>>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
>>>>>>>> use
>>>>>>>>>>>>> partial
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>> full caching mode.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Also I have a few additions:
>>>>>>>>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
>>>>>>>>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
>>>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>>> talk
>>>>>>>>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
>>>>>>>> fits
>>>>>>>>>>> more,
>>>>>>>>>>>>>>>>>>>> considering my optimization with filters.
>>>>>>>>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
>>>>>>>>> separate
>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>> and rescanning from the options point of view? Like
>>>>>>>>> initially
>>>>>>>>>>> we
>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
>>>>>>>>> now
>>>>>>>>>> we
>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
>>>>>>>>>>>>> smiralexan@gmail.com
>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and Jark,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 1. Builders vs 'of'
>>>>>>>>>>>>>>>>>>>>> I understand that builders are used when we have
>>>>>>>> multiple
>>>>>>>>>>>>>>>>> parameters.
>>>>>>>>>>>>>>>>>>>>> I suggested them because we could add parameters later.
>>>>>>>> To
>>>>>>>>>>>>> prevent
>>>>>>>>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
>>>>>>>>> can
>>>>>>>>>>>>>>> suggest
>>>>>>>>>>>>>>>>>>>>> one more config now - "rescanStartTime".
>>>>>>>>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
>>>>>>>> reload
>>>>>>>>>> of
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>> starts. This parameter can be thought of as
>>>>>>>> 'initialDelay'
>>>>>>>>>>> (diff
>>>>>>>>>>>>>>>>>>>>> between current time and rescanStartTime) in method
>>>>>>>>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
>>>>>>>>> can
>>>>>>>>>> be
>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>> useful when the dimension table is updated by some other
>>>>>>>>>>>>> scheduled
>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
>>>>>>>> second
>>>>>>>>>> scan
>>>>>>>>>>>>>>>>> (first
>>>>>>>>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
>>>>>>>>>> without
>>>>>>>>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
>>>>>>>>> one
>>>>>>>>>>>>> day.
>>>>>>>>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
>>>>>>>> if
>>>>>>>>>> you
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
>>>>>>>> myself
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 2. Common table options
>>>>>>>>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
>>>>>>>>> for
>>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> options,
>>>>>>>>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 3. Retries
>>>>>>>>>>>>>>>>>>>>> I'm fine with suggestion close to
>>>>>>>>> RetryUtils#tryTimes(times,
>>>>>>>>>>>>> call)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi Jark and Alexander,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
>>>>>>>> common
>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> options. I prefer to introduce a new
>>>>>>>>> DefaultLookupCacheOptions
>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> holding these option definitions because putting all
>>>>>>>> options
>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
>>>>>>>>>>>>> categorized.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
>>>>>>>>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
>>>>>>>>>>>>> RescanRuntimeProvider
>>>>>>>>>>>>>>>>>>> considering both arguments are required.
>>>>>>>>>>>>>>>>>>>>>> 2. Introduce new table options matching
>>>>>>>>>>>>> DefaultLookupCacheFactory
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
>>>>>>>>> imjark@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 1) retry logic
>>>>>>>>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
>>>>>>>>>>> utilities,
>>>>>>>>>>>>>>>>> e.g.
>>>>>>>>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
>>>>>>>>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
>>>>>>>> by
>>>>>>>>>>>>>>>>> DataStream
>>>>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
>>>>>>>> to
>>>>>>>>>> put
>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 2) cache ConfigOptions
>>>>>>>>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
>>>>>>>>>>> framework.
>>>>>>>>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
>>>>>>>>>> includes
>>>>>>>>>>>>>>>>>>> "sink.parallelism", "format" options.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thank you for considering my comments.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
>>>>>>>> such
>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> re-establish the connection
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
>>>>>>>> be
>>>>>>>>>>>>> placed in
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> separate function, that can be implemented by
>>>>>>>>> connectors.
>>>>>>>>>>>>> Just
>>>>>>>>>>>>>>>>>> moving
>>>>>>>>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
>>>>>>>>>> more
>>>>>>>>>>>>>>>>> concise
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
>>>>>>>> The
>>>>>>>>>>>>> decision
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>>>>>>>> to you.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
>>>>>>>>>> this
>>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
>>>>>>>> current
>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
>>>>>>>>>> still
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> put
>>>>>>>>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
>>>>>>>>> reuse
>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
>>>>>>>> significant,
>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>> possible
>>>>>>>>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
>>>>>>>>> out
>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> documentation for connector developers.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
>>>>>>>>>>>>> renqschn@gmail.com>:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
>>>>>>>>> same
>>>>>>>>>>>>> page!
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
>>>>>>>>>> quoting
>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>> reply
>>>>>>>>>>>>>>>>>>> under this email.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
>>>>>>>> in
>>>>>>>>>>>>> lookup()
>>>>>>>>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
>>>>>>>>>> meaningful
>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>> specific retriable failures, and there might be custom
>>>>>>>> logic
>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>>>> making
>>>>>>>>>>>>>>>>>>> retry, such as re-establish the connection
>>>>>>>>>>>>> (JdbcRowDataLookupFunction
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
>>>>>>>>> version
>>>>>>>>>> of
>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>> Do
>>>>>>>>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
>>>>>>>>>> FLIP.
>>>>>>>>>>>>> Hope
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> can finalize our proposal soon!
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
>>>>>>>> I
>>>>>>>>>> have
>>>>>>>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>>>>>>>>> suggestions and questions.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
>>>>>>>>>> class.
>>>>>>>>>>>>>>> 'eval'
>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
>>>>>>>> The
>>>>>>>>>> same
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>> 'async' case.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
>>>>>>>>>>>>>>>>>>> 'cacheMissingKey'
>>>>>>>>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
>>>>>>>>>>>>>>>>>>> ScanRuntimeProvider.
>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
>>>>>>>>>>> 'build'
>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
>>>>>>>>>> TableFunctionProvider
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
>>>>>>>>>>>>> deprecated.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
>>>>>>>> assume
>>>>>>>>>>>>> usage of
>>>>>>>>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
>>>>>>>>> case,
>>>>>>>>>>> it
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
>>>>>>>> or
>>>>>>>>>>>>> 'putAll'
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> LookupCache.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
>>>>>>>>>> version
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>>> Do
>>>>>>>>>>>>>>>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
>>>>>>>> make
>>>>>>>>>>> small
>>>>>>>>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
>>>>>>>>>> worth
>>>>>>>>>>>>>>>>>> mentioning
>>>>>>>>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
>>>>>>>> the
>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
>>>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
>>>>>>>> As
>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
>>>>>>>>>>>>> refactor on
>>>>>>>>>>>>>>>>> our
>>>>>>>>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
>>>>>>>> design
>>>>>>>>>> now
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> happy to hear more suggestions from you!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
>>>>>>>>> and
>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
>>>>>>>>>>>>>>> previously.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
>>>>>>>> reflect
>>>>>>>>>> the
>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>> design.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
>>>>>>>> and
>>>>>>>>>>>>>>>>> introduce a
>>>>>>>>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
>>>>>>>> scanning.
>>>>>>>>> We
>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>> planning
>>>>>>>>>>>>>>>>>>> to support SourceFunction / InputFormat for now
>>>>>>>> considering
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> complexity
>>>>>>>>>>>>>>>>>>> of FLIP-27 Source API.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
>>>>>>>>>> make
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> semantic of lookup more straightforward for developers.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
>>>>>>>>> is
>>>>>>>>>>>>>>>>> deprecated
>>>>>>>>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
>>>>>>>>>>> currently
>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>> not?
>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
>>>>>>>> for
>>>>>>>>>>> now.
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>> it will be deprecated in the future but we don't have a
>>>>>>>>> clear
>>>>>>>>>>> plan
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> that.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
>>>>>>>>>> looking
>>>>>>>>>>>>>>>>> forward
>>>>>>>>>>>>>>>>>>> to cooperating with you after we finalize the design and
>>>>>>>>>>>>> interfaces!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
>>>>>>>> Смирнов <
>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
>>>>>>>>> all
>>>>>>>>>>>>>>> points!
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
>>>>>>>>> is
>>>>>>>>>>>>>>>>> deprecated
>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
>>>>>>>>> but
>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
>>>>>>>>> version
>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>> OK
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
>>>>>>>>>>> supporting
>>>>>>>>>>>>>>>>> rescan
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
>>>>>>>> for
>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> decision we
>>>>>>>>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
>>>>>>>> participants.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
>>>>>>>>> your
>>>>>>>>>>>>>>>>>>> statements. All
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
>>>>>>>>> would
>>>>>>>>>> be
>>>>>>>>>>>>> nice
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
>>>>>>>> lot
>>>>>>>>>> of
>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
>>>>>>>> one
>>>>>>>>>> we
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> discussing,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
>>>>>>>> Anyway
>>>>>>>>>>>>> looking
>>>>>>>>>>>>>>>>>>> forward for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the FLIP update!
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
>>>>>>>>>> imjark@gmail.com
>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
>>>>>>>>>>>>> discussed
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> several times
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
>>>>>>>>> many
>>>>>>>>>> of
>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>> points!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
>>>>>>>> design
>>>>>>>>>> docs
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> maybe can be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
>>>>>>>>> discussions:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
>>>>>>>> "cache
>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> framework" way.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
>>>>>>>>> customize
>>>>>>>>>>> and
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
>>>>>>>> easy-use.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
>>>>>>>>>>> flexibility
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> conciseness.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>> esp reducing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
>>>>>>>> the
>>>>>>>>>>>>> unified
>>>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>>>>> to both
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
>>>>>>>>> direction.
>>>>>>>>>> If
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>> to support
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
>>>>>>>> decide
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>>>> the cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
>>>>>>>>> and
>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>>>> affect the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
>>>>>>>>> your
>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
>>>>>>>>>>> InputFormat,
>>>>>>>>>>>>>>>>>>> SourceFunction for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
>>>>>>>> source
>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
>>>>>>>>>>> re-scan
>>>>>>>>>>>>>>>>>> ability
>>>>>>>>>>>>>>>>>>> for FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
>>>>>>>>>>> effort
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
>>>>>>>>> InputFormat&SourceFunction,
>>>>>>>>>>> as
>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>> are not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
>>>>>>>>> another
>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
>>>>>>>>>> plan
>>>>>>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
>>>>>>>>> SourceFunction
>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> deprecated.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
>>>>>>>> <
>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
>>>>>>>>> InputFormat
>>>>>>>>>>> is
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> considered.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
>>>>>>>>>>>>>>>>>>> martijn@ververica.com>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
>>>>>>>>> connectors
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
>>>>>>>>> The
>>>>>>>>>>> old
>>>>>>>>>>>>>>>>>>> interfaces will be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
>>>>>>>>>> refactored
>>>>>>>>>>> to
>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>> the new ones
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dropped.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
>>>>>>>> are
>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
>>>>>>>>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
>>>>>>>> Смирнов
>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
>>>>>>>>> make
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>> comments and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
>>>>>>>>> we
>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> achieve
>>>>>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
>>>>>>>> in
>>>>>>>>>>>>>>>>>>> flink-table-common,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>> Therefore if a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
>>>>>>>> cache
>>>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>>>> and their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
>>>>>>>> lookupConfig
>>>>>>>>> to
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> planner, but if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
>>>>>>>>> in
>>>>>>>>>>> his
>>>>>>>>>>>>>>>>>>> TableFunction, it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
>>>>>>>>>>>>> interface
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> documentation). In
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
>>>>>>>>> unified.
>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
>>>>>>>>> cache,
>>>>>>>>>> we
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
>>>>>>>>> optimization
>>>>>>>>>> in
>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> LRU cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
>>>>>>>>>> Collection<RowData>>.
>>>>>>>>>>>>> Here
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
>>>>>>>>>> cache,
>>>>>>>>>>>>> even
>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
>>>>>>>>> rows
>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>> applying
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
>>>>>>>>>>>>>>>>>> TableFunction,
>>>>>>>>>>>>>>>>>>> we store
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
>>>>>>>>>> cache
>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
>>>>>>>>>>> bytes).
>>>>>>>>>>>>>>>>> I.e.
>>>>>>>>>>>>>>>>>>> we don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
>>>>>>>>>> pruned,
>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
>>>>>>>> If
>>>>>>>>>> the
>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>> knows about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
>>>>>>>>>>> option
>>>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>>>>> the start
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
>>>>>>>>> idea
>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> do this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
>>>>>>>> and
>>>>>>>>>>>>> 'weigher'
>>>>>>>>>>>>>>>>>>> methods of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
>>>>>>>>>>>>> collection
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
>>>>>>>>>> automatically
>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>> much
>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
>>>>>>>>>>> filters
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
>>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>> pushdown.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
>>>>>>>> no
>>>>>>>>>>>>> database
>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
>>>>>>>>>>> feature
>>>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
>>>>>>>>>> talk
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
>>>>>>>> databases
>>>>>>>>>>> might
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> support all
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
>>>>>>>> all).
>>>>>>>>> I
>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
>>>>>>>>>> optimization
>>>>>>>>>>>>>>>>>>> independently of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
>>>>>>>>>> complex
>>>>>>>>>>>>>>>>> problems
>>>>>>>>>>>>>>>>>>> (or
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
>>>>>>>> Actually
>>>>>>>>> in
>>>>>>>>>>> our
>>>>>>>>>>>>>>>>>>> internal version
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
>>>>>>>> and
>>>>>>>>>>>>>>> reloading
>>>>>>>>>>>>>>>>>>> data from
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
>>>>>>>> a
>>>>>>>>>> way
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> unify
>>>>>>>>>>>>>>>>>>> the logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
>>>>>>>>>>>>>>> SourceFunction,
>>>>>>>>>>>>>>>>>>> Source,...)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
>>>>>>>>> result
>>>>>>>>>> I
>>>>>>>>>>>>>>>>> settled
>>>>>>>>>>>>>>>>>>> on using
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
>>>>>>>>> in
>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
>>>>>>>> plans
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> deprecate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
>>>>>>>>>> usage
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
>>>>>>>>>>> source
>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>> designed to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
>>>>>>>>> (SplitEnumerator
>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> JobManager and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
>>>>>>>>>> operator
>>>>>>>>>>>>>>>>> (lookup
>>>>>>>>>>>>>>>>>>> join
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
>>>>>>>> direct
>>>>>>>>>> way
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>> splits from
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
>>>>>>>>> works
>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
>>>>>>>>>>>>>>>>> AddSplitEvents).
>>>>>>>>>>>>>>>>>>> Usage of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
>>>>>>>>> clearer
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> easier. But if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
>>>>>>>>>>>>> FLIP-27, I
>>>>>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
>>>>>>>>> lookup
>>>>>>>>>>> join
>>>>>>>>>>>>>>> ALL
>>>>>>>>>>>>>>>>>>> cache in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
>>>>>>>> of
>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>> source?
>>>>>>>>>>>>>>>>>>> The point
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
>>>>>>>> join
>>>>>>>>>> ALL
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> and simple
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
>>>>>>>>> case
>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> performed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
>>>>>>>> (cache)
>>>>>>>>> is
>>>>>>>>>>>>>>> cleared
>>>>>>>>>>>>>>>>>>> (correct me
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
>>>>>>>>>>>>> functionality of
>>>>>>>>>>>>>>>>>>> simple join
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
>>>>>>>>>>>>> functionality of
>>>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
>>>>>>>> be
>>>>>>>>>>> easy
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> new FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
>>>>>>>> -
>>>>>>>>> we
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>> to change
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
>>>>>>>>>> again
>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>> some TTL).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
>>>>>>>>> long-term
>>>>>>>>>>>>> goal
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> will make
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
>>>>>>>>> said.
>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> can limit
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
>>>>>>>>>> (InputFormats).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
>>>>>>>>>> flexible
>>>>>>>>>>>>>>>>>>> interfaces for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
>>>>>>>> both
>>>>>>>>>> in
>>>>>>>>>>>>> LRU
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> ALL caches.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
>>>>>>>>> have
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> opportunity to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
>>>>>>>> currently
>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>> pushdown works
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
>>>>>>>>> filters
>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
>>>>>>>>>>>>> features.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
>>>>>>>>> that
>>>>>>>>>>>>>>> involves
>>>>>>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
>>>>>>>>> from
>>>>>>>>>>>>>>>>>>> InputFormat in favor
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
>>>>>>>>> realization
>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>> complex and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
>>>>>>>>> extend
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> functionality of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
>>>>>>>>>> case
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>> join ALL
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
>>>>>>>>>>> imjark@gmail.com
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
>>>>>>>>> want
>>>>>>>>>> to
>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>> ideas:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
>>>>>>>>>> connectors
>>>>>>>>>>>>> base
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
>>>>>>>>> ways
>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> work (e.g.,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
>>>>>>>>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
>>>>>>>>> flexible
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
>>>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
>>>>>>>>> should
>>>>>>>>>>> be a
>>>>>>>>>>>>>>>>> final
>>>>>>>>>>>>>>>>>>> state,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
>>>>>>>>> into
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> benefit a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lot
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
>>>>>>>>>>> Connectors
>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
>>>>>>>>> cache,
>>>>>>>>>> we
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
>>>>>>>> means
>>>>>>>>>> the
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> meaningless in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
>>>>>>>> to
>>>>>>>>> do
>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>> and projects
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
>>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
>>>>>>>> interfaces
>>>>>>>>> to
>>>>>>>>>>>>> reduce
>>>>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
>>>>>>>>>> source
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> lookup source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
>>>>>>>>>> pushdown
>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> caches,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
>>>>>>>>> of
>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>>> We have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
>>>>>>>>> "eval"
>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
>>>>>>>> share
>>>>>>>>>> the
>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
>>>>>>>>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
>>>>>>>>>>> deprecated,
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
>>>>>>>>>>>>> LookupJoin,
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> may make
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
>>>>>>>> the
>>>>>>>>>> ALL
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> logic and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reuse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
>>>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
>>>>>>>>> lies
>>>>>>>>>>> out
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> scope of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
>>>>>>>> be
>>>>>>>>>>> done
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
>>>>>>>> Visser <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
>>>>>>>>>> correctly
>>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
>>>>>>>>>>>>>>>>> jdbc/hive/hbase."
>>>>>>>>>>>>>>>>>>> -> Would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
>>>>>>>>> implement
>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
>>>>>>>> to
>>>>>>>>>>> doing
>>>>>>>>>>>>>>>>> that,
>>>>>>>>>>>>>>>>>>> outside
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>>>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
>>>>>>>>>> improvement!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
>>>>>>>> implementation
>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> nice
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>> AS
>>>>>>>>>>>>> OF
>>>>>>>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
>>>>>>>>>>>>> implemented.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
>>>>>>>>> say
>>>>>>>>>>>>> that:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
>>>>>>>>> to
>>>>>>>>>>> cut
>>>>>>>>>>>>> off
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
>>>>>>>> the
>>>>>>>>>> most
>>>>>>>>>>>>>>> handy
>>>>>>>>>>>>>>>>>>> way to do
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
>>>>>>>> bit
>>>>>>>>>>>>> harder to
>>>>>>>>>>>>>>>>>>> pass it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
>>>>>>>>>> Alexander
>>>>>>>>>>>>>>>>>> correctly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> jdbc/hive/hbase.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
>>>>>>>> caching
>>>>>>>>>>>>>>>>> parameters
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
>>>>>>>>> set
>>>>>>>>>> it
>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
>>>>>>>>> options
>>>>>>>>>>> for
>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>> deprives us of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
>>>>>>>>>> implement
>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
>>>>>>>>> more
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
>>>>>>>>>> schema
>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
>>>>>>>> not
>>>>>>>>>>> right
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
>>>>>>>>>>> architecture?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
>>>>>>>>> Visser <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
>>>>>>>>>> wanted
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> express that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
>>>>>>>> this
>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>>>> hope
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
>>>>>>>>>>> Смирнов <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
>>>>>>>>>> However, I
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>> questions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
>>>>>>>>> get
>>>>>>>>>>>>>>>>>>> something?).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
>>>>>>>> of
>>>>>>>>>> "FOR
>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>> AS
>>>>>>>>>>>>>>>>> OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
>>>>>>>>> you
>>>>>>>>>>>>> said,
>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>> go
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
>>>>>>>> performance
>>>>>>>>>> (no
>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
>>>>>>>> do
>>>>>>>>>> you
>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
>>>>>>>>>>> explicitly
>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
>>>>>>>> the
>>>>>>>>>>> list
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
>>>>>>>>>> want
>>>>>>>>>>>>> to.
>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
>>>>>>>>> caching
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> modules
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
>>>>>>>>>> flink-table-common
>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
>>>>>>>>>>>>>>>>>>> breaking/non-breaking
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
>>>>>>>>>>> proc_time"?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
>>>>>>>>>>>>> options in
>>>>>>>>>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
>>>>>>>>> never
>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
>>>>>>>>>>> semantics
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
>>>>>>>>> it
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>> limiting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
>>>>>>>>>>> business
>>>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
>>>>>>>> logic
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> framework? I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
>>>>>>>>>> option
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
>>>>>>>> the
>>>>>>>>>>> wrong
>>>>>>>>>>>>>>>>>>> decision,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
>>>>>>>>> logic
>>>>>>>>>>> (not
>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
>>>>>>>>>>> functions
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> ONE
>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
>>>>>>>>> caches).
>>>>>>>>>>>>> Does it
>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matter for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
>>>>>>>>> logic
>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> located,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
>>>>>>>>>>>>> 'sink.parallelism',
>>>>>>>>>>>>>>>>>>> which in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some way
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
>>>>>>>> framework"
>>>>>>>>>> and
>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>> see any
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
>>>>>>>>>>> all-caching
>>>>>>>>>>>>>>>>>>> scenario
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
>>>>>>>>>> discussion,
>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in our
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
>>>>>>>>>> quite
>>>>>>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reused
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
>>>>>>>>> for
>>>>>>>>>> a
>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>> API).
>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
>>>>>>>> use
>>>>>>>>>>>>>>>>> InputFormat
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
>>>>>>>> even
>>>>>>>>>> Hive
>>>>>>>>>>>>> - it
>>>>>>>>>>>>>>>>>> uses
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
>>>>>>>> a
>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
>>>>>>>>>> ability
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
>>>>>>>>>> number
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
>>>>>>>> reload
>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduces
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
>>>>>>>>>> blocking). I
>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
>>>>>>>>>> code,
>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
>>>>>>>>> an
>>>>>>>>>>>>> ideal
>>>>>>>>>>>>>>>>>>> solution,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
>>>>>>>>> might
>>>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
>>>>>>>>>>> developer
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
>>>>>>>>> new
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
>>>>>>>> options
>>>>>>>>>>> into
>>>>>>>>>>>>> 2
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
>>>>>>>> will
>>>>>>>>>>> need
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> redirect
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
>>>>>>>>>>>>> LookupConfig
>>>>>>>>>>>>>>> (+
>>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> add an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
>>>>>>>>>> naming),
>>>>>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
>>>>>>>>>> won't
>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> refactoring at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
>>>>>>>> connector
>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
>>>>>>>> wants
>>>>>>>>> to
>>>>>>>>>>> use
>>>>>>>>>>>>>>> his
>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
>>>>>>>>>>> configs
>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
>>>>>>>> with
>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
>>>>>>>> it's a
>>>>>>>>>>> rare
>>>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
>>>>>>>> pushed
>>>>>>>>>> all
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>>>>> down
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
>>>>>>>> is
>>>>>>>>>> that
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> ONLY
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
>>>>>>>>>>>>> FileSystemTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
>>>>>>>>>>> currently).
>>>>>>>>>>>>>>> Also
>>>>>>>>>>>>>>>>>>> for some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
>>>>>>>>>>> complex
>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
>>>>>>>> the
>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
>>>>>>>> large
>>>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
>>>>>>>>>>> suppose
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> dimension
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
>>>>>>>> 20
>>>>>>>>> to
>>>>>>>>>>> 40,
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> input
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stream
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
>>>>>>>>> by
>>>>>>>>>>> age
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> users. If
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
>>>>>>>>>> This
>>>>>>>>>>>>> means
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
>>>>>>>>> almost
>>>>>>>>>> 2
>>>>>>>>>>>>>>> times.
>>>>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gain a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
>>>>>>>>>>> optimization
>>>>>>>>>>>>>>>>> starts
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
>>>>>>>>>> filters
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
>>>>>>>>> opens
>>>>>>>>>> up
>>>>>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
>>>>>>>> 'not
>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>> useful'.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
>>>>>>>>>>>>> regarding
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> topic!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
>>>>>>>>>> points,
>>>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
>>>>>>>>> come
>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>>> consensus.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
>>>>>>>>> Ren
>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
>>>>>>>> for
>>>>>>>>> my
>>>>>>>>>>>>> late
>>>>>>>>>>>>>>>>>>> response!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
>>>>>>>>> and
>>>>>>>>>>>>> Leonard
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> I’d
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
>>>>>>>>>> implementing
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
>>>>>>>>>>>>> user-provided
>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
>>>>>>>>> extending
>>>>>>>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
>>>>>>>> semantic
>>>>>>>>> of
>>>>>>>>>>>>> "FOR
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
>>>>>>>>>> reflect
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> content
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
>>>>>>>> users
>>>>>>>>>>>>> choose
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
>>>>>>>>> that
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> breakage is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
>>>>>>>>> prefer
>>>>>>>>>>> not
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> framework
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (whether
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
>>>>>>>>> TableFunction),
>>>>>>>>>> we
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
>>>>>>>>> control
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> behavior of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
>>>>>>>>>>> should
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> cautious.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Under
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
>>>>>>>>>> framework
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> only be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
>>>>>>>>> it’s
>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
>>>>>>>> source
>>>>>>>>>>> loads
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> refresh
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
>>>>>>>>>> high
>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
>>>>>>>>> widely
>>>>>>>>>>>>> used
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> our
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
>>>>>>>>> the
>>>>>>>>>>>>> user’s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
>>>>>>>>>>>>> introduce a
>>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interface for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
>>>>>>>> would
>>>>>>>>>>>>> become
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
>>>>>>>> framework
>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
>>>>>>>>>> there
>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>>> exist two
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caches
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
>>>>>>>> user
>>>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configures
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
>>>>>>>>>> implemented
>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
>>>>>>>>>>>>> Alexander, I
>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
>>>>>>>> way
>>>>>>>>>> down
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>>>>>>> runner
>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
>>>>>>>>>> network
>>>>>>>>>>>>> I/O
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pressure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
>>>>>>>> these
>>>>>>>>>>>>>>>>>> optimizations
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
>>>>>>>>>>> reflect
>>>>>>>>>>>>> our
>>>>>>>>>>>>>>>>>>> ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> TableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
>>>>>>>>>>>>> (CachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
>>>>>>>> developers
>>>>>>>>>> and
>>>>>>>>>>>>>>>>> regulate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
>>>>>>>> reference.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
>>>>>>>>> Александр
>>>>>>>>>>>>> Смирнов
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
>>>>>>>>>>> solution
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> step:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
>>>>>>>> mutually
>>>>>>>>>>>>> exclusive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (originally
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
>>>>>>>>>>> conceptually
>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
>>>>>>>>>>>>> different.
>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> go one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
>>>>>>>>> will
>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>> deleting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
>>>>>>>>>>>>> connectors.
>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
>>>>>>>>>> about
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
>>>>>>>>>> tasks
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parts
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
>>>>>>>>> unification
>>>>>>>>>> /
>>>>>>>>>>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
>>>>>>>>>> Qingsheng?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
>>>>>>>>> fields
>>>>>>>>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table, we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
>>>>>>>>> after
>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>> pushdown. So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
>>>>>>>>>> much
>>>>>>>>>>>>> less
>>>>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
>>>>>>>>>>> architecture
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
>>>>>>>> honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
>>>>>>>>>> kinds
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> conversations
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
>>>>>>>>> confluence,
>>>>>>>>>>> so
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> made a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jira
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
>>>>>>>> in
>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> details
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
>>>>>>>>> Heise
>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
>>>>>>>>>> inconsistency
>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
>>>>>>>>>> could
>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>> live
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
>>>>>>>> of
>>>>>>>>>>>>> making
>>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
>>>>>>>>>> devise a
>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> layer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
>>>>>>>>> CachingTableFunction
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delegates to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> X in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
>>>>>>>>>> Lifting
>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unnecessary
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
>>>>>>>>> will
>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>> receive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
>>>>>>>>> more
>>>>>>>>>>>>>>>>>> interesting
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> save
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
>>>>>>>>>> changes
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
>>>>>>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>> Everything
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
>>>>>>>> That
>>>>>>>>>>> means
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
>>>>>>>> Alexander
>>>>>>>>>>>>> pointed
>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
>>>>>>>>>>> architecture
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
>>>>>>>> honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
>>>>>>>>>> Александр
>>>>>>>>>>>>>>>>> Смирнов
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
>>>>>>>>> I'm
>>>>>>>>>>>>> not a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> yet,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
>>>>>>>>>> FLIP
>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
>>>>>>>>>>>>> feature in
>>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> company’s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
>>>>>>>> our
>>>>>>>>>>>>> thoughts
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
>>>>>>>> alternative
>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
>>>>>>>>>>>>>>> (CachingTableFunction).
>>>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
>>>>>>>>>>>>> flink-table-common
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
>>>>>>>> tables –
>>>>>>>>>>> it’s
>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> convenient
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
>>>>>>>>>>>>> CachingTableFunction
>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
>>>>>>>> and
>>>>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connected
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
>>>>>>>> module,
>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
>>>>>>>>>>> depend
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
>>>>>>>>> which
>>>>>>>>>>>>> doesn’t
>>>>>>>>>>>>>>>>>>> sound
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> good.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
>>>>>>>>>>>>>>> ‘getLookupConfig’
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
>>>>>>>>>>>>> connectors
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
>>>>>>>>>> therefore
>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>> won’t
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> depend on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
>>>>>>>>>>> planner
>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> construct a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
>>>>>>>>>> runtime
>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
>>>>>>>>>> Architecture
>>>>>>>>>>>>>>> looks
>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>> yours
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
>>>>>>>> that
>>>>>>>>>> will
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responsible
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
>>>>>>>>>>>>> inheritors.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
>>>>>>>>>> AsyncLookupJoinRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
>>>>>>>>>>>>>>>>> LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
>>>>>>>> etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
>>>>>>>> powerful
>>>>>>>>>>>>> advantage
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
>>>>>>>>> level,
>>>>>>>>>>> we
>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
>>>>>>>>>>>>> LookupJoinRunnerWithCalc
>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> named
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
>>>>>>>> function,
>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mostly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
>>>>>>>>>> lookup
>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> condition
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
>>>>>>>>>> WHERE
>>>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
>>>>>>>> A.age =
>>>>>>>>>>>>> B.age +
>>>>>>>>>>>>>>>>> 10
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
>>>>>>>>>> storing
>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
>>>>>>>> reduced:
>>>>>>>>>>>>> filters =
>>>>>>>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
>>>>>>>>> reduce
>>>>>>>>>>>>>>> records’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size. So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
>>>>>>>> be
>>>>>>>>>>>>>>> increased
>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
>>>>>>>> Ren
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
>>>>>>>>>>>>> discussion
>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
>>>>>>>>>> table
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
>>>>>>>>> isn’t a
>>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
>>>>>>>> with
>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>> joins,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
>>>>>>>>> table
>>>>>>>>>>>>>>> options.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
>>>>>>>>> Any
>>>>>>>>>>>>>>>>>> suggestions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jingsong Li <ji...@gmail.com>.

Thanks Qingsheng and all.

I like this design.

Some comments:

1. LookupCache implements Serializable?

2. Minor: After FLIP-234 [1], there should be many connectors that
implement both PartialCachingLookupProvider and
PartialCachingAsyncLookupProvider. Can we extract a common interface
for `LookupCache getCache();` to ensure consistency?

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-234%3A+Support+Retryable+Lookup+Join+To+Solve+Delayed+Updates+Issue+In+External+Systems

Best,
Jingsong

On Tue, Jun 21, 2022 at 4:09 PM Qingsheng Ren <re...@apache.org> wrote:
>
> Hi devs,
>
> I’d like to push FLIP-221 forward a little bit. Recently we had some offline discussions and updated the FLIP. Here’s the diff compared to the previous version:
>
> 1. (Async)LookupFunctionProvider is designed as a base interface for constructing lookup functions.
> 2. From the LookupFunction we extend PartialCaching / FullCachingLookupProvider for partial and full caching mode.
> 3. Introduce CacheReloadTrigger for specifying reload stratrgy in full caching mode, and provide 2 default implementations (Periodic / TimedCacheReloadTrigger)
>
> Looking forward to your replies~
>
> Best,
> Qingsheng
>
> > On Jun 2, 2022, at 17:15, Qingsheng Ren <re...@gmail.com> wrote:
> >
> > Hi Becket,
> >
> > Thanks for your feedback!
> >
> > 1. An alternative way is to let the implementation of cache to decide
> > whether to store a missing key in the cache instead of the framework.
> > This sounds more reasonable and makes the LookupProvider interface
> > cleaner. I can update the FLIP and clarify in the JavaDoc of
> > LookupCache#put that the cache should decide whether to store an empty
> > collection.
> >
> > 2. Initially the builder pattern is for the extensibility of
> > LookupProvider interfaces that we could need to add more
> > configurations in the future. We can remove the builder now as we have
> > resolved the issue in 1. As for the builder in DefaultLookupCache I
> > prefer to keep it because we have a lot of arguments in the
> > constructor.
> >
> > 3. I think this might overturn the overall design. I agree with
> > Becket's idea that the API design should be layered considering
> > extensibility and it'll be great to have one unified interface
> > supporting both partial, full and even mixed custom strategies, but we
> > have some issues to resolve. The original purpose of treating full
> > caching separately is that we'd like to reuse the ability of
> > ScanRuntimeProvider. Developers just need to hand over Source /
> > SourceFunction / InputFormat so that the framework could be able to
> > compose the underlying topology and control the reload (maybe in a
> > distributed way). Under your design we leave the reload operation
> > totally to the CacheStrategy and I think it will be hard for
> > developers to reuse the source in the initializeCache method.
> >
> > Best regards,
> >
> > Qingsheng
> >
> > On Thu, Jun 2, 2022 at 1:50 PM Becket Qin <be...@gmail.com> wrote:
> >>
> >> Thanks for updating the FLIP, Qingsheng. A few more comments:
> >>
> >> 1. I am still not sure about what is the use case for cacheMissingKey().
> >> More specifically, when would users want to have getCache() return a
> >> non-empty value and cacheMissingKey() returns false?
> >>
> >> 2. The builder pattern. Usually the builder pattern is used when there are
> >> a lot of variations of constructors. For example, if a class has three
> >> variables and all of them are optional, so there could potentially be many
> >> combinations of the variables. But in this FLIP, I don't see such case.
> >> What is the reason we have builders for all the classes?
> >>
> >> 3. Should the caching strategy be excluded from the top level provider API?
> >> Technically speaking, the Flink framework should only have two interfaces
> >> to deal with:
> >>    A) LookupFunction
> >>    B) AsyncLookupFunction
> >> Orthogonally, we *believe* there are two different strategies people can do
> >> caching. Note that the Flink framework does not care what is the caching
> >> strategy here.
> >>    a) partial caching
> >>    b) full caching
> >>
> >> Putting them together, we end up with 3 combinations that we think are
> >> valid:
> >>     Aa) PartialCachingLookupFunctionProvider
> >>     Ba) PartialCachingAsyncLookupFunctionProvider
> >>     Ab) FullCachingLookupFunctionProvider
> >>
> >> However, the caching strategy could actually be quite flexible. E.g. an
> >> initial full cache load followed by some partial updates. Also, I am not
> >> 100% sure if the full caching will always use ScanTableSource. Including
> >> the caching strategy in the top level provider API would make it harder to
> >> extend.
> >>
> >> One possible solution is to just have *LookupFunctionProvider* and
> >> *AsyncLookupFunctionProvider
> >> *as the top level API, both with a getCacheStrategy() method returning an
> >> optional CacheStrategy. The CacheStrategy class would have the following
> >> methods:
> >> 1. void open(Context), the context exposes some of the resources that may
> >> be useful for the the caching strategy, e.g. an ExecutorService that is
> >> synchronized with the data processing, or a cache refresh trigger which
> >> blocks data processing and refresh the cache.
> >> 2. void initializeCache(), a blocking method allows users to pre-populate
> >> the cache before processing any data if they wish.
> >> 3. void maybeCache(RowData key, Collection<RowData> value), blocking or
> >> non-blocking method.
> >> 4. void refreshCache(), a blocking / non-blocking method that is invoked by
> >> the Flink framework when the cache refresh trigger is pulled.
> >>
> >> In the above design, partial caching and full caching would be
> >> implementations of the CachingStrategy. And it is OK for users to implement
> >> their own CachingStrategy if they want to.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >> On Thu, Jun 2, 2022 at 12:14 PM Jark Wu <im...@gmail.com> wrote:
> >>
> >>> Thank Qingsheng for the detailed summary and updates,
> >>>
> >>> The changes look good to me in general. I just have one minor improvement
> >>> comment.
> >>> Could we add a static util method to the "FullCachingReloadTrigger"
> >>> interface for quick usage?
> >>>
> >>> #periodicReloadAtFixedRate(Duration)
> >>> #periodicReloadWithFixedDelay(Duration)
> >>>
> >>> I think we can also do this for LookupCache. Because users may not know
> >>> where is the default
> >>> implementations and how to use them.
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <re...@gmail.com> wrote:
> >>>
> >>>> Hi Jingsong,
> >>>>
> >>>> Thanks for your comments!
> >>>>
> >>>>> AllCache definition is not flexible, for example, PartialCache can use
> >>>> any custom storage, while the AllCache can not, AllCache can also be
> >>>> considered to store memory or disk, also need a flexible strategy.
> >>>>
> >>>> We had an offline discussion with Jark and Leonard. Basically we think
> >>>> exposing the interface of full cache storage to connector developers
> >>> might
> >>>> limit our future optimizations. The storage of full caching shouldn’t
> >>> have
> >>>> too many variations for different lookup tables so making it pluggable
> >>>> might not help a lot. Also I think it is not quite easy for connector
> >>>> developers to implement such an optimized storage. We can keep optimizing
> >>>> this storage in the future and all full caching lookup tables would
> >>> benefit
> >>>> from this.
> >>>>
> >>>>> We are more inclined to deprecate the connector `async` option when
> >>>> discussing FLIP-234. Can we remove this option from this FLIP?
> >>>>
> >>>> Thanks for the reminder! This option has been removed in the latest
> >>>> version.
> >>>>
> >>>> Best regards,
> >>>>
> >>>> Qingsheng
> >>>>
> >>>>
> >>>>> On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com> wrote:
> >>>>>
> >>>>> Thanks Alexander for your reply. We can discuss the new interface when
> >>> it
> >>>>> comes out.
> >>>>>
> >>>>> We are more inclined to deprecate the connector `async` option when
> >>>>> discussing FLIP-234 [1]. We should use hint to let planner decide.
> >>>>> Although the discussion has not yet produced a conclusion, can we
> >>> remove
> >>>>> this option from this FLIP? It doesn't seem to be related to this FLIP,
> >>>> but
> >>>>> more to FLIP-234, and we can form a conclusion over there.
> >>>>>
> >>>>> [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> >>>>>
> >>>>> Best,
> >>>>> Jingsong
> >>>>>
> >>>>> On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:
> >>>>>
> >>>>>> Hi Jark,
> >>>>>>
> >>>>>> Thanks for clarifying it. It would be fine. as long as we could
> >>> provide
> >>>> the
> >>>>>> no-cache solution. I was just wondering if the client side cache could
> >>>>>> really help when HBase is used, since the data to look up should be
> >>>> huge.
> >>>>>> Depending how much data will be cached on the client side, the data
> >>> that
> >>>>>> should be lru in e.g. LruBlockCache will not be lru anymore. In the
> >>>> worst
> >>>>>> case scenario, once the cached data at client side is expired, the
> >>>> request
> >>>>>> will hit disk which will cause extra latency temporarily, if I am not
> >>>>>> mistaken.
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Jing
> >>>>>>
> >>>>>> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi Jing Ge,
> >>>>>>>
> >>>>>>> What do you mean about the "impact on the block cache used by HBase"?
> >>>>>>> In my understanding, the connector cache and HBase cache are totally
> >>>> two
> >>>>>>> things.
> >>>>>>> The connector cache is a local/client cache, and the HBase cache is a
> >>>>>>> server cache.
> >>>>>>>
> >>>>>>>> does it make sense to have a no-cache solution as one of the
> >>>>>>> default solutions so that customers will have no effort for the
> >>>> migration
> >>>>>>> if they want to stick with Hbase cache
> >>>>>>>
> >>>>>>> The implementation migration should be transparent to users. Take the
> >>>>>> HBase
> >>>>>>> connector as
> >>>>>>> an example,  it already supports lookup cache but is disabled by
> >>>> default.
> >>>>>>> After migration, the
> >>>>>>> connector still disables cache by default (i.e. no-cache solution).
> >>> No
> >>>>>>> migration effort for users.
> >>>>>>>
> >>>>>>> HBase cache and connector cache are two different things. HBase cache
> >>>>>> can't
> >>>>>>> simply replace
> >>>>>>> connector cache. Because one of the most important usages for
> >>> connector
> >>>>>>> cache is reducing
> >>>>>>> the I/O request/response and improving the throughput, which can
> >>>> achieve
> >>>>>>> by just using a server cache.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jark
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
> >>>>>>>
> >>>>>>>> Thanks all for the valuable discussion. The new feature looks very
> >>>>>>>> interesting.
> >>>>>>>>
> >>>>>>>> According to the FLIP description: "*Currently we have JDBC, Hive
> >>> and
> >>>>>>> HBase
> >>>>>>>> connector implemented lookup table source. All existing
> >>>> implementations
> >>>>>>>> will be migrated to the current design and the migration will be
> >>>>>>>> transparent to end users*." I was only wondering if we should pay
> >>>>>>> attention
> >>>>>>>> to HBase and similar DBs. Since, commonly, the lookup data will be
> >>>> huge
> >>>>>>>> while using HBase, partial caching will be used in this case, if I
> >>> am
> >>>>>> not
> >>>>>>>> mistaken, which might have an impact on the block cache used by
> >>> HBase,
> >>>>>>> e.g.
> >>>>>>>> LruBlockCache.
> >>>>>>>> Another question is that, since HBase provides a sophisticated cache
> >>>>>>>> solution, does it make sense to have a no-cache solution as one of
> >>> the
> >>>>>>>> default solutions so that customers will have no effort for the
> >>>>>> migration
> >>>>>>>> if they want to stick with Hbase cache?
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Jing
> >>>>>>>>
> >>>>>>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <
> >>> jingsonglee0@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> I think the problem now is below:
> >>>>>>>>> 1. AllCache and PartialCache interface on the non-uniform, one
> >>> needs
> >>>>>> to
> >>>>>>>>> provide LookupProvider, the other needs to provide CacheBuilder.
> >>>>>>>>> 2. AllCache definition is not flexible, for example, PartialCache
> >>> can
> >>>>>>> use
> >>>>>>>>> any custom storage, while the AllCache can not, AllCache can also
> >>> be
> >>>>>>>>> considered to store memory or disk, also need a flexible strategy.
> >>>>>>>>> 3. AllCache can not customize ReloadStrategy, currently only
> >>>>>>>>> ScheduledReloadStrategy.
> >>>>>>>>>
> >>>>>>>>> In order to solve the above problems, the following are my ideas.
> >>>>>>>>>
> >>>>>>>>> ## Top level cache interfaces:
> >>>>>>>>>
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> public interface CacheLookupProvider extends
> >>>>>>>>> LookupTableSource.LookupRuntimeProvider {
> >>>>>>>>>
> >>>>>>>>>   CacheBuilder createCacheBuilder();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> public interface CacheBuilder {
> >>>>>>>>>   Cache create();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> public interface Cache {
> >>>>>>>>>
> >>>>>>>>>   /**
> >>>>>>>>>    * Returns the value associated with key in this cache, or null
> >>>>>> if
> >>>>>>>>> there is no cached value for
> >>>>>>>>>    * key.
> >>>>>>>>>    */
> >>>>>>>>>   @Nullable
> >>>>>>>>>   Collection<RowData> getIfPresent(RowData key);
> >>>>>>>>>
> >>>>>>>>>   /** Returns the number of key-value mappings in the cache. */
> >>>>>>>>>   long size();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> ## Partial cache
> >>>>>>>>>
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> public interface PartialCacheLookupFunction extends
> >>>>>>> CacheLookupProvider {
> >>>>>>>>>
> >>>>>>>>>   @Override
> >>>>>>>>>   PartialCacheBuilder createCacheBuilder();
> >>>>>>>>>
> >>>>>>>>> /** Creates an {@link LookupFunction} instance. */
> >>>>>>>>> LookupFunction createLookupFunction();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> public interface PartialCacheBuilder extends CacheBuilder {
> >>>>>>>>>
> >>>>>>>>>   PartialCache create();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> public interface PartialCache extends Cache {
> >>>>>>>>>
> >>>>>>>>>   /**
> >>>>>>>>>    * Associates the specified value rows with the specified key
> >>> row
> >>>>>>>>> in the cache. If the cache
> >>>>>>>>>    * previously contained value associated with the key, the old
> >>>>>>>>> value is replaced by the
> >>>>>>>>>    * specified value.
> >>>>>>>>>    *
> >>>>>>>>>    * @return the previous value rows associated with key, or null
> >>>>>> if
> >>>>>>>>> there was no mapping for key.
> >>>>>>>>>    * @param key - key row with which the specified value is to be
> >>>>>>>>> associated
> >>>>>>>>>    * @param value – value rows to be associated with the specified
> >>>>>>> key
> >>>>>>>>>    */
> >>>>>>>>>   Collection<RowData> put(RowData key, Collection<RowData> value);
> >>>>>>>>>
> >>>>>>>>>   /** Discards any cached value for the specified key. */
> >>>>>>>>>   void invalidate(RowData key);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> ## All cache
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> public interface AllCacheLookupProvider extends
> >>> CacheLookupProvider {
> >>>>>>>>>
> >>>>>>>>>   void registerReloadStrategy(ScheduledExecutorService
> >>>>>>>>> executorService, Reloader reloader);
> >>>>>>>>>
> >>>>>>>>>   ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> >>>>>>>>>
> >>>>>>>>>   @Override
> >>>>>>>>>   AllCacheBuilder createCacheBuilder();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> public interface AllCacheBuilder extends CacheBuilder {
> >>>>>>>>>
> >>>>>>>>>   AllCache create();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> public interface AllCache extends Cache {
> >>>>>>>>>
> >>>>>>>>>   void putAll(Iterator<Map<RowData, RowData>> allEntries);
> >>>>>>>>>
> >>>>>>>>>   void clearAll();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> public interface Reloader {
> >>>>>>>>>
> >>>>>>>>>   void reload();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jingsong
> >>>>>>>>>
> >>>>>>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <
> >>> jingsonglee0@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks Qingsheng and all for your discussion.
> >>>>>>>>>>
> >>>>>>>>>> Very sorry to jump in so late.
> >>>>>>>>>>
> >>>>>>>>>> Maybe I missed something?
> >>>>>>>>>> My first impression when I saw the cache interface was, why don't
> >>>>>> we
> >>>>>>>>>> provide an interface similar to guava cache [1], on top of guava
> >>>>>>> cache,
> >>>>>>>>>> caffeine also makes extensions for asynchronous calls.[2]
> >>>>>>>>>> There is also the bulk load in caffeine too.
> >>>>>>>>>>
> >>>>>>>>>> I am also more confused why first from LookupCacheFactory.Builder
> >>>>>> and
> >>>>>>>>> then
> >>>>>>>>>> to Factory to create Cache.
> >>>>>>>>>>
> >>>>>>>>>> [1] https://github.com/google/guava
> >>>>>>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jingsong
> >>>>>>>>>>
> >>>>>>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com>
> >>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> After looking at the new introduced ReloadTime and Becket's
> >>>>>> comment,
> >>>>>>>>>>> I agree with Becket we should have a pluggable reloading
> >>> strategy.
> >>>>>>>>>>> We can provide some common implementations, e.g., periodic
> >>>>>>> reloading,
> >>>>>>>>> and
> >>>>>>>>>>> daily reloading.
> >>>>>>>>>>> But there definitely be some connector- or business-specific
> >>>>>>> reloading
> >>>>>>>>>>> strategies, e.g.
> >>>>>>>>>>> notify by a zookeeper watcher, reload once a new Hive partition
> >>> is
> >>>>>>>>>>> complete.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jark
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
> >>>>>>>> "XXXProvider".
> >>>>>>>>>>>> What is the difference between them? If they are the same, can
> >>>>>> we
> >>>>>>>> just
> >>>>>>>>>>> use
> >>>>>>>>>>>> XXXFactory everywhere?
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
> >>>>>>>>> policy
> >>>>>>>>>>>> also be pluggable? Periodical reloading could be sometimes be
> >>>>>>> tricky
> >>>>>>>>> in
> >>>>>>>>>>>> practice. For example, if user uses 24 hours as the cache
> >>>>>> refresh
> >>>>>>>>>>> interval
> >>>>>>>>>>>> and some nightly batch job delayed, the cache update may still
> >>>>>> see
> >>>>>>>> the
> >>>>>>>>>>>> stale data.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> >>>>>>>> should
> >>>>>>>>> be
> >>>>>>>>>>>> removed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
> >>>>>> seems a
> >>>>>>>>>>> little
> >>>>>>>>>>>> confusing to me. If Optional<LookupCacheFactory>
> >>>>>> getCacheFactory()
> >>>>>>>>>>> returns
> >>>>>>>>>>>> a non-empty factory, doesn't that already indicates the
> >>>>>> framework
> >>>>>>> to
> >>>>>>>>>>> cache
> >>>>>>>>>>>> the missing keys? Also, why is this method returning an
> >>>>>>>>>>> Optional<Boolean>
> >>>>>>>>>>>> instead of boolean?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
> >>>>>> renqschn@gmail.com
> >>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Lincoln and Jark,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the comments! If the community reaches a consensus
> >>>>>>> that
> >>>>>>>> we
> >>>>>>>>>>> use
> >>>>>>>>>>>>> SQL hint instead of table options to decide whether to use sync
> >>>>>>> or
> >>>>>>>>>>> async
> >>>>>>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
> >>>>>>>>> option.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think it’s a good idea to let the decision of async made on
> >>>>>>> query
> >>>>>>>>>>>>> level, which could make better optimization with more
> >>>>>> infomation
> >>>>>>>>>>> gathered
> >>>>>>>>>>>>> by planner. Is there any FLIP describing the issue in
> >>>>>>> FLINK-27625?
> >>>>>>>> I
> >>>>>>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
> >>>>>>>>> missing
> >>>>>>>>>>>>> instead of the entire async mode to be controlled by hint.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
> >>>>>> lincoln.86xy@gmail.com
> >>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for your reply!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
> >>>>>>> no
> >>>>>>>>> idea
> >>>>>>>>>>>>>> whether or when to remove it (we can discuss it in another
> >>>>>>> issue
> >>>>>>>>> for
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
> >>>>>>> into
> >>>>>>>> a
> >>>>>>>>>>>>> common
> >>>>>>>>>>>>>> option now.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Lincoln Lee
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Lincoln,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
> >>>>>> the
> >>>>>>>>>>>>> connectors
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>> provide both async and sync runtime providers simultaneously
> >>>>>>>>> instead
> >>>>>>>>>>>>> of one
> >>>>>>>>>>>>>>> of them.
> >>>>>>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
> >>>>>> option
> >>>>>>> is
> >>>>>>>>>>>>> planned to
> >>>>>>>>>>>>>>> be removed
> >>>>>>>>>>>>>>> in the long term, I think it makes sense not to introduce it
> >>>>>>> in
> >>>>>>>>> this
> >>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> >>>>>>>> lincoln.86xy@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
> >>>>>>> idea
> >>>>>>>>>>> that
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>> have a common table option. I have a minor comments on
> >>>>>>>>>>> 'lookup.async'
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> not make it a common option:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The table layer abstracts both sync and async lookup
> >>>>>>>>> capabilities,
> >>>>>>>>>>>>>>>> connectors implementers can choose one or both, in the case
> >>>>>>> of
> >>>>>>>>>>>>>>> implementing
> >>>>>>>>>>>>>>>> only one capability(status of the most of existing builtin
> >>>>>>>>>>> connectors)
> >>>>>>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
> >>>>>>> both
> >>>>>>>>>>>>>>>> capabilities, I think this choice is more suitable for
> >>>>>> making
> >>>>>>>>>>>>> decisions
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>> the query level, for example, table planner can choose the
> >>>>>>>>> physical
> >>>>>>>>>>>>>>>> implementation of async lookup or sync lookup based on its
> >>>>>>> cost
> >>>>>>>>>>>>> model, or
> >>>>>>>>>>>>>>>> users can give query hint based on their own better
> >>>>>>>>>>> understanding.  If
> >>>>>>>>>>>>>>>> there is another common table option 'lookup.async', it may
> >>>>>>>>> confuse
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> users in the long run.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
> >>>>>>>> place
> >>>>>>>>>>> (for
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> current hbase connector) and not turn it into a common
> >>>>>>> option.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> WDYT?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Lincoln Lee
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Alexander,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
> >>>>>> you
> >>>>>>>> can
> >>>>>>>>>>> find
> >>>>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>>> changes from my latest email. Since some terminologies has
> >>>>>>>>>>> changed so
> >>>>>>>>>>>>>>>> I’ll
> >>>>>>>>>>>>>>>>> use the new concept for replying your comments.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. Builder vs ‘of’
> >>>>>>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
> >>>>>> optional
> >>>>>>>>>>>>> parameters
> >>>>>>>>>>>>>>>>> for full caching mode (“rescan” previously). The
> >>>>>>>>>>> schedule-with-delay
> >>>>>>>>>>>>>>> idea
> >>>>>>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
> >>>>>> the
> >>>>>>>>>>> builder
> >>>>>>>>>>>>> API
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> full caching to make it more descriptive for developers.
> >>>>>>> Would
> >>>>>>>>> you
> >>>>>>>>>>>>> mind
> >>>>>>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
> >>>>>>>>> workspace
> >>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> just provide your account ID and ping any PMC member
> >>>>>>> including
> >>>>>>>>>>> Jark.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2. Common table options
> >>>>>>>>>>>>>>>>> We have some discussions these days and propose to
> >>>>>>> introduce 8
> >>>>>>>>>>> common
> >>>>>>>>>>>>>>>>> table options about caching. It has been updated on the
> >>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 3. Retries
> >>>>>>>>>>>>>>>>> I think we are on the same page :-)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For your additional concerns:
> >>>>>>>>>>>>>>>>> 1) The table option has been updated.
> >>>>>>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
> >>>>>> use
> >>>>>>>>>>> partial
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>> full caching mode.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> >>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Also I have a few additions:
> >>>>>>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> >>>>>>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
> >>>>>> that
> >>>>>>>> we
> >>>>>>>>>>> talk
> >>>>>>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
> >>>>>> fits
> >>>>>>>>> more,
> >>>>>>>>>>>>>>>>>> considering my optimization with filters.
> >>>>>>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
> >>>>>>> separate
> >>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>> and rescanning from the options point of view? Like
> >>>>>>> initially
> >>>>>>>>> we
> >>>>>>>>>>> had
> >>>>>>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
> >>>>>>> now
> >>>>>>>> we
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
> >>>>>>> be
> >>>>>>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> >>>>>>>>>>> smiralexan@gmail.com
> >>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Qingsheng and Jark,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. Builders vs 'of'
> >>>>>>>>>>>>>>>>>>> I understand that builders are used when we have
> >>>>>> multiple
> >>>>>>>>>>>>>>> parameters.
> >>>>>>>>>>>>>>>>>>> I suggested them because we could add parameters later.
> >>>>>> To
> >>>>>>>>>>> prevent
> >>>>>>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
> >>>>>>> can
> >>>>>>>>>>>>> suggest
> >>>>>>>>>>>>>>>>>>> one more config now - "rescanStartTime".
> >>>>>>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
> >>>>>> reload
> >>>>>>>> of
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> starts. This parameter can be thought of as
> >>>>>> 'initialDelay'
> >>>>>>>>> (diff
> >>>>>>>>>>>>>>>>>>> between current time and rescanStartTime) in method
> >>>>>>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
> >>>>>>> can
> >>>>>>>> be
> >>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>> useful when the dimension table is updated by some other
> >>>>>>>>>>> scheduled
> >>>>>>>>>>>>>>> job
> >>>>>>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
> >>>>>> second
> >>>>>>>> scan
> >>>>>>>>>>>>>>> (first
> >>>>>>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
> >>>>>>>> without
> >>>>>>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
> >>>>>>> one
> >>>>>>>>>>> day.
> >>>>>>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
> >>>>>> if
> >>>>>>>> you
> >>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
> >>>>>> myself
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2. Common table options
> >>>>>>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
> >>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
> >>>>>>> for
> >>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
> >>>>>>>> cache
> >>>>>>>>>>>>>>> options,
> >>>>>>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 3. Retries
> >>>>>>>>>>>>>>>>>>> I'm fine with suggestion close to
> >>>>>>> RetryUtils#tryTimes(times,
> >>>>>>>>>>> call)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> >>>>>>>> renqschn@gmail.com
> >>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Jark and Alexander,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
> >>>>>> common
> >>>>>>>>> table
> >>>>>>>>>>>>>>>>> options. I prefer to introduce a new
> >>>>>>> DefaultLookupCacheOptions
> >>>>>>>>>>> class
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> holding these option definitions because putting all
> >>>>>> options
> >>>>>>>>> into
> >>>>>>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
> >>>>>>>>>>> categorized.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
> >>>>>>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
> >>>>>>>>>>> RescanRuntimeProvider
> >>>>>>>>>>>>>>>>> considering both arguments are required.
> >>>>>>>>>>>>>>>>>>>> 2. Introduce new table options matching
> >>>>>>>>>>> DefaultLookupCacheFactory
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> >>>>>>> imjark@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1) retry logic
> >>>>>>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
> >>>>>>>>> utilities,
> >>>>>>>>>>>>>>> e.g.
> >>>>>>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
> >>>>>>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
> >>>>>> by
> >>>>>>>>>>>>>>> DataStream
> >>>>>>>>>>>>>>>>> users.
> >>>>>>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
> >>>>>> to
> >>>>>>>> put
> >>>>>>>>>>> it.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2) cache ConfigOptions
> >>>>>>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
> >>>>>>>>> framework.
> >>>>>>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
> >>>>>>>> includes
> >>>>>>>>>>>>>>>>> "sink.parallelism", "format" options.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> >>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thank you for considering my comments.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
> >>>>>> such
> >>>>>>> as
> >>>>>>>>>>>>>>>>> re-establish the connection
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
> >>>>>> be
> >>>>>>>>>>> placed in
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>> separate function, that can be implemented by
> >>>>>>> connectors.
> >>>>>>>>>>> Just
> >>>>>>>>>>>>>>>> moving
> >>>>>>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
> >>>>>>>> more
> >>>>>>>>>>>>>>> concise
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
> >>>>>> The
> >>>>>>>>>>> decision
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>>>>> to you.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> >>>>>>>>>>> developers
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> define their own options as we do now per connector.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
> >>>>>>>> this
> >>>>>>>>>>> FLIP
> >>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
> >>>>>> current
> >>>>>>>>> cache
> >>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> >>>>>>>> still
> >>>>>>>>>>> we
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> put
> >>>>>>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
> >>>>>>> reuse
> >>>>>>>>>>> them
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
> >>>>>> significant,
> >>>>>>>>> avoid
> >>>>>>>>>>>>>>>> possible
> >>>>>>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
> >>>>>>> out
> >>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> documentation for connector developers.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> >>>>>>>>>>> renqschn@gmail.com>:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Alexander,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
> >>>>>>> same
> >>>>>>>>>>> page!
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
> >>>>>>>> quoting
> >>>>>>>>>>> your
> >>>>>>>>>>>>>>>> reply
> >>>>>>>>>>>>>>>>> under this email.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
> >>>>>> in
> >>>>>>>>>>> lookup()
> >>>>>>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
> >>>>>>>> meaningful
> >>>>>>>>>>>>> under
> >>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>> specific retriable failures, and there might be custom
> >>>>>> logic
> >>>>>>>>>>> before
> >>>>>>>>>>>>>>>> making
> >>>>>>>>>>>>>>>>> retry, such as re-establish the connection
> >>>>>>>>>>> (JdbcRowDataLookupFunction
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
> >>>>>>> version
> >>>>>>>> of
> >>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>> Do
> >>>>>>>>>>>>>>>>> you have any special plans for them?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> >>>>>>>>>>> developers
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> define their own options as we do now per connector.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
> >>>>>>>> FLIP.
> >>>>>>>>>>> Hope
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> can finalize our proposal soon!
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> >>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
> >>>>>> I
> >>>>>>>> have
> >>>>>>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>>>>>>>> suggestions and questions.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> >>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> >>>>>>>> class.
> >>>>>>>>>>>>> 'eval'
> >>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
> >>>>>> The
> >>>>>>>> same
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>> 'async' case.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
> >>>>>>>>>>>>>>>>> 'cacheMissingKey'
> >>>>>>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> >>>>>>>>>>>>>>>>> ScanRuntimeProvider.
> >>>>>>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
> >>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> >>>>>>>>> 'build'
> >>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
> >>>>>>>> TableFunctionProvider
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> >>>>>>>>>>> deprecated.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
> >>>>>> assume
> >>>>>>>>>>> usage of
> >>>>>>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
> >>>>>>> case,
> >>>>>>>>> it
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
> >>>>>> or
> >>>>>>>>>>> 'putAll'
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> LookupCache.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
> >>>>>>>> version
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>> Do
> >>>>>>>>>>>>>>>>>>>>>>>> you have any special plans for them?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
> >>>>>> make
> >>>>>>>>> small
> >>>>>>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
> >>>>>>>> worth
> >>>>>>>>>>>>>>>> mentioning
> >>>>>>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
> >>>>>> the
> >>>>>>>>>>> future.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> >>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
> >>>>>> As
> >>>>>>>> Jark
> >>>>>>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
> >>>>>>>>>>> refactor on
> >>>>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
> >>>>>> design
> >>>>>>>> now
> >>>>>>>>>>> and
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> happy to hear more suggestions from you!
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
> >>>>>>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
> >>>>>>> and
> >>>>>>>> is
> >>>>>>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
> >>>>>>>>>>>>> previously.
> >>>>>>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
> >>>>>> reflect
> >>>>>>>> the
> >>>>>>>>>>> new
> >>>>>>>>>>>>>>>>> design.
> >>>>>>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
> >>>>>> and
> >>>>>>>>>>>>>>> introduce a
> >>>>>>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
> >>>>>> scanning.
> >>>>>>> We
> >>>>>>>>> are
> >>>>>>>>>>>>>>>> planning
> >>>>>>>>>>>>>>>>> to support SourceFunction / InputFormat for now
> >>>>>> considering
> >>>>>>>> the
> >>>>>>>>>>>>>>>> complexity
> >>>>>>>>>>>>>>>>> of FLIP-27 Source API.
> >>>>>>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> >>>>>>>> make
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> semantic of lookup more straightforward for developers.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
> >>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> >>>>>>> is
> >>>>>>>>>>>>>>> deprecated
> >>>>>>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
> >>>>>>>>> currently
> >>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>> not?
> >>>>>>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
> >>>>>> for
> >>>>>>>>> now.
> >>>>>>>>>>> I
> >>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>> it will be deprecated in the future but we don't have a
> >>>>>>> clear
> >>>>>>>>> plan
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> that.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> >>>>>>>> looking
> >>>>>>>>>>>>>>> forward
> >>>>>>>>>>>>>>>>> to cooperating with you after we finalize the design and
> >>>>>>>>>>> interfaces!
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
> >>>>>> Смирнов <
> >>>>>>>>>>>>>>>>> smiralexan@gmail.com> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
> >>>>>>> all
> >>>>>>>>>>>>> points!
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> >>>>>>> is
> >>>>>>>>>>>>>>> deprecated
> >>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
> >>>>>>> but
> >>>>>>>>>>>>>>> currently
> >>>>>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
> >>>>>>> version
> >>>>>>>>>>> it's
> >>>>>>>>>>>>> OK
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
> >>>>>>>>> supporting
> >>>>>>>>>>>>>>> rescan
> >>>>>>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
> >>>>>> for
> >>>>>>>>> this
> >>>>>>>>>>>>>>>>> decision we
> >>>>>>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
> >>>>>> participants.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
> >>>>>>> your
> >>>>>>>>>>>>>>>>> statements. All
> >>>>>>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
> >>>>>>> would
> >>>>>>>> be
> >>>>>>>>>>> nice
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
> >>>>>> lot
> >>>>>>>> of
> >>>>>>>>>>> work
> >>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
> >>>>>> one
> >>>>>>>> we
> >>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> discussing,
> >>>>>>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
> >>>>>> Anyway
> >>>>>>>>>>> looking
> >>>>>>>>>>>>>>>>> forward for
> >>>>>>>>>>>>>>>>>>>>>>>>>> the FLIP update!
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> >>>>>>>> imjark@gmail.com
> >>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> >>>>>>>>>>> discussed
> >>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> several times
> >>>>>>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
> >>>>>>> many
> >>>>>>>> of
> >>>>>>>>>>> your
> >>>>>>>>>>>>>>>>> points!
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
> >>>>>> design
> >>>>>>>> docs
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> maybe can be
> >>>>>>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
> >>>>>>> discussions:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
> >>>>>> "cache
> >>>>>>>> in
> >>>>>>>>>>>>>>>>> framework" way.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> >>>>>>> customize
> >>>>>>>>> and
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
> >>>>>> easy-use.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
> >>>>>>>>> flexibility
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> conciseness.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> >>>>>>>> lookup
> >>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>> esp reducing
> >>>>>>>>>>>>>>>>>>>>>>>>>>> IO.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
> >>>>>> the
> >>>>>>>>>>> unified
> >>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>>> to both
> >>>>>>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
> >>>>>>> direction.
> >>>>>>>> If
> >>>>>>>>>>> we
> >>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>> to support
> >>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
> >>>>>> use
> >>>>>>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
> >>>>>> decide
> >>>>>>>> to
> >>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>> the cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
> >>>>>>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
> >>>>>>> and
> >>>>>>>>> it
> >>>>>>>>>>>>>>>> doesn't
> >>>>>>>>>>>>>>>>> affect the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
> >>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
> >>>>>>> your
> >>>>>>>>>>>>>>> proposal.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
> >>>>>>>>> InputFormat,
> >>>>>>>>>>>>>>>>> SourceFunction for
> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> >>>>>>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
> >>>>>> source
> >>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>> instead of
> >>>>>>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
> >>>>>>>>> re-scan
> >>>>>>>>>>>>>>>> ability
> >>>>>>>>>>>>>>>>> for FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
> >>>>>>>>> effort
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> FLIP-27 source
> >>>>>>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
> >>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
> >>>>>>> InputFormat&SourceFunction,
> >>>>>>>>> as
> >>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>> are not
> >>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> >>>>>>> another
> >>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
> >>>>>>>> plan
> >>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
> >>>>>>> SourceFunction
> >>>>>>>>> are
> >>>>>>>>>>>>>>>>> deprecated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
> >>>>>> <
> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
> >>>>>>> InputFormat
> >>>>>>>>> is
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>>>> considered.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> >>>>>>>>>>>>>>>>> martijn@ververica.com>:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
> >>>>>>> connectors
> >>>>>>>>> to
> >>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
> >>>>>>> The
> >>>>>>>>> old
> >>>>>>>>>>>>>>>>> interfaces will be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
> >>>>>>>> refactored
> >>>>>>>>> to
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>> the new ones
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> dropped.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
> >>>>>> are
> >>>>>>>>> using
> >>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>> interfaces,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
> >>>>>>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
> >>>>>> Смирнов
> >>>>>>> <
> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
> >>>>>>> make
> >>>>>>>>>>> some
> >>>>>>>>>>>>>>>>> comments and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
> >>>>>>> we
> >>>>>>>>> can
> >>>>>>>>>>>>>>>> achieve
> >>>>>>>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
> >>>>>> in
> >>>>>>>>>>>>>>>>> flink-table-common,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
> >>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>> Therefore if a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
> >>>>>> cache
> >>>>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>> and their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
> >>>>>> lookupConfig
> >>>>>>> to
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>> planner, but if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
> >>>>>>> in
> >>>>>>>>> his
> >>>>>>>>>>>>>>>>> TableFunction, it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
> >>>>>>>>>>> interface
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
> >>>>>>> the
> >>>>>>>>>>>>>>>>> documentation). In
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
> >>>>>>> unified.
> >>>>>>>>>>> WDYT?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> >>>>>>> cache,
> >>>>>>>> we
> >>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> have 90% of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> >>>>>>> optimization
> >>>>>>>> in
> >>>>>>>>>>> case
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> LRU cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> >>>>>>>> Collection<RowData>>.
> >>>>>>>>>>> Here
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> always
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
> >>>>>>>> cache,
> >>>>>>>>>>> even
> >>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
> >>>>>>> rows
> >>>>>>>>>>> after
> >>>>>>>>>>>>>>>>> applying
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> >>>>>>>>>>>>>>>> TableFunction,
> >>>>>>>>>>>>>>>>> we store
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> >>>>>>>> cache
> >>>>>>>>>>> line
> >>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> >>>>>>>>> bytes).
> >>>>>>>>>>>>>>> I.e.
> >>>>>>>>>>>>>>>>> we don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
> >>>>>>>> pruned,
> >>>>>>>>>>> but
> >>>>>>>>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
> >>>>>> If
> >>>>>>>> the
> >>>>>>>>>>> user
> >>>>>>>>>>>>>>>>> knows about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> >>>>>>>>> option
> >>>>>>>>>>>>>>> before
> >>>>>>>>>>>>>>>>> the start
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
> >>>>>>> idea
> >>>>>>>>>>> that we
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> do this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
> >>>>>> and
> >>>>>>>>>>> 'weigher'
> >>>>>>>>>>>>>>>>> methods of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> >>>>>>>>>>> collection
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> >>>>>>>> automatically
> >>>>>>>>>>> fit
> >>>>>>>>>>>>>>>> much
> >>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> >>>>>>>>> filters
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> projects
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> >>>>>>>>> interfaces,
> >>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>> mean it's
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> >>>>>>>>> implement
> >>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>> pushdown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
> >>>>>> no
> >>>>>>>>>>> database
> >>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> >>>>>>>>> feature
> >>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> >>>>>>>> talk
> >>>>>>>>>>> about
> >>>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
> >>>>>> databases
> >>>>>>>>> might
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>> support all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
> >>>>>> all).
> >>>>>>> I
> >>>>>>>>>>> think
> >>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
> >>>>>>>> optimization
> >>>>>>>>>>>>>>>>> independently of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
> >>>>>>>> complex
> >>>>>>>>>>>>>>> problems
> >>>>>>>>>>>>>>>>> (or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
> >>>>>> Actually
> >>>>>>> in
> >>>>>>>>> our
> >>>>>>>>>>>>>>>>> internal version
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
> >>>>>> and
> >>>>>>>>>>>>> reloading
> >>>>>>>>>>>>>>>>> data from
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
> >>>>>> a
> >>>>>>>> way
> >>>>>>>>> to
> >>>>>>>>>>>>>>> unify
> >>>>>>>>>>>>>>>>> the logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> >>>>>>>>>>>>> SourceFunction,
> >>>>>>>>>>>>>>>>> Source,...)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
> >>>>>>> result
> >>>>>>>> I
> >>>>>>>>>>>>>>> settled
> >>>>>>>>>>>>>>>>> on using
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
> >>>>>>> in
> >>>>>>>>> all
> >>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
> >>>>>> plans
> >>>>>>>> to
> >>>>>>>>>>>>>>>> deprecate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> >>>>>>>> usage
> >>>>>>>>> of
> >>>>>>>>>>>>>>>>> FLIP-27 source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> >>>>>>>>> source
> >>>>>>>>>>> was
> >>>>>>>>>>>>>>>>> designed to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
> >>>>>>> (SplitEnumerator
> >>>>>>>> on
> >>>>>>>>>>>>>>>>> JobManager and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> >>>>>>>> operator
> >>>>>>>>>>>>>>> (lookup
> >>>>>>>>>>>>>>>>> join
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
> >>>>>> direct
> >>>>>>>> way
> >>>>>>>>> to
> >>>>>>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>> splits from
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
> >>>>>>> works
> >>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> >>>>>>>>>>>>>>> AddSplitEvents).
> >>>>>>>>>>>>>>>>> Usage of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
> >>>>>>> clearer
> >>>>>>>>> and
> >>>>>>>>>>>>>>>>> easier. But if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> >>>>>>>>>>> FLIP-27, I
> >>>>>>>>>>>>>>>>> have the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
> >>>>>>> lookup
> >>>>>>>>> join
> >>>>>>>>>>>>> ALL
> >>>>>>>>>>>>>>>>> cache in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
> >>>>>> of
> >>>>>>>>> batch
> >>>>>>>>>>>>>>>> source?
> >>>>>>>>>>>>>>>>> The point
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
> >>>>>> join
> >>>>>>>> ALL
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> and simple
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
> >>>>>>> case
> >>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> performed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
> >>>>>> (cache)
> >>>>>>> is
> >>>>>>>>>>>>> cleared
> >>>>>>>>>>>>>>>>> (correct me
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> >>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>> simple join
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
> >>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
> >>>>>> be
> >>>>>>>>> easy
> >>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> new FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
> >>>>>> -
> >>>>>>> we
> >>>>>>>>>>> will
> >>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>> to change
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> >>>>>>>> again
> >>>>>>>>>>> after
> >>>>>>>>>>>>>>>>> some TTL).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> >>>>>>> long-term
> >>>>>>>>>>> goal
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> will make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
> >>>>>>> said.
> >>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> can limit
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> >>>>>>>> (InputFormats).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> >>>>>>>> flexible
> >>>>>>>>>>>>>>>>> interfaces for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
> >>>>>> both
> >>>>>>>> in
> >>>>>>>>>>> LRU
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> ALL caches.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> >>>>>>>>>>> supported
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
> >>>>>>> have
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>> opportunity to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
> >>>>>> currently
> >>>>>>>>> filter
> >>>>>>>>>>>>>>>>> pushdown works
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> >>>>>>> filters
> >>>>>>>> +
> >>>>>>>>>>>>>>>>> projections
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
> >>>>>>>>>>> features.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
> >>>>>>> that
> >>>>>>>>>>>>> involves
> >>>>>>>>>>>>>>>>> multiple
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
> >>>>>>> from
> >>>>>>>>>>>>>>>>> InputFormat in favor
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> >>>>>>> realization
> >>>>>>>>>>> really
> >>>>>>>>>>>>>>>>> complex and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
> >>>>>>> extend
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> >>>>>>>> case
> >>>>>>>>> of
> >>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>> join ALL
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> >>>>>>>>> imjark@gmail.com
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
> >>>>>>> want
> >>>>>>>> to
> >>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>> ideas:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> >>>>>>>> connectors
> >>>>>>>>>>> base
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
> >>>>>>> ways
> >>>>>>>>>>> should
> >>>>>>>>>>>>>>>>> work (e.g.,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
> >>>>>>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
> >>>>>>> flexible
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
> >>>>>> we
> >>>>>>>> can
> >>>>>>>>>>> have
> >>>>>>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
> >>>>>>> should
> >>>>>>>>> be a
> >>>>>>>>>>>>>>> final
> >>>>>>>>>>>>>>>>> state,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
> >>>>>>> into
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> benefit a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> lot
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> >>>>>>>>> Connectors
> >>>>>>>>>>> use
> >>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> >>>>>>> cache,
> >>>>>>>> we
> >>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> have 90% of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
> >>>>>> means
> >>>>>>>> the
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> meaningless in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
> >>>>>> to
> >>>>>>> do
> >>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>> and projects
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> >>>>>>>>> interfaces,
> >>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>> mean it's
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
> >>>>>> interfaces
> >>>>>>> to
> >>>>>>>>>>> reduce
> >>>>>>>>>>>>>>> IO
> >>>>>>>>>>>>>>>>> and the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
> >>>>>>>> source
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> lookup source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> >>>>>>>> pushdown
> >>>>>>>>>>> logic
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> caches,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
> >>>>>>> of
> >>>>>>>>> this
> >>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>> We have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> never
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
> >>>>>>> "eval"
> >>>>>>>>>>> method
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
> >>>>>> share
> >>>>>>>> the
> >>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> reload
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> >>>>>>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> >>>>>>>>> deprecated,
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> >>>>>>>>>>> LookupJoin,
> >>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> may make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
> >>>>>> the
> >>>>>>>> ALL
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> logic and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> reuse
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> >>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
> >>>>>>> lies
> >>>>>>>>> out
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> scope of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
> >>>>>> be
> >>>>>>>>> done
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
> >>>>>> Visser <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> >>>>>>>> correctly
> >>>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> >>>>>>>>>>>>>>> jdbc/hive/hbase."
> >>>>>>>>>>>>>>>>> -> Would
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> >>>>>>> implement
> >>>>>>>>>>> these
> >>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
> >>>>>> to
> >>>>>>>>> doing
> >>>>>>>>>>>>>>> that,
> >>>>>>>>>>>>>>>>> outside
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> >>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> >>>>>>>> improvement!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
> >>>>>> implementation
> >>>>>>>>>>> would be
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> nice
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> >>>>>>> SYSTEM_TIME
> >>>>>>>>> AS
> >>>>>>>>>>> OF
> >>>>>>>>>>>>>>>>> proc_time"
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> >>>>>>>>>>> implemented.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
> >>>>>>> say
> >>>>>>>>>>> that:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
> >>>>>>> to
> >>>>>>>>> cut
> >>>>>>>>>>> off
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
> >>>>>> the
> >>>>>>>> most
> >>>>>>>>>>>>> handy
> >>>>>>>>>>>>>>>>> way to do
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
> >>>>>> bit
> >>>>>>>>>>> harder to
> >>>>>>>>>>>>>>>>> pass it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> >>>>>>>> Alexander
> >>>>>>>>>>>>>>>> correctly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
> >>>>>>> for
> >>>>>>>>>>>>>>>>> jdbc/hive/hbase.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
> >>>>>> caching
> >>>>>>>>>>>>>>> parameters
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
> >>>>>>> set
> >>>>>>>> it
> >>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
> >>>>>>> options
> >>>>>>>>> for
> >>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> tables.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> >>>>>>>>> really
> >>>>>>>>>>>>>>>>> deprives us of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> >>>>>>>> implement
> >>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cache).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
> >>>>>>> more
> >>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> >>>>>>>> schema
> >>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
> >>>>>> not
> >>>>>>>>> right
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> >>>>>>>>> architecture?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> >>>>>>> Visser <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> >>>>>>>> wanted
> >>>>>>>>> to
> >>>>>>>>>>>>>>>>> express that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
> >>>>>> this
> >>>>>>>>> topic
> >>>>>>>>>>>>>>> and I
> >>>>>>>>>>>>>>>>> hope
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> >>>>>>>>> Смирнов <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> >>>>>>>> However, I
> >>>>>>>>>>> have
> >>>>>>>>>>>>>>>>> questions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
> >>>>>>> get
> >>>>>>>>>>>>>>>>> something?).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
> >>>>>> of
> >>>>>>>> "FOR
> >>>>>>>>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> >>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>> AS
> >>>>>>>>>>>>>>> OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time"
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
> >>>>>>> you
> >>>>>>>>>>> said,
> >>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>> go
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> on it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
> >>>>>> performance
> >>>>>>>> (no
> >>>>>>>>>>> one
> >>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
> >>>>>> do
> >>>>>>>> you
> >>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> >>>>>>>>> explicitly
> >>>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> whether
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
> >>>>>> the
> >>>>>>>>> list
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> >>>>>>>> want
> >>>>>>>>>>> to.
> >>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>> what
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
> >>>>>>> caching
> >>>>>>>>> in
> >>>>>>>>>>>>>>>> modules
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> >>>>>>>> flink-table-common
> >>>>>>>>>>> from
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> considered
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> >>>>>>>>>>>>>>>>> breaking/non-breaking
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> >>>>>>>>> proc_time"?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> >>>>>>>>>>> options in
> >>>>>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> control
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
> >>>>>>> never
> >>>>>>>>>>>>> happened
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> previously
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> >>>>>>>>> semantics
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
> >>>>>>> it
> >>>>>>>>>>> about
> >>>>>>>>>>>>>>>>> limiting
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> >>>>>>>>> business
> >>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
> >>>>>> logic
> >>>>>>> in
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>> framework? I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> >>>>>>>> option
> >>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
> >>>>>> the
> >>>>>>>>> wrong
> >>>>>>>>>>>>>>>>> decision,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
> >>>>>>> logic
> >>>>>>>>> (not
> >>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> >>>>>>>>> functions
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>> ONE
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (there
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> >>>>>>> caches).
> >>>>>>>>>>> Does it
> >>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matter for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
> >>>>>>> logic
> >>>>>>>> is
> >>>>>>>>>>>>>>>> located,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> which is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> >>>>>>>>>>> 'sink.parallelism',
> >>>>>>>>>>>>>>>>> which in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some way
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
> >>>>>> framework"
> >>>>>>>> and
> >>>>>>>>> I
> >>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>> see any
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> >>>>>>>>> all-caching
> >>>>>>>>>>>>>>>>> scenario
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> >>>>>>>> discussion,
> >>>>>>>>>>> but
> >>>>>>>>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> in our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> >>>>>>>> quite
> >>>>>>>>>>>>>>> easily
> >>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> reused
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
> >>>>>>> for
> >>>>>>>> a
> >>>>>>>>>>> new
> >>>>>>>>>>>>>>>> API).
> >>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
> >>>>>> use
> >>>>>>>>>>>>>>> InputFormat
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
> >>>>>> even
> >>>>>>>> Hive
> >>>>>>>>>>> - it
> >>>>>>>>>>>>>>>> uses
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
> >>>>>> a
> >>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> >>>>>>>> ability
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>> reload
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> >>>>>>>> number
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
> >>>>>> reload
> >>>>>>>>> time
> >>>>>>>>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduces
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> >>>>>>>> blocking). I
> >>>>>>>>>>> know
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> usually
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> >>>>>>>> code,
> >>>>>>>>>>> but
> >>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
> >>>>>>> an
> >>>>>>>>>>> ideal
> >>>>>>>>>>>>>>>>> solution,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
> >>>>>>> might
> >>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> >>>>>>>>> developer
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
> >>>>>>> new
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
> >>>>>> options
> >>>>>>>>> into
> >>>>>>>>>>> 2
> >>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
> >>>>>> will
> >>>>>>>>> need
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>> is to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> redirect
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> >>>>>>>>>>> LookupConfig
> >>>>>>>>>>>>> (+
> >>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> add an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> >>>>>>>> naming),
> >>>>>>>>>>>>>>>> everything
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> will be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> >>>>>>>> won't
> >>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> refactoring at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
> >>>>>> connector
> >>>>>>>>>>> because
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> backward
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
> >>>>>> wants
> >>>>>>> to
> >>>>>>>>> use
> >>>>>>>>>>>>> his
> >>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> >>>>>>>>> configs
> >>>>>>>>>>>>> into
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
> >>>>>> with
> >>>>>>>>>>> already
> >>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
> >>>>>> it's a
> >>>>>>>>> rare
> >>>>>>>>>>>>>>> case).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
> >>>>>> pushed
> >>>>>>>> all
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>>> down
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> >>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
> >>>>>> is
> >>>>>>>> that
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> ONLY
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> >>>>>>>>>>> FileSystemTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> >>>>>>>>> currently).
> >>>>>>>>>>>>> Also
> >>>>>>>>>>>>>>>>> for some
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> >>>>>>>>> complex
> >>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
> >>>>>> the
> >>>>>>>>> cache
> >>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
> >>>>>> large
> >>>>>>>>>>> amount of
> >>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> from the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> >>>>>>>>> suppose
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> dimension
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
> >>>>>> 20
> >>>>>>> to
> >>>>>>>>> 40,
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> input
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
> >>>>>>> by
> >>>>>>>>> age
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> users. If
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> >>>>>>>> This
> >>>>>>>>>>> means
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
> >>>>>>> almost
> >>>>>>>> 2
> >>>>>>>>>>>>> times.
> >>>>>>>>>>>>>>>> It
> >>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gain a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> >>>>>>>>> optimization
> >>>>>>>>>>>>>>> starts
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> >>>>>>>> filters
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> projections
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
> >>>>>>> opens
> >>>>>>>> up
> >>>>>>>>>>>>>>>>> additional
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
> >>>>>> 'not
> >>>>>>>>> quite
> >>>>>>>>>>>>>>>>> useful'.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> >>>>>>>>>>> regarding
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> topic!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> >>>>>>>> points,
> >>>>>>>>>>> and I
> >>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
> >>>>>>> come
> >>>>>>>>> to a
> >>>>>>>>>>>>>>>>> consensus.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
> >>>>>>> Ren
> >>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
> >>>>>> for
> >>>>>>> my
> >>>>>>>>>>> late
> >>>>>>>>>>>>>>>>> response!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
> >>>>>>> and
> >>>>>>>>>>> Leonard
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> I’d
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> >>>>>>>> implementing
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> logic in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> >>>>>>>>>>> user-provided
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> >>>>>>> extending
> >>>>>>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
> >>>>>> semantic
> >>>>>>> of
> >>>>>>>>>>> "FOR
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> >>>>>>>> reflect
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> content
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
> >>>>>> users
> >>>>>>>>>>> choose
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
> >>>>>>> that
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> breakage is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
> >>>>>>> prefer
> >>>>>>>>> not
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> provide
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
> >>>>>>> in
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>> framework
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (whether
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> >>>>>>> TableFunction),
> >>>>>>>> we
> >>>>>>>>>>> have
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
> >>>>>>> control
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>> behavior of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> >>>>>>>>> should
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> cautious.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Under
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> >>>>>>>> framework
> >>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>> only be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
> >>>>>>> it’s
> >>>>>>>>>>> hard
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
> >>>>>> source
> >>>>>>>>> loads
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> refresh
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> >>>>>>>> high
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
> >>>>>>> widely
> >>>>>>>>>>> used
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> internal
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
> >>>>>>> the
> >>>>>>>>>>> user’s
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> >>>>>>>>>>> introduce a
> >>>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interface for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
> >>>>>> would
> >>>>>>>>>>> become
> >>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> complex.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
> >>>>>> framework
> >>>>>>>>> might
> >>>>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> >>>>>>>> there
> >>>>>>>>>>> might
> >>>>>>>>>>>>>>>>> exist two
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caches
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
> >>>>>> user
> >>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> configures
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> >>>>>>>> implemented
> >>>>>>>>>>> by
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> >>>>>>>>>>> Alexander, I
> >>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
> >>>>>> way
> >>>>>>>> down
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
> >>>>>> of
> >>>>>>>> the
> >>>>>>>>>>>>>>> runner
> >>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> >>>>>>>> network
> >>>>>>>>>>> I/O
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> pressure
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
> >>>>>> these
> >>>>>>>>>>>>>>>> optimizations
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> >>>>>>>>> reflect
> >>>>>>>>>>> our
> >>>>>>>>>>>>>>>>> ideas.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
> >>>>>>> of
> >>>>>>>>>>>>>>>>> TableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> >>>>>>>>>>> (CachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
> >>>>>> developers
> >>>>>>>> and
> >>>>>>>>>>>>>>> regulate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
> >>>>>> reference.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> >>>>>>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> >>>>>>> Александр
> >>>>>>>>>>> Смирнов
> >>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> >>>>>>>>> solution
> >>>>>>>>>>> as
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> step:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
> >>>>>> mutually
> >>>>>>>>>>> exclusive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> (originally
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> >>>>>>>>> conceptually
> >>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>> follow
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> >>>>>>>>>>> different.
> >>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> go one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
> >>>>>>> will
> >>>>>>>>> mean
> >>>>>>>>>>>>>>>>> deleting
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> >>>>>>>>>>> connectors.
> >>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> think we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> >>>>>>>> about
> >>>>>>>>>>> that
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> >>>>>>>> tasks
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parts
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> >>>>>>> unification
> >>>>>>>> /
> >>>>>>>>>>>>>>>>> introducing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> >>>>>>>> Qingsheng?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> >>>>>>>>> requests
> >>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
> >>>>>>> fields
> >>>>>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table, we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
> >>>>>>> after
> >>>>>>>>>>> that we
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> >>>>>>>> filter
> >>>>>>>>>>>>>>>>> pushdown. So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> >>>>>>>> much
> >>>>>>>>>>> less
> >>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> >>>>>>>>> architecture
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> >>>>>> honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> >>>>>>>> kinds
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> conversations
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> >>>>>>> confluence,
> >>>>>>>>> so
> >>>>>>>>>>> I
> >>>>>>>>>>>>>>>> made a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Jira
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
> >>>>>> in
> >>>>>>>>> more
> >>>>>>>>>>>>>>>> details
> >>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
> >>>>>>> Heise
> >>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> >>>>>>>> inconsistency
> >>>>>>>>>>> was
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> >>>>>>>> could
> >>>>>>>>>>> also
> >>>>>>>>>>>>>>>> live
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
> >>>>>> of
> >>>>>>>>>>> making
> >>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> >>>>>>>> devise a
> >>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> layer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> >>>>>>> CachingTableFunction
> >>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> delegates to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> X in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> >>>>>>>> Lifting
> >>>>>>>>>>> it
> >>>>>>>>>>>>>>> into
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> >>>>>>>>>>> probably
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> unnecessary
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
> >>>>>>> will
> >>>>>>>>> only
> >>>>>>>>>>>>>>>> receive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
> >>>>>>> more
> >>>>>>>>>>>>>>>> interesting
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> save
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> >>>>>>>> changes
> >>>>>>>>> of
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> FLIP
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> >>>>>>>>> interfaces.
> >>>>>>>>>>>>>>>>> Everything
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> else
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
> >>>>>> That
> >>>>>>>>> means
> >>>>>>>>>>> we
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
> >>>>>> Alexander
> >>>>>>>>>>> pointed
> >>>>>>>>>>>>>>> out
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> later.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> >>>>>>>>> architecture
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> >>>>>> honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> >>>>>>>> Александр
> >>>>>>>>>>>>>>> Смирнов
> >>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
> >>>>>>> I'm
> >>>>>>>>>>> not a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> committer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> yet,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> >>>>>>>> FLIP
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> >>>>>>>>>>> feature in
> >>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> company’s
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
> >>>>>> our
> >>>>>>>>>>> thoughts
> >>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> this and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
> >>>>>> alternative
> >>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> >>>>>>>>>>>>> (CachingTableFunction).
> >>>>>>>>>>>>>>>> As
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> >>>>>>>>>>> flink-table-common
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
> >>>>>> tables –
> >>>>>>>>> it’s
> >>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> convenient
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> >>>>>>>>>>> CachingTableFunction
> >>>>>>>>>>>>>>>>> contains
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
> >>>>>> and
> >>>>>>>>>>>>>>> everything
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connected
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
> >>>>>> module,
> >>>>>>>>>>> probably
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> >>>>>>>>> depend
> >>>>>>>>>>> on
> >>>>>>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
> >>>>>>> which
> >>>>>>>>>>> doesn’t
> >>>>>>>>>>>>>>>>> sound
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> good.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> >>>>>>>>>>>>> ‘getLookupConfig’
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> >>>>>>>>>>> connectors
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> >>>>>>>> therefore
> >>>>>>>>>>> they
> >>>>>>>>>>>>>>>> won’t
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> depend on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> >>>>>>>>> planner
> >>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> construct a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> >>>>>>>> runtime
> >>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> >>>>>>>> Architecture
> >>>>>>>>>>>>> looks
> >>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> >>>>>>>>> actually
> >>>>>>>>>>>>>>> yours
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
> >>>>>> that
> >>>>>>>> will
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> responsible
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> >>>>>>>>>>> inheritors.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> >>>>>>>> AsyncLookupJoinRunner,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> >>>>>>>>>>>>>>> LookupJoinCachingRunner,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
> >>>>>> etc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
> >>>>>> powerful
> >>>>>>>>>>> advantage
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> such a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
> >>>>>>> level,
> >>>>>>>>> we
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> >>>>>>>>>>> LookupJoinRunnerWithCalc
> >>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> named
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
> >>>>>> function,
> >>>>>>>>> which
> >>>>>>>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mostly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> >>>>>>>> lookup
> >>>>>>>>>>> table
> >>>>>>>>>>>>>>> B
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> condition
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> >>>>>>>> WHERE
> >>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
> >>>>>> A.age =
> >>>>>>>>>>> B.age +
> >>>>>>>>>>>>>>> 10
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> >>>>>>>> storing
> >>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
> >>>>>> reduced:
> >>>>>>>>>>> filters =
> >>>>>>>>>>>>>>>>> avoid
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
> >>>>>>> reduce
> >>>>>>>>>>>>> records’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> size. So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
> >>>>>> be
> >>>>>>>>>>>>> increased
> >>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
> >>>>>> Ren
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> >>>>>>>>>>> discussion
> >>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> >>>>>>>> table
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> >>>>>>>>> should
> >>>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
> >>>>>>> isn’t a
> >>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
> >>>>>> with
> >>>>>>>>> lookup
> >>>>>>>>>>>>>>>> joins,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> >>>>>>>>>>> including
> >>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
> >>>>>>> table
> >>>>>>>>>>>>> options.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Please
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> take a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
> >>>>>>> Any
> >>>>>>>>>>>>>>>> suggestions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@apache.org>.

Hi devs,

I’d like to push FLIP-221 forward a little bit. Recently we had some offline discussions and updated the FLIP. Here’s the diff compared to the previous version:

1. (Async)LookupFunctionProvider is designed as a base interface for constructing lookup functions.
2. From the LookupFunction we extend PartialCaching / FullCachingLookupProvider for partial and full caching mode.
3. Introduce CacheReloadTrigger for specifying reload stratrgy in full caching mode, and provide 2 default implementations (Periodic / TimedCacheReloadTrigger)

Looking forward to your replies~

Best,
Qingsheng

> On Jun 2, 2022, at 17:15, Qingsheng Ren <re...@gmail.com> wrote:
> 
> Hi Becket,
> 
> Thanks for your feedback!
> 
> 1. An alternative way is to let the implementation of cache to decide
> whether to store a missing key in the cache instead of the framework.
> This sounds more reasonable and makes the LookupProvider interface
> cleaner. I can update the FLIP and clarify in the JavaDoc of
> LookupCache#put that the cache should decide whether to store an empty
> collection.
> 
> 2. Initially the builder pattern is for the extensibility of
> LookupProvider interfaces that we could need to add more
> configurations in the future. We can remove the builder now as we have
> resolved the issue in 1. As for the builder in DefaultLookupCache I
> prefer to keep it because we have a lot of arguments in the
> constructor.
> 
> 3. I think this might overturn the overall design. I agree with
> Becket's idea that the API design should be layered considering
> extensibility and it'll be great to have one unified interface
> supporting both partial, full and even mixed custom strategies, but we
> have some issues to resolve. The original purpose of treating full
> caching separately is that we'd like to reuse the ability of
> ScanRuntimeProvider. Developers just need to hand over Source /
> SourceFunction / InputFormat so that the framework could be able to
> compose the underlying topology and control the reload (maybe in a
> distributed way). Under your design we leave the reload operation
> totally to the CacheStrategy and I think it will be hard for
> developers to reuse the source in the initializeCache method.
> 
> Best regards,
> 
> Qingsheng
> 
> On Thu, Jun 2, 2022 at 1:50 PM Becket Qin <be...@gmail.com> wrote:
>> 
>> Thanks for updating the FLIP, Qingsheng. A few more comments:
>> 
>> 1. I am still not sure about what is the use case for cacheMissingKey().
>> More specifically, when would users want to have getCache() return a
>> non-empty value and cacheMissingKey() returns false?
>> 
>> 2. The builder pattern. Usually the builder pattern is used when there are
>> a lot of variations of constructors. For example, if a class has three
>> variables and all of them are optional, so there could potentially be many
>> combinations of the variables. But in this FLIP, I don't see such case.
>> What is the reason we have builders for all the classes?
>> 
>> 3. Should the caching strategy be excluded from the top level provider API?
>> Technically speaking, the Flink framework should only have two interfaces
>> to deal with:
>>    A) LookupFunction
>>    B) AsyncLookupFunction
>> Orthogonally, we *believe* there are two different strategies people can do
>> caching. Note that the Flink framework does not care what is the caching
>> strategy here.
>>    a) partial caching
>>    b) full caching
>> 
>> Putting them together, we end up with 3 combinations that we think are
>> valid:
>>     Aa) PartialCachingLookupFunctionProvider
>>     Ba) PartialCachingAsyncLookupFunctionProvider
>>     Ab) FullCachingLookupFunctionProvider
>> 
>> However, the caching strategy could actually be quite flexible. E.g. an
>> initial full cache load followed by some partial updates. Also, I am not
>> 100% sure if the full caching will always use ScanTableSource. Including
>> the caching strategy in the top level provider API would make it harder to
>> extend.
>> 
>> One possible solution is to just have *LookupFunctionProvider* and
>> *AsyncLookupFunctionProvider
>> *as the top level API, both with a getCacheStrategy() method returning an
>> optional CacheStrategy. The CacheStrategy class would have the following
>> methods:
>> 1. void open(Context), the context exposes some of the resources that may
>> be useful for the the caching strategy, e.g. an ExecutorService that is
>> synchronized with the data processing, or a cache refresh trigger which
>> blocks data processing and refresh the cache.
>> 2. void initializeCache(), a blocking method allows users to pre-populate
>> the cache before processing any data if they wish.
>> 3. void maybeCache(RowData key, Collection<RowData> value), blocking or
>> non-blocking method.
>> 4. void refreshCache(), a blocking / non-blocking method that is invoked by
>> the Flink framework when the cache refresh trigger is pulled.
>> 
>> In the above design, partial caching and full caching would be
>> implementations of the CachingStrategy. And it is OK for users to implement
>> their own CachingStrategy if they want to.
>> 
>> Thanks,
>> 
>> Jiangjie (Becket) Qin
>> 
>> 
>> On Thu, Jun 2, 2022 at 12:14 PM Jark Wu <im...@gmail.com> wrote:
>> 
>>> Thank Qingsheng for the detailed summary and updates,
>>> 
>>> The changes look good to me in general. I just have one minor improvement
>>> comment.
>>> Could we add a static util method to the "FullCachingReloadTrigger"
>>> interface for quick usage?
>>> 
>>> #periodicReloadAtFixedRate(Duration)
>>> #periodicReloadWithFixedDelay(Duration)
>>> 
>>> I think we can also do this for LookupCache. Because users may not know
>>> where is the default
>>> implementations and how to use them.
>>> 
>>> Best,
>>> Jark
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <re...@gmail.com> wrote:
>>> 
>>>> Hi Jingsong,
>>>> 
>>>> Thanks for your comments!
>>>> 
>>>>> AllCache definition is not flexible, for example, PartialCache can use
>>>> any custom storage, while the AllCache can not, AllCache can also be
>>>> considered to store memory or disk, also need a flexible strategy.
>>>> 
>>>> We had an offline discussion with Jark and Leonard. Basically we think
>>>> exposing the interface of full cache storage to connector developers
>>> might
>>>> limit our future optimizations. The storage of full caching shouldn’t
>>> have
>>>> too many variations for different lookup tables so making it pluggable
>>>> might not help a lot. Also I think it is not quite easy for connector
>>>> developers to implement such an optimized storage. We can keep optimizing
>>>> this storage in the future and all full caching lookup tables would
>>> benefit
>>>> from this.
>>>> 
>>>>> We are more inclined to deprecate the connector `async` option when
>>>> discussing FLIP-234. Can we remove this option from this FLIP?
>>>> 
>>>> Thanks for the reminder! This option has been removed in the latest
>>>> version.
>>>> 
>>>> Best regards,
>>>> 
>>>> Qingsheng
>>>> 
>>>> 
>>>>> On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com> wrote:
>>>>> 
>>>>> Thanks Alexander for your reply. We can discuss the new interface when
>>> it
>>>>> comes out.
>>>>> 
>>>>> We are more inclined to deprecate the connector `async` option when
>>>>> discussing FLIP-234 [1]. We should use hint to let planner decide.
>>>>> Although the discussion has not yet produced a conclusion, can we
>>> remove
>>>>> this option from this FLIP? It doesn't seem to be related to this FLIP,
>>>> but
>>>>> more to FLIP-234, and we can form a conclusion over there.
>>>>> 
>>>>> [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
>>>>> 
>>>>> Best,
>>>>> Jingsong
>>>>> 
>>>>> On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:
>>>>> 
>>>>>> Hi Jark,
>>>>>> 
>>>>>> Thanks for clarifying it. It would be fine. as long as we could
>>> provide
>>>> the
>>>>>> no-cache solution. I was just wondering if the client side cache could
>>>>>> really help when HBase is used, since the data to look up should be
>>>> huge.
>>>>>> Depending how much data will be cached on the client side, the data
>>> that
>>>>>> should be lru in e.g. LruBlockCache will not be lru anymore. In the
>>>> worst
>>>>>> case scenario, once the cached data at client side is expired, the
>>>> request
>>>>>> will hit disk which will cause extra latency temporarily, if I am not
>>>>>> mistaken.
>>>>>> 
>>>>>> Best regards,
>>>>>> Jing
>>>>>> 
>>>>>> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi Jing Ge,
>>>>>>> 
>>>>>>> What do you mean about the "impact on the block cache used by HBase"?
>>>>>>> In my understanding, the connector cache and HBase cache are totally
>>>> two
>>>>>>> things.
>>>>>>> The connector cache is a local/client cache, and the HBase cache is a
>>>>>>> server cache.
>>>>>>> 
>>>>>>>> does it make sense to have a no-cache solution as one of the
>>>>>>> default solutions so that customers will have no effort for the
>>>> migration
>>>>>>> if they want to stick with Hbase cache
>>>>>>> 
>>>>>>> The implementation migration should be transparent to users. Take the
>>>>>> HBase
>>>>>>> connector as
>>>>>>> an example,  it already supports lookup cache but is disabled by
>>>> default.
>>>>>>> After migration, the
>>>>>>> connector still disables cache by default (i.e. no-cache solution).
>>> No
>>>>>>> migration effort for users.
>>>>>>> 
>>>>>>> HBase cache and connector cache are two different things. HBase cache
>>>>>> can't
>>>>>>> simply replace
>>>>>>> connector cache. Because one of the most important usages for
>>> connector
>>>>>>> cache is reducing
>>>>>>> the I/O request/response and improving the throughput, which can
>>>> achieve
>>>>>>> by just using a server cache.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jark
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
>>>>>>> 
>>>>>>>> Thanks all for the valuable discussion. The new feature looks very
>>>>>>>> interesting.
>>>>>>>> 
>>>>>>>> According to the FLIP description: "*Currently we have JDBC, Hive
>>> and
>>>>>>> HBase
>>>>>>>> connector implemented lookup table source. All existing
>>>> implementations
>>>>>>>> will be migrated to the current design and the migration will be
>>>>>>>> transparent to end users*." I was only wondering if we should pay
>>>>>>> attention
>>>>>>>> to HBase and similar DBs. Since, commonly, the lookup data will be
>>>> huge
>>>>>>>> while using HBase, partial caching will be used in this case, if I
>>> am
>>>>>> not
>>>>>>>> mistaken, which might have an impact on the block cache used by
>>> HBase,
>>>>>>> e.g.
>>>>>>>> LruBlockCache.
>>>>>>>> Another question is that, since HBase provides a sophisticated cache
>>>>>>>> solution, does it make sense to have a no-cache solution as one of
>>> the
>>>>>>>> default solutions so that customers will have no effort for the
>>>>>> migration
>>>>>>>> if they want to stick with Hbase cache?
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Jing
>>>>>>>> 
>>>>>>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <
>>> jingsonglee0@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I think the problem now is below:
>>>>>>>>> 1. AllCache and PartialCache interface on the non-uniform, one
>>> needs
>>>>>> to
>>>>>>>>> provide LookupProvider, the other needs to provide CacheBuilder.
>>>>>>>>> 2. AllCache definition is not flexible, for example, PartialCache
>>> can
>>>>>>> use
>>>>>>>>> any custom storage, while the AllCache can not, AllCache can also
>>> be
>>>>>>>>> considered to store memory or disk, also need a flexible strategy.
>>>>>>>>> 3. AllCache can not customize ReloadStrategy, currently only
>>>>>>>>> ScheduledReloadStrategy.
>>>>>>>>> 
>>>>>>>>> In order to solve the above problems, the following are my ideas.
>>>>>>>>> 
>>>>>>>>> ## Top level cache interfaces:
>>>>>>>>> 
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> public interface CacheLookupProvider extends
>>>>>>>>> LookupTableSource.LookupRuntimeProvider {
>>>>>>>>> 
>>>>>>>>>   CacheBuilder createCacheBuilder();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> public interface CacheBuilder {
>>>>>>>>>   Cache create();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> public interface Cache {
>>>>>>>>> 
>>>>>>>>>   /**
>>>>>>>>>    * Returns the value associated with key in this cache, or null
>>>>>> if
>>>>>>>>> there is no cached value for
>>>>>>>>>    * key.
>>>>>>>>>    */
>>>>>>>>>   @Nullable
>>>>>>>>>   Collection<RowData> getIfPresent(RowData key);
>>>>>>>>> 
>>>>>>>>>   /** Returns the number of key-value mappings in the cache. */
>>>>>>>>>   long size();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> ## Partial cache
>>>>>>>>> 
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> public interface PartialCacheLookupFunction extends
>>>>>>> CacheLookupProvider {
>>>>>>>>> 
>>>>>>>>>   @Override
>>>>>>>>>   PartialCacheBuilder createCacheBuilder();
>>>>>>>>> 
>>>>>>>>> /** Creates an {@link LookupFunction} instance. */
>>>>>>>>> LookupFunction createLookupFunction();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> public interface PartialCacheBuilder extends CacheBuilder {
>>>>>>>>> 
>>>>>>>>>   PartialCache create();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> public interface PartialCache extends Cache {
>>>>>>>>> 
>>>>>>>>>   /**
>>>>>>>>>    * Associates the specified value rows with the specified key
>>> row
>>>>>>>>> in the cache. If the cache
>>>>>>>>>    * previously contained value associated with the key, the old
>>>>>>>>> value is replaced by the
>>>>>>>>>    * specified value.
>>>>>>>>>    *
>>>>>>>>>    * @return the previous value rows associated with key, or null
>>>>>> if
>>>>>>>>> there was no mapping for key.
>>>>>>>>>    * @param key - key row with which the specified value is to be
>>>>>>>>> associated
>>>>>>>>>    * @param value – value rows to be associated with the specified
>>>>>>> key
>>>>>>>>>    */
>>>>>>>>>   Collection<RowData> put(RowData key, Collection<RowData> value);
>>>>>>>>> 
>>>>>>>>>   /** Discards any cached value for the specified key. */
>>>>>>>>>   void invalidate(RowData key);
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> ## All cache
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> public interface AllCacheLookupProvider extends
>>> CacheLookupProvider {
>>>>>>>>> 
>>>>>>>>>   void registerReloadStrategy(ScheduledExecutorService
>>>>>>>>> executorService, Reloader reloader);
>>>>>>>>> 
>>>>>>>>>   ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
>>>>>>>>> 
>>>>>>>>>   @Override
>>>>>>>>>   AllCacheBuilder createCacheBuilder();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> public interface AllCacheBuilder extends CacheBuilder {
>>>>>>>>> 
>>>>>>>>>   AllCache create();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> public interface AllCache extends Cache {
>>>>>>>>> 
>>>>>>>>>   void putAll(Iterator<Map<RowData, RowData>> allEntries);
>>>>>>>>> 
>>>>>>>>>   void clearAll();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> public interface Reloader {
>>>>>>>>> 
>>>>>>>>>   void reload();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Jingsong
>>>>>>>>> 
>>>>>>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <
>>> jingsonglee0@gmail.com
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks Qingsheng and all for your discussion.
>>>>>>>>>> 
>>>>>>>>>> Very sorry to jump in so late.
>>>>>>>>>> 
>>>>>>>>>> Maybe I missed something?
>>>>>>>>>> My first impression when I saw the cache interface was, why don't
>>>>>> we
>>>>>>>>>> provide an interface similar to guava cache [1], on top of guava
>>>>>>> cache,
>>>>>>>>>> caffeine also makes extensions for asynchronous calls.[2]
>>>>>>>>>> There is also the bulk load in caffeine too.
>>>>>>>>>> 
>>>>>>>>>> I am also more confused why first from LookupCacheFactory.Builder
>>>>>> and
>>>>>>>>> then
>>>>>>>>>> to Factory to create Cache.
>>>>>>>>>> 
>>>>>>>>>> [1] https://github.com/google/guava
>>>>>>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Jingsong
>>>>>>>>>> 
>>>>>>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com>
>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> After looking at the new introduced ReloadTime and Becket's
>>>>>> comment,
>>>>>>>>>>> I agree with Becket we should have a pluggable reloading
>>> strategy.
>>>>>>>>>>> We can provide some common implementations, e.g., periodic
>>>>>>> reloading,
>>>>>>>>> and
>>>>>>>>>>> daily reloading.
>>>>>>>>>>> But there definitely be some connector- or business-specific
>>>>>>> reloading
>>>>>>>>>>> strategies, e.g.
>>>>>>>>>>> notify by a zookeeper watcher, reload once a new Hive partition
>>> is
>>>>>>>>>>> complete.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
>>>>>>>> "XXXProvider".
>>>>>>>>>>>> What is the difference between them? If they are the same, can
>>>>>> we
>>>>>>>> just
>>>>>>>>>>> use
>>>>>>>>>>>> XXXFactory everywhere?
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
>>>>>>>>> policy
>>>>>>>>>>>> also be pluggable? Periodical reloading could be sometimes be
>>>>>>> tricky
>>>>>>>>> in
>>>>>>>>>>>> practice. For example, if user uses 24 hours as the cache
>>>>>> refresh
>>>>>>>>>>> interval
>>>>>>>>>>>> and some nightly batch job delayed, the cache update may still
>>>>>> see
>>>>>>>> the
>>>>>>>>>>>> stale data.
>>>>>>>>>>>> 
>>>>>>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
>>>>>>>> should
>>>>>>>>> be
>>>>>>>>>>>> removed.
>>>>>>>>>>>> 
>>>>>>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
>>>>>> seems a
>>>>>>>>>>> little
>>>>>>>>>>>> confusing to me. If Optional<LookupCacheFactory>
>>>>>> getCacheFactory()
>>>>>>>>>>> returns
>>>>>>>>>>>> a non-empty factory, doesn't that already indicates the
>>>>>> framework
>>>>>>> to
>>>>>>>>>>> cache
>>>>>>>>>>>> the missing keys? Also, why is this method returning an
>>>>>>>>>>> Optional<Boolean>
>>>>>>>>>>>> instead of boolean?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
>>>>>> renqschn@gmail.com
>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Lincoln and Jark,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the comments! If the community reaches a consensus
>>>>>>> that
>>>>>>>> we
>>>>>>>>>>> use
>>>>>>>>>>>>> SQL hint instead of table options to decide whether to use sync
>>>>>>> or
>>>>>>>>>>> async
>>>>>>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
>>>>>>>>> option.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think it’s a good idea to let the decision of async made on
>>>>>>> query
>>>>>>>>>>>>> level, which could make better optimization with more
>>>>>> infomation
>>>>>>>>>>> gathered
>>>>>>>>>>>>> by planner. Is there any FLIP describing the issue in
>>>>>>> FLINK-27625?
>>>>>>>> I
>>>>>>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
>>>>>>>>> missing
>>>>>>>>>>>>> instead of the entire async mode to be controlled by hint.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
>>>>>> lincoln.86xy@gmail.com
>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Jark,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for your reply!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
>>>>>>> no
>>>>>>>>> idea
>>>>>>>>>>>>>> whether or when to remove it (we can discuss it in another
>>>>>>> issue
>>>>>>>>> for
>>>>>>>>>>> the
>>>>>>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
>>>>>>> into
>>>>>>>> a
>>>>>>>>>>>>> common
>>>>>>>>>>>>>> option now.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Lincoln Lee
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Lincoln,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
>>>>>> the
>>>>>>>>>>>>> connectors
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> provide both async and sync runtime providers simultaneously
>>>>>>>>> instead
>>>>>>>>>>>>> of one
>>>>>>>>>>>>>>> of them.
>>>>>>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
>>>>>> option
>>>>>>> is
>>>>>>>>>>>>> planned to
>>>>>>>>>>>>>>> be removed
>>>>>>>>>>>>>>> in the long term, I think it makes sense not to introduce it
>>>>>>> in
>>>>>>>>> this
>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
>>>>>>>> lincoln.86xy@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
>>>>>>> idea
>>>>>>>>>>> that
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> have a common table option. I have a minor comments on
>>>>>>>>>>> 'lookup.async'
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> not make it a common option:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The table layer abstracts both sync and async lookup
>>>>>>>>> capabilities,
>>>>>>>>>>>>>>>> connectors implementers can choose one or both, in the case
>>>>>>> of
>>>>>>>>>>>>>>> implementing
>>>>>>>>>>>>>>>> only one capability(status of the most of existing builtin
>>>>>>>>>>> connectors)
>>>>>>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
>>>>>>> both
>>>>>>>>>>>>>>>> capabilities, I think this choice is more suitable for
>>>>>> making
>>>>>>>>>>>>> decisions
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>> the query level, for example, table planner can choose the
>>>>>>>>> physical
>>>>>>>>>>>>>>>> implementation of async lookup or sync lookup based on its
>>>>>>> cost
>>>>>>>>>>>>> model, or
>>>>>>>>>>>>>>>> users can give query hint based on their own better
>>>>>>>>>>> understanding.  If
>>>>>>>>>>>>>>>> there is another common table option 'lookup.async', it may
>>>>>>>>> confuse
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> users in the long run.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
>>>>>>>> place
>>>>>>>>>>> (for
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> current hbase connector) and not turn it into a common
>>>>>>> option.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Lincoln Lee
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
>>>>>> you
>>>>>>>> can
>>>>>>>>>>> find
>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>> changes from my latest email. Since some terminologies has
>>>>>>>>>>> changed so
>>>>>>>>>>>>>>>> I’ll
>>>>>>>>>>>>>>>>> use the new concept for replying your comments.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1. Builder vs ‘of’
>>>>>>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
>>>>>> optional
>>>>>>>>>>>>> parameters
>>>>>>>>>>>>>>>>> for full caching mode (“rescan” previously). The
>>>>>>>>>>> schedule-with-delay
>>>>>>>>>>>>>>> idea
>>>>>>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
>>>>>> the
>>>>>>>>>>> builder
>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> full caching to make it more descriptive for developers.
>>>>>>> Would
>>>>>>>>> you
>>>>>>>>>>>>> mind
>>>>>>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
>>>>>>>>> workspace
>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> just provide your account ID and ping any PMC member
>>>>>>> including
>>>>>>>>>>> Jark.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2. Common table options
>>>>>>>>>>>>>>>>> We have some discussions these days and propose to
>>>>>>> introduce 8
>>>>>>>>>>> common
>>>>>>>>>>>>>>>>> table options about caching. It has been updated on the
>>>>>>> FLIP.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 3. Retries
>>>>>>>>>>>>>>>>> I think we are on the same page :-)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> For your additional concerns:
>>>>>>>>>>>>>>>>> 1) The table option has been updated.
>>>>>>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
>>>>>> use
>>>>>>>>>>> partial
>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>> full caching mode.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Also I have a few additions:
>>>>>>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
>>>>>>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
>>>>>> that
>>>>>>>> we
>>>>>>>>>>> talk
>>>>>>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
>>>>>> fits
>>>>>>>>> more,
>>>>>>>>>>>>>>>>>> considering my optimization with filters.
>>>>>>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
>>>>>>> separate
>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>> and rescanning from the options point of view? Like
>>>>>>> initially
>>>>>>>>> we
>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
>>>>>>> now
>>>>>>>> we
>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
>>>>>>> be
>>>>>>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
>>>>>>>>>>> smiralexan@gmail.com
>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Qingsheng and Jark,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. Builders vs 'of'
>>>>>>>>>>>>>>>>>>> I understand that builders are used when we have
>>>>>> multiple
>>>>>>>>>>>>>>> parameters.
>>>>>>>>>>>>>>>>>>> I suggested them because we could add parameters later.
>>>>>> To
>>>>>>>>>>> prevent
>>>>>>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
>>>>>>> can
>>>>>>>>>>>>> suggest
>>>>>>>>>>>>>>>>>>> one more config now - "rescanStartTime".
>>>>>>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
>>>>>> reload
>>>>>>>> of
>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> starts. This parameter can be thought of as
>>>>>> 'initialDelay'
>>>>>>>>> (diff
>>>>>>>>>>>>>>>>>>> between current time and rescanStartTime) in method
>>>>>>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
>>>>>>> can
>>>>>>>> be
>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>> useful when the dimension table is updated by some other
>>>>>>>>>>> scheduled
>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
>>>>>> second
>>>>>>>> scan
>>>>>>>>>>>>>>> (first
>>>>>>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
>>>>>>>> without
>>>>>>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
>>>>>>> one
>>>>>>>>>>> day.
>>>>>>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
>>>>>> if
>>>>>>>> you
>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
>>>>>> myself
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2. Common table options
>>>>>>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
>>>>>>> for
>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
>>>>>>>> cache
>>>>>>>>>>>>>>> options,
>>>>>>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 3. Retries
>>>>>>>>>>>>>>>>>>> I'm fine with suggestion close to
>>>>>>> RetryUtils#tryTimes(times,
>>>>>>>>>>> call)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
>>>>>>>> renqschn@gmail.com
>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Jark and Alexander,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
>>>>>> common
>>>>>>>>> table
>>>>>>>>>>>>>>>>> options. I prefer to introduce a new
>>>>>>> DefaultLookupCacheOptions
>>>>>>>>>>> class
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> holding these option definitions because putting all
>>>>>> options
>>>>>>>>> into
>>>>>>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
>>>>>>>>>>> categorized.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
>>>>>>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
>>>>>>>>>>> RescanRuntimeProvider
>>>>>>>>>>>>>>>>> considering both arguments are required.
>>>>>>>>>>>>>>>>>>>> 2. Introduce new table options matching
>>>>>>>>>>> DefaultLookupCacheFactory
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
>>>>>>> imjark@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 1) retry logic
>>>>>>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
>>>>>>>>> utilities,
>>>>>>>>>>>>>>> e.g.
>>>>>>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
>>>>>>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
>>>>>> by
>>>>>>>>>>>>>>> DataStream
>>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
>>>>>> to
>>>>>>>> put
>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 2) cache ConfigOptions
>>>>>>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
>>>>>>>>> framework.
>>>>>>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
>>>>>>>> includes
>>>>>>>>>>>>>>>>> "sink.parallelism", "format" options.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thank you for considering my comments.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
>>>>>> such
>>>>>>> as
>>>>>>>>>>>>>>>>> re-establish the connection
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
>>>>>> be
>>>>>>>>>>> placed in
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> separate function, that can be implemented by
>>>>>>> connectors.
>>>>>>>>>>> Just
>>>>>>>>>>>>>>>> moving
>>>>>>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
>>>>>>>> more
>>>>>>>>>>>>>>> concise
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
>>>>>> The
>>>>>>>>>>> decision
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>>>>>> to you.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>>>>>>>>> developers
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
>>>>>>>> this
>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
>>>>>> current
>>>>>>>>> cache
>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
>>>>>>>> still
>>>>>>>>>>> we
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> put
>>>>>>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
>>>>>>> reuse
>>>>>>>>>>> them
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
>>>>>> significant,
>>>>>>>>> avoid
>>>>>>>>>>>>>>>> possible
>>>>>>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
>>>>>>> out
>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> documentation for connector developers.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
>>>>>>>>>>> renqschn@gmail.com>:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
>>>>>>> same
>>>>>>>>>>> page!
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
>>>>>>>> quoting
>>>>>>>>>>> your
>>>>>>>>>>>>>>>> reply
>>>>>>>>>>>>>>>>> under this email.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
>>>>>> in
>>>>>>>>>>> lookup()
>>>>>>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
>>>>>>>> meaningful
>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> specific retriable failures, and there might be custom
>>>>>> logic
>>>>>>>>>>> before
>>>>>>>>>>>>>>>> making
>>>>>>>>>>>>>>>>> retry, such as re-establish the connection
>>>>>>>>>>> (JdbcRowDataLookupFunction
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
>>>>>>> version
>>>>>>>> of
>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>> Do
>>>>>>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>>>>>>>>> developers
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
>>>>>>>> FLIP.
>>>>>>>>>>> Hope
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> can finalize our proposal soon!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
>>>>>> I
>>>>>>>> have
>>>>>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>>>>>>> suggestions and questions.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
>>>>>>>> class.
>>>>>>>>>>>>> 'eval'
>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
>>>>>> The
>>>>>>>> same
>>>>>>>>>>> is
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> 'async' case.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
>>>>>>>>>>>>>>>>> 'cacheMissingKey'
>>>>>>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
>>>>>>>>>>>>>>>>> ScanRuntimeProvider.
>>>>>>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
>>>>>>>>> 'build'
>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
>>>>>>>> TableFunctionProvider
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
>>>>>>>>>>> deprecated.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
>>>>>> assume
>>>>>>>>>>> usage of
>>>>>>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
>>>>>>> case,
>>>>>>>>> it
>>>>>>>>>>> is
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
>>>>>> or
>>>>>>>>>>> 'putAll'
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> LookupCache.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
>>>>>>>> version
>>>>>>>>>>> of
>>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>> Do
>>>>>>>>>>>>>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
>>>>>> make
>>>>>>>>> small
>>>>>>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
>>>>>>>> worth
>>>>>>>>>>>>>>>> mentioning
>>>>>>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
>>>>>> the
>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
>>>>>> As
>>>>>>>> Jark
>>>>>>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
>>>>>>>>>>> refactor on
>>>>>>>>>>>>>>> our
>>>>>>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
>>>>>> design
>>>>>>>> now
>>>>>>>>>>> and
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> happy to hear more suggestions from you!
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
>>>>>>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
>>>>>>> and
>>>>>>>> is
>>>>>>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
>>>>>>>>>>>>> previously.
>>>>>>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
>>>>>> reflect
>>>>>>>> the
>>>>>>>>>>> new
>>>>>>>>>>>>>>>>> design.
>>>>>>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
>>>>>> and
>>>>>>>>>>>>>>> introduce a
>>>>>>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
>>>>>> scanning.
>>>>>>> We
>>>>>>>>> are
>>>>>>>>>>>>>>>> planning
>>>>>>>>>>>>>>>>> to support SourceFunction / InputFormat for now
>>>>>> considering
>>>>>>>> the
>>>>>>>>>>>>>>>> complexity
>>>>>>>>>>>>>>>>> of FLIP-27 Source API.
>>>>>>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
>>>>>>>> make
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> semantic of lookup more straightforward for developers.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
>>>>>>> is
>>>>>>>>>>>>>>> deprecated
>>>>>>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
>>>>>>>>> currently
>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>> not?
>>>>>>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
>>>>>> for
>>>>>>>>> now.
>>>>>>>>>>> I
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>> it will be deprecated in the future but we don't have a
>>>>>>> clear
>>>>>>>>> plan
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> that.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
>>>>>>>> looking
>>>>>>>>>>>>>>> forward
>>>>>>>>>>>>>>>>> to cooperating with you after we finalize the design and
>>>>>>>>>>> interfaces!
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
>>>>>> Смирнов <
>>>>>>>>>>>>>>>>> smiralexan@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
>>>>>>> all
>>>>>>>>>>>>> points!
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
>>>>>>> is
>>>>>>>>>>>>>>> deprecated
>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
>>>>>>> but
>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
>>>>>>> version
>>>>>>>>>>> it's
>>>>>>>>>>>>> OK
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
>>>>>>>>> supporting
>>>>>>>>>>>>>>> rescan
>>>>>>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
>>>>>> for
>>>>>>>>> this
>>>>>>>>>>>>>>>>> decision we
>>>>>>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
>>>>>> participants.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
>>>>>>> your
>>>>>>>>>>>>>>>>> statements. All
>>>>>>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
>>>>>>> would
>>>>>>>> be
>>>>>>>>>>> nice
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
>>>>>> lot
>>>>>>>> of
>>>>>>>>>>> work
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
>>>>>> one
>>>>>>>> we
>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> discussing,
>>>>>>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
>>>>>> Anyway
>>>>>>>>>>> looking
>>>>>>>>>>>>>>>>> forward for
>>>>>>>>>>>>>>>>>>>>>>>>>> the FLIP update!
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
>>>>>>>> imjark@gmail.com
>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
>>>>>>>>>>> discussed
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> several times
>>>>>>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
>>>>>>> many
>>>>>>>> of
>>>>>>>>>>> your
>>>>>>>>>>>>>>>>> points!
>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
>>>>>> design
>>>>>>>> docs
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> maybe can be
>>>>>>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
>>>>>>> discussions:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
>>>>>> "cache
>>>>>>>> in
>>>>>>>>>>>>>>>>> framework" way.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
>>>>>>> customize
>>>>>>>>> and
>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
>>>>>> easy-use.
>>>>>>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
>>>>>>>>> flexibility
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> conciseness.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
>>>>>>>> lookup
>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>> esp reducing
>>>>>>>>>>>>>>>>>>>>>>>>>>> IO.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
>>>>>> the
>>>>>>>>>>> unified
>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>>> to both
>>>>>>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
>>>>>>> direction.
>>>>>>>> If
>>>>>>>>>>> we
>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>> to support
>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
>>>>>> use
>>>>>>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
>>>>>> decide
>>>>>>>> to
>>>>>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>> the cache
>>>>>>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
>>>>>>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
>>>>>>> and
>>>>>>>>> it
>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>> affect the
>>>>>>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
>>>>>>> your
>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
>>>>>>>>> InputFormat,
>>>>>>>>>>>>>>>>> SourceFunction for
>>>>>>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
>>>>>>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
>>>>>> source
>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
>>>>>>>>> re-scan
>>>>>>>>>>>>>>>> ability
>>>>>>>>>>>>>>>>> for FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
>>>>>>>>> effort
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
>>>>>>> InputFormat&SourceFunction,
>>>>>>>>> as
>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>> are not
>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
>>>>>>> another
>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
>>>>>>>> plan
>>>>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
>>>>>>> SourceFunction
>>>>>>>>> are
>>>>>>>>>>>>>>>>> deprecated.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
>>>>>> <
>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
>>>>>>> InputFormat
>>>>>>>>> is
>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> considered.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
>>>>>>>>>>>>>>>>> martijn@ververica.com>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
>>>>>>> connectors
>>>>>>>>> to
>>>>>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
>>>>>>> The
>>>>>>>>> old
>>>>>>>>>>>>>>>>> interfaces will be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
>>>>>>>> refactored
>>>>>>>>> to
>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>> the new ones
>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dropped.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
>>>>>> are
>>>>>>>>> using
>>>>>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
>>>>>>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
>>>>>> Смирнов
>>>>>>> <
>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
>>>>>>> make
>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> comments and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
>>>>>>> we
>>>>>>>>> can
>>>>>>>>>>>>>>>> achieve
>>>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
>>>>>> in
>>>>>>>>>>>>>>>>> flink-table-common,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>> Therefore if a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
>>>>>> cache
>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>> and their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
>>>>>> lookupConfig
>>>>>>> to
>>>>>>>>> the
>>>>>>>>>>>>>>>>> planner, but if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
>>>>>>> in
>>>>>>>>> his
>>>>>>>>>>>>>>>>> TableFunction, it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
>>>>>>>>>>> interface
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
>>>>>>> the
>>>>>>>>>>>>>>>>> documentation). In
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
>>>>>>> unified.
>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
>>>>>>> cache,
>>>>>>>> we
>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
>>>>>>> optimization
>>>>>>>> in
>>>>>>>>>>> case
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> LRU cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
>>>>>>>> Collection<RowData>>.
>>>>>>>>>>> Here
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
>>>>>>>> cache,
>>>>>>>>>>> even
>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
>>>>>>> rows
>>>>>>>>>>> after
>>>>>>>>>>>>>>>>> applying
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
>>>>>>>>>>>>>>>> TableFunction,
>>>>>>>>>>>>>>>>> we store
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
>>>>>>>> cache
>>>>>>>>>>> line
>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
>>>>>>>>> bytes).
>>>>>>>>>>>>>>> I.e.
>>>>>>>>>>>>>>>>> we don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
>>>>>>>> pruned,
>>>>>>>>>>> but
>>>>>>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
>>>>>> If
>>>>>>>> the
>>>>>>>>>>> user
>>>>>>>>>>>>>>>>> knows about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
>>>>>>>>> option
>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>>> the start
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
>>>>>>> idea
>>>>>>>>>>> that we
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> do this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
>>>>>> and
>>>>>>>>>>> 'weigher'
>>>>>>>>>>>>>>>>> methods of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
>>>>>>>>>>> collection
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
>>>>>>>> automatically
>>>>>>>>>>> fit
>>>>>>>>>>>>>>>> much
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
>>>>>>>>> filters
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
>>>>>>>>> interfaces,
>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
>>>>>>>>> implement
>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>> pushdown.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
>>>>>> no
>>>>>>>>>>> database
>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
>>>>>>>>> feature
>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
>>>>>>>> talk
>>>>>>>>>>> about
>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
>>>>>> databases
>>>>>>>>> might
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> support all
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
>>>>>> all).
>>>>>>> I
>>>>>>>>>>> think
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
>>>>>>>> optimization
>>>>>>>>>>>>>>>>> independently of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
>>>>>>>> complex
>>>>>>>>>>>>>>> problems
>>>>>>>>>>>>>>>>> (or
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
>>>>>> Actually
>>>>>>> in
>>>>>>>>> our
>>>>>>>>>>>>>>>>> internal version
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
>>>>>> and
>>>>>>>>>>>>> reloading
>>>>>>>>>>>>>>>>> data from
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
>>>>>> a
>>>>>>>> way
>>>>>>>>> to
>>>>>>>>>>>>>>> unify
>>>>>>>>>>>>>>>>> the logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
>>>>>>>>>>>>> SourceFunction,
>>>>>>>>>>>>>>>>> Source,...)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
>>>>>>> result
>>>>>>>> I
>>>>>>>>>>>>>>> settled
>>>>>>>>>>>>>>>>> on using
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
>>>>>>> in
>>>>>>>>> all
>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
>>>>>> plans
>>>>>>>> to
>>>>>>>>>>>>>>>> deprecate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
>>>>>>>> usage
>>>>>>>>> of
>>>>>>>>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
>>>>>>>>> source
>>>>>>>>>>> was
>>>>>>>>>>>>>>>>> designed to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
>>>>>>> (SplitEnumerator
>>>>>>>> on
>>>>>>>>>>>>>>>>> JobManager and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
>>>>>>>> operator
>>>>>>>>>>>>>>> (lookup
>>>>>>>>>>>>>>>>> join
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
>>>>>> direct
>>>>>>>> way
>>>>>>>>> to
>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>> splits from
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
>>>>>>> works
>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
>>>>>>>>>>>>>>> AddSplitEvents).
>>>>>>>>>>>>>>>>> Usage of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
>>>>>>> clearer
>>>>>>>>> and
>>>>>>>>>>>>>>>>> easier. But if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
>>>>>>>>>>> FLIP-27, I
>>>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
>>>>>>> lookup
>>>>>>>>> join
>>>>>>>>>>>>> ALL
>>>>>>>>>>>>>>>>> cache in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
>>>>>> of
>>>>>>>>> batch
>>>>>>>>>>>>>>>> source?
>>>>>>>>>>>>>>>>> The point
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
>>>>>> join
>>>>>>>> ALL
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> and simple
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
>>>>>>> case
>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> performed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
>>>>>> (cache)
>>>>>>> is
>>>>>>>>>>>>> cleared
>>>>>>>>>>>>>>>>> (correct me
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
>>>>>>>>>>> functionality of
>>>>>>>>>>>>>>>>> simple join
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
>>>>>>>>>>> functionality of
>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
>>>>>> be
>>>>>>>>> easy
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> new FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
>>>>>> -
>>>>>>> we
>>>>>>>>>>> will
>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>> to change
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
>>>>>>>> again
>>>>>>>>>>> after
>>>>>>>>>>>>>>>>> some TTL).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
>>>>>>> long-term
>>>>>>>>>>> goal
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> will make
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
>>>>>>> said.
>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> can limit
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
>>>>>>>> (InputFormats).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
>>>>>>>> flexible
>>>>>>>>>>>>>>>>> interfaces for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
>>>>>> both
>>>>>>>> in
>>>>>>>>>>> LRU
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> ALL caches.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
>>>>>>>>>>> supported
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
>>>>>>> have
>>>>>>>>> the
>>>>>>>>>>>>>>>>> opportunity to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
>>>>>> currently
>>>>>>>>> filter
>>>>>>>>>>>>>>>>> pushdown works
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
>>>>>>> filters
>>>>>>>> +
>>>>>>>>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
>>>>>>>>>>> features.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
>>>>>>> that
>>>>>>>>>>>>> involves
>>>>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
>>>>>>> from
>>>>>>>>>>>>>>>>> InputFormat in favor
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
>>>>>>> realization
>>>>>>>>>>> really
>>>>>>>>>>>>>>>>> complex and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
>>>>>>> extend
>>>>>>>>> the
>>>>>>>>>>>>>>>>> functionality of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
>>>>>>>> case
>>>>>>>>> of
>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>> join ALL
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
>>>>>>>>> imjark@gmail.com
>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
>>>>>>> want
>>>>>>>> to
>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>> ideas:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
>>>>>>>> connectors
>>>>>>>>>>> base
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
>>>>>>> ways
>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> work (e.g.,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
>>>>>>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
>>>>>>> flexible
>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
>>>>>> we
>>>>>>>> can
>>>>>>>>>>> have
>>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>>>>>> advantages.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
>>>>>>> should
>>>>>>>>> be a
>>>>>>>>>>>>>>> final
>>>>>>>>>>>>>>>>> state,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
>>>>>>> into
>>>>>>>>>>> cache
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> benefit a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> lot
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
>>>>>>>>> Connectors
>>>>>>>>>>> use
>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
>>>>>>> cache,
>>>>>>>> we
>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
>>>>>> means
>>>>>>>> the
>>>>>>>>>>> cache
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> meaningless in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
>>>>>> to
>>>>>>> do
>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>> and projects
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
>>>>>>>>> interfaces,
>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
>>>>>> interfaces
>>>>>>> to
>>>>>>>>>>> reduce
>>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
>>>>>>>> source
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> lookup source
>>>>>>>>>>>>>>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
>>>>>>>> pushdown
>>>>>>>>>>> logic
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> caches,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
>>>>>>> of
>>>>>>>>> this
>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>> We have
>>>>>>>>>>>>>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
>>>>>>> "eval"
>>>>>>>>>>> method
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
>>>>>> share
>>>>>>>> the
>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
>>>>>>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Source.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
>>>>>>>>> deprecated,
>>>>>>>>>>> and
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
>>>>>>>>>>> LookupJoin,
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> may make
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
>>>>>> the
>>>>>>>> ALL
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> logic and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> reuse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
>>>>>>> lies
>>>>>>>>> out
>>>>>>>>>>> of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> scope of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
>>>>>> be
>>>>>>>>> done
>>>>>>>>>>> for
>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
>>>>>> Visser <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
>>>>>>>> correctly
>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
>>>>>>>>>>>>>>> jdbc/hive/hbase."
>>>>>>>>>>>>>>>>> -> Would
>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
>>>>>>> implement
>>>>>>>>>>> these
>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
>>>>>> to
>>>>>>>>> doing
>>>>>>>>>>>>>>> that,
>>>>>>>>>>>>>>>>> outside
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
>>>>>>>> improvement!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
>>>>>> implementation
>>>>>>>>>>> would be
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> nice
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
>>>>>>> SYSTEM_TIME
>>>>>>>>> AS
>>>>>>>>>>> OF
>>>>>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
>>>>>>>>>>> implemented.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
>>>>>>> say
>>>>>>>>>>> that:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
>>>>>>> to
>>>>>>>>> cut
>>>>>>>>>>> off
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
>>>>>> the
>>>>>>>> most
>>>>>>>>>>>>> handy
>>>>>>>>>>>>>>>>> way to do
>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
>>>>>> bit
>>>>>>>>>>> harder to
>>>>>>>>>>>>>>>>> pass it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
>>>>>>>> Alexander
>>>>>>>>>>>>>>>> correctly
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
>>>>>>> for
>>>>>>>>>>>>>>>>> jdbc/hive/hbase.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
>>>>>> caching
>>>>>>>>>>>>>>> parameters
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
>>>>>>> set
>>>>>>>> it
>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
>>>>>>> options
>>>>>>>>> for
>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
>>>>>>>>> really
>>>>>>>>>>>>>>>>> deprives us of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
>>>>>>>> implement
>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
>>>>>>> more
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
>>>>>>>> schema
>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
>>>>>> not
>>>>>>>>> right
>>>>>>>>>>> and
>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
>>>>>>>>> architecture?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
>>>>>>> Visser <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
>>>>>>>> wanted
>>>>>>>>> to
>>>>>>>>>>>>>>>>> express that
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
>>>>>> this
>>>>>>>>> topic
>>>>>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>> hope
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
>>>>>>>>> Смирнов <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
>>>>>>>> However, I
>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> questions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
>>>>>>> get
>>>>>>>>>>>>>>>>> something?).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
>>>>>> of
>>>>>>>> "FOR
>>>>>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>> AS
>>>>>>>>>>>>>>> OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
>>>>>>> you
>>>>>>>>>>> said,
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>> go
>>>>>>>>>>>>>>>>>>>>>>>>>>>> on it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
>>>>>> performance
>>>>>>>> (no
>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
>>>>>> do
>>>>>>>> you
>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
>>>>>>>>> explicitly
>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
>>>>>> the
>>>>>>>>> list
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
>>>>>>>> want
>>>>>>>>>>> to.
>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
>>>>>>> caching
>>>>>>>>> in
>>>>>>>>>>>>>>>> modules
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
>>>>>>>> flink-table-common
>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
>>>>>>>>>>>>>>>>> breaking/non-breaking
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
>>>>>>>>> proc_time"?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
>>>>>>>>>>> options in
>>>>>>>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
>>>>>>> never
>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
>>>>>>>>> semantics
>>>>>>>>>>> of
>>>>>>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
>>>>>>> it
>>>>>>>>>>> about
>>>>>>>>>>>>>>>>> limiting
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
>>>>>>>>> business
>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
>>>>>> logic
>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>>>>>> framework? I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
>>>>>>>> option
>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
>>>>>> the
>>>>>>>>> wrong
>>>>>>>>>>>>>>>>> decision,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
>>>>>>> logic
>>>>>>>>> (not
>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
>>>>>>>>> functions
>>>>>>>>>>> of
>>>>>>>>>>>>>>> ONE
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
>>>>>>> caches).
>>>>>>>>>>> Does it
>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matter for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
>>>>>>> logic
>>>>>>>> is
>>>>>>>>>>>>>>>> located,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
>>>>>>>>>>> 'sink.parallelism',
>>>>>>>>>>>>>>>>> which in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some way
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
>>>>>> framework"
>>>>>>>> and
>>>>>>>>> I
>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>> see any
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
>>>>>>>>> all-caching
>>>>>>>>>>>>>>>>> scenario
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
>>>>>>>> discussion,
>>>>>>>>>>> but
>>>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in our
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
>>>>>>>> quite
>>>>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>> reused
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
>>>>>>> for
>>>>>>>> a
>>>>>>>>>>> new
>>>>>>>>>>>>>>>> API).
>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
>>>>>> use
>>>>>>>>>>>>>>> InputFormat
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
>>>>>> even
>>>>>>>> Hive
>>>>>>>>>>> - it
>>>>>>>>>>>>>>>> uses
>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
>>>>>> a
>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
>>>>>>>> ability
>>>>>>>>>>> to
>>>>>>>>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
>>>>>>>> number
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
>>>>>> reload
>>>>>>>>> time
>>>>>>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduces
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
>>>>>>>> blocking). I
>>>>>>>>>>> know
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
>>>>>>>> code,
>>>>>>>>>>> but
>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
>>>>>>> an
>>>>>>>>>>> ideal
>>>>>>>>>>>>>>>>> solution,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
>>>>>>> might
>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
>>>>>>>>> developer
>>>>>>>>>>> of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
>>>>>>> new
>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
>>>>>> options
>>>>>>>>> into
>>>>>>>>>>> 2
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
>>>>>> will
>>>>>>>>> need
>>>>>>>>>>> to
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> redirect
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
>>>>>>>>>>> LookupConfig
>>>>>>>>>>>>> (+
>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>>> add an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
>>>>>>>> naming),
>>>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
>>>>>>>> won't
>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>>>>>> refactoring at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
>>>>>> connector
>>>>>>>>>>> because
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
>>>>>> wants
>>>>>>> to
>>>>>>>>> use
>>>>>>>>>>>>> his
>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
>>>>>>>>> configs
>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
>>>>>> with
>>>>>>>>>>> already
>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
>>>>>> it's a
>>>>>>>>> rare
>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
>>>>>> pushed
>>>>>>>> all
>>>>>>>>>>> the
>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>>> down
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
>>>>>> is
>>>>>>>> that
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> ONLY
>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
>>>>>>>>>>> FileSystemTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
>>>>>>>>> currently).
>>>>>>>>>>>>> Also
>>>>>>>>>>>>>>>>> for some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
>>>>>>>>> complex
>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
>>>>>> the
>>>>>>>>> cache
>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
>>>>>> large
>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
>>>>>>>>> suppose
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> dimension
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
>>>>>> 20
>>>>>>> to
>>>>>>>>> 40,
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> input
>>>>>>>>>>>>>>>>>>>>>>>>>>>> stream
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
>>>>>>> by
>>>>>>>>> age
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> users. If
>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
>>>>>>>> This
>>>>>>>>>>> means
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
>>>>>>> almost
>>>>>>>> 2
>>>>>>>>>>>>> times.
>>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gain a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
>>>>>>>>> optimization
>>>>>>>>>>>>>>> starts
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
>>>>>>>> filters
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
>>>>>>> opens
>>>>>>>> up
>>>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
>>>>>> 'not
>>>>>>>>> quite
>>>>>>>>>>>>>>>>> useful'.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
>>>>>>>>>>> regarding
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> topic!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
>>>>>>>> points,
>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
>>>>>>> come
>>>>>>>>> to a
>>>>>>>>>>>>>>>>> consensus.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
>>>>>>> Ren
>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
>>>>>> for
>>>>>>> my
>>>>>>>>>>> late
>>>>>>>>>>>>>>>>> response!
>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
>>>>>>> and
>>>>>>>>>>> Leonard
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> I’d
>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
>>>>>>>> implementing
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
>>>>>>>>>>> user-provided
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
>>>>>>> extending
>>>>>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
>>>>>> semantic
>>>>>>> of
>>>>>>>>>>> "FOR
>>>>>>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
>>>>>>>> reflect
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> content
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
>>>>>> users
>>>>>>>>>>> choose
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
>>>>>>> that
>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> breakage is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
>>>>>>> prefer
>>>>>>>>> not
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>>>>>> framework
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (whether
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
>>>>>>> TableFunction),
>>>>>>>> we
>>>>>>>>>>> have
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
>>>>>>> control
>>>>>>>>> the
>>>>>>>>>>>>>>>>> behavior of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
>>>>>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> cautious.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Under
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
>>>>>>>> framework
>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> only be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
>>>>>>> it’s
>>>>>>>>>>> hard
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
>>>>>> source
>>>>>>>>> loads
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> refresh
>>>>>>>>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
>>>>>>>> high
>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
>>>>>>> widely
>>>>>>>>>>> used
>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>> our
>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
>>>>>>> the
>>>>>>>>>>> user’s
>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
>>>>>>>>>>> introduce a
>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interface for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
>>>>>> would
>>>>>>>>>>> become
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
>>>>>> framework
>>>>>>>>> might
>>>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
>>>>>>>> there
>>>>>>>>>>> might
>>>>>>>>>>>>>>>>> exist two
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caches
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
>>>>>> user
>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configures
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
>>>>>>>> implemented
>>>>>>>>>>> by
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> source).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
>>>>>>>>>>> Alexander, I
>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
>>>>>> way
>>>>>>>> down
>>>>>>>>>>> to
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
>>>>>> of
>>>>>>>> the
>>>>>>>>>>>>>>> runner
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
>>>>>>>> network
>>>>>>>>>>> I/O
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> pressure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
>>>>>> these
>>>>>>>>>>>>>>>> optimizations
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
>>>>>>>>> reflect
>>>>>>>>>>> our
>>>>>>>>>>>>>>>>> ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
>>>>>>> of
>>>>>>>>>>>>>>>>> TableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
>>>>>>>>>>> (CachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
>>>>>> developers
>>>>>>>> and
>>>>>>>>>>>>>>> regulate
>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
>>>>>> reference.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
>>>>>>> Александр
>>>>>>>>>>> Смирнов
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
>>>>>>>>> solution
>>>>>>>>>>> as
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> step:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
>>>>>> mutually
>>>>>>>>>>> exclusive
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (originally
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
>>>>>>>>> conceptually
>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
>>>>>>>>>>> different.
>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> go one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
>>>>>>> will
>>>>>>>>> mean
>>>>>>>>>>>>>>>>> deleting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
>>>>>>>>>>> connectors.
>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
>>>>>>>> about
>>>>>>>>>>> that
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
>>>>>>>> tasks
>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parts
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
>>>>>>> unification
>>>>>>>> /
>>>>>>>>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
>>>>>>>> Qingsheng?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
>>>>>>>>> requests
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
>>>>>>> fields
>>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table, we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
>>>>>>> after
>>>>>>>>>>> that we
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
>>>>>>>> filter
>>>>>>>>>>>>>>>>> pushdown. So
>>>>>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
>>>>>>>> much
>>>>>>>>>>> less
>>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
>>>>>>>>> architecture
>>>>>>>>>>> is
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
>>>>>> honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
>>>>>>>> kinds
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> conversations
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
>>>>>>> confluence,
>>>>>>>>> so
>>>>>>>>>>> I
>>>>>>>>>>>>>>>> made a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jira
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
>>>>>> in
>>>>>>>>> more
>>>>>>>>>>>>>>>> details
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
>>>>>>> Heise
>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
>>>>>>>> inconsistency
>>>>>>>>>>> was
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
>>>>>>>> could
>>>>>>>>>>> also
>>>>>>>>>>>>>>>> live
>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
>>>>>> of
>>>>>>>>>>> making
>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
>>>>>>>> devise a
>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>>>>>> layer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
>>>>>>> CachingTableFunction
>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>> delegates to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> X in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
>>>>>>>> Lifting
>>>>>>>>>>> it
>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
>>>>>>>>>>> probably
>>>>>>>>>>>>>>>>>>>>>>>>>>>> unnecessary
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
>>>>>>> will
>>>>>>>>> only
>>>>>>>>>>>>>>>> receive
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
>>>>>>> more
>>>>>>>>>>>>>>>> interesting
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> save
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
>>>>>>>> changes
>>>>>>>>> of
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
>>>>>>>>> interfaces.
>>>>>>>>>>>>>>>>> Everything
>>>>>>>>>>>>>>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
>>>>>> That
>>>>>>>>> means
>>>>>>>>>>> we
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
>>>>>> Alexander
>>>>>>>>>>> pointed
>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
>>>>>>>>> architecture
>>>>>>>>>>> is
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
>>>>>> honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
>>>>>>>> Александр
>>>>>>>>>>>>>>> Смирнов
>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
>>>>>>> I'm
>>>>>>>>>>> not a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> yet,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
>>>>>>>> FLIP
>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
>>>>>>>>>>> feature in
>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> company’s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
>>>>>> our
>>>>>>>>>>> thoughts
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
>>>>>> alternative
>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
>>>>>>>>>>>>> (CachingTableFunction).
>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
>>>>>>>>>>> flink-table-common
>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
>>>>>> tables –
>>>>>>>>> it’s
>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> convenient
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
>>>>>>>>>>> CachingTableFunction
>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
>>>>>> and
>>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connected
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
>>>>>> module,
>>>>>>>>>>> probably
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
>>>>>>>>> depend
>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
>>>>>>> which
>>>>>>>>>>> doesn’t
>>>>>>>>>>>>>>>>> sound
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> good.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
>>>>>>>>>>>>> ‘getLookupConfig’
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
>>>>>>>>>>> connectors
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
>>>>>>>> therefore
>>>>>>>>>>> they
>>>>>>>>>>>>>>>> won’t
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> depend on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
>>>>>>>>> planner
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> construct a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
>>>>>>>> runtime
>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
>>>>>>>> Architecture
>>>>>>>>>>>>> looks
>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
>>>>>>>>> actually
>>>>>>>>>>>>>>> yours
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
>>>>>> that
>>>>>>>> will
>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>> responsible
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
>>>>>>>>>>> inheritors.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
>>>>>>>> AsyncLookupJoinRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
>>>>>>>>>>>>>>> LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
>>>>>> etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
>>>>>> powerful
>>>>>>>>>>> advantage
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> such a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
>>>>>>> level,
>>>>>>>>> we
>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
>>>>>>>>>>> LookupJoinRunnerWithCalc
>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>>>>>>>>>>> named
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
>>>>>> function,
>>>>>>>>> which
>>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mostly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
>>>>>>>> lookup
>>>>>>>>>>> table
>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> condition
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
>>>>>>>> WHERE
>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
>>>>>> A.age =
>>>>>>>>>>> B.age +
>>>>>>>>>>>>>>> 10
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
>>>>>>>> storing
>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
>>>>>> reduced:
>>>>>>>>>>> filters =
>>>>>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
>>>>>>> reduce
>>>>>>>>>>>>> records’
>>>>>>>>>>>>>>>>>>>>>>>>>>>> size. So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
>>>>>> be
>>>>>>>>>>>>> increased
>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
>>>>>> Ren
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
>>>>>>>>>>> discussion
>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
>>>>>>>> table
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
>>>>>>>>> should
>>>>>>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
>>>>>>> isn’t a
>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
>>>>>> with
>>>>>>>>> lookup
>>>>>>>>>>>>>>>> joins,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
>>>>>>>>>>> including
>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
>>>>>>> table
>>>>>>>>>>>>> options.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
>>>>>>> Any
>>>>>>>>>>>>>>>> suggestions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi Becket,

Thanks for your feedback!

1. An alternative way is to let the implementation of cache to decide
whether to store a missing key in the cache instead of the framework.
This sounds more reasonable and makes the LookupProvider interface
cleaner. I can update the FLIP and clarify in the JavaDoc of
LookupCache#put that the cache should decide whether to store an empty
collection.

2. Initially the builder pattern is for the extensibility of
LookupProvider interfaces that we could need to add more
configurations in the future. We can remove the builder now as we have
resolved the issue in 1. As for the builder in DefaultLookupCache I
prefer to keep it because we have a lot of arguments in the
constructor.

3. I think this might overturn the overall design. I agree with
Becket's idea that the API design should be layered considering
extensibility and it'll be great to have one unified interface
supporting both partial, full and even mixed custom strategies, but we
have some issues to resolve. The original purpose of treating full
caching separately is that we'd like to reuse the ability of
ScanRuntimeProvider. Developers just need to hand over Source /
SourceFunction / InputFormat so that the framework could be able to
compose the underlying topology and control the reload (maybe in a
distributed way). Under your design we leave the reload operation
totally to the CacheStrategy and I think it will be hard for
developers to reuse the source in the initializeCache method.

Best regards,

Qingsheng

On Thu, Jun 2, 2022 at 1:50 PM Becket Qin <be...@gmail.com> wrote:
>
> Thanks for updating the FLIP, Qingsheng. A few more comments:
>
> 1. I am still not sure about what is the use case for cacheMissingKey().
> More specifically, when would users want to have getCache() return a
> non-empty value and cacheMissingKey() returns false?
>
> 2. The builder pattern. Usually the builder pattern is used when there are
> a lot of variations of constructors. For example, if a class has three
> variables and all of them are optional, so there could potentially be many
> combinations of the variables. But in this FLIP, I don't see such case.
> What is the reason we have builders for all the classes?
>
> 3. Should the caching strategy be excluded from the top level provider API?
> Technically speaking, the Flink framework should only have two interfaces
> to deal with:
>     A) LookupFunction
>     B) AsyncLookupFunction
> Orthogonally, we *believe* there are two different strategies people can do
> caching. Note that the Flink framework does not care what is the caching
> strategy here.
>     a) partial caching
>     b) full caching
>
> Putting them together, we end up with 3 combinations that we think are
> valid:
>      Aa) PartialCachingLookupFunctionProvider
>      Ba) PartialCachingAsyncLookupFunctionProvider
>      Ab) FullCachingLookupFunctionProvider
>
> However, the caching strategy could actually be quite flexible. E.g. an
> initial full cache load followed by some partial updates. Also, I am not
> 100% sure if the full caching will always use ScanTableSource. Including
> the caching strategy in the top level provider API would make it harder to
> extend.
>
> One possible solution is to just have *LookupFunctionProvider* and
> *AsyncLookupFunctionProvider
> *as the top level API, both with a getCacheStrategy() method returning an
> optional CacheStrategy. The CacheStrategy class would have the following
> methods:
> 1. void open(Context), the context exposes some of the resources that may
> be useful for the the caching strategy, e.g. an ExecutorService that is
> synchronized with the data processing, or a cache refresh trigger which
> blocks data processing and refresh the cache.
> 2. void initializeCache(), a blocking method allows users to pre-populate
> the cache before processing any data if they wish.
> 3. void maybeCache(RowData key, Collection<RowData> value), blocking or
> non-blocking method.
> 4. void refreshCache(), a blocking / non-blocking method that is invoked by
> the Flink framework when the cache refresh trigger is pulled.
>
> In the above design, partial caching and full caching would be
> implementations of the CachingStrategy. And it is OK for users to implement
> their own CachingStrategy if they want to.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Thu, Jun 2, 2022 at 12:14 PM Jark Wu <im...@gmail.com> wrote:
>
> > Thank Qingsheng for the detailed summary and updates,
> >
> > The changes look good to me in general. I just have one minor improvement
> > comment.
> > Could we add a static util method to the "FullCachingReloadTrigger"
> > interface for quick usage?
> >
> > #periodicReloadAtFixedRate(Duration)
> > #periodicReloadWithFixedDelay(Duration)
> >
> > I think we can also do this for LookupCache. Because users may not know
> > where is the default
> > implementations and how to use them.
> >
> > Best,
> > Jark
> >
> >
> >
> >
> >
> >
> > On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <re...@gmail.com> wrote:
> >
> > > Hi Jingsong,
> > >
> > > Thanks for your comments!
> > >
> > > > AllCache definition is not flexible, for example, PartialCache can use
> > > any custom storage, while the AllCache can not, AllCache can also be
> > > considered to store memory or disk, also need a flexible strategy.
> > >
> > > We had an offline discussion with Jark and Leonard. Basically we think
> > > exposing the interface of full cache storage to connector developers
> > might
> > > limit our future optimizations. The storage of full caching shouldn’t
> > have
> > > too many variations for different lookup tables so making it pluggable
> > > might not help a lot. Also I think it is not quite easy for connector
> > > developers to implement such an optimized storage. We can keep optimizing
> > > this storage in the future and all full caching lookup tables would
> > benefit
> > > from this.
> > >
> > > > We are more inclined to deprecate the connector `async` option when
> > > discussing FLIP-234. Can we remove this option from this FLIP?
> > >
> > > Thanks for the reminder! This option has been removed in the latest
> > > version.
> > >
> > > Best regards,
> > >
> > > Qingsheng
> > >
> > >
> > > > On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com> wrote:
> > > >
> > > > Thanks Alexander for your reply. We can discuss the new interface when
> > it
> > > > comes out.
> > > >
> > > > We are more inclined to deprecate the connector `async` option when
> > > > discussing FLIP-234 [1]. We should use hint to let planner decide.
> > > > Although the discussion has not yet produced a conclusion, can we
> > remove
> > > > this option from this FLIP? It doesn't seem to be related to this FLIP,
> > > but
> > > > more to FLIP-234, and we can form a conclusion over there.
> > > >
> > > > [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:
> > > >
> > > >> Hi Jark,
> > > >>
> > > >> Thanks for clarifying it. It would be fine. as long as we could
> > provide
> > > the
> > > >> no-cache solution. I was just wondering if the client side cache could
> > > >> really help when HBase is used, since the data to look up should be
> > > huge.
> > > >> Depending how much data will be cached on the client side, the data
> > that
> > > >> should be lru in e.g. LruBlockCache will not be lru anymore. In the
> > > worst
> > > >> case scenario, once the cached data at client side is expired, the
> > > request
> > > >> will hit disk which will cause extra latency temporarily, if I am not
> > > >> mistaken.
> > > >>
> > > >> Best regards,
> > > >> Jing
> > > >>
> > > >> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
> > > >>
> > > >>> Hi Jing Ge,
> > > >>>
> > > >>> What do you mean about the "impact on the block cache used by HBase"?
> > > >>> In my understanding, the connector cache and HBase cache are totally
> > > two
> > > >>> things.
> > > >>> The connector cache is a local/client cache, and the HBase cache is a
> > > >>> server cache.
> > > >>>
> > > >>>> does it make sense to have a no-cache solution as one of the
> > > >>> default solutions so that customers will have no effort for the
> > > migration
> > > >>> if they want to stick with Hbase cache
> > > >>>
> > > >>> The implementation migration should be transparent to users. Take the
> > > >> HBase
> > > >>> connector as
> > > >>> an example,  it already supports lookup cache but is disabled by
> > > default.
> > > >>> After migration, the
> > > >>> connector still disables cache by default (i.e. no-cache solution).
> > No
> > > >>> migration effort for users.
> > > >>>
> > > >>> HBase cache and connector cache are two different things. HBase cache
> > > >> can't
> > > >>> simply replace
> > > >>> connector cache. Because one of the most important usages for
> > connector
> > > >>> cache is reducing
> > > >>> the I/O request/response and improving the throughput, which can
> > > achieve
> > > >>> by just using a server cache.
> > > >>>
> > > >>> Best,
> > > >>> Jark
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
> > > >>>
> > > >>>> Thanks all for the valuable discussion. The new feature looks very
> > > >>>> interesting.
> > > >>>>
> > > >>>> According to the FLIP description: "*Currently we have JDBC, Hive
> > and
> > > >>> HBase
> > > >>>> connector implemented lookup table source. All existing
> > > implementations
> > > >>>> will be migrated to the current design and the migration will be
> > > >>>> transparent to end users*." I was only wondering if we should pay
> > > >>> attention
> > > >>>> to HBase and similar DBs. Since, commonly, the lookup data will be
> > > huge
> > > >>>> while using HBase, partial caching will be used in this case, if I
> > am
> > > >> not
> > > >>>> mistaken, which might have an impact on the block cache used by
> > HBase,
> > > >>> e.g.
> > > >>>> LruBlockCache.
> > > >>>> Another question is that, since HBase provides a sophisticated cache
> > > >>>> solution, does it make sense to have a no-cache solution as one of
> > the
> > > >>>> default solutions so that customers will have no effort for the
> > > >> migration
> > > >>>> if they want to stick with Hbase cache?
> > > >>>>
> > > >>>> Best regards,
> > > >>>> Jing
> > > >>>>
> > > >>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <
> > jingsonglee0@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hi all,
> > > >>>>>
> > > >>>>> I think the problem now is below:
> > > >>>>> 1. AllCache and PartialCache interface on the non-uniform, one
> > needs
> > > >> to
> > > >>>>> provide LookupProvider, the other needs to provide CacheBuilder.
> > > >>>>> 2. AllCache definition is not flexible, for example, PartialCache
> > can
> > > >>> use
> > > >>>>> any custom storage, while the AllCache can not, AllCache can also
> > be
> > > >>>>> considered to store memory or disk, also need a flexible strategy.
> > > >>>>> 3. AllCache can not customize ReloadStrategy, currently only
> > > >>>>> ScheduledReloadStrategy.
> > > >>>>>
> > > >>>>> In order to solve the above problems, the following are my ideas.
> > > >>>>>
> > > >>>>> ## Top level cache interfaces:
> > > >>>>>
> > > >>>>> ```
> > > >>>>>
> > > >>>>> public interface CacheLookupProvider extends
> > > >>>>> LookupTableSource.LookupRuntimeProvider {
> > > >>>>>
> > > >>>>>    CacheBuilder createCacheBuilder();
> > > >>>>> }
> > > >>>>>
> > > >>>>>
> > > >>>>> public interface CacheBuilder {
> > > >>>>>    Cache create();
> > > >>>>> }
> > > >>>>>
> > > >>>>>
> > > >>>>> public interface Cache {
> > > >>>>>
> > > >>>>>    /**
> > > >>>>>     * Returns the value associated with key in this cache, or null
> > > >> if
> > > >>>>> there is no cached value for
> > > >>>>>     * key.
> > > >>>>>     */
> > > >>>>>    @Nullable
> > > >>>>>    Collection<RowData> getIfPresent(RowData key);
> > > >>>>>
> > > >>>>>    /** Returns the number of key-value mappings in the cache. */
> > > >>>>>    long size();
> > > >>>>> }
> > > >>>>>
> > > >>>>> ```
> > > >>>>>
> > > >>>>> ## Partial cache
> > > >>>>>
> > > >>>>> ```
> > > >>>>>
> > > >>>>> public interface PartialCacheLookupFunction extends
> > > >>> CacheLookupProvider {
> > > >>>>>
> > > >>>>>    @Override
> > > >>>>>    PartialCacheBuilder createCacheBuilder();
> > > >>>>>
> > > >>>>> /** Creates an {@link LookupFunction} instance. */
> > > >>>>> LookupFunction createLookupFunction();
> > > >>>>> }
> > > >>>>>
> > > >>>>>
> > > >>>>> public interface PartialCacheBuilder extends CacheBuilder {
> > > >>>>>
> > > >>>>>    PartialCache create();
> > > >>>>> }
> > > >>>>>
> > > >>>>>
> > > >>>>> public interface PartialCache extends Cache {
> > > >>>>>
> > > >>>>>    /**
> > > >>>>>     * Associates the specified value rows with the specified key
> > row
> > > >>>>> in the cache. If the cache
> > > >>>>>     * previously contained value associated with the key, the old
> > > >>>>> value is replaced by the
> > > >>>>>     * specified value.
> > > >>>>>     *
> > > >>>>>     * @return the previous value rows associated with key, or null
> > > >> if
> > > >>>>> there was no mapping for key.
> > > >>>>>     * @param key - key row with which the specified value is to be
> > > >>>>> associated
> > > >>>>>     * @param value – value rows to be associated with the specified
> > > >>> key
> > > >>>>>     */
> > > >>>>>    Collection<RowData> put(RowData key, Collection<RowData> value);
> > > >>>>>
> > > >>>>>    /** Discards any cached value for the specified key. */
> > > >>>>>    void invalidate(RowData key);
> > > >>>>> }
> > > >>>>>
> > > >>>>> ```
> > > >>>>>
> > > >>>>> ## All cache
> > > >>>>> ```
> > > >>>>>
> > > >>>>> public interface AllCacheLookupProvider extends
> > CacheLookupProvider {
> > > >>>>>
> > > >>>>>    void registerReloadStrategy(ScheduledExecutorService
> > > >>>>> executorService, Reloader reloader);
> > > >>>>>
> > > >>>>>    ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> > > >>>>>
> > > >>>>>    @Override
> > > >>>>>    AllCacheBuilder createCacheBuilder();
> > > >>>>> }
> > > >>>>>
> > > >>>>>
> > > >>>>> public interface AllCacheBuilder extends CacheBuilder {
> > > >>>>>
> > > >>>>>    AllCache create();
> > > >>>>> }
> > > >>>>>
> > > >>>>>
> > > >>>>> public interface AllCache extends Cache {
> > > >>>>>
> > > >>>>>    void putAll(Iterator<Map<RowData, RowData>> allEntries);
> > > >>>>>
> > > >>>>>    void clearAll();
> > > >>>>> }
> > > >>>>>
> > > >>>>>
> > > >>>>> public interface Reloader {
> > > >>>>>
> > > >>>>>    void reload();
> > > >>>>> }
> > > >>>>>
> > > >>>>> ```
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Jingsong
> > > >>>>>
> > > >>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <
> > jingsonglee0@gmail.com
> > > >>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Thanks Qingsheng and all for your discussion.
> > > >>>>>>
> > > >>>>>> Very sorry to jump in so late.
> > > >>>>>>
> > > >>>>>> Maybe I missed something?
> > > >>>>>> My first impression when I saw the cache interface was, why don't
> > > >> we
> > > >>>>>> provide an interface similar to guava cache [1], on top of guava
> > > >>> cache,
> > > >>>>>> caffeine also makes extensions for asynchronous calls.[2]
> > > >>>>>> There is also the bulk load in caffeine too.
> > > >>>>>>
> > > >>>>>> I am also more confused why first from LookupCacheFactory.Builder
> > > >> and
> > > >>>>> then
> > > >>>>>> to Factory to create Cache.
> > > >>>>>>
> > > >>>>>> [1] https://github.com/google/guava
> > > >>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> Jingsong
> > > >>>>>>
> > > >>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com>
> > wrote:
> > > >>>>>>
> > > >>>>>>> After looking at the new introduced ReloadTime and Becket's
> > > >> comment,
> > > >>>>>>> I agree with Becket we should have a pluggable reloading
> > strategy.
> > > >>>>>>> We can provide some common implementations, e.g., periodic
> > > >>> reloading,
> > > >>>>> and
> > > >>>>>>> daily reloading.
> > > >>>>>>> But there definitely be some connector- or business-specific
> > > >>> reloading
> > > >>>>>>> strategies, e.g.
> > > >>>>>>> notify by a zookeeper watcher, reload once a new Hive partition
> > is
> > > >>>>>>> complete.
> > > >>>>>>>
> > > >>>>>>> Best,
> > > >>>>>>> Jark
> > > >>>>>>>
> > > >>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
> > > >>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Qingsheng,
> > > >>>>>>>>
> > > >>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
> > > >>>>>>>>
> > > >>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
> > > >>>> "XXXProvider".
> > > >>>>>>>> What is the difference between them? If they are the same, can
> > > >> we
> > > >>>> just
> > > >>>>>>> use
> > > >>>>>>>> XXXFactory everywhere?
> > > >>>>>>>>
> > > >>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
> > > >>>>> policy
> > > >>>>>>>> also be pluggable? Periodical reloading could be sometimes be
> > > >>> tricky
> > > >>>>> in
> > > >>>>>>>> practice. For example, if user uses 24 hours as the cache
> > > >> refresh
> > > >>>>>>> interval
> > > >>>>>>>> and some nightly batch job delayed, the cache update may still
> > > >> see
> > > >>>> the
> > > >>>>>>>> stale data.
> > > >>>>>>>>
> > > >>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> > > >>>> should
> > > >>>>> be
> > > >>>>>>>> removed.
> > > >>>>>>>>
> > > >>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
> > > >> seems a
> > > >>>>>>> little
> > > >>>>>>>> confusing to me. If Optional<LookupCacheFactory>
> > > >> getCacheFactory()
> > > >>>>>>> returns
> > > >>>>>>>> a non-empty factory, doesn't that already indicates the
> > > >> framework
> > > >>> to
> > > >>>>>>> cache
> > > >>>>>>>> the missing keys? Also, why is this method returning an
> > > >>>>>>> Optional<Boolean>
> > > >>>>>>>> instead of boolean?
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>>
> > > >>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
> > > >> renqschn@gmail.com
> > > >>>>
> > > >>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Lincoln and Jark,
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks for the comments! If the community reaches a consensus
> > > >>> that
> > > >>>> we
> > > >>>>>>> use
> > > >>>>>>>>> SQL hint instead of table options to decide whether to use sync
> > > >>> or
> > > >>>>>>> async
> > > >>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
> > > >>>>> option.
> > > >>>>>>>>>
> > > >>>>>>>>> I think it’s a good idea to let the decision of async made on
> > > >>> query
> > > >>>>>>>>> level, which could make better optimization with more
> > > >> infomation
> > > >>>>>>> gathered
> > > >>>>>>>>> by planner. Is there any FLIP describing the issue in
> > > >>> FLINK-27625?
> > > >>>> I
> > > >>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
> > > >>>>> missing
> > > >>>>>>>>> instead of the entire async mode to be controlled by hint.
> > > >>>>>>>>>
> > > >>>>>>>>> Best regards,
> > > >>>>>>>>>
> > > >>>>>>>>> Qingsheng
> > > >>>>>>>>>
> > > >>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
> > > >> lincoln.86xy@gmail.com
> > > >>>>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hi Jark,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks for your reply!
> > > >>>>>>>>>>
> > > >>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
> > > >>> no
> > > >>>>> idea
> > > >>>>>>>>>> whether or when to remove it (we can discuss it in another
> > > >>> issue
> > > >>>>> for
> > > >>>>>>> the
> > > >>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
> > > >>> into
> > > >>>> a
> > > >>>>>>>>> common
> > > >>>>>>>>>> option now.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Best,
> > > >>>>>>>>>> Lincoln Lee
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi Lincoln,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
> > > >> the
> > > >>>>>>>>> connectors
> > > >>>>>>>>>>> can
> > > >>>>>>>>>>> provide both async and sync runtime providers simultaneously
> > > >>>>> instead
> > > >>>>>>>>> of one
> > > >>>>>>>>>>> of them.
> > > >>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
> > > >> option
> > > >>> is
> > > >>>>>>>>> planned to
> > > >>>>>>>>>>> be removed
> > > >>>>>>>>>>> in the long term, I think it makes sense not to introduce it
> > > >>> in
> > > >>>>> this
> > > >>>>>>>>> FLIP.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Best,
> > > >>>>>>>>>>> Jark
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> > > >>>> lincoln.86xy@gmail.com
> > > >>>>>>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Qingsheng,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
> > > >>> idea
> > > >>>>>>> that
> > > >>>>>>>>> we
> > > >>>>>>>>>>> can
> > > >>>>>>>>>>>> have a common table option. I have a minor comments on
> > > >>>>>>> 'lookup.async'
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>> not make it a common option:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The table layer abstracts both sync and async lookup
> > > >>>>> capabilities,
> > > >>>>>>>>>>>> connectors implementers can choose one or both, in the case
> > > >>> of
> > > >>>>>>>>>>> implementing
> > > >>>>>>>>>>>> only one capability(status of the most of existing builtin
> > > >>>>>>> connectors)
> > > >>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
> > > >>> both
> > > >>>>>>>>>>>> capabilities, I think this choice is more suitable for
> > > >> making
> > > >>>>>>>>> decisions
> > > >>>>>>>>>>> at
> > > >>>>>>>>>>>> the query level, for example, table planner can choose the
> > > >>>>> physical
> > > >>>>>>>>>>>> implementation of async lookup or sync lookup based on its
> > > >>> cost
> > > >>>>>>>>> model, or
> > > >>>>>>>>>>>> users can give query hint based on their own better
> > > >>>>>>> understanding.  If
> > > >>>>>>>>>>>> there is another common table option 'lookup.async', it may
> > > >>>>> confuse
> > > >>>>>>>>> the
> > > >>>>>>>>>>>> users in the long run.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
> > > >>>> place
> > > >>>>>>> (for
> > > >>>>>>>>> the
> > > >>>>>>>>>>>> current hbase connector) and not turn it into a common
> > > >>> option.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> WDYT?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Best,
> > > >>>>>>>>>>>> Lincoln Lee
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi Alexander,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
> > > >> you
> > > >>>> can
> > > >>>>>>> find
> > > >>>>>>>>>>>> those
> > > >>>>>>>>>>>>> changes from my latest email. Since some terminologies has
> > > >>>>>>> changed so
> > > >>>>>>>>>>>> I’ll
> > > >>>>>>>>>>>>> use the new concept for replying your comments.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 1. Builder vs ‘of’
> > > >>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
> > > >> optional
> > > >>>>>>>>> parameters
> > > >>>>>>>>>>>>> for full caching mode (“rescan” previously). The
> > > >>>>>>> schedule-with-delay
> > > >>>>>>>>>>> idea
> > > >>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
> > > >> the
> > > >>>>>>> builder
> > > >>>>>>>>> API
> > > >>>>>>>>>>>> of
> > > >>>>>>>>>>>>> full caching to make it more descriptive for developers.
> > > >>> Would
> > > >>>>> you
> > > >>>>>>>>> mind
> > > >>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
> > > >>>>> workspace
> > > >>>>>>>>> you
> > > >>>>>>>>>>>> can
> > > >>>>>>>>>>>>> just provide your account ID and ping any PMC member
> > > >>> including
> > > >>>>>>> Jark.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 2. Common table options
> > > >>>>>>>>>>>>> We have some discussions these days and propose to
> > > >>> introduce 8
> > > >>>>>>> common
> > > >>>>>>>>>>>>> table options about caching. It has been updated on the
> > > >>> FLIP.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 3. Retries
> > > >>>>>>>>>>>>> I think we are on the same page :-)
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> For your additional concerns:
> > > >>>>>>>>>>>>> 1) The table option has been updated.
> > > >>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
> > > >> use
> > > >>>>>>> partial
> > > >>>>>>>>> or
> > > >>>>>>>>>>>>> full caching mode.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Qingsheng
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> > > >>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Also I have a few additions:
> > > >>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> > > >>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
> > > >> that
> > > >>>> we
> > > >>>>>>> talk
> > > >>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
> > > >> fits
> > > >>>>> more,
> > > >>>>>>>>>>>>>> considering my optimization with filters.
> > > >>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
> > > >>> separate
> > > >>>>>>>>> caching
> > > >>>>>>>>>>>>>> and rescanning from the options point of view? Like
> > > >>> initially
> > > >>>>> we
> > > >>>>>>> had
> > > >>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
> > > >>> now
> > > >>>> we
> > > >>>>>>> can
> > > >>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
> > > >>> be
> > > >>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>> Alexander
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> > > >>>>>>> smiralexan@gmail.com
> > > >>>>>>>>>>>> :
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Hi Qingsheng and Jark,
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> 1. Builders vs 'of'
> > > >>>>>>>>>>>>>>> I understand that builders are used when we have
> > > >> multiple
> > > >>>>>>>>>>> parameters.
> > > >>>>>>>>>>>>>>> I suggested them because we could add parameters later.
> > > >> To
> > > >>>>>>> prevent
> > > >>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
> > > >>> can
> > > >>>>>>>>> suggest
> > > >>>>>>>>>>>>>>> one more config now - "rescanStartTime".
> > > >>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
> > > >> reload
> > > >>>> of
> > > >>>>>>> cache
> > > >>>>>>>>>>>>>>> starts. This parameter can be thought of as
> > > >> 'initialDelay'
> > > >>>>> (diff
> > > >>>>>>>>>>>>>>> between current time and rescanStartTime) in method
> > > >>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
> > > >>> can
> > > >>>> be
> > > >>>>>>> very
> > > >>>>>>>>>>>>>>> useful when the dimension table is updated by some other
> > > >>>>>>> scheduled
> > > >>>>>>>>>>> job
> > > >>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
> > > >> second
> > > >>>> scan
> > > >>>>>>>>>>> (first
> > > >>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
> > > >>>> without
> > > >>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
> > > >>> one
> > > >>>>>>> day.
> > > >>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
> > > >> if
> > > >>>> you
> > > >>>>>>> would
> > > >>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
> > > >> myself
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> 2. Common table options
> > > >>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
> > > >>>> cache
> > > >>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
> > > >>> for
> > > >>>>>>>>> default
> > > >>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
> > > >>>> cache
> > > >>>>>>>>>>> options,
> > > >>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> 3. Retries
> > > >>>>>>>>>>>>>>> I'm fine with suggestion close to
> > > >>> RetryUtils#tryTimes(times,
> > > >>>>>>> call)
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>> Alexander
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> > > >>>> renqschn@gmail.com
> > > >>>>>> :
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi Jark and Alexander,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
> > > >> common
> > > >>>>> table
> > > >>>>>>>>>>>>> options. I prefer to introduce a new
> > > >>> DefaultLookupCacheOptions
> > > >>>>>>> class
> > > >>>>>>>>>>> for
> > > >>>>>>>>>>>>> holding these option definitions because putting all
> > > >> options
> > > >>>>> into
> > > >>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
> > > >>>>>>> categorized.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
> > > >>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
> > > >>>>>>> RescanRuntimeProvider
> > > >>>>>>>>>>>>> considering both arguments are required.
> > > >>>>>>>>>>>>>>>> 2. Introduce new table options matching
> > > >>>>>>> DefaultLookupCacheFactory
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>> Qingsheng
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> > > >>> imjark@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Hi Alex,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> 1) retry logic
> > > >>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
> > > >>>>> utilities,
> > > >>>>>>>>>>> e.g.
> > > >>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
> > > >>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
> > > >> by
> > > >>>>>>>>>>> DataStream
> > > >>>>>>>>>>>>> users.
> > > >>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
> > > >> to
> > > >>>> put
> > > >>>>>>> it.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> 2) cache ConfigOptions
> > > >>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
> > > >>>>> framework.
> > > >>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
> > > >>>> includes
> > > >>>>>>>>>>>>> "sink.parallelism", "format" options.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>> Jark
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> > > >>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Thank you for considering my comments.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
> > > >> such
> > > >>> as
> > > >>>>>>>>>>>>> re-establish the connection
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
> > > >> be
> > > >>>>>>> placed in
> > > >>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>> separate function, that can be implemented by
> > > >>> connectors.
> > > >>>>>>> Just
> > > >>>>>>>>>>>> moving
> > > >>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
> > > >>>> more
> > > >>>>>>>>>>> concise
> > > >>>>>>>>>>>> +
> > > >>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
> > > >> The
> > > >>>>>>> decision
> > > >>>>>>>>>>> is
> > > >>>>>>>>>>>>> up
> > > >>>>>>>>>>>>>>>>>> to you.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> > > >>>>>>> developers
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>> define their own options as we do now per connector.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
> > > >>>> this
> > > >>>>>>> FLIP
> > > >>>>>>>>>>> was
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
> > > >> current
> > > >>>>> cache
> > > >>>>>>>>>>>> design
> > > >>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> > > >>>> still
> > > >>>>>>> we
> > > >>>>>>>>>>> can
> > > >>>>>>>>>>>>> put
> > > >>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
> > > >>> reuse
> > > >>>>>>> them
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
> > > >> significant,
> > > >>>>> avoid
> > > >>>>>>>>>>>> possible
> > > >>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
> > > >>> out
> > > >>>> in
> > > >>>>>>>>>>>>>>>>>> documentation for connector developers.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>> Alexander
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> > > >>>>>>> renqschn@gmail.com>:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Hi Alexander,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
> > > >>> same
> > > >>>>>>> page!
> > > >>>>>>>>> I
> > > >>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
> > > >>>> quoting
> > > >>>>>>> your
> > > >>>>>>>>>>>> reply
> > > >>>>>>>>>>>>> under this email.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
> > > >> in
> > > >>>>>>> lookup()
> > > >>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
> > > >>>> meaningful
> > > >>>>>>>>> under
> > > >>>>>>>>>>>> some
> > > >>>>>>>>>>>>> specific retriable failures, and there might be custom
> > > >> logic
> > > >>>>>>> before
> > > >>>>>>>>>>>> making
> > > >>>>>>>>>>>>> retry, such as re-establish the connection
> > > >>>>>>> (JdbcRowDataLookupFunction
> > > >>>>>>>>>>> is
> > > >>>>>>>>>>>> an
> > > >>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
> > > >>> version
> > > >>>> of
> > > >>>>>>>>> FLIP.
> > > >>>>>>>>>>>> Do
> > > >>>>>>>>>>>>> you have any special plans for them?
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> > > >>>>>>> developers
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>> define their own options as we do now per connector.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
> > > >>>> FLIP.
> > > >>>>>>> Hope
> > > >>>>>>>>>>> we
> > > >>>>>>>>>>>>> can finalize our proposal soon!
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Qingsheng
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> > > >>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
> > > >> I
> > > >>>> have
> > > >>>>>>>>>>> several
> > > >>>>>>>>>>>>>>>>>>>> suggestions and questions.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> > > >>>>>>> TableFunction
> > > >>>>>>>>>>> is a
> > > >>>>>>>>>>>>> good
> > > >>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> > > >>>> class.
> > > >>>>>>>>> 'eval'
> > > >>>>>>>>>>>>> method
> > > >>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
> > > >> The
> > > >>>> same
> > > >>>>>>> is
> > > >>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>> 'async' case.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
> > > >>>>>>>>>>>>> 'cacheMissingKey'
> > > >>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> > > >>>>>>>>>>>>> ScanRuntimeProvider.
> > > >>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
> > > >>> and
> > > >>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> > > >>>>> 'build'
> > > >>>>>>>>>>>> method
> > > >>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
> > > >>>> TableFunctionProvider
> > > >>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> > > >>>>>>> deprecated.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
> > > >> assume
> > > >>>>>>> usage of
> > > >>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
> > > >>> case,
> > > >>>>> it
> > > >>>>>>> is
> > > >>>>>>>>>>> not
> > > >>>>>>>>>>>>> very
> > > >>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
> > > >> or
> > > >>>>>>> 'putAll'
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>> LookupCache.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
> > > >>>> version
> > > >>>>>>> of
> > > >>>>>>>>>>>> FLIP.
> > > >>>>>>>>>>>>> Do
> > > >>>>>>>>>>>>>>>>>>>> you have any special plans for them?
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
> > > >> make
> > > >>>>> small
> > > >>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
> > > >>>> worth
> > > >>>>>>>>>>>> mentioning
> > > >>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
> > > >> the
> > > >>>>>>> future.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> > > >>>>>>> renqschn@gmail.com
> > > >>>>>>>>>>>> :
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
> > > >> As
> > > >>>> Jark
> > > >>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
> > > >>>>>>> refactor on
> > > >>>>>>>>>>> our
> > > >>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
> > > >> design
> > > >>>> now
> > > >>>>>>> and
> > > >>>>>>>>> we
> > > >>>>>>>>>>>> are
> > > >>>>>>>>>>>>> happy to hear more suggestions from you!
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
> > > >>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
> > > >>> and
> > > >>>> is
> > > >>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
> > > >>>>>>>>> previously.
> > > >>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
> > > >> reflect
> > > >>>> the
> > > >>>>>>> new
> > > >>>>>>>>>>>>> design.
> > > >>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
> > > >> and
> > > >>>>>>>>>>> introduce a
> > > >>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
> > > >> scanning.
> > > >>> We
> > > >>>>> are
> > > >>>>>>>>>>>> planning
> > > >>>>>>>>>>>>> to support SourceFunction / InputFormat for now
> > > >> considering
> > > >>>> the
> > > >>>>>>>>>>>> complexity
> > > >>>>>>>>>>>>> of FLIP-27 Source API.
> > > >>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> > > >>>> make
> > > >>>>>>> the
> > > >>>>>>>>>>>>> semantic of lookup more straightforward for developers.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
> > > >>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> > > >>> is
> > > >>>>>>>>>>> deprecated
> > > >>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
> > > >>>>> currently
> > > >>>>>>>>> it's
> > > >>>>>>>>>>>> not?
> > > >>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
> > > >> for
> > > >>>>> now.
> > > >>>>>>> I
> > > >>>>>>>>>>>> think
> > > >>>>>>>>>>>>> it will be deprecated in the future but we don't have a
> > > >>> clear
> > > >>>>> plan
> > > >>>>>>>>> for
> > > >>>>>>>>>>>> that.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> > > >>>> looking
> > > >>>>>>>>>>> forward
> > > >>>>>>>>>>>>> to cooperating with you after we finalize the design and
> > > >>>>>>> interfaces!
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Qingsheng
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
> > > >> Смирнов <
> > > >>>>>>>>>>>>> smiralexan@gmail.com> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
> > > >>> all
> > > >>>>>>>>> points!
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> > > >>> is
> > > >>>>>>>>>>> deprecated
> > > >>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
> > > >>> but
> > > >>>>>>>>>>> currently
> > > >>>>>>>>>>>>> it's
> > > >>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
> > > >>> version
> > > >>>>>>> it's
> > > >>>>>>>>> OK
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>> use
> > > >>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
> > > >>>>> supporting
> > > >>>>>>>>>>> rescan
> > > >>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
> > > >> for
> > > >>>>> this
> > > >>>>>>>>>>>>> decision we
> > > >>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
> > > >> participants.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
> > > >>> your
> > > >>>>>>>>>>>>> statements. All
> > > >>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
> > > >>> would
> > > >>>> be
> > > >>>>>>> nice
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>> work
> > > >>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
> > > >> lot
> > > >>>> of
> > > >>>>>>> work
> > > >>>>>>>>>>> on
> > > >>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
> > > >> one
> > > >>>> we
> > > >>>>>>> are
> > > >>>>>>>>>>>>> discussing,
> > > >>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
> > > >> Anyway
> > > >>>>>>> looking
> > > >>>>>>>>>>>>> forward for
> > > >>>>>>>>>>>>>>>>>>>>>> the FLIP update!
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> > > >>>> imjark@gmail.com
> > > >>>>>> :
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> > > >>>>>>> discussed
> > > >>>>>>>>>>> it
> > > >>>>>>>>>>>>> several times
> > > >>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
> > > >>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
> > > >>> many
> > > >>>> of
> > > >>>>>>> your
> > > >>>>>>>>>>>>> points!
> > > >>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
> > > >> design
> > > >>>> docs
> > > >>>>>>> and
> > > >>>>>>>>>>>>> maybe can be
> > > >>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
> > > >>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
> > > >>> discussions:
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
> > > >> "cache
> > > >>>> in
> > > >>>>>>>>>>>>> framework" way.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> > > >>> customize
> > > >>>>> and
> > > >>>>>>> a
> > > >>>>>>>>>>>>> default
> > > >>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
> > > >> easy-use.
> > > >>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
> > > >>>>> flexibility
> > > >>>>>>> and
> > > >>>>>>>>>>>>> conciseness.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> > > >>>> lookup
> > > >>>>>>>>>>> cache,
> > > >>>>>>>>>>>>> esp reducing
> > > >>>>>>>>>>>>>>>>>>>>>>> IO.
> > > >>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
> > > >> the
> > > >>>>>>> unified
> > > >>>>>>>>>>> way
> > > >>>>>>>>>>>>> to both
> > > >>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> > > >>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
> > > >>> direction.
> > > >>>> If
> > > >>>>>>> we
> > > >>>>>>>>>>> need
> > > >>>>>>>>>>>>> to support
> > > >>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
> > > >> use
> > > >>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
> > > >> decide
> > > >>>> to
> > > >>>>>>>>>>>> implement
> > > >>>>>>>>>>>>> the cache
> > > >>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
> > > >>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
> > > >>> and
> > > >>>>> it
> > > >>>>>>>>>>>> doesn't
> > > >>>>>>>>>>>>> affect the
> > > >>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
> > > >> to
> > > >>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
> > > >>> your
> > > >>>>>>>>>>> proposal.
> > > >>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
> > > >>>>> InputFormat,
> > > >>>>>>>>>>>>> SourceFunction for
> > > >>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> > > >>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
> > > >> source
> > > >>>>>>> operator
> > > >>>>>>>>>>>>> instead of
> > > >>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
> > > >>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
> > > >>>>> re-scan
> > > >>>>>>>>>>>> ability
> > > >>>>>>>>>>>>> for FLIP-27
> > > >>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
> > > >>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
> > > >>>>> effort
> > > >>>>>>> of
> > > >>>>>>>>>>>>> FLIP-27 source
> > > >>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
> > > >>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
> > > >>> InputFormat&SourceFunction,
> > > >>>>> as
> > > >>>>>>>>> they
> > > >>>>>>>>>>>>> are not
> > > >>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> > > >>> another
> > > >>>>>>>>> function
> > > >>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
> > > >>>> plan
> > > >>>>>>>>>>> FLIP-27
> > > >>>>>>>>>>>>> source
> > > >>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
> > > >>> SourceFunction
> > > >>>>> are
> > > >>>>>>>>>>>>> deprecated.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>>>> Jark
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
> > > >> <
> > > >>>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
> > > >>> InputFormat
> > > >>>>> is
> > > >>>>>>> not
> > > >>>>>>>>>>>>> considered.
> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> > > >>>>>>>>>>>>> martijn@ververica.com>:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
> > > >>> connectors
> > > >>>>> to
> > > >>>>>>>>>>>> FLIP-27
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
> > > >>> The
> > > >>>>> old
> > > >>>>>>>>>>>>> interfaces will be
> > > >>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
> > > >>>> refactored
> > > >>>>> to
> > > >>>>>>>>> use
> > > >>>>>>>>>>>>> the new ones
> > > >>>>>>>>>>>>>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>>>>>>> dropped.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
> > > >> are
> > > >>>>> using
> > > >>>>>>>>>>>> FLIP-27
> > > >>>>>>>>>>>>> interfaces,
> > > >>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
> > > >>>>>>> interfaces.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
> > > >> Смирнов
> > > >>> <
> > > >>>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
> > > >>> make
> > > >>>>>>> some
> > > >>>>>>>>>>>>> comments and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
> > > >>> we
> > > >>>>> can
> > > >>>>>>>>>>>> achieve
> > > >>>>>>>>>>>>> both
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
> > > >> in
> > > >>>>>>>>>>>>> flink-table-common,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
> > > >>>>>>> flink-table-runtime.
> > > >>>>>>>>>>>>> Therefore if a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
> > > >> cache
> > > >>>>>>>>>>> strategies
> > > >>>>>>>>>>>>> and their
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
> > > >> lookupConfig
> > > >>> to
> > > >>>>> the
> > > >>>>>>>>>>>>> planner, but if
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
> > > >>> in
> > > >>>>> his
> > > >>>>>>>>>>>>> TableFunction, it
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
> > > >>>>>>> interface
> > > >>>>>>>>>>> for
> > > >>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
> > > >>> the
> > > >>>>>>>>>>>>> documentation). In
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
> > > >>> unified.
> > > >>>>>>> WDYT?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> > > >>> cache,
> > > >>>> we
> > > >>>>>>> will
> > > >>>>>>>>>>>>> have 90% of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> > > >>> optimization
> > > >>>> in
> > > >>>>>>> case
> > > >>>>>>>>>>> of
> > > >>>>>>>>>>>>> LRU cache.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> > > >>>> Collection<RowData>>.
> > > >>>>>>> Here
> > > >>>>>>>>>>> we
> > > >>>>>>>>>>>>> always
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
> > > >>>> cache,
> > > >>>>>>> even
> > > >>>>>>>>>>>>> after
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
> > > >>> rows
> > > >>>>>>> after
> > > >>>>>>>>>>>>> applying
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> > > >>>>>>>>>>>> TableFunction,
> > > >>>>>>>>>>>>> we store
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> > > >>>> cache
> > > >>>>>>> line
> > > >>>>>>>>>>>> will
> > > >>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> > > >>>>> bytes).
> > > >>>>>>>>>>> I.e.
> > > >>>>>>>>>>>>> we don't
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
> > > >>>> pruned,
> > > >>>>>>> but
> > > >>>>>>>>>>>>> significantly
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
> > > >> If
> > > >>>> the
> > > >>>>>>> user
> > > >>>>>>>>>>>>> knows about
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> > > >>>>> option
> > > >>>>>>>>>>> before
> > > >>>>>>>>>>>>> the start
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
> > > >>> idea
> > > >>>>>>> that we
> > > >>>>>>>>>>>> can
> > > >>>>>>>>>>>>> do this
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
> > > >> and
> > > >>>>>>> 'weigher'
> > > >>>>>>>>>>>>> methods of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> > > >>>>>>> collection
> > > >>>>>>>>>>> of
> > > >>>>>>>>>>>>> rows
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> > > >>>> automatically
> > > >>>>>>> fit
> > > >>>>>>>>>>>> much
> > > >>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> > > >>>>> filters
> > > >>>>>>> and
> > > >>>>>>>>>>>>> projects
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > > >>>>>>>>>>>>> SupportsProjectionPushDown.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > > >>>>> interfaces,
> > > >>>>>>>>>>> don't
> > > >>>>>>>>>>>>> mean it's
> > > >>>>>>>>>>>>>>>>>>>>>>>> hard
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> > > >>>>> implement
> > > >>>>>>>>>>> filter
> > > >>>>>>>>>>>>> pushdown.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
> > > >> no
> > > >>>>>>> database
> > > >>>>>>>>>>>>> connector
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> > > >>>>> feature
> > > >>>>>>>>>>> won't
> > > >>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> > > >>>> talk
> > > >>>>>>> about
> > > >>>>>>>>>>>>> other
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
> > > >> databases
> > > >>>>> might
> > > >>>>>>>>> not
> > > >>>>>>>>>>>>> support all
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
> > > >> all).
> > > >>> I
> > > >>>>>>> think
> > > >>>>>>>>>>>> users
> > > >>>>>>>>>>>>> are
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
> > > >>>> optimization
> > > >>>>>>>>>>>>> independently of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
> > > >>>> complex
> > > >>>>>>>>>>> problems
> > > >>>>>>>>>>>>> (or
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
> > > >> Actually
> > > >>> in
> > > >>>>> our
> > > >>>>>>>>>>>>> internal version
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
> > > >> and
> > > >>>>>>>>> reloading
> > > >>>>>>>>>>>>> data from
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
> > > >> a
> > > >>>> way
> > > >>>>> to
> > > >>>>>>>>>>> unify
> > > >>>>>>>>>>>>> the logic
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> > > >>>>>>>>> SourceFunction,
> > > >>>>>>>>>>>>> Source,...)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
> > > >>> result
> > > >>>> I
> > > >>>>>>>>>>> settled
> > > >>>>>>>>>>>>> on using
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
> > > >>> in
> > > >>>>> all
> > > >>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
> > > >> plans
> > > >>>> to
> > > >>>>>>>>>>>> deprecate
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> > > >>>> usage
> > > >>>>> of
> > > >>>>>>>>>>>>> FLIP-27 source
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> > > >>>>> source
> > > >>>>>>> was
> > > >>>>>>>>>>>>> designed to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
> > > >>> (SplitEnumerator
> > > >>>> on
> > > >>>>>>>>>>>>> JobManager and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> > > >>>> operator
> > > >>>>>>>>>>> (lookup
> > > >>>>>>>>>>>>> join
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
> > > >> direct
> > > >>>> way
> > > >>>>> to
> > > >>>>>>>>>>> pass
> > > >>>>>>>>>>>>> splits from
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
> > > >>> works
> > > >>>>>>>>> through
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> > > >>>>>>>>>>> AddSplitEvents).
> > > >>>>>>>>>>>>> Usage of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
> > > >>> clearer
> > > >>>>> and
> > > >>>>>>>>>>>>> easier. But if
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> > > >>>>>>> FLIP-27, I
> > > >>>>>>>>>>>>> have the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
> > > >>> lookup
> > > >>>>> join
> > > >>>>>>>>> ALL
> > > >>>>>>>>>>>>> cache in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
> > > >> of
> > > >>>>> batch
> > > >>>>>>>>>>>> source?
> > > >>>>>>>>>>>>> The point
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
> > > >> join
> > > >>>> ALL
> > > >>>>>>>>> cache
> > > >>>>>>>>>>>>> and simple
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
> > > >>> case
> > > >>>>>>>>> scanning
> > > >>>>>>>>>>>> is
> > > >>>>>>>>>>>>> performed
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
> > > >> (cache)
> > > >>> is
> > > >>>>>>>>> cleared
> > > >>>>>>>>>>>>> (correct me
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> > > >>>>>>> functionality of
> > > >>>>>>>>>>>>> simple join
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
> > > >>>>>>> functionality of
> > > >>>>>>>>>>>>> scanning
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
> > > >> be
> > > >>>>> easy
> > > >>>>>>>>> with
> > > >>>>>>>>>>>>> new FLIP-27
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
> > > >> -
> > > >>> we
> > > >>>>>>> will
> > > >>>>>>>>>>> need
> > > >>>>>>>>>>>>> to change
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> > > >>>> again
> > > >>>>>>> after
> > > >>>>>>>>>>>>> some TTL).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> > > >>> long-term
> > > >>>>>>> goal
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>> will make
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
> > > >>> said.
> > > >>>>>>> Maybe
> > > >>>>>>>>>>> we
> > > >>>>>>>>>>>>> can limit
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> > > >>>> (InputFormats).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> > > >>>> flexible
> > > >>>>>>>>>>>>> interfaces for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
> > > >> both
> > > >>>> in
> > > >>>>>>> LRU
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>> ALL caches.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> > > >>>>>>> supported
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>> Flink
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
> > > >>> have
> > > >>>>> the
> > > >>>>>>>>>>>>> opportunity to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
> > > >> currently
> > > >>>>> filter
> > > >>>>>>>>>>>>> pushdown works
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> > > >>> filters
> > > >>>> +
> > > >>>>>>>>>>>>> projections
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
> > > >>>>>>> features.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
> > > >>> that
> > > >>>>>>>>> involves
> > > >>>>>>>>>>>>> multiple
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
> > > >>> from
> > > >>>>>>>>>>>>> InputFormat in favor
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> > > >>> realization
> > > >>>>>>> really
> > > >>>>>>>>>>>>> complex and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
> > > >>> extend
> > > >>>>> the
> > > >>>>>>>>>>>>> functionality of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> > > >>>> case
> > > >>>>> of
> > > >>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>> join ALL
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> cache?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> > > >>>>> imjark@gmail.com
> > > >>>>>>>> :
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
> > > >>> want
> > > >>>> to
> > > >>>>>>>>> share
> > > >>>>>>>>>>>> my
> > > >>>>>>>>>>>>> ideas:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> > > >>>> connectors
> > > >>>>>>> base
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
> > > >>> ways
> > > >>>>>>> should
> > > >>>>>>>>>>>>> work (e.g.,
> > > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
> > > >>>>>>> interfaces.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
> > > >>> flexible
> > > >>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
> > > >> we
> > > >>>> can
> > > >>>>>>> have
> > > >>>>>>>>>>>> both
> > > >>>>>>>>>>>>>>>>>>>>>>>> advantages.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
> > > >>> should
> > > >>>>> be a
> > > >>>>>>>>>>> final
> > > >>>>>>>>>>>>> state,
> > > >>>>>>>>>>>>>>>>>>>>>>>> and we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
> > > >>> into
> > > >>>>>>> cache
> > > >>>>>>>>>>> can
> > > >>>>>>>>>>>>> benefit a
> > > >>>>>>>>>>>>>>>>>>>>>>>> lot
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> > > >>>>> Connectors
> > > >>>>>>> use
> > > >>>>>>>>>>>>> cache to
> > > >>>>>>>>>>>>>>>>>>>>>>>> reduce
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> IO
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> > > >>> cache,
> > > >>>> we
> > > >>>>>>> will
> > > >>>>>>>>>>>>> have 90% of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
> > > >> means
> > > >>>> the
> > > >>>>>>> cache
> > > >>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>>>> meaningless in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
> > > >> to
> > > >>> do
> > > >>>>>>>>> filters
> > > >>>>>>>>>>>>> and projects
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > > >>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > > >>>>> interfaces,
> > > >>>>>>>>>>> don't
> > > >>>>>>>>>>>>> mean it's
> > > >>>>>>>>>>>>>>>>>>>>>>>> hard
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
> > > >> interfaces
> > > >>> to
> > > >>>>>>> reduce
> > > >>>>>>>>>>> IO
> > > >>>>>>>>>>>>> and the
> > > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> size.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
> > > >>>> source
> > > >>>>>>> and
> > > >>>>>>>>>>>>> lookup source
> > > >>>>>>>>>>>>>>>>>>>>>>>> share
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> > > >>>> pushdown
> > > >>>>>>> logic
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>> caches,
> > > >>>>>>>>>>>>>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
> > > >>> of
> > > >>>>> this
> > > >>>>>>>>>>> FLIP.
> > > >>>>>>>>>>>>> We have
> > > >>>>>>>>>>>>>>>>>>>>>>>> never
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
> > > >>> "eval"
> > > >>>>>>> method
> > > >>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
> > > >> share
> > > >>>> the
> > > >>>>>>>>> logic
> > > >>>>>>>>>>>> of
> > > >>>>>>>>>>>>> reload
> > > >>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> > > >>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Source.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> > > >>>>> deprecated,
> > > >>>>>>> and
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>> FLIP-27
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> source
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> > > >>>>>>> LookupJoin,
> > > >>>>>>>>>>>> this
> > > >>>>>>>>>>>>> may make
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
> > > >> the
> > > >>>> ALL
> > > >>>>>>>>> cache
> > > >>>>>>>>>>>>> logic and
> > > >>>>>>>>>>>>>>>>>>>>>>>> reuse
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> > > >>>>>>>>>>>>> ro.v.boyko@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
> > > >>> lies
> > > >>>>> out
> > > >>>>>>> of
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> scope of
> > > >>>>>>>>>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
> > > >> be
> > > >>>>> done
> > > >>>>>>> for
> > > >>>>>>>>>>>> all
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
> > > >> Visser <
> > > >>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> > > >>>> correctly
> > > >>>>>>>>>>>> mentioned
> > > >>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>> filter
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> > > >>>>>>>>>>> jdbc/hive/hbase."
> > > >>>>>>>>>>>>> -> Would
> > > >>>>>>>>>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> > > >>> implement
> > > >>>>>>> these
> > > >>>>>>>>>>>> filter
> > > >>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
> > > >> to
> > > >>>>> doing
> > > >>>>>>>>>>> that,
> > > >>>>>>>>>>>>> outside
> > > >>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> > > >>>>>>>>>>>>> ro.v.boyko@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> > > >>>> improvement!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
> > > >> implementation
> > > >>>>>>> would be
> > > >>>>>>>>>>> a
> > > >>>>>>>>>>>>> nice
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> > > >>> SYSTEM_TIME
> > > >>>>> AS
> > > >>>>>>> OF
> > > >>>>>>>>>>>>> proc_time"
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> semantics
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> > > >>>>>>> implemented.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
> > > >>> say
> > > >>>>>>> that:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
> > > >>> to
> > > >>>>> cut
> > > >>>>>>> off
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>> size
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
> > > >> the
> > > >>>> most
> > > >>>>>>>>> handy
> > > >>>>>>>>>>>>> way to do
> > > >>>>>>>>>>>>>>>>>>>>>>>> it
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
> > > >> bit
> > > >>>>>>> harder to
> > > >>>>>>>>>>>>> pass it
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> through the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> > > >>>> Alexander
> > > >>>>>>>>>>>> correctly
> > > >>>>>>>>>>>>>>>>>>>>>>>> mentioned
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
> > > >>> for
> > > >>>>>>>>>>>>> jdbc/hive/hbase.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
> > > >> caching
> > > >>>>>>>>>>> parameters
> > > >>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>> different
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
> > > >>> set
> > > >>>> it
> > > >>>>>>>>>>> through
> > > >>>>>>>>>>>>> DDL
> > > >>>>>>>>>>>>>>>>>>>>>>>> rather
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> than
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
> > > >>> options
> > > >>>>> for
> > > >>>>>>>>> all
> > > >>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>> tables.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> > > >>>>> really
> > > >>>>>>>>>>>>> deprives us of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> > > >>>> implement
> > > >>>>>>>>> their
> > > >>>>>>>>>>>> own
> > > >>>>>>>>>>>>>>>>>>>>>>>> cache).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> But
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
> > > >>> more
> > > >>>>>>>>>>> different
> > > >>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> strategies
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> > > >>>> schema
> > > >>>>>>>>>>> proposed
> > > >>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
> > > >> not
> > > >>>>> right
> > > >>>>>>> and
> > > >>>>>>>>>>>> all
> > > >>>>>>>>>>>>> these
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> > > >>>>> architecture?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> > > >>> Visser <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> > > >>>> wanted
> > > >>>>> to
> > > >>>>>>>>>>>>> express that
> > > >>>>>>>>>>>>>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> really
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
> > > >> this
> > > >>>>> topic
> > > >>>>>>>>>>> and I
> > > >>>>>>>>>>>>> hope
> > > >>>>>>>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> > > >>>>> Смирнов <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> > > >>>> However, I
> > > >>>>>>> have
> > > >>>>>>>>>>>>> questions
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> about
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
> > > >>> get
> > > >>>>>>>>>>>>> something?).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
> > > >> of
> > > >>>> "FOR
> > > >>>>>>>>>>>>> SYSTEM_TIME
> > > >>>>>>>>>>>>>>>>>>>>>>>> AS OF
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> > > >>>>> SYSTEM_TIME
> > > >>>>>>> AS
> > > >>>>>>>>>>> OF
> > > >>>>>>>>>>>>>>>>>>>>>>>> proc_time"
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
> > > >>> you
> > > >>>>>>> said,
> > > >>>>>>>>>>>> users
> > > >>>>>>>>>>>>> go
> > > >>>>>>>>>>>>>>>>>>>>>>>> on it
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
> > > >> performance
> > > >>>> (no
> > > >>>>>>> one
> > > >>>>>>>>>>>>> proposed
> > > >>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> enable
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
> > > >> do
> > > >>>> you
> > > >>>>>>> mean
> > > >>>>>>>>>>>>> other
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> developers
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> > > >>>>> explicitly
> > > >>>>>>>>>>>> specify
> > > >>>>>>>>>>>>>>>>>>>>>>>> whether
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> their
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
> > > >> the
> > > >>>>> list
> > > >>>>>>> of
> > > >>>>>>>>>>>>> supported
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> > > >>>> want
> > > >>>>>>> to.
> > > >>>>>>>>> So
> > > >>>>>>>>>>>>> what
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
> > > >>> caching
> > > >>>>> in
> > > >>>>>>>>>>>> modules
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> > > >>>> flink-table-common
> > > >>>>>>> from
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> considered
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> > > >>>>>>>>>>>>> breaking/non-breaking
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> > > >>>>> proc_time"?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> > > >>>>>>> options in
> > > >>>>>>>>>>>> DDL
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> control
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
> > > >>> never
> > > >>>>>>>>> happened
> > > >>>>>>>>>>>>>>>>>>>>>>>> previously
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> > > >>>>> semantics
> > > >>>>>>> of
> > > >>>>>>>>>>> DDL
> > > >>>>>>>>>>>>>>>>>>>>>>>> options
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
> > > >>> it
> > > >>>>>>> about
> > > >>>>>>>>>>>>> limiting
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> scope
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> > > >>>>> business
> > > >>>>>>>>>>> logic
> > > >>>>>>>>>>>>> rather
> > > >>>>>>>>>>>>>>>>>>>>>>>> than
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
> > > >> logic
> > > >>> in
> > > >>>>> the
> > > >>>>>>>>>>>>> framework? I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> mean
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> > > >>>> option
> > > >>>>>>> with
> > > >>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
> > > >> the
> > > >>>>> wrong
> > > >>>>>>>>>>>>> decision,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> because it
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
> > > >>> logic
> > > >>>>> (not
> > > >>>>>>>>>>> just
> > > >>>>>>>>>>>>>>>>>>>>>>>> performance
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> > > >>>>> functions
> > > >>>>>>> of
> > > >>>>>>>>>>> ONE
> > > >>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> (there
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> > > >>> caches).
> > > >>>>>>> Does it
> > > >>>>>>>>>>>>> really
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> matter for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
> > > >>> logic
> > > >>>> is
> > > >>>>>>>>>>>> located,
> > > >>>>>>>>>>>>>>>>>>>>>>>> which is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> > > >>>>>>> 'sink.parallelism',
> > > >>>>>>>>>>>>> which in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> some way
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
> > > >> framework"
> > > >>>> and
> > > >>>>> I
> > > >>>>>>>>>>> don't
> > > >>>>>>>>>>>>> see any
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> problem
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> > > >>>>> all-caching
> > > >>>>>>>>>>>>> scenario
> > > >>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> > > >>>> discussion,
> > > >>>>>>> but
> > > >>>>>>>>>>>>> actually
> > > >>>>>>>>>>>>>>>>>>>>>>>> in our
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> > > >>>> quite
> > > >>>>>>>>>>> easily
> > > >>>>>>>>>>>> -
> > > >>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>> reused
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
> > > >>> for
> > > >>>> a
> > > >>>>>>> new
> > > >>>>>>>>>>>> API).
> > > >>>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> point is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
> > > >> use
> > > >>>>>>>>>>> InputFormat
> > > >>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> scanning
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
> > > >> even
> > > >>>> Hive
> > > >>>>>>> - it
> > > >>>>>>>>>>>> uses
> > > >>>>>>>>>>>>>>>>>>>>>>>> class
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
> > > >> a
> > > >>>>>>> wrapper
> > > >>>>>>>>>>>> around
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> > > >>>> ability
> > > >>>>>>> to
> > > >>>>>>>>>>>> reload
> > > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> data
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> > > >>>> number
> > > >>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> but
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
> > > >> reload
> > > >>>>> time
> > > >>>>>>>>>>>>> significantly
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> reduces
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> > > >>>> blocking). I
> > > >>>>>>> know
> > > >>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>> usually
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> > > >>>> code,
> > > >>>>>>> but
> > > >>>>>>>>>>>> maybe
> > > >>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>>>>> one
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
> > > >>> an
> > > >>>>>>> ideal
> > > >>>>>>>>>>>>> solution,
> > > >>>>>>>>>>>>>>>>>>>>>>>> maybe
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
> > > >>> might
> > > >>>>>>>>>>> introduce
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> > > >>>>> developer
> > > >>>>>>> of
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>> connector
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
> > > >>> new
> > > >>>>>>> cache
> > > >>>>>>>>>>>>> options
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
> > > >> options
> > > >>>>> into
> > > >>>>>>> 2
> > > >>>>>>>>>>>>> different
> > > >>>>>>>>>>>>>>>>>>>>>>>> code
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
> > > >> will
> > > >>>>> need
> > > >>>>>>> to
> > > >>>>>>>>>>> do
> > > >>>>>>>>>>>>> is to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> redirect
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> > > >>>>>>> LookupConfig
> > > >>>>>>>>> (+
> > > >>>>>>>>>>>>> maybe
> > > >>>>>>>>>>>>>>>>>>>>>>>> add an
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> > > >>>> naming),
> > > >>>>>>>>>>>> everything
> > > >>>>>>>>>>>>>>>>>>>>>>>> will be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> > > >>>> won't
> > > >>>>>>> do
> > > >>>>>>>>>>>>>>>>>>>>>>>> refactoring at
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> all,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
> > > >> connector
> > > >>>>>>> because
> > > >>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>> backward
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
> > > >> wants
> > > >>> to
> > > >>>>> use
> > > >>>>>>>>> his
> > > >>>>>>>>>>>> own
> > > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> logic,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> > > >>>>> configs
> > > >>>>>>>>> into
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
> > > >> with
> > > >>>>>>> already
> > > >>>>>>>>>>>>> existing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> configs
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
> > > >> it's a
> > > >>>>> rare
> > > >>>>>>>>>>> case).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
> > > >> pushed
> > > >>>> all
> > > >>>>>>> the
> > > >>>>>>>>>>> way
> > > >>>>>>>>>>>>> down
> > > >>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> > > >>>> source
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
> > > >> is
> > > >>>> that
> > > >>>>>>> the
> > > >>>>>>>>>>>> ONLY
> > > >>>>>>>>>>>>>>>>>>>>>>>> connector
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> > > >>>>>>> FileSystemTableSource
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> > > >>>>> currently).
> > > >>>>>>>>> Also
> > > >>>>>>>>>>>>> for some
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> > > >>>>> complex
> > > >>>>>>>>>>>> filters
> > > >>>>>>>>>>>>>>>>>>>>>>>> that we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
> > > >> the
> > > >>>>> cache
> > > >>>>>>>>>>> seems
> > > >>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> quite
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
> > > >> large
> > > >>>>>>> amount of
> > > >>>>>>>>>>>> data
> > > >>>>>>>>>>>>>>>>>>>>>>>> from the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> > > >>>>> suppose
> > > >>>>>>> in
> > > >>>>>>>>>>>>> dimension
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
> > > >> 20
> > > >>> to
> > > >>>>> 40,
> > > >>>>>>>>> and
> > > >>>>>>>>>>>>> input
> > > >>>>>>>>>>>>>>>>>>>>>>>> stream
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
> > > >>> by
> > > >>>>> age
> > > >>>>>>> of
> > > >>>>>>>>>>>>> users. If
> > > >>>>>>>>>>>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> > > >>>> This
> > > >>>>>>> means
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>> user
> > > >>>>>>>>>>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
> > > >>> almost
> > > >>>> 2
> > > >>>>>>>>> times.
> > > >>>>>>>>>>>> It
> > > >>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> gain a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> > > >>>>> optimization
> > > >>>>>>>>>>> starts
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> really
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> > > >>>> filters
> > > >>>>>>> and
> > > >>>>>>>>>>>>> projections
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> can't
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
> > > >>> opens
> > > >>>> up
> > > >>>>>>>>>>>>> additional
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
> > > >> 'not
> > > >>>>> quite
> > > >>>>>>>>>>>>> useful'.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> > > >>>>>>> regarding
> > > >>>>>>>>>>> this
> > > >>>>>>>>>>>>> topic!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Because
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> > > >>>> points,
> > > >>>>>>> and I
> > > >>>>>>>>>>>>> think
> > > >>>>>>>>>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
> > > >>> come
> > > >>>>> to a
> > > >>>>>>>>>>>>> consensus.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
> > > >>> Ren
> > > >>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> :
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
> > > >> for
> > > >>> my
> > > >>>>>>> late
> > > >>>>>>>>>>>>> response!
> > > >>>>>>>>>>>>>>>>>>>>>>>> We
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> had
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
> > > >>> and
> > > >>>>>>> Leonard
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>> I’d
> > > >>>>>>>>>>>>>>>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> > > >>>> implementing
> > > >>>>>>> the
> > > >>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>> logic in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> > > >>>>>>> user-provided
> > > >>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> function,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> > > >>> extending
> > > >>>>>>>>>>>>> TableFunction
> > > >>>>>>>>>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> these
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
> > > >> semantic
> > > >>> of
> > > >>>>>>> "FOR
> > > >>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> > > >>>> reflect
> > > >>>>>>> the
> > > >>>>>>>>>>>>> content
> > > >>>>>>>>>>>>>>>>>>>>>>>> of the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
> > > >> users
> > > >>>>>>> choose
> > > >>>>>>>>> to
> > > >>>>>>>>>>>>> enable
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> caching
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
> > > >>> that
> > > >>>>>>> this
> > > >>>>>>>>>>>>> breakage is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
> > > >>> prefer
> > > >>>>> not
> > > >>>>>>> to
> > > >>>>>>>>>>>>> provide
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> caching on
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
> > > >>> in
> > > >>>>> the
> > > >>>>>>>>>>>>> framework
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> (whether
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> > > >>> TableFunction),
> > > >>>> we
> > > >>>>>>> have
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> confront a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
> > > >>> control
> > > >>>>> the
> > > >>>>>>>>>>>>> behavior of
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> > > >>>>> should
> > > >>>>>>> be
> > > >>>>>>>>>>>>> cautious.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Under
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> > > >>>> framework
> > > >>>>>>>>> should
> > > >>>>>>>>>>>>> only be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
> > > >>> it’s
> > > >>>>>>> hard
> > > >>>>>>>>> to
> > > >>>>>>>>>>>>> apply
> > > >>>>>>>>>>>>>>>>>>>>>>>> these
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
> > > >> source
> > > >>>>> loads
> > > >>>>>>> and
> > > >>>>>>>>>>>>> refresh
> > > >>>>>>>>>>>>>>>>>>>>>>>> all
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> > > >>>> high
> > > >>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>> performance
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
> > > >>> widely
> > > >>>>>>> used
> > > >>>>>>>>> by
> > > >>>>>>>>>>>> our
> > > >>>>>>>>>>>>>>>>>>>>>>>> internal
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
> > > >>> the
> > > >>>>>>> user’s
> > > >>>>>>>>>>>>>>>>>>>>>>>> TableFunction
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> > > >>>>>>> introduce a
> > > >>>>>>>>>>>> new
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> interface for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
> > > >> would
> > > >>>>>>> become
> > > >>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>>>>>> complex.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
> > > >> framework
> > > >>>>> might
> > > >>>>>>>>>>>>> introduce
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> > > >>>> there
> > > >>>>>>> might
> > > >>>>>>>>>>>>> exist two
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> caches
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
> > > >> user
> > > >>>>>>>>>>> incorrectly
> > > >>>>>>>>>>>>>>>>>>>>>>>> configures
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> > > >>>> implemented
> > > >>>>>>> by
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> source).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> > > >>>>>>> Alexander, I
> > > >>>>>>>>>>>>> think
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> filters
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
> > > >> way
> > > >>>> down
> > > >>>>>>> to
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> function,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
> > > >> of
> > > >>>> the
> > > >>>>>>>>>>> runner
> > > >>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> > > >>>> network
> > > >>>>>>> I/O
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>> pressure
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> on the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
> > > >> these
> > > >>>>>>>>>>>> optimizations
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> > > >>>>> reflect
> > > >>>>>>> our
> > > >>>>>>>>>>>>> ideas.
> > > >>>>>>>>>>>>>>>>>>>>>>>> We
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
> > > >>> of
> > > >>>>>>>>>>>>> TableFunction,
> > > >>>>>>>>>>>>>>>>>>>>>>>> and we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> > > >>>>>>> (CachingTableFunction,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
> > > >> developers
> > > >>>> and
> > > >>>>>>>>>>> regulate
> > > >>>>>>>>>>>>>>>>>>>>>>>> metrics
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> of the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
> > > >> reference.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> > > >>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> > > >>> Александр
> > > >>>>>>> Смирнов
> > > >>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> > > >>>>> solution
> > > >>>>>>> as
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>> first
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> step:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
> > > >> mutually
> > > >>>>>>> exclusive
> > > >>>>>>>>>>>>>>>>>>>>>>>> (originally
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> > > >>>>> conceptually
> > > >>>>>>>>> they
> > > >>>>>>>>>>>>> follow
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> > > >>>>>>> different.
> > > >>>>>>>>>>> If
> > > >>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> go one
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
> > > >>> will
> > > >>>>> mean
> > > >>>>>>>>>>>>> deleting
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> existing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> > > >>>>>>> connectors.
> > > >>>>>>>>>>> So
> > > >>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>>>>>>>>>> think we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> > > >>>> about
> > > >>>>>>> that
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>> then
> > > >>>>>>>>>>>>>>>>>>>>>>>> work
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> > > >>>> tasks
> > > >>>>>>> for
> > > >>>>>>>>>>>>> different
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> parts
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> > > >>> unification
> > > >>>> /
> > > >>>>>>>>>>>>> introducing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> > > >>>> Qingsheng?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> > > >>>>> requests
> > > >>>>>>>>>>> after
> > > >>>>>>>>>>>>>>>>>>>>>>>> filter
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
> > > >>> fields
> > > >>>>> of
> > > >>>>>>> the
> > > >>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> table, we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
> > > >>> after
> > > >>>>>>> that we
> > > >>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>>>>>>> filter
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> > > >>>> filter
> > > >>>>>>>>>>>>> pushdown. So
> > > >>>>>>>>>>>>>>>>>>>>>>>> if
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> > > >>>> much
> > > >>>>>>> less
> > > >>>>>>>>>>>> rows
> > > >>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > > >>>>> architecture
> > > >>>>>>> is
> > > >>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>>>>>> shared.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> > > >> honest.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> > > >>>> kinds
> > > >>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>> conversations
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> :)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> > > >>> confluence,
> > > >>>>> so
> > > >>>>>>> I
> > > >>>>>>>>>>>> made a
> > > >>>>>>>>>>>>>>>>>>>>>>>> Jira
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> issue,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
> > > >> in
> > > >>>>> more
> > > >>>>>>>>>>>> details
> > > >>>>>>>>>>>>> -
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
> > > >>> Heise
> > > >>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> > > >>>> inconsistency
> > > >>>>>>> was
> > > >>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> > > >>>> could
> > > >>>>>>> also
> > > >>>>>>>>>>>> live
> > > >>>>>>>>>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
> > > >> of
> > > >>>>>>> making
> > > >>>>>>>>>>>>> caching
> > > >>>>>>>>>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> > > >>>> devise a
> > > >>>>>>>>>>> caching
> > > >>>>>>>>>>>>>>>>>>>>>>>> layer
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> > > >>> CachingTableFunction
> > > >>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>> delegates to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> X in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> > > >>>> Lifting
> > > >>>>>>> it
> > > >>>>>>>>>>> into
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> operator
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> > > >>>>>>> probably
> > > >>>>>>>>>>>>>>>>>>>>>>>> unnecessary
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
> > > >>> will
> > > >>>>> only
> > > >>>>>>>>>>>> receive
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
> > > >>> more
> > > >>>>>>>>>>>> interesting
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> save
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> > > >>>> changes
> > > >>>>> of
> > > >>>>>>>>>>> this
> > > >>>>>>>>>>>>> FLIP
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> would be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> > > >>>>> interfaces.
> > > >>>>>>>>>>>>> Everything
> > > >>>>>>>>>>>>>>>>>>>>>>>> else
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
> > > >> That
> > > >>>>> means
> > > >>>>>>> we
> > > >>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>>>>>>> easily
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
> > > >> Alexander
> > > >>>>>>> pointed
> > > >>>>>>>>>>> out
> > > >>>>>>>>>>>>>>>>>>>>>>>> later.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > > >>>>> architecture
> > > >>>>>>> is
> > > >>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>>>>>> shared.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> > > >> honest.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> > > >>>> Александр
> > > >>>>>>>>>>> Смирнов
> > > >>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
> > > >>> I'm
> > > >>>>>>> not a
> > > >>>>>>>>>>>>>>>>>>>>>>>> committer
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> yet,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> > > >>>> FLIP
> > > >>>>>>>>> really
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> interested
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> > > >>>>>>> feature in
> > > >>>>>>>>>>> my
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> company’s
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
> > > >> our
> > > >>>>>>> thoughts
> > > >>>>>>>>>>> on
> > > >>>>>>>>>>>>>>>>>>>>>>>> this and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
> > > >> alternative
> > > >>>>> than
> > > >>>>>>>>>>>>>>>>>>>>>>>> introducing an
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> > > >>>>>>>>> (CachingTableFunction).
> > > >>>>>>>>>>>> As
> > > >>>>>>>>>>>>>>>>>>>>>>>> you
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> know,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> > > >>>>>>> flink-table-common
> > > >>>>>>>>>>>>>>>>>>>>>>>> module,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
> > > >> tables –
> > > >>>>> it’s
> > > >>>>>>>>> very
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> convenient
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> > > >>>>>>> CachingTableFunction
> > > >>>>>>>>>>>>> contains
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> logic
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
> > > >> and
> > > >>>>>>>>>>> everything
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> connected
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
> > > >> module,
> > > >>>>>>> probably
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> > > >>>>> depend
> > > >>>>>>> on
> > > >>>>>>>>>>>>> another
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> module,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
> > > >>> which
> > > >>>>>>> doesn’t
> > > >>>>>>>>>>>>> sound
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> good.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> > > >>>>>>>>> ‘getLookupConfig’
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> > > >>>>>>> connectors
> > > >>>>>>>>> to
> > > >>>>>>>>>>>>> only
> > > >>>>>>>>>>>>>>>>>>>>>>>> pass
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> > > >>>> therefore
> > > >>>>>>> they
> > > >>>>>>>>>>>> won’t
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> depend on
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> > > >>>>> planner
> > > >>>>>>>>>>> will
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> construct a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> > > >>>> runtime
> > > >>>>>>> logic
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> > > >>>> Architecture
> > > >>>>>>>>> looks
> > > >>>>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> > > >>>>> actually
> > > >>>>>>>>>>> yours
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
> > > >> that
> > > >>>> will
> > > >>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>>>>> responsible
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> > > >>>>>>> inheritors.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> > > >>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> -
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> > > >>>> AsyncLookupJoinRunner,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> > > >>>>>>>>>>> LookupJoinCachingRunner,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
> > > >> etc.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
> > > >> powerful
> > > >>>>>>> advantage
> > > >>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>> such a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
> > > >>> level,
> > > >>>>> we
> > > >>>>>>> can
> > > >>>>>>>>>>>>> apply
> > > >>>>>>>>>>>>>>>>>>>>>>>> some
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> > > >>>>>>> LookupJoinRunnerWithCalc
> > > >>>>>>>>>>> was
> > > >>>>>>>>>>>>>>>>>>>>>>>> named
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
> > > >> function,
> > > >>>>> which
> > > >>>>>>>>>>>> actually
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> mostly
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> > > >>>> lookup
> > > >>>>>>> table
> > > >>>>>>>>>>> B
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> condition
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> > > >>>> WHERE
> > > >>>>>>>>>>>> B.salary >
> > > >>>>>>>>>>>>>>>>>>>>>>>> 1000’
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
> > > >> A.age =
> > > >>>>>>> B.age +
> > > >>>>>>>>>>> 10
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> > > >>>> storing
> > > >>>>>>>>>>> records
> > > >>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> cache,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
> > > >> reduced:
> > > >>>>>>> filters =
> > > >>>>>>>>>>>>> avoid
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> storing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
> > > >>> reduce
> > > >>>>>>>>> records’
> > > >>>>>>>>>>>>>>>>>>>>>>>> size. So
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
> > > >> be
> > > >>>>>>>>> increased
> > > >>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> user.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
> > > >> Ren
> > > >>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> > > >>>>>>> discussion
> > > >>>>>>>>>>>> about
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> > > >>>> table
> > > >>>>>>>>> cache
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>> its
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> > > >>>>> should
> > > >>>>>>>>>>>>> implement
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> their
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
> > > >>> isn’t a
> > > >>>>>>>>>>> standard
> > > >>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
> > > >> with
> > > >>>>> lookup
> > > >>>>>>>>>>>> joins,
> > > >>>>>>>>>>>>>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> is a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> > > >>>>>>> including
> > > >>>>>>>>>>>>> cache,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
> > > >>> table
> > > >>>>>>>>> options.
> > > >>>>>>>>>>>>>>>>>>>>>>>> Please
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> take a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
> > > >>> Any
> > > >>>>>>>>>>>> suggestions
> > > >>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> --
> > > >>>>>>>>>>>>>>>>>>>>> Best Regards,
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > > >>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Becket Qin <be...@gmail.com>.

Thanks for updating the FLIP, Qingsheng. A few more comments:

1. I am still not sure about what is the use case for cacheMissingKey().
More specifically, when would users want to have getCache() return a
non-empty value and cacheMissingKey() returns false?

2. The builder pattern. Usually the builder pattern is used when there are
a lot of variations of constructors. For example, if a class has three
variables and all of them are optional, so there could potentially be many
combinations of the variables. But in this FLIP, I don't see such case.
What is the reason we have builders for all the classes?

3. Should the caching strategy be excluded from the top level provider API?
Technically speaking, the Flink framework should only have two interfaces
to deal with:
    A) LookupFunction
    B) AsyncLookupFunction
Orthogonally, we *believe* there are two different strategies people can do
caching. Note that the Flink framework does not care what is the caching
strategy here.
    a) partial caching
    b) full caching

Putting them together, we end up with 3 combinations that we think are
valid:
     Aa) PartialCachingLookupFunctionProvider
     Ba) PartialCachingAsyncLookupFunctionProvider
     Ab) FullCachingLookupFunctionProvider

However, the caching strategy could actually be quite flexible. E.g. an
initial full cache load followed by some partial updates. Also, I am not
100% sure if the full caching will always use ScanTableSource. Including
the caching strategy in the top level provider API would make it harder to
extend.

One possible solution is to just have *LookupFunctionProvider* and
*AsyncLookupFunctionProvider
*as the top level API, both with a getCacheStrategy() method returning an
optional CacheStrategy. The CacheStrategy class would have the following
methods:
1. void open(Context), the context exposes some of the resources that may
be useful for the the caching strategy, e.g. an ExecutorService that is
synchronized with the data processing, or a cache refresh trigger which
blocks data processing and refresh the cache.
2. void initializeCache(), a blocking method allows users to pre-populate
the cache before processing any data if they wish.
3. void maybeCache(RowData key, Collection<RowData> value), blocking or
non-blocking method.
4. void refreshCache(), a blocking / non-blocking method that is invoked by
the Flink framework when the cache refresh trigger is pulled.

In the above design, partial caching and full caching would be
implementations of the CachingStrategy. And it is OK for users to implement
their own CachingStrategy if they want to.

Thanks,

Jiangjie (Becket) Qin


On Thu, Jun 2, 2022 at 12:14 PM Jark Wu <im...@gmail.com> wrote:

> Thank Qingsheng for the detailed summary and updates,
>
> The changes look good to me in general. I just have one minor improvement
> comment.
> Could we add a static util method to the "FullCachingReloadTrigger"
> interface for quick usage?
>
> #periodicReloadAtFixedRate(Duration)
> #periodicReloadWithFixedDelay(Duration)
>
> I think we can also do this for LookupCache. Because users may not know
> where is the default
> implementations and how to use them.
>
> Best,
> Jark
>
>
>
>
>
>
> On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <re...@gmail.com> wrote:
>
> > Hi Jingsong,
> >
> > Thanks for your comments!
> >
> > > AllCache definition is not flexible, for example, PartialCache can use
> > any custom storage, while the AllCache can not, AllCache can also be
> > considered to store memory or disk, also need a flexible strategy.
> >
> > We had an offline discussion with Jark and Leonard. Basically we think
> > exposing the interface of full cache storage to connector developers
> might
> > limit our future optimizations. The storage of full caching shouldn’t
> have
> > too many variations for different lookup tables so making it pluggable
> > might not help a lot. Also I think it is not quite easy for connector
> > developers to implement such an optimized storage. We can keep optimizing
> > this storage in the future and all full caching lookup tables would
> benefit
> > from this.
> >
> > > We are more inclined to deprecate the connector `async` option when
> > discussing FLIP-234. Can we remove this option from this FLIP?
> >
> > Thanks for the reminder! This option has been removed in the latest
> > version.
> >
> > Best regards,
> >
> > Qingsheng
> >
> >
> > > On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com> wrote:
> > >
> > > Thanks Alexander for your reply. We can discuss the new interface when
> it
> > > comes out.
> > >
> > > We are more inclined to deprecate the connector `async` option when
> > > discussing FLIP-234 [1]. We should use hint to let planner decide.
> > > Although the discussion has not yet produced a conclusion, can we
> remove
> > > this option from this FLIP? It doesn't seem to be related to this FLIP,
> > but
> > > more to FLIP-234, and we can form a conclusion over there.
> > >
> > > [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:
> > >
> > >> Hi Jark,
> > >>
> > >> Thanks for clarifying it. It would be fine. as long as we could
> provide
> > the
> > >> no-cache solution. I was just wondering if the client side cache could
> > >> really help when HBase is used, since the data to look up should be
> > huge.
> > >> Depending how much data will be cached on the client side, the data
> that
> > >> should be lru in e.g. LruBlockCache will not be lru anymore. In the
> > worst
> > >> case scenario, once the cached data at client side is expired, the
> > request
> > >> will hit disk which will cause extra latency temporarily, if I am not
> > >> mistaken.
> > >>
> > >> Best regards,
> > >> Jing
> > >>
> > >> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
> > >>
> > >>> Hi Jing Ge,
> > >>>
> > >>> What do you mean about the "impact on the block cache used by HBase"?
> > >>> In my understanding, the connector cache and HBase cache are totally
> > two
> > >>> things.
> > >>> The connector cache is a local/client cache, and the HBase cache is a
> > >>> server cache.
> > >>>
> > >>>> does it make sense to have a no-cache solution as one of the
> > >>> default solutions so that customers will have no effort for the
> > migration
> > >>> if they want to stick with Hbase cache
> > >>>
> > >>> The implementation migration should be transparent to users. Take the
> > >> HBase
> > >>> connector as
> > >>> an example,  it already supports lookup cache but is disabled by
> > default.
> > >>> After migration, the
> > >>> connector still disables cache by default (i.e. no-cache solution).
> No
> > >>> migration effort for users.
> > >>>
> > >>> HBase cache and connector cache are two different things. HBase cache
> > >> can't
> > >>> simply replace
> > >>> connector cache. Because one of the most important usages for
> connector
> > >>> cache is reducing
> > >>> the I/O request/response and improving the throughput, which can
> > achieve
> > >>> by just using a server cache.
> > >>>
> > >>> Best,
> > >>> Jark
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
> > >>>
> > >>>> Thanks all for the valuable discussion. The new feature looks very
> > >>>> interesting.
> > >>>>
> > >>>> According to the FLIP description: "*Currently we have JDBC, Hive
> and
> > >>> HBase
> > >>>> connector implemented lookup table source. All existing
> > implementations
> > >>>> will be migrated to the current design and the migration will be
> > >>>> transparent to end users*." I was only wondering if we should pay
> > >>> attention
> > >>>> to HBase and similar DBs. Since, commonly, the lookup data will be
> > huge
> > >>>> while using HBase, partial caching will be used in this case, if I
> am
> > >> not
> > >>>> mistaken, which might have an impact on the block cache used by
> HBase,
> > >>> e.g.
> > >>>> LruBlockCache.
> > >>>> Another question is that, since HBase provides a sophisticated cache
> > >>>> solution, does it make sense to have a no-cache solution as one of
> the
> > >>>> default solutions so that customers will have no effort for the
> > >> migration
> > >>>> if they want to stick with Hbase cache?
> > >>>>
> > >>>> Best regards,
> > >>>> Jing
> > >>>>
> > >>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <
> jingsonglee0@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I think the problem now is below:
> > >>>>> 1. AllCache and PartialCache interface on the non-uniform, one
> needs
> > >> to
> > >>>>> provide LookupProvider, the other needs to provide CacheBuilder.
> > >>>>> 2. AllCache definition is not flexible, for example, PartialCache
> can
> > >>> use
> > >>>>> any custom storage, while the AllCache can not, AllCache can also
> be
> > >>>>> considered to store memory or disk, also need a flexible strategy.
> > >>>>> 3. AllCache can not customize ReloadStrategy, currently only
> > >>>>> ScheduledReloadStrategy.
> > >>>>>
> > >>>>> In order to solve the above problems, the following are my ideas.
> > >>>>>
> > >>>>> ## Top level cache interfaces:
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> public interface CacheLookupProvider extends
> > >>>>> LookupTableSource.LookupRuntimeProvider {
> > >>>>>
> > >>>>>    CacheBuilder createCacheBuilder();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface CacheBuilder {
> > >>>>>    Cache create();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface Cache {
> > >>>>>
> > >>>>>    /**
> > >>>>>     * Returns the value associated with key in this cache, or null
> > >> if
> > >>>>> there is no cached value for
> > >>>>>     * key.
> > >>>>>     */
> > >>>>>    @Nullable
> > >>>>>    Collection<RowData> getIfPresent(RowData key);
> > >>>>>
> > >>>>>    /** Returns the number of key-value mappings in the cache. */
> > >>>>>    long size();
> > >>>>> }
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> ## Partial cache
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> public interface PartialCacheLookupFunction extends
> > >>> CacheLookupProvider {
> > >>>>>
> > >>>>>    @Override
> > >>>>>    PartialCacheBuilder createCacheBuilder();
> > >>>>>
> > >>>>> /** Creates an {@link LookupFunction} instance. */
> > >>>>> LookupFunction createLookupFunction();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface PartialCacheBuilder extends CacheBuilder {
> > >>>>>
> > >>>>>    PartialCache create();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface PartialCache extends Cache {
> > >>>>>
> > >>>>>    /**
> > >>>>>     * Associates the specified value rows with the specified key
> row
> > >>>>> in the cache. If the cache
> > >>>>>     * previously contained value associated with the key, the old
> > >>>>> value is replaced by the
> > >>>>>     * specified value.
> > >>>>>     *
> > >>>>>     * @return the previous value rows associated with key, or null
> > >> if
> > >>>>> there was no mapping for key.
> > >>>>>     * @param key - key row with which the specified value is to be
> > >>>>> associated
> > >>>>>     * @param value – value rows to be associated with the specified
> > >>> key
> > >>>>>     */
> > >>>>>    Collection<RowData> put(RowData key, Collection<RowData> value);
> > >>>>>
> > >>>>>    /** Discards any cached value for the specified key. */
> > >>>>>    void invalidate(RowData key);
> > >>>>> }
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> ## All cache
> > >>>>> ```
> > >>>>>
> > >>>>> public interface AllCacheLookupProvider extends
> CacheLookupProvider {
> > >>>>>
> > >>>>>    void registerReloadStrategy(ScheduledExecutorService
> > >>>>> executorService, Reloader reloader);
> > >>>>>
> > >>>>>    ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> > >>>>>
> > >>>>>    @Override
> > >>>>>    AllCacheBuilder createCacheBuilder();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface AllCacheBuilder extends CacheBuilder {
> > >>>>>
> > >>>>>    AllCache create();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface AllCache extends Cache {
> > >>>>>
> > >>>>>    void putAll(Iterator<Map<RowData, RowData>> allEntries);
> > >>>>>
> > >>>>>    void clearAll();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface Reloader {
> > >>>>>
> > >>>>>    void reload();
> > >>>>> }
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> Best,
> > >>>>> Jingsong
> > >>>>>
> > >>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <
> jingsonglee0@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Thanks Qingsheng and all for your discussion.
> > >>>>>>
> > >>>>>> Very sorry to jump in so late.
> > >>>>>>
> > >>>>>> Maybe I missed something?
> > >>>>>> My first impression when I saw the cache interface was, why don't
> > >> we
> > >>>>>> provide an interface similar to guava cache [1], on top of guava
> > >>> cache,
> > >>>>>> caffeine also makes extensions for asynchronous calls.[2]
> > >>>>>> There is also the bulk load in caffeine too.
> > >>>>>>
> > >>>>>> I am also more confused why first from LookupCacheFactory.Builder
> > >> and
> > >>>>> then
> > >>>>>> to Factory to create Cache.
> > >>>>>>
> > >>>>>> [1] https://github.com/google/guava
> > >>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Jingsong
> > >>>>>>
> > >>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com>
> wrote:
> > >>>>>>
> > >>>>>>> After looking at the new introduced ReloadTime and Becket's
> > >> comment,
> > >>>>>>> I agree with Becket we should have a pluggable reloading
> strategy.
> > >>>>>>> We can provide some common implementations, e.g., periodic
> > >>> reloading,
> > >>>>> and
> > >>>>>>> daily reloading.
> > >>>>>>> But there definitely be some connector- or business-specific
> > >>> reloading
> > >>>>>>> strategies, e.g.
> > >>>>>>> notify by a zookeeper watcher, reload once a new Hive partition
> is
> > >>>>>>> complete.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Jark
> > >>>>>>>
> > >>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Qingsheng,
> > >>>>>>>>
> > >>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
> > >>>>>>>>
> > >>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
> > >>>> "XXXProvider".
> > >>>>>>>> What is the difference between them? If they are the same, can
> > >> we
> > >>>> just
> > >>>>>>> use
> > >>>>>>>> XXXFactory everywhere?
> > >>>>>>>>
> > >>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
> > >>>>> policy
> > >>>>>>>> also be pluggable? Periodical reloading could be sometimes be
> > >>> tricky
> > >>>>> in
> > >>>>>>>> practice. For example, if user uses 24 hours as the cache
> > >> refresh
> > >>>>>>> interval
> > >>>>>>>> and some nightly batch job delayed, the cache update may still
> > >> see
> > >>>> the
> > >>>>>>>> stale data.
> > >>>>>>>>
> > >>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> > >>>> should
> > >>>>> be
> > >>>>>>>> removed.
> > >>>>>>>>
> > >>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
> > >> seems a
> > >>>>>>> little
> > >>>>>>>> confusing to me. If Optional<LookupCacheFactory>
> > >> getCacheFactory()
> > >>>>>>> returns
> > >>>>>>>> a non-empty factory, doesn't that already indicates the
> > >> framework
> > >>> to
> > >>>>>>> cache
> > >>>>>>>> the missing keys? Also, why is this method returning an
> > >>>>>>> Optional<Boolean>
> > >>>>>>>> instead of boolean?
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>>
> > >>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
> > >> renqschn@gmail.com
> > >>>>
> > >>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Lincoln and Jark,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the comments! If the community reaches a consensus
> > >>> that
> > >>>> we
> > >>>>>>> use
> > >>>>>>>>> SQL hint instead of table options to decide whether to use sync
> > >>> or
> > >>>>>>> async
> > >>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
> > >>>>> option.
> > >>>>>>>>>
> > >>>>>>>>> I think it’s a good idea to let the decision of async made on
> > >>> query
> > >>>>>>>>> level, which could make better optimization with more
> > >> infomation
> > >>>>>>> gathered
> > >>>>>>>>> by planner. Is there any FLIP describing the issue in
> > >>> FLINK-27625?
> > >>>> I
> > >>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
> > >>>>> missing
> > >>>>>>>>> instead of the entire async mode to be controlled by hint.
> > >>>>>>>>>
> > >>>>>>>>> Best regards,
> > >>>>>>>>>
> > >>>>>>>>> Qingsheng
> > >>>>>>>>>
> > >>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
> > >> lincoln.86xy@gmail.com
> > >>>>
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi Jark,
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for your reply!
> > >>>>>>>>>>
> > >>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
> > >>> no
> > >>>>> idea
> > >>>>>>>>>> whether or when to remove it (we can discuss it in another
> > >>> issue
> > >>>>> for
> > >>>>>>> the
> > >>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
> > >>> into
> > >>>> a
> > >>>>>>>>> common
> > >>>>>>>>>> option now.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Lincoln Lee
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Lincoln,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
> > >> the
> > >>>>>>>>> connectors
> > >>>>>>>>>>> can
> > >>>>>>>>>>> provide both async and sync runtime providers simultaneously
> > >>>>> instead
> > >>>>>>>>> of one
> > >>>>>>>>>>> of them.
> > >>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
> > >> option
> > >>> is
> > >>>>>>>>> planned to
> > >>>>>>>>>>> be removed
> > >>>>>>>>>>> in the long term, I think it makes sense not to introduce it
> > >>> in
> > >>>>> this
> > >>>>>>>>> FLIP.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jark
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> > >>>> lincoln.86xy@gmail.com
> > >>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
> > >>> idea
> > >>>>>>> that
> > >>>>>>>>> we
> > >>>>>>>>>>> can
> > >>>>>>>>>>>> have a common table option. I have a minor comments on
> > >>>>>>> 'lookup.async'
> > >>>>>>>>>>> that
> > >>>>>>>>>>>> not make it a common option:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The table layer abstracts both sync and async lookup
> > >>>>> capabilities,
> > >>>>>>>>>>>> connectors implementers can choose one or both, in the case
> > >>> of
> > >>>>>>>>>>> implementing
> > >>>>>>>>>>>> only one capability(status of the most of existing builtin
> > >>>>>>> connectors)
> > >>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
> > >>> both
> > >>>>>>>>>>>> capabilities, I think this choice is more suitable for
> > >> making
> > >>>>>>>>> decisions
> > >>>>>>>>>>> at
> > >>>>>>>>>>>> the query level, for example, table planner can choose the
> > >>>>> physical
> > >>>>>>>>>>>> implementation of async lookup or sync lookup based on its
> > >>> cost
> > >>>>>>>>> model, or
> > >>>>>>>>>>>> users can give query hint based on their own better
> > >>>>>>> understanding.  If
> > >>>>>>>>>>>> there is another common table option 'lookup.async', it may
> > >>>>> confuse
> > >>>>>>>>> the
> > >>>>>>>>>>>> users in the long run.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
> > >>>> place
> > >>>>>>> (for
> > >>>>>>>>> the
> > >>>>>>>>>>>> current hbase connector) and not turn it into a common
> > >>> option.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> WDYT?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Lincoln Lee
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Alexander,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
> > >> you
> > >>>> can
> > >>>>>>> find
> > >>>>>>>>>>>> those
> > >>>>>>>>>>>>> changes from my latest email. Since some terminologies has
> > >>>>>>> changed so
> > >>>>>>>>>>>> I’ll
> > >>>>>>>>>>>>> use the new concept for replying your comments.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1. Builder vs ‘of’
> > >>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
> > >> optional
> > >>>>>>>>> parameters
> > >>>>>>>>>>>>> for full caching mode (“rescan” previously). The
> > >>>>>>> schedule-with-delay
> > >>>>>>>>>>> idea
> > >>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
> > >> the
> > >>>>>>> builder
> > >>>>>>>>> API
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>> full caching to make it more descriptive for developers.
> > >>> Would
> > >>>>> you
> > >>>>>>>>> mind
> > >>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
> > >>>>> workspace
> > >>>>>>>>> you
> > >>>>>>>>>>>> can
> > >>>>>>>>>>>>> just provide your account ID and ping any PMC member
> > >>> including
> > >>>>>>> Jark.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 2. Common table options
> > >>>>>>>>>>>>> We have some discussions these days and propose to
> > >>> introduce 8
> > >>>>>>> common
> > >>>>>>>>>>>>> table options about caching. It has been updated on the
> > >>> FLIP.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 3. Retries
> > >>>>>>>>>>>>> I think we are on the same page :-)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> For your additional concerns:
> > >>>>>>>>>>>>> 1) The table option has been updated.
> > >>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
> > >> use
> > >>>>>>> partial
> > >>>>>>>>> or
> > >>>>>>>>>>>>> full caching mode.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> > >>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Also I have a few additions:
> > >>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> > >>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
> > >> that
> > >>>> we
> > >>>>>>> talk
> > >>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
> > >> fits
> > >>>>> more,
> > >>>>>>>>>>>>>> considering my optimization with filters.
> > >>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
> > >>> separate
> > >>>>>>>>> caching
> > >>>>>>>>>>>>>> and rescanning from the options point of view? Like
> > >>> initially
> > >>>>> we
> > >>>>>>> had
> > >>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
> > >>> now
> > >>>> we
> > >>>>>>> can
> > >>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
> > >>> be
> > >>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> > >>>>>>> smiralexan@gmail.com
> > >>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi Qingsheng and Jark,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 1. Builders vs 'of'
> > >>>>>>>>>>>>>>> I understand that builders are used when we have
> > >> multiple
> > >>>>>>>>>>> parameters.
> > >>>>>>>>>>>>>>> I suggested them because we could add parameters later.
> > >> To
> > >>>>>>> prevent
> > >>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
> > >>> can
> > >>>>>>>>> suggest
> > >>>>>>>>>>>>>>> one more config now - "rescanStartTime".
> > >>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
> > >> reload
> > >>>> of
> > >>>>>>> cache
> > >>>>>>>>>>>>>>> starts. This parameter can be thought of as
> > >> 'initialDelay'
> > >>>>> (diff
> > >>>>>>>>>>>>>>> between current time and rescanStartTime) in method
> > >>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
> > >>> can
> > >>>> be
> > >>>>>>> very
> > >>>>>>>>>>>>>>> useful when the dimension table is updated by some other
> > >>>>>>> scheduled
> > >>>>>>>>>>> job
> > >>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
> > >> second
> > >>>> scan
> > >>>>>>>>>>> (first
> > >>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
> > >>>> without
> > >>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
> > >>> one
> > >>>>>>> day.
> > >>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
> > >> if
> > >>>> you
> > >>>>>>> would
> > >>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
> > >> myself
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 2. Common table options
> > >>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
> > >>>> cache
> > >>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
> > >>> for
> > >>>>>>>>> default
> > >>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
> > >>>> cache
> > >>>>>>>>>>> options,
> > >>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 3. Retries
> > >>>>>>>>>>>>>>> I'm fine with suggestion close to
> > >>> RetryUtils#tryTimes(times,
> > >>>>>>> call)
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> > >>>> renqschn@gmail.com
> > >>>>>> :
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Jark and Alexander,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
> > >> common
> > >>>>> table
> > >>>>>>>>>>>>> options. I prefer to introduce a new
> > >>> DefaultLookupCacheOptions
> > >>>>>>> class
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>> holding these option definitions because putting all
> > >> options
> > >>>>> into
> > >>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
> > >>>>>>> categorized.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
> > >>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
> > >>>>>>> RescanRuntimeProvider
> > >>>>>>>>>>>>> considering both arguments are required.
> > >>>>>>>>>>>>>>>> 2. Introduce new table options matching
> > >>>>>>> DefaultLookupCacheFactory
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> > >>> imjark@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi Alex,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 1) retry logic
> > >>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
> > >>>>> utilities,
> > >>>>>>>>>>> e.g.
> > >>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
> > >>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
> > >> by
> > >>>>>>>>>>> DataStream
> > >>>>>>>>>>>>> users.
> > >>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
> > >> to
> > >>>> put
> > >>>>>>> it.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 2) cache ConfigOptions
> > >>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
> > >>>>> framework.
> > >>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
> > >>>> includes
> > >>>>>>>>>>>>> "sink.parallelism", "format" options.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> > >>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thank you for considering my comments.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
> > >> such
> > >>> as
> > >>>>>>>>>>>>> re-establish the connection
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
> > >> be
> > >>>>>>> placed in
> > >>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>> separate function, that can be implemented by
> > >>> connectors.
> > >>>>>>> Just
> > >>>>>>>>>>>> moving
> > >>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
> > >>>> more
> > >>>>>>>>>>> concise
> > >>>>>>>>>>>> +
> > >>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
> > >> The
> > >>>>>>> decision
> > >>>>>>>>>>> is
> > >>>>>>>>>>>>> up
> > >>>>>>>>>>>>>>>>>> to you.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> > >>>>>>> developers
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> define their own options as we do now per connector.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
> > >>>> this
> > >>>>>>> FLIP
> > >>>>>>>>>>> was
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
> > >> current
> > >>>>> cache
> > >>>>>>>>>>>> design
> > >>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> > >>>> still
> > >>>>>>> we
> > >>>>>>>>>>> can
> > >>>>>>>>>>>>> put
> > >>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
> > >>> reuse
> > >>>>>>> them
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
> > >> significant,
> > >>>>> avoid
> > >>>>>>>>>>>> possible
> > >>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
> > >>> out
> > >>>> in
> > >>>>>>>>>>>>>>>>>> documentation for connector developers.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> > >>>>>>> renqschn@gmail.com>:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hi Alexander,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
> > >>> same
> > >>>>>>> page!
> > >>>>>>>>> I
> > >>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
> > >>>> quoting
> > >>>>>>> your
> > >>>>>>>>>>>> reply
> > >>>>>>>>>>>>> under this email.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
> > >> in
> > >>>>>>> lookup()
> > >>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
> > >>>> meaningful
> > >>>>>>>>> under
> > >>>>>>>>>>>> some
> > >>>>>>>>>>>>> specific retriable failures, and there might be custom
> > >> logic
> > >>>>>>> before
> > >>>>>>>>>>>> making
> > >>>>>>>>>>>>> retry, such as re-establish the connection
> > >>>>>>> (JdbcRowDataLookupFunction
> > >>>>>>>>>>> is
> > >>>>>>>>>>>> an
> > >>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
> > >>> version
> > >>>> of
> > >>>>>>>>> FLIP.
> > >>>>>>>>>>>> Do
> > >>>>>>>>>>>>> you have any special plans for them?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> > >>>>>>> developers
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> define their own options as we do now per connector.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
> > >>>> FLIP.
> > >>>>>>> Hope
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> can finalize our proposal soon!
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> > >>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
> > >> I
> > >>>> have
> > >>>>>>>>>>> several
> > >>>>>>>>>>>>>>>>>>>> suggestions and questions.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> > >>>>>>> TableFunction
> > >>>>>>>>>>> is a
> > >>>>>>>>>>>>> good
> > >>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> > >>>> class.
> > >>>>>>>>> 'eval'
> > >>>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
> > >> The
> > >>>> same
> > >>>>>>> is
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>> 'async' case.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
> > >>>>>>>>>>>>> 'cacheMissingKey'
> > >>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> > >>>>>>>>>>>>> ScanRuntimeProvider.
> > >>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
> > >>> and
> > >>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> > >>>>> 'build'
> > >>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
> > >>>> TableFunctionProvider
> > >>>>>>> and
> > >>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> > >>>>>>> deprecated.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
> > >> assume
> > >>>>>>> usage of
> > >>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
> > >>> case,
> > >>>>> it
> > >>>>>>> is
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>> very
> > >>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
> > >> or
> > >>>>>>> 'putAll'
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>> LookupCache.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
> > >>>> version
> > >>>>>>> of
> > >>>>>>>>>>>> FLIP.
> > >>>>>>>>>>>>> Do
> > >>>>>>>>>>>>>>>>>>>> you have any special plans for them?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
> > >> make
> > >>>>> small
> > >>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
> > >>>> worth
> > >>>>>>>>>>>> mentioning
> > >>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
> > >> the
> > >>>>>>> future.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> > >>>>>>> renqschn@gmail.com
> > >>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
> > >> As
> > >>>> Jark
> > >>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
> > >>>>>>> refactor on
> > >>>>>>>>>>> our
> > >>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
> > >> design
> > >>>> now
> > >>>>>>> and
> > >>>>>>>>> we
> > >>>>>>>>>>>> are
> > >>>>>>>>>>>>> happy to hear more suggestions from you!
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
> > >>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
> > >>> and
> > >>>> is
> > >>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
> > >>>>>>>>> previously.
> > >>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
> > >> reflect
> > >>>> the
> > >>>>>>> new
> > >>>>>>>>>>>>> design.
> > >>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
> > >> and
> > >>>>>>>>>>> introduce a
> > >>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
> > >> scanning.
> > >>> We
> > >>>>> are
> > >>>>>>>>>>>> planning
> > >>>>>>>>>>>>> to support SourceFunction / InputFormat for now
> > >> considering
> > >>>> the
> > >>>>>>>>>>>> complexity
> > >>>>>>>>>>>>> of FLIP-27 Source API.
> > >>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> > >>>> make
> > >>>>>>> the
> > >>>>>>>>>>>>> semantic of lookup more straightforward for developers.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
> > >>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> > >>> is
> > >>>>>>>>>>> deprecated
> > >>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
> > >>>>> currently
> > >>>>>>>>> it's
> > >>>>>>>>>>>> not?
> > >>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
> > >> for
> > >>>>> now.
> > >>>>>>> I
> > >>>>>>>>>>>> think
> > >>>>>>>>>>>>> it will be deprecated in the future but we don't have a
> > >>> clear
> > >>>>> plan
> > >>>>>>>>> for
> > >>>>>>>>>>>> that.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> > >>>> looking
> > >>>>>>>>>>> forward
> > >>>>>>>>>>>>> to cooperating with you after we finalize the design and
> > >>>>>>> interfaces!
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
> > >> Смирнов <
> > >>>>>>>>>>>>> smiralexan@gmail.com> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
> > >>> all
> > >>>>>>>>> points!
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> > >>> is
> > >>>>>>>>>>> deprecated
> > >>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
> > >>> but
> > >>>>>>>>>>> currently
> > >>>>>>>>>>>>> it's
> > >>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
> > >>> version
> > >>>>>>> it's
> > >>>>>>>>> OK
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
> > >>>>> supporting
> > >>>>>>>>>>> rescan
> > >>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
> > >> for
> > >>>>> this
> > >>>>>>>>>>>>> decision we
> > >>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
> > >> participants.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
> > >>> your
> > >>>>>>>>>>>>> statements. All
> > >>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
> > >>> would
> > >>>> be
> > >>>>>>> nice
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
> > >> lot
> > >>>> of
> > >>>>>>> work
> > >>>>>>>>>>> on
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
> > >> one
> > >>>> we
> > >>>>>>> are
> > >>>>>>>>>>>>> discussing,
> > >>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
> > >> Anyway
> > >>>>>>> looking
> > >>>>>>>>>>>>> forward for
> > >>>>>>>>>>>>>>>>>>>>>> the FLIP update!
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> > >>>> imjark@gmail.com
> > >>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> > >>>>>>> discussed
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>> several times
> > >>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
> > >>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
> > >>> many
> > >>>> of
> > >>>>>>> your
> > >>>>>>>>>>>>> points!
> > >>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
> > >> design
> > >>>> docs
> > >>>>>>> and
> > >>>>>>>>>>>>> maybe can be
> > >>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
> > >>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
> > >>> discussions:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
> > >> "cache
> > >>>> in
> > >>>>>>>>>>>>> framework" way.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> > >>> customize
> > >>>>> and
> > >>>>>>> a
> > >>>>>>>>>>>>> default
> > >>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
> > >> easy-use.
> > >>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
> > >>>>> flexibility
> > >>>>>>> and
> > >>>>>>>>>>>>> conciseness.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> > >>>> lookup
> > >>>>>>>>>>> cache,
> > >>>>>>>>>>>>> esp reducing
> > >>>>>>>>>>>>>>>>>>>>>>> IO.
> > >>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
> > >> the
> > >>>>>>> unified
> > >>>>>>>>>>> way
> > >>>>>>>>>>>>> to both
> > >>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> > >>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
> > >>> direction.
> > >>>> If
> > >>>>>>> we
> > >>>>>>>>>>> need
> > >>>>>>>>>>>>> to support
> > >>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
> > >> use
> > >>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
> > >> decide
> > >>>> to
> > >>>>>>>>>>>> implement
> > >>>>>>>>>>>>> the cache
> > >>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
> > >>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
> > >>> and
> > >>>>> it
> > >>>>>>>>>>>> doesn't
> > >>>>>>>>>>>>> affect the
> > >>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
> > >> to
> > >>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
> > >>> your
> > >>>>>>>>>>> proposal.
> > >>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
> > >>>>> InputFormat,
> > >>>>>>>>>>>>> SourceFunction for
> > >>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> > >>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
> > >> source
> > >>>>>>> operator
> > >>>>>>>>>>>>> instead of
> > >>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
> > >>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
> > >>>>> re-scan
> > >>>>>>>>>>>> ability
> > >>>>>>>>>>>>> for FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
> > >>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
> > >>>>> effort
> > >>>>>>> of
> > >>>>>>>>>>>>> FLIP-27 source
> > >>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
> > >>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
> > >>> InputFormat&SourceFunction,
> > >>>>> as
> > >>>>>>>>> they
> > >>>>>>>>>>>>> are not
> > >>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> > >>> another
> > >>>>>>>>> function
> > >>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
> > >>>> plan
> > >>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>> source
> > >>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
> > >>> SourceFunction
> > >>>>> are
> > >>>>>>>>>>>>> deprecated.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
> > >> <
> > >>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
> > >>> InputFormat
> > >>>>> is
> > >>>>>>> not
> > >>>>>>>>>>>>> considered.
> > >>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> > >>>>>>>>>>>>> martijn@ververica.com>:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
> > >>> connectors
> > >>>>> to
> > >>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
> > >>> The
> > >>>>> old
> > >>>>>>>>>>>>> interfaces will be
> > >>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
> > >>>> refactored
> > >>>>> to
> > >>>>>>>>> use
> > >>>>>>>>>>>>> the new ones
> > >>>>>>>>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>>>>>> dropped.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
> > >> are
> > >>>>> using
> > >>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>> interfaces,
> > >>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
> > >>>>>>> interfaces.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
> > >> Смирнов
> > >>> <
> > >>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
> > >>> make
> > >>>>>>> some
> > >>>>>>>>>>>>> comments and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
> > >>> we
> > >>>>> can
> > >>>>>>>>>>>> achieve
> > >>>>>>>>>>>>> both
> > >>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
> > >> in
> > >>>>>>>>>>>>> flink-table-common,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
> > >>>>>>> flink-table-runtime.
> > >>>>>>>>>>>>> Therefore if a
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
> > >> cache
> > >>>>>>>>>>> strategies
> > >>>>>>>>>>>>> and their
> > >>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
> > >> lookupConfig
> > >>> to
> > >>>>> the
> > >>>>>>>>>>>>> planner, but if
> > >>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
> > >>> in
> > >>>>> his
> > >>>>>>>>>>>>> TableFunction, it
> > >>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
> > >>>>>>> interface
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
> > >>> the
> > >>>>>>>>>>>>> documentation). In
> > >>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
> > >>> unified.
> > >>>>>>> WDYT?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> > >>> cache,
> > >>>> we
> > >>>>>>> will
> > >>>>>>>>>>>>> have 90% of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> > >>> optimization
> > >>>> in
> > >>>>>>> case
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>> LRU cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> > >>>> Collection<RowData>>.
> > >>>>>>> Here
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> always
> > >>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
> > >>>> cache,
> > >>>>>>> even
> > >>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
> > >>> rows
> > >>>>>>> after
> > >>>>>>>>>>>>> applying
> > >>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> > >>>>>>>>>>>> TableFunction,
> > >>>>>>>>>>>>> we store
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> > >>>> cache
> > >>>>>>> line
> > >>>>>>>>>>>> will
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> > >>>>> bytes).
> > >>>>>>>>>>> I.e.
> > >>>>>>>>>>>>> we don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
> > >>>> pruned,
> > >>>>>>> but
> > >>>>>>>>>>>>> significantly
> > >>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
> > >> If
> > >>>> the
> > >>>>>>> user
> > >>>>>>>>>>>>> knows about
> > >>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> > >>>>> option
> > >>>>>>>>>>> before
> > >>>>>>>>>>>>> the start
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
> > >>> idea
> > >>>>>>> that we
> > >>>>>>>>>>>> can
> > >>>>>>>>>>>>> do this
> > >>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
> > >> and
> > >>>>>>> 'weigher'
> > >>>>>>>>>>>>> methods of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> > >>>>>>> collection
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>> rows
> > >>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> > >>>> automatically
> > >>>>>>> fit
> > >>>>>>>>>>>> much
> > >>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> > >>>>> filters
> > >>>>>>> and
> > >>>>>>>>>>>>> projects
> > >>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > >>>>>>>>>>>>> SupportsProjectionPushDown.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > >>>>> interfaces,
> > >>>>>>>>>>> don't
> > >>>>>>>>>>>>> mean it's
> > >>>>>>>>>>>>>>>>>>>>>>>> hard
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> > >>>>> implement
> > >>>>>>>>>>> filter
> > >>>>>>>>>>>>> pushdown.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
> > >> no
> > >>>>>>> database
> > >>>>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> > >>>>> feature
> > >>>>>>>>>>> won't
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> > >>>> talk
> > >>>>>>> about
> > >>>>>>>>>>>>> other
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
> > >> databases
> > >>>>> might
> > >>>>>>>>> not
> > >>>>>>>>>>>>> support all
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
> > >> all).
> > >>> I
> > >>>>>>> think
> > >>>>>>>>>>>> users
> > >>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
> > >>>> optimization
> > >>>>>>>>>>>>> independently of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
> > >>>> complex
> > >>>>>>>>>>> problems
> > >>>>>>>>>>>>> (or
> > >>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
> > >> Actually
> > >>> in
> > >>>>> our
> > >>>>>>>>>>>>> internal version
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
> > >> and
> > >>>>>>>>> reloading
> > >>>>>>>>>>>>> data from
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
> > >> a
> > >>>> way
> > >>>>> to
> > >>>>>>>>>>> unify
> > >>>>>>>>>>>>> the logic
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> > >>>>>>>>> SourceFunction,
> > >>>>>>>>>>>>> Source,...)
> > >>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
> > >>> result
> > >>>> I
> > >>>>>>>>>>> settled
> > >>>>>>>>>>>>> on using
> > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
> > >>> in
> > >>>>> all
> > >>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
> > >> plans
> > >>>> to
> > >>>>>>>>>>>> deprecate
> > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> > >>>> usage
> > >>>>> of
> > >>>>>>>>>>>>> FLIP-27 source
> > >>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> > >>>>> source
> > >>>>>>> was
> > >>>>>>>>>>>>> designed to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
> > >>> (SplitEnumerator
> > >>>> on
> > >>>>>>>>>>>>> JobManager and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> > >>>> operator
> > >>>>>>>>>>> (lookup
> > >>>>>>>>>>>>> join
> > >>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
> > >> direct
> > >>>> way
> > >>>>> to
> > >>>>>>>>>>> pass
> > >>>>>>>>>>>>> splits from
> > >>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
> > >>> works
> > >>>>>>>>> through
> > >>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> > >>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> > >>>>>>>>>>> AddSplitEvents).
> > >>>>>>>>>>>>> Usage of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
> > >>> clearer
> > >>>>> and
> > >>>>>>>>>>>>> easier. But if
> > >>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> > >>>>>>> FLIP-27, I
> > >>>>>>>>>>>>> have the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
> > >>> lookup
> > >>>>> join
> > >>>>>>>>> ALL
> > >>>>>>>>>>>>> cache in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
> > >> of
> > >>>>> batch
> > >>>>>>>>>>>> source?
> > >>>>>>>>>>>>> The point
> > >>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
> > >> join
> > >>>> ALL
> > >>>>>>>>> cache
> > >>>>>>>>>>>>> and simple
> > >>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
> > >>> case
> > >>>>>>>>> scanning
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>>> performed
> > >>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
> > >> (cache)
> > >>> is
> > >>>>>>>>> cleared
> > >>>>>>>>>>>>> (correct me
> > >>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> > >>>>>>> functionality of
> > >>>>>>>>>>>>> simple join
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
> > >>>>>>> functionality of
> > >>>>>>>>>>>>> scanning
> > >>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
> > >> be
> > >>>>> easy
> > >>>>>>>>> with
> > >>>>>>>>>>>>> new FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
> > >> -
> > >>> we
> > >>>>>>> will
> > >>>>>>>>>>> need
> > >>>>>>>>>>>>> to change
> > >>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> > >>>> again
> > >>>>>>> after
> > >>>>>>>>>>>>> some TTL).
> > >>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> > >>> long-term
> > >>>>>>> goal
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> will make
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
> > >>> said.
> > >>>>>>> Maybe
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> can limit
> > >>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> > >>>> (InputFormats).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> > >>>> flexible
> > >>>>>>>>>>>>> interfaces for
> > >>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
> > >> both
> > >>>> in
> > >>>>>>> LRU
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> ALL caches.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> > >>>>>>> supported
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>> Flink
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
> > >>> have
> > >>>>> the
> > >>>>>>>>>>>>> opportunity to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
> > >> currently
> > >>>>> filter
> > >>>>>>>>>>>>> pushdown works
> > >>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> > >>> filters
> > >>>> +
> > >>>>>>>>>>>>> projections
> > >>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
> > >>>>>>> features.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
> > >>> that
> > >>>>>>>>> involves
> > >>>>>>>>>>>>> multiple
> > >>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
> > >>> from
> > >>>>>>>>>>>>> InputFormat in favor
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> > >>> realization
> > >>>>>>> really
> > >>>>>>>>>>>>> complex and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
> > >>> extend
> > >>>>> the
> > >>>>>>>>>>>>> functionality of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> > >>>> case
> > >>>>> of
> > >>>>>>>>>>>> lookup
> > >>>>>>>>>>>>> join ALL
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> > >>>>> imjark@gmail.com
> > >>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
> > >>> want
> > >>>> to
> > >>>>>>>>> share
> > >>>>>>>>>>>> my
> > >>>>>>>>>>>>> ideas:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> > >>>> connectors
> > >>>>>>> base
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
> > >>> ways
> > >>>>>>> should
> > >>>>>>>>>>>>> work (e.g.,
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
> > >>>>>>> interfaces.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
> > >>> flexible
> > >>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
> > >> we
> > >>>> can
> > >>>>>>> have
> > >>>>>>>>>>>> both
> > >>>>>>>>>>>>>>>>>>>>>>>> advantages.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
> > >>> should
> > >>>>> be a
> > >>>>>>>>>>> final
> > >>>>>>>>>>>>> state,
> > >>>>>>>>>>>>>>>>>>>>>>>> and we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
> > >>> into
> > >>>>>>> cache
> > >>>>>>>>>>> can
> > >>>>>>>>>>>>> benefit a
> > >>>>>>>>>>>>>>>>>>>>>>>> lot
> > >>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> > >>>>> Connectors
> > >>>>>>> use
> > >>>>>>>>>>>>> cache to
> > >>>>>>>>>>>>>>>>>>>>>>>> reduce
> > >>>>>>>>>>>>>>>>>>>>>>>>>> IO
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> > >>> cache,
> > >>>> we
> > >>>>>>> will
> > >>>>>>>>>>>>> have 90% of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
> > >> means
> > >>>> the
> > >>>>>>> cache
> > >>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>> meaningless in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
> > >> to
> > >>> do
> > >>>>>>>>> filters
> > >>>>>>>>>>>>> and projects
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > >>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > >>>>> interfaces,
> > >>>>>>>>>>> don't
> > >>>>>>>>>>>>> mean it's
> > >>>>>>>>>>>>>>>>>>>>>>>> hard
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
> > >> interfaces
> > >>> to
> > >>>>>>> reduce
> > >>>>>>>>>>> IO
> > >>>>>>>>>>>>> and the
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> size.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
> > >>>> source
> > >>>>>>> and
> > >>>>>>>>>>>>> lookup source
> > >>>>>>>>>>>>>>>>>>>>>>>> share
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> > >>>> pushdown
> > >>>>>>> logic
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>> caches,
> > >>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
> > >>> of
> > >>>>> this
> > >>>>>>>>>>> FLIP.
> > >>>>>>>>>>>>> We have
> > >>>>>>>>>>>>>>>>>>>>>>>> never
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
> > >>> "eval"
> > >>>>>>> method
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
> > >> share
> > >>>> the
> > >>>>>>>>> logic
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>> reload
> > >>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> > >>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Source.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> > >>>>> deprecated,
> > >>>>>>> and
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>> source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> > >>>>>>> LookupJoin,
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>> may make
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
> > >> the
> > >>>> ALL
> > >>>>>>>>> cache
> > >>>>>>>>>>>>> logic and
> > >>>>>>>>>>>>>>>>>>>>>>>> reuse
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> > >>>>>>>>>>>>> ro.v.boyko@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
> > >>> lies
> > >>>>> out
> > >>>>>>> of
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> scope of
> > >>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
> > >> be
> > >>>>> done
> > >>>>>>> for
> > >>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
> > >> Visser <
> > >>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> > >>>> correctly
> > >>>>>>>>>>>> mentioned
> > >>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> > >>>>>>>>>>> jdbc/hive/hbase."
> > >>>>>>>>>>>>> -> Would
> > >>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> > >>> implement
> > >>>>>>> these
> > >>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
> > >> to
> > >>>>> doing
> > >>>>>>>>>>> that,
> > >>>>>>>>>>>>> outside
> > >>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> > >>>>>>>>>>>>> ro.v.boyko@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> > >>>> improvement!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
> > >> implementation
> > >>>>>>> would be
> > >>>>>>>>>>> a
> > >>>>>>>>>>>>> nice
> > >>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> > >>> SYSTEM_TIME
> > >>>>> AS
> > >>>>>>> OF
> > >>>>>>>>>>>>> proc_time"
> > >>>>>>>>>>>>>>>>>>>>>>>>>> semantics
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> > >>>>>>> implemented.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
> > >>> say
> > >>>>>>> that:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
> > >>> to
> > >>>>> cut
> > >>>>>>> off
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>> size
> > >>>>>>>>>>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
> > >> the
> > >>>> most
> > >>>>>>>>> handy
> > >>>>>>>>>>>>> way to do
> > >>>>>>>>>>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
> > >> bit
> > >>>>>>> harder to
> > >>>>>>>>>>>>> pass it
> > >>>>>>>>>>>>>>>>>>>>>>>>>> through the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> > >>>> Alexander
> > >>>>>>>>>>>> correctly
> > >>>>>>>>>>>>>>>>>>>>>>>> mentioned
> > >>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
> > >>> for
> > >>>>>>>>>>>>> jdbc/hive/hbase.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
> > >> caching
> > >>>>>>>>>>> parameters
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
> > >>> set
> > >>>> it
> > >>>>>>>>>>> through
> > >>>>>>>>>>>>> DDL
> > >>>>>>>>>>>>>>>>>>>>>>>> rather
> > >>>>>>>>>>>>>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
> > >>> options
> > >>>>> for
> > >>>>>>>>> all
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>> tables.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> > >>>>> really
> > >>>>>>>>>>>>> deprives us of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> > >>>> implement
> > >>>>>>>>> their
> > >>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>>>>> cache).
> > >>>>>>>>>>>>>>>>>>>>>>>>>> But
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
> > >>> more
> > >>>>>>>>>>> different
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>> strategies
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> > >>>> schema
> > >>>>>>>>>>> proposed
> > >>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
> > >> not
> > >>>>> right
> > >>>>>>> and
> > >>>>>>>>>>>> all
> > >>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> > >>>>> architecture?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> > >>> Visser <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> > >>>> wanted
> > >>>>> to
> > >>>>>>>>>>>>> express that
> > >>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
> > >> this
> > >>>>> topic
> > >>>>>>>>>>> and I
> > >>>>>>>>>>>>> hope
> > >>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> > >>>>> Смирнов <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> > >>>> However, I
> > >>>>>>> have
> > >>>>>>>>>>>>> questions
> > >>>>>>>>>>>>>>>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
> > >>> get
> > >>>>>>>>>>>>> something?).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
> > >> of
> > >>>> "FOR
> > >>>>>>>>>>>>> SYSTEM_TIME
> > >>>>>>>>>>>>>>>>>>>>>>>> AS OF
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> > >>>>> SYSTEM_TIME
> > >>>>>>> AS
> > >>>>>>>>>>> OF
> > >>>>>>>>>>>>>>>>>>>>>>>> proc_time"
> > >>>>>>>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
> > >>> you
> > >>>>>>> said,
> > >>>>>>>>>>>> users
> > >>>>>>>>>>>>> go
> > >>>>>>>>>>>>>>>>>>>>>>>> on it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
> > >> performance
> > >>>> (no
> > >>>>>>> one
> > >>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> enable
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
> > >> do
> > >>>> you
> > >>>>>>> mean
> > >>>>>>>>>>>>> other
> > >>>>>>>>>>>>>>>>>>>>>>>>>> developers
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> > >>>>> explicitly
> > >>>>>>>>>>>> specify
> > >>>>>>>>>>>>>>>>>>>>>>>> whether
> > >>>>>>>>>>>>>>>>>>>>>>>>>> their
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
> > >> the
> > >>>>> list
> > >>>>>>> of
> > >>>>>>>>>>>>> supported
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> > >>>> want
> > >>>>>>> to.
> > >>>>>>>>> So
> > >>>>>>>>>>>>> what
> > >>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
> > >>> caching
> > >>>>> in
> > >>>>>>>>>>>> modules
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> > >>>> flink-table-common
> > >>>>>>> from
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> considered
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> > >>>>>>>>>>>>> breaking/non-breaking
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> > >>>>> proc_time"?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> > >>>>>>> options in
> > >>>>>>>>>>>> DDL
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> control
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
> > >>> never
> > >>>>>>>>> happened
> > >>>>>>>>>>>>>>>>>>>>>>>> previously
> > >>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> > >>>>> semantics
> > >>>>>>> of
> > >>>>>>>>>>> DDL
> > >>>>>>>>>>>>>>>>>>>>>>>> options
> > >>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
> > >>> it
> > >>>>>>> about
> > >>>>>>>>>>>>> limiting
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> scope
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> > >>>>> business
> > >>>>>>>>>>> logic
> > >>>>>>>>>>>>> rather
> > >>>>>>>>>>>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
> > >> logic
> > >>> in
> > >>>>> the
> > >>>>>>>>>>>>> framework? I
> > >>>>>>>>>>>>>>>>>>>>>>>>>> mean
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> > >>>> option
> > >>>>>>> with
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
> > >> the
> > >>>>> wrong
> > >>>>>>>>>>>>> decision,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> because it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
> > >>> logic
> > >>>>> (not
> > >>>>>>>>>>> just
> > >>>>>>>>>>>>>>>>>>>>>>>> performance
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> > >>>>> functions
> > >>>>>>> of
> > >>>>>>>>>>> ONE
> > >>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>> (there
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> > >>> caches).
> > >>>>>>> Does it
> > >>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>> matter for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
> > >>> logic
> > >>>> is
> > >>>>>>>>>>>> located,
> > >>>>>>>>>>>>>>>>>>>>>>>> which is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> > >>>>>>> 'sink.parallelism',
> > >>>>>>>>>>>>> which in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> some way
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
> > >> framework"
> > >>>> and
> > >>>>> I
> > >>>>>>>>>>> don't
> > >>>>>>>>>>>>> see any
> > >>>>>>>>>>>>>>>>>>>>>>>>>> problem
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> > >>>>> all-caching
> > >>>>>>>>>>>>> scenario
> > >>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> > >>>> discussion,
> > >>>>>>> but
> > >>>>>>>>>>>>> actually
> > >>>>>>>>>>>>>>>>>>>>>>>> in our
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> > >>>> quite
> > >>>>>>>>>>> easily
> > >>>>>>>>>>>> -
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>> reused
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
> > >>> for
> > >>>> a
> > >>>>>>> new
> > >>>>>>>>>>>> API).
> > >>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>>>>>>>> point is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
> > >> use
> > >>>>>>>>>>> InputFormat
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>> scanning
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
> > >> even
> > >>>> Hive
> > >>>>>>> - it
> > >>>>>>>>>>>> uses
> > >>>>>>>>>>>>>>>>>>>>>>>> class
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
> > >> a
> > >>>>>>> wrapper
> > >>>>>>>>>>>> around
> > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> > >>>> ability
> > >>>>>>> to
> > >>>>>>>>>>>> reload
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> > >>>> number
> > >>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
> > >> reload
> > >>>>> time
> > >>>>>>>>>>>>> significantly
> > >>>>>>>>>>>>>>>>>>>>>>>>>> reduces
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> > >>>> blocking). I
> > >>>>>>> know
> > >>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>> usually
> > >>>>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> > >>>> code,
> > >>>>>>> but
> > >>>>>>>>>>>> maybe
> > >>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>> one
> > >>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
> > >>> an
> > >>>>>>> ideal
> > >>>>>>>>>>>>> solution,
> > >>>>>>>>>>>>>>>>>>>>>>>> maybe
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
> > >>> might
> > >>>>>>>>>>> introduce
> > >>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> > >>>>> developer
> > >>>>>>> of
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
> > >>> new
> > >>>>>>> cache
> > >>>>>>>>>>>>> options
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
> > >> options
> > >>>>> into
> > >>>>>>> 2
> > >>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
> > >> will
> > >>>>> need
> > >>>>>>> to
> > >>>>>>>>>>> do
> > >>>>>>>>>>>>> is to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> redirect
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> > >>>>>>> LookupConfig
> > >>>>>>>>> (+
> > >>>>>>>>>>>>> maybe
> > >>>>>>>>>>>>>>>>>>>>>>>> add an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> > >>>> naming),
> > >>>>>>>>>>>> everything
> > >>>>>>>>>>>>>>>>>>>>>>>> will be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> > >>>> won't
> > >>>>>>> do
> > >>>>>>>>>>>>>>>>>>>>>>>> refactoring at
> > >>>>>>>>>>>>>>>>>>>>>>>>>> all,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
> > >> connector
> > >>>>>>> because
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> backward
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
> > >> wants
> > >>> to
> > >>>>> use
> > >>>>>>>>> his
> > >>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>> logic,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> > >>>>> configs
> > >>>>>>>>> into
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
> > >> with
> > >>>>>>> already
> > >>>>>>>>>>>>> existing
> > >>>>>>>>>>>>>>>>>>>>>>>>>> configs
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
> > >> it's a
> > >>>>> rare
> > >>>>>>>>>>> case).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
> > >> pushed
> > >>>> all
> > >>>>>>> the
> > >>>>>>>>>>> way
> > >>>>>>>>>>>>> down
> > >>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> > >>>> source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
> > >> is
> > >>>> that
> > >>>>>>> the
> > >>>>>>>>>>>> ONLY
> > >>>>>>>>>>>>>>>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> > >>>>>>> FileSystemTableSource
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> > >>>>> currently).
> > >>>>>>>>> Also
> > >>>>>>>>>>>>> for some
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> > >>>>> complex
> > >>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>>>>>>>>>> that we
> > >>>>>>>>>>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
> > >> the
> > >>>>> cache
> > >>>>>>>>>>> seems
> > >>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
> > >> large
> > >>>>>>> amount of
> > >>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>>>>>>> from the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> > >>>>> suppose
> > >>>>>>> in
> > >>>>>>>>>>>>> dimension
> > >>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
> > >> 20
> > >>> to
> > >>>>> 40,
> > >>>>>>>>> and
> > >>>>>>>>>>>>> input
> > >>>>>>>>>>>>>>>>>>>>>>>> stream
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
> > >>> by
> > >>>>> age
> > >>>>>>> of
> > >>>>>>>>>>>>> users. If
> > >>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> > >>>> This
> > >>>>>>> means
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
> > >>> almost
> > >>>> 2
> > >>>>>>>>> times.
> > >>>>>>>>>>>> It
> > >>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>> gain a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> > >>>>> optimization
> > >>>>>>>>>>> starts
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> > >>>> filters
> > >>>>>>> and
> > >>>>>>>>>>>>> projections
> > >>>>>>>>>>>>>>>>>>>>>>>>>> can't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
> > >>> opens
> > >>>> up
> > >>>>>>>>>>>>> additional
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
> > >> 'not
> > >>>>> quite
> > >>>>>>>>>>>>> useful'.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> > >>>>>>> regarding
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>> topic!
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Because
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> > >>>> points,
> > >>>>>>> and I
> > >>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
> > >>> come
> > >>>>> to a
> > >>>>>>>>>>>>> consensus.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
> > >>> Ren
> > >>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
> > >> for
> > >>> my
> > >>>>>>> late
> > >>>>>>>>>>>>> response!
> > >>>>>>>>>>>>>>>>>>>>>>>> We
> > >>>>>>>>>>>>>>>>>>>>>>>>>> had
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
> > >>> and
> > >>>>>>> Leonard
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>> I’d
> > >>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> > >>>> implementing
> > >>>>>>> the
> > >>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>> logic in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> > >>>>>>> user-provided
> > >>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>> function,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> > >>> extending
> > >>>>>>>>>>>>> TableFunction
> > >>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
> > >> semantic
> > >>> of
> > >>>>>>> "FOR
> > >>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> > >>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> > >>>> reflect
> > >>>>>>> the
> > >>>>>>>>>>>>> content
> > >>>>>>>>>>>>>>>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
> > >> users
> > >>>>>>> choose
> > >>>>>>>>> to
> > >>>>>>>>>>>>> enable
> > >>>>>>>>>>>>>>>>>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
> > >>> that
> > >>>>>>> this
> > >>>>>>>>>>>>> breakage is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
> > >>> prefer
> > >>>>> not
> > >>>>>>> to
> > >>>>>>>>>>>>> provide
> > >>>>>>>>>>>>>>>>>>>>>>>>>> caching on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
> > >>> in
> > >>>>> the
> > >>>>>>>>>>>>> framework
> > >>>>>>>>>>>>>>>>>>>>>>>>>> (whether
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> > >>> TableFunction),
> > >>>> we
> > >>>>>>> have
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> confront a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
> > >>> control
> > >>>>> the
> > >>>>>>>>>>>>> behavior of
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> > >>>>> should
> > >>>>>>> be
> > >>>>>>>>>>>>> cautious.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Under
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> > >>>> framework
> > >>>>>>>>> should
> > >>>>>>>>>>>>> only be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
> > >>> it’s
> > >>>>>>> hard
> > >>>>>>>>> to
> > >>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
> > >> source
> > >>>>> loads
> > >>>>>>> and
> > >>>>>>>>>>>>> refresh
> > >>>>>>>>>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> > >>>> high
> > >>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>> performance
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
> > >>> widely
> > >>>>>>> used
> > >>>>>>>>> by
> > >>>>>>>>>>>> our
> > >>>>>>>>>>>>>>>>>>>>>>>> internal
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
> > >>> the
> > >>>>>>> user’s
> > >>>>>>>>>>>>>>>>>>>>>>>> TableFunction
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> > >>>>>>> introduce a
> > >>>>>>>>>>>> new
> > >>>>>>>>>>>>>>>>>>>>>>>>>> interface for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
> > >> would
> > >>>>>>> become
> > >>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>>>>> complex.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
> > >> framework
> > >>>>> might
> > >>>>>>>>>>>>> introduce
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> > >>>> there
> > >>>>>>> might
> > >>>>>>>>>>>>> exist two
> > >>>>>>>>>>>>>>>>>>>>>>>>>> caches
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
> > >> user
> > >>>>>>>>>>> incorrectly
> > >>>>>>>>>>>>>>>>>>>>>>>> configures
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> > >>>> implemented
> > >>>>>>> by
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>> source).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> > >>>>>>> Alexander, I
> > >>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
> > >> way
> > >>>> down
> > >>>>>>> to
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>> function,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
> > >> of
> > >>>> the
> > >>>>>>>>>>> runner
> > >>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> > >>>> network
> > >>>>>>> I/O
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>> pressure
> > >>>>>>>>>>>>>>>>>>>>>>>>>> on the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
> > >> these
> > >>>>>>>>>>>> optimizations
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> > >>>>> reflect
> > >>>>>>> our
> > >>>>>>>>>>>>> ideas.
> > >>>>>>>>>>>>>>>>>>>>>>>> We
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
> > >>> of
> > >>>>>>>>>>>>> TableFunction,
> > >>>>>>>>>>>>>>>>>>>>>>>> and we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> > >>>>>>> (CachingTableFunction,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
> > >> developers
> > >>>> and
> > >>>>>>>>>>> regulate
> > >>>>>>>>>>>>>>>>>>>>>>>> metrics
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
> > >> reference.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> > >>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> > >>> Александр
> > >>>>>>> Смирнов
> > >>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> > >>>>> solution
> > >>>>>>> as
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>> first
> > >>>>>>>>>>>>>>>>>>>>>>>>>> step:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
> > >> mutually
> > >>>>>>> exclusive
> > >>>>>>>>>>>>>>>>>>>>>>>> (originally
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> > >>>>> conceptually
> > >>>>>>>>> they
> > >>>>>>>>>>>>> follow
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> > >>>>>>> different.
> > >>>>>>>>>>> If
> > >>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>> go one
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
> > >>> will
> > >>>>> mean
> > >>>>>>>>>>>>> deleting
> > >>>>>>>>>>>>>>>>>>>>>>>>>> existing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> > >>>>>>> connectors.
> > >>>>>>>>>>> So
> > >>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>> think we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> > >>>> about
> > >>>>>>> that
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> then
> > >>>>>>>>>>>>>>>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> > >>>> tasks
> > >>>>>>> for
> > >>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>>>> parts
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> > >>> unification
> > >>>> /
> > >>>>>>>>>>>>> introducing
> > >>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> > >>>> Qingsheng?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> > >>>>> requests
> > >>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
> > >>> fields
> > >>>>> of
> > >>>>>>> the
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>> table, we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
> > >>> after
> > >>>>>>> that we
> > >>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> > >>>> filter
> > >>>>>>>>>>>>> pushdown. So
> > >>>>>>>>>>>>>>>>>>>>>>>> if
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> > >>>> much
> > >>>>>>> less
> > >>>>>>>>>>>> rows
> > >>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > >>>>> architecture
> > >>>>>>> is
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>> shared.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> > >> honest.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> > >>>> kinds
> > >>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> conversations
> > >>>>>>>>>>>>>>>>>>>>>>>>>> :)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> > >>> confluence,
> > >>>>> so
> > >>>>>>> I
> > >>>>>>>>>>>> made a
> > >>>>>>>>>>>>>>>>>>>>>>>> Jira
> > >>>>>>>>>>>>>>>>>>>>>>>>>> issue,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
> > >> in
> > >>>>> more
> > >>>>>>>>>>>> details
> > >>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
> > >>> Heise
> > >>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> > >>>> inconsistency
> > >>>>>>> was
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> > >>>> could
> > >>>>>>> also
> > >>>>>>>>>>>> live
> > >>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
> > >> of
> > >>>>>>> making
> > >>>>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> > >>>> devise a
> > >>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>>>>>> layer
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> > >>> CachingTableFunction
> > >>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>> delegates to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> X in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> > >>>> Lifting
> > >>>>>>> it
> > >>>>>>>>>>> into
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> operator
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> > >>>>>>> probably
> > >>>>>>>>>>>>>>>>>>>>>>>> unnecessary
> > >>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
> > >>> will
> > >>>>> only
> > >>>>>>>>>>>> receive
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
> > >>> more
> > >>>>>>>>>>>> interesting
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> save
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> > >>>> changes
> > >>>>> of
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>> FLIP
> > >>>>>>>>>>>>>>>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> > >>>>> interfaces.
> > >>>>>>>>>>>>> Everything
> > >>>>>>>>>>>>>>>>>>>>>>>> else
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
> > >> That
> > >>>>> means
> > >>>>>>> we
> > >>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>> easily
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
> > >> Alexander
> > >>>>>>> pointed
> > >>>>>>>>>>> out
> > >>>>>>>>>>>>>>>>>>>>>>>> later.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > >>>>> architecture
> > >>>>>>> is
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>> shared.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> > >> honest.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> > >>>> Александр
> > >>>>>>>>>>> Смирнов
> > >>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
> > >>> I'm
> > >>>>>>> not a
> > >>>>>>>>>>>>>>>>>>>>>>>> committer
> > >>>>>>>>>>>>>>>>>>>>>>>>>> yet,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> > >>>> FLIP
> > >>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>> interested
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> > >>>>>>> feature in
> > >>>>>>>>>>> my
> > >>>>>>>>>>>>>>>>>>>>>>>>>> company’s
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
> > >> our
> > >>>>>>> thoughts
> > >>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>>>>>>> this and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
> > >> alternative
> > >>>>> than
> > >>>>>>>>>>>>>>>>>>>>>>>> introducing an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> > >>>>>>>>> (CachingTableFunction).
> > >>>>>>>>>>>> As
> > >>>>>>>>>>>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>>>>>>>>>> know,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> > >>>>>>> flink-table-common
> > >>>>>>>>>>>>>>>>>>>>>>>> module,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
> > >> tables –
> > >>>>> it’s
> > >>>>>>>>> very
> > >>>>>>>>>>>>>>>>>>>>>>>>>> convenient
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> > >>>>>>> CachingTableFunction
> > >>>>>>>>>>>>> contains
> > >>>>>>>>>>>>>>>>>>>>>>>>>> logic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
> > >> and
> > >>>>>>>>>>> everything
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connected
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
> > >> module,
> > >>>>>>> probably
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> > >>>>> depend
> > >>>>>>> on
> > >>>>>>>>>>>>> another
> > >>>>>>>>>>>>>>>>>>>>>>>>>> module,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
> > >>> which
> > >>>>>>> doesn’t
> > >>>>>>>>>>>>> sound
> > >>>>>>>>>>>>>>>>>>>>>>>>>> good.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> > >>>>>>>>> ‘getLookupConfig’
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> > >>>>>>> connectors
> > >>>>>>>>> to
> > >>>>>>>>>>>>> only
> > >>>>>>>>>>>>>>>>>>>>>>>> pass
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> > >>>> therefore
> > >>>>>>> they
> > >>>>>>>>>>>> won’t
> > >>>>>>>>>>>>>>>>>>>>>>>>>> depend on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> > >>>>> planner
> > >>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>> construct a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> > >>>> runtime
> > >>>>>>> logic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> > >>>> Architecture
> > >>>>>>>>> looks
> > >>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> > >>>>> actually
> > >>>>>>>>>>> yours
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
> > >> that
> > >>>> will
> > >>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>> responsible
> > >>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> > >>>>>>> inheritors.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> > >>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
> > >>>>>>>>>>>>>>>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> > >>>> AsyncLookupJoinRunner,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> > >>>>>>>>>>> LookupJoinCachingRunner,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
> > >> etc.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
> > >> powerful
> > >>>>>>> advantage
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> such a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
> > >>> level,
> > >>>>> we
> > >>>>>>> can
> > >>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>>>>>>>>> some
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> > >>>>>>> LookupJoinRunnerWithCalc
> > >>>>>>>>>>> was
> > >>>>>>>>>>>>>>>>>>>>>>>> named
> > >>>>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
> > >> function,
> > >>>>> which
> > >>>>>>>>>>>> actually
> > >>>>>>>>>>>>>>>>>>>>>>>>>> mostly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> > >>>> lookup
> > >>>>>>> table
> > >>>>>>>>>>> B
> > >>>>>>>>>>>>>>>>>>>>>>>>>> condition
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> > >>>> WHERE
> > >>>>>>>>>>>> B.salary >
> > >>>>>>>>>>>>>>>>>>>>>>>> 1000’
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
> > >> A.age =
> > >>>>>>> B.age +
> > >>>>>>>>>>> 10
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> > >>>> storing
> > >>>>>>>>>>> records
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
> > >> reduced:
> > >>>>>>> filters =
> > >>>>>>>>>>>>> avoid
> > >>>>>>>>>>>>>>>>>>>>>>>>>> storing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
> > >>> reduce
> > >>>>>>>>> records’
> > >>>>>>>>>>>>>>>>>>>>>>>> size. So
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
> > >> be
> > >>>>>>>>> increased
> > >>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> user.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
> > >> Ren
> > >>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> > >>>>>>> discussion
> > >>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> > >>>> table
> > >>>>>>>>> cache
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>> its
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> > >>>>> should
> > >>>>>>>>>>>>> implement
> > >>>>>>>>>>>>>>>>>>>>>>>>>> their
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
> > >>> isn’t a
> > >>>>>>>>>>> standard
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
> > >> with
> > >>>>> lookup
> > >>>>>>>>>>>> joins,
> > >>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>> is a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> > >>>>>>> including
> > >>>>>>>>>>>>> cache,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
> > >>> table
> > >>>>>>>>> options.
> > >>>>>>>>>>>>>>>>>>>>>>>> Please
> > >>>>>>>>>>>>>>>>>>>>>>>>>> take a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
> > >>> Any
> > >>>>>>>>>>>> suggestions
> > >>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > >>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jark Wu <im...@gmail.com>.

Thank Qingsheng for the detailed summary and updates,

The changes look good to me in general. I just have one minor improvement
comment.
Could we add a static util method to the "FullCachingReloadTrigger"
interface for quick usage?

#periodicReloadAtFixedRate(Duration)
#periodicReloadWithFixedDelay(Duration)

I think we can also do this for LookupCache. Because users may not know
where is the default
implementations and how to use them.

Best,
Jark






On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <re...@gmail.com> wrote:

> Hi Jingsong,
>
> Thanks for your comments!
>
> > AllCache definition is not flexible, for example, PartialCache can use
> any custom storage, while the AllCache can not, AllCache can also be
> considered to store memory or disk, also need a flexible strategy.
>
> We had an offline discussion with Jark and Leonard. Basically we think
> exposing the interface of full cache storage to connector developers might
> limit our future optimizations. The storage of full caching shouldn’t have
> too many variations for different lookup tables so making it pluggable
> might not help a lot. Also I think it is not quite easy for connector
> developers to implement such an optimized storage. We can keep optimizing
> this storage in the future and all full caching lookup tables would benefit
> from this.
>
> > We are more inclined to deprecate the connector `async` option when
> discussing FLIP-234. Can we remove this option from this FLIP?
>
> Thanks for the reminder! This option has been removed in the latest
> version.
>
> Best regards,
>
> Qingsheng
>
>
> > On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com> wrote:
> >
> > Thanks Alexander for your reply. We can discuss the new interface when it
> > comes out.
> >
> > We are more inclined to deprecate the connector `async` option when
> > discussing FLIP-234 [1]. We should use hint to let planner decide.
> > Although the discussion has not yet produced a conclusion, can we remove
> > this option from this FLIP? It doesn't seem to be related to this FLIP,
> but
> > more to FLIP-234, and we can form a conclusion over there.
> >
> > [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> >
> > Best,
> > Jingsong
> >
> > On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:
> >
> >> Hi Jark,
> >>
> >> Thanks for clarifying it. It would be fine. as long as we could provide
> the
> >> no-cache solution. I was just wondering if the client side cache could
> >> really help when HBase is used, since the data to look up should be
> huge.
> >> Depending how much data will be cached on the client side, the data that
> >> should be lru in e.g. LruBlockCache will not be lru anymore. In the
> worst
> >> case scenario, once the cached data at client side is expired, the
> request
> >> will hit disk which will cause extra latency temporarily, if I am not
> >> mistaken.
> >>
> >> Best regards,
> >> Jing
> >>
> >> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
> >>
> >>> Hi Jing Ge,
> >>>
> >>> What do you mean about the "impact on the block cache used by HBase"?
> >>> In my understanding, the connector cache and HBase cache are totally
> two
> >>> things.
> >>> The connector cache is a local/client cache, and the HBase cache is a
> >>> server cache.
> >>>
> >>>> does it make sense to have a no-cache solution as one of the
> >>> default solutions so that customers will have no effort for the
> migration
> >>> if they want to stick with Hbase cache
> >>>
> >>> The implementation migration should be transparent to users. Take the
> >> HBase
> >>> connector as
> >>> an example,  it already supports lookup cache but is disabled by
> default.
> >>> After migration, the
> >>> connector still disables cache by default (i.e. no-cache solution). No
> >>> migration effort for users.
> >>>
> >>> HBase cache and connector cache are two different things. HBase cache
> >> can't
> >>> simply replace
> >>> connector cache. Because one of the most important usages for connector
> >>> cache is reducing
> >>> the I/O request/response and improving the throughput, which can
> achieve
> >>> by just using a server cache.
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
> >>>
> >>>> Thanks all for the valuable discussion. The new feature looks very
> >>>> interesting.
> >>>>
> >>>> According to the FLIP description: "*Currently we have JDBC, Hive and
> >>> HBase
> >>>> connector implemented lookup table source. All existing
> implementations
> >>>> will be migrated to the current design and the migration will be
> >>>> transparent to end users*." I was only wondering if we should pay
> >>> attention
> >>>> to HBase and similar DBs. Since, commonly, the lookup data will be
> huge
> >>>> while using HBase, partial caching will be used in this case, if I am
> >> not
> >>>> mistaken, which might have an impact on the block cache used by HBase,
> >>> e.g.
> >>>> LruBlockCache.
> >>>> Another question is that, since HBase provides a sophisticated cache
> >>>> solution, does it make sense to have a no-cache solution as one of the
> >>>> default solutions so that customers will have no effort for the
> >> migration
> >>>> if they want to stick with Hbase cache?
> >>>>
> >>>> Best regards,
> >>>> Jing
> >>>>
> >>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <ji...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I think the problem now is below:
> >>>>> 1. AllCache and PartialCache interface on the non-uniform, one needs
> >> to
> >>>>> provide LookupProvider, the other needs to provide CacheBuilder.
> >>>>> 2. AllCache definition is not flexible, for example, PartialCache can
> >>> use
> >>>>> any custom storage, while the AllCache can not, AllCache can also be
> >>>>> considered to store memory or disk, also need a flexible strategy.
> >>>>> 3. AllCache can not customize ReloadStrategy, currently only
> >>>>> ScheduledReloadStrategy.
> >>>>>
> >>>>> In order to solve the above problems, the following are my ideas.
> >>>>>
> >>>>> ## Top level cache interfaces:
> >>>>>
> >>>>> ```
> >>>>>
> >>>>> public interface CacheLookupProvider extends
> >>>>> LookupTableSource.LookupRuntimeProvider {
> >>>>>
> >>>>>    CacheBuilder createCacheBuilder();
> >>>>> }
> >>>>>
> >>>>>
> >>>>> public interface CacheBuilder {
> >>>>>    Cache create();
> >>>>> }
> >>>>>
> >>>>>
> >>>>> public interface Cache {
> >>>>>
> >>>>>    /**
> >>>>>     * Returns the value associated with key in this cache, or null
> >> if
> >>>>> there is no cached value for
> >>>>>     * key.
> >>>>>     */
> >>>>>    @Nullable
> >>>>>    Collection<RowData> getIfPresent(RowData key);
> >>>>>
> >>>>>    /** Returns the number of key-value mappings in the cache. */
> >>>>>    long size();
> >>>>> }
> >>>>>
> >>>>> ```
> >>>>>
> >>>>> ## Partial cache
> >>>>>
> >>>>> ```
> >>>>>
> >>>>> public interface PartialCacheLookupFunction extends
> >>> CacheLookupProvider {
> >>>>>
> >>>>>    @Override
> >>>>>    PartialCacheBuilder createCacheBuilder();
> >>>>>
> >>>>> /** Creates an {@link LookupFunction} instance. */
> >>>>> LookupFunction createLookupFunction();
> >>>>> }
> >>>>>
> >>>>>
> >>>>> public interface PartialCacheBuilder extends CacheBuilder {
> >>>>>
> >>>>>    PartialCache create();
> >>>>> }
> >>>>>
> >>>>>
> >>>>> public interface PartialCache extends Cache {
> >>>>>
> >>>>>    /**
> >>>>>     * Associates the specified value rows with the specified key row
> >>>>> in the cache. If the cache
> >>>>>     * previously contained value associated with the key, the old
> >>>>> value is replaced by the
> >>>>>     * specified value.
> >>>>>     *
> >>>>>     * @return the previous value rows associated with key, or null
> >> if
> >>>>> there was no mapping for key.
> >>>>>     * @param key - key row with which the specified value is to be
> >>>>> associated
> >>>>>     * @param value – value rows to be associated with the specified
> >>> key
> >>>>>     */
> >>>>>    Collection<RowData> put(RowData key, Collection<RowData> value);
> >>>>>
> >>>>>    /** Discards any cached value for the specified key. */
> >>>>>    void invalidate(RowData key);
> >>>>> }
> >>>>>
> >>>>> ```
> >>>>>
> >>>>> ## All cache
> >>>>> ```
> >>>>>
> >>>>> public interface AllCacheLookupProvider extends CacheLookupProvider {
> >>>>>
> >>>>>    void registerReloadStrategy(ScheduledExecutorService
> >>>>> executorService, Reloader reloader);
> >>>>>
> >>>>>    ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> >>>>>
> >>>>>    @Override
> >>>>>    AllCacheBuilder createCacheBuilder();
> >>>>> }
> >>>>>
> >>>>>
> >>>>> public interface AllCacheBuilder extends CacheBuilder {
> >>>>>
> >>>>>    AllCache create();
> >>>>> }
> >>>>>
> >>>>>
> >>>>> public interface AllCache extends Cache {
> >>>>>
> >>>>>    void putAll(Iterator<Map<RowData, RowData>> allEntries);
> >>>>>
> >>>>>    void clearAll();
> >>>>> }
> >>>>>
> >>>>>
> >>>>> public interface Reloader {
> >>>>>
> >>>>>    void reload();
> >>>>> }
> >>>>>
> >>>>> ```
> >>>>>
> >>>>> Best,
> >>>>> Jingsong
> >>>>>
> >>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <jingsonglee0@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Thanks Qingsheng and all for your discussion.
> >>>>>>
> >>>>>> Very sorry to jump in so late.
> >>>>>>
> >>>>>> Maybe I missed something?
> >>>>>> My first impression when I saw the cache interface was, why don't
> >> we
> >>>>>> provide an interface similar to guava cache [1], on top of guava
> >>> cache,
> >>>>>> caffeine also makes extensions for asynchronous calls.[2]
> >>>>>> There is also the bulk load in caffeine too.
> >>>>>>
> >>>>>> I am also more confused why first from LookupCacheFactory.Builder
> >> and
> >>>>> then
> >>>>>> to Factory to create Cache.
> >>>>>>
> >>>>>> [1] https://github.com/google/guava
> >>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
> >>>>>>
> >>>>>> Best,
> >>>>>> Jingsong
> >>>>>>
> >>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
> >>>>>>
> >>>>>>> After looking at the new introduced ReloadTime and Becket's
> >> comment,
> >>>>>>> I agree with Becket we should have a pluggable reloading strategy.
> >>>>>>> We can provide some common implementations, e.g., periodic
> >>> reloading,
> >>>>> and
> >>>>>>> daily reloading.
> >>>>>>> But there definitely be some connector- or business-specific
> >>> reloading
> >>>>>>> strategies, e.g.
> >>>>>>> notify by a zookeeper watcher, reload once a new Hive partition is
> >>>>>>> complete.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jark
> >>>>>>>
> >>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Qingsheng,
> >>>>>>>>
> >>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
> >>>>>>>>
> >>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
> >>>> "XXXProvider".
> >>>>>>>> What is the difference between them? If they are the same, can
> >> we
> >>>> just
> >>>>>>> use
> >>>>>>>> XXXFactory everywhere?
> >>>>>>>>
> >>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
> >>>>> policy
> >>>>>>>> also be pluggable? Periodical reloading could be sometimes be
> >>> tricky
> >>>>> in
> >>>>>>>> practice. For example, if user uses 24 hours as the cache
> >> refresh
> >>>>>>> interval
> >>>>>>>> and some nightly batch job delayed, the cache update may still
> >> see
> >>>> the
> >>>>>>>> stale data.
> >>>>>>>>
> >>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> >>>> should
> >>>>> be
> >>>>>>>> removed.
> >>>>>>>>
> >>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
> >> seems a
> >>>>>>> little
> >>>>>>>> confusing to me. If Optional<LookupCacheFactory>
> >> getCacheFactory()
> >>>>>>> returns
> >>>>>>>> a non-empty factory, doesn't that already indicates the
> >> framework
> >>> to
> >>>>>>> cache
> >>>>>>>> the missing keys? Also, why is this method returning an
> >>>>>>> Optional<Boolean>
> >>>>>>>> instead of boolean?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
> >> renqschn@gmail.com
> >>>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Lincoln and Jark,
> >>>>>>>>>
> >>>>>>>>> Thanks for the comments! If the community reaches a consensus
> >>> that
> >>>> we
> >>>>>>> use
> >>>>>>>>> SQL hint instead of table options to decide whether to use sync
> >>> or
> >>>>>>> async
> >>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
> >>>>> option.
> >>>>>>>>>
> >>>>>>>>> I think it’s a good idea to let the decision of async made on
> >>> query
> >>>>>>>>> level, which could make better optimization with more
> >> infomation
> >>>>>>> gathered
> >>>>>>>>> by planner. Is there any FLIP describing the issue in
> >>> FLINK-27625?
> >>>> I
> >>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
> >>>>> missing
> >>>>>>>>> instead of the entire async mode to be controlled by hint.
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>>
> >>>>>>>>> Qingsheng
> >>>>>>>>>
> >>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
> >> lincoln.86xy@gmail.com
> >>>>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Jark,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for your reply!
> >>>>>>>>>>
> >>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
> >>> no
> >>>>> idea
> >>>>>>>>>> whether or when to remove it (we can discuss it in another
> >>> issue
> >>>>> for
> >>>>>>> the
> >>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
> >>> into
> >>>> a
> >>>>>>>>> common
> >>>>>>>>>> option now.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Lincoln Lee
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> >>>>>>>>>>
> >>>>>>>>>>> Hi Lincoln,
> >>>>>>>>>>>
> >>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
> >> the
> >>>>>>>>> connectors
> >>>>>>>>>>> can
> >>>>>>>>>>> provide both async and sync runtime providers simultaneously
> >>>>> instead
> >>>>>>>>> of one
> >>>>>>>>>>> of them.
> >>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
> >> option
> >>> is
> >>>>>>>>> planned to
> >>>>>>>>>>> be removed
> >>>>>>>>>>> in the long term, I think it makes sense not to introduce it
> >>> in
> >>>>> this
> >>>>>>>>> FLIP.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jark
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> >>>> lincoln.86xy@gmail.com
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
> >>> idea
> >>>>>>> that
> >>>>>>>>> we
> >>>>>>>>>>> can
> >>>>>>>>>>>> have a common table option. I have a minor comments on
> >>>>>>> 'lookup.async'
> >>>>>>>>>>> that
> >>>>>>>>>>>> not make it a common option:
> >>>>>>>>>>>>
> >>>>>>>>>>>> The table layer abstracts both sync and async lookup
> >>>>> capabilities,
> >>>>>>>>>>>> connectors implementers can choose one or both, in the case
> >>> of
> >>>>>>>>>>> implementing
> >>>>>>>>>>>> only one capability(status of the most of existing builtin
> >>>>>>> connectors)
> >>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
> >>> both
> >>>>>>>>>>>> capabilities, I think this choice is more suitable for
> >> making
> >>>>>>>>> decisions
> >>>>>>>>>>> at
> >>>>>>>>>>>> the query level, for example, table planner can choose the
> >>>>> physical
> >>>>>>>>>>>> implementation of async lookup or sync lookup based on its
> >>> cost
> >>>>>>>>> model, or
> >>>>>>>>>>>> users can give query hint based on their own better
> >>>>>>> understanding.  If
> >>>>>>>>>>>> there is another common table option 'lookup.async', it may
> >>>>> confuse
> >>>>>>>>> the
> >>>>>>>>>>>> users in the long run.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
> >>>> place
> >>>>>>> (for
> >>>>>>>>> the
> >>>>>>>>>>>> current hbase connector) and not turn it into a common
> >>> option.
> >>>>>>>>>>>>
> >>>>>>>>>>>> WDYT?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Lincoln Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Alexander,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
> >> you
> >>>> can
> >>>>>>> find
> >>>>>>>>>>>> those
> >>>>>>>>>>>>> changes from my latest email. Since some terminologies has
> >>>>>>> changed so
> >>>>>>>>>>>> I’ll
> >>>>>>>>>>>>> use the new concept for replying your comments.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. Builder vs ‘of’
> >>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
> >> optional
> >>>>>>>>> parameters
> >>>>>>>>>>>>> for full caching mode (“rescan” previously). The
> >>>>>>> schedule-with-delay
> >>>>>>>>>>> idea
> >>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
> >> the
> >>>>>>> builder
> >>>>>>>>> API
> >>>>>>>>>>>> of
> >>>>>>>>>>>>> full caching to make it more descriptive for developers.
> >>> Would
> >>>>> you
> >>>>>>>>> mind
> >>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
> >>>>> workspace
> >>>>>>>>> you
> >>>>>>>>>>>> can
> >>>>>>>>>>>>> just provide your account ID and ping any PMC member
> >>> including
> >>>>>>> Jark.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2. Common table options
> >>>>>>>>>>>>> We have some discussions these days and propose to
> >>> introduce 8
> >>>>>>> common
> >>>>>>>>>>>>> table options about caching. It has been updated on the
> >>> FLIP.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 3. Retries
> >>>>>>>>>>>>> I think we are on the same page :-)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For your additional concerns:
> >>>>>>>>>>>>> 1) The table option has been updated.
> >>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
> >> use
> >>>>>>> partial
> >>>>>>>>> or
> >>>>>>>>>>>>> full caching mode.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> >>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also I have a few additions:
> >>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> >>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
> >> that
> >>>> we
> >>>>>>> talk
> >>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
> >> fits
> >>>>> more,
> >>>>>>>>>>>>>> considering my optimization with filters.
> >>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
> >>> separate
> >>>>>>>>> caching
> >>>>>>>>>>>>>> and rescanning from the options point of view? Like
> >>> initially
> >>>>> we
> >>>>>>> had
> >>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
> >>> now
> >>>> we
> >>>>>>> can
> >>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
> >>> be
> >>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> >>>>>>> smiralexan@gmail.com
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Qingsheng and Jark,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1. Builders vs 'of'
> >>>>>>>>>>>>>>> I understand that builders are used when we have
> >> multiple
> >>>>>>>>>>> parameters.
> >>>>>>>>>>>>>>> I suggested them because we could add parameters later.
> >> To
> >>>>>>> prevent
> >>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
> >>> can
> >>>>>>>>> suggest
> >>>>>>>>>>>>>>> one more config now - "rescanStartTime".
> >>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
> >> reload
> >>>> of
> >>>>>>> cache
> >>>>>>>>>>>>>>> starts. This parameter can be thought of as
> >> 'initialDelay'
> >>>>> (diff
> >>>>>>>>>>>>>>> between current time and rescanStartTime) in method
> >>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
> >>> can
> >>>> be
> >>>>>>> very
> >>>>>>>>>>>>>>> useful when the dimension table is updated by some other
> >>>>>>> scheduled
> >>>>>>>>>>> job
> >>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
> >> second
> >>>> scan
> >>>>>>>>>>> (first
> >>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
> >>>> without
> >>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
> >>> one
> >>>>>>> day.
> >>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
> >> if
> >>>> you
> >>>>>>> would
> >>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
> >> myself
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2. Common table options
> >>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
> >>>> cache
> >>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
> >>> for
> >>>>>>>>> default
> >>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
> >>>> cache
> >>>>>>>>>>> options,
> >>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 3. Retries
> >>>>>>>>>>>>>>> I'm fine with suggestion close to
> >>> RetryUtils#tryTimes(times,
> >>>>>>> call)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> >>>> renqschn@gmail.com
> >>>>>> :
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Jark and Alexander,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
> >> common
> >>>>> table
> >>>>>>>>>>>>> options. I prefer to introduce a new
> >>> DefaultLookupCacheOptions
> >>>>>>> class
> >>>>>>>>>>> for
> >>>>>>>>>>>>> holding these option definitions because putting all
> >> options
> >>>>> into
> >>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
> >>>>>>> categorized.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
> >>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
> >>>>>>> RescanRuntimeProvider
> >>>>>>>>>>>>> considering both arguments are required.
> >>>>>>>>>>>>>>>> 2. Introduce new table options matching
> >>>>>>> DefaultLookupCacheFactory
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> >>> imjark@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1) retry logic
> >>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
> >>>>> utilities,
> >>>>>>>>>>> e.g.
> >>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
> >>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
> >> by
> >>>>>>>>>>> DataStream
> >>>>>>>>>>>>> users.
> >>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
> >> to
> >>>> put
> >>>>>>> it.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2) cache ConfigOptions
> >>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
> >>>>> framework.
> >>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
> >>>> includes
> >>>>>>>>>>>>> "sink.parallelism", "format" options.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> >>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thank you for considering my comments.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
> >> such
> >>> as
> >>>>>>>>>>>>> re-establish the connection
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
> >> be
> >>>>>>> placed in
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>> separate function, that can be implemented by
> >>> connectors.
> >>>>>>> Just
> >>>>>>>>>>>> moving
> >>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
> >>>> more
> >>>>>>>>>>> concise
> >>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
> >> The
> >>>>>>> decision
> >>>>>>>>>>> is
> >>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>> to you.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> >>>>>>> developers
> >>>>>>>>>>> to
> >>>>>>>>>>>>> define their own options as we do now per connector.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
> >>>> this
> >>>>>>> FLIP
> >>>>>>>>>>> was
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
> >> current
> >>>>> cache
> >>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> >>>> still
> >>>>>>> we
> >>>>>>>>>>> can
> >>>>>>>>>>>>> put
> >>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
> >>> reuse
> >>>>>>> them
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
> >> significant,
> >>>>> avoid
> >>>>>>>>>>>> possible
> >>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
> >>> out
> >>>> in
> >>>>>>>>>>>>>>>>>> documentation for connector developers.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>> Alexander
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> >>>>>>> renqschn@gmail.com>:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Alexander,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
> >>> same
> >>>>>>> page!
> >>>>>>>>> I
> >>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
> >>>> quoting
> >>>>>>> your
> >>>>>>>>>>>> reply
> >>>>>>>>>>>>> under this email.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
> >> in
> >>>>>>> lookup()
> >>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
> >>>> meaningful
> >>>>>>>>> under
> >>>>>>>>>>>> some
> >>>>>>>>>>>>> specific retriable failures, and there might be custom
> >> logic
> >>>>>>> before
> >>>>>>>>>>>> making
> >>>>>>>>>>>>> retry, such as re-establish the connection
> >>>>>>> (JdbcRowDataLookupFunction
> >>>>>>>>>>> is
> >>>>>>>>>>>> an
> >>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
> >>> version
> >>>> of
> >>>>>>>>> FLIP.
> >>>>>>>>>>>> Do
> >>>>>>>>>>>>> you have any special plans for them?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> >>>>>>> developers
> >>>>>>>>>>> to
> >>>>>>>>>>>>> define their own options as we do now per connector.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
> >>>> FLIP.
> >>>>>>> Hope
> >>>>>>>>>>> we
> >>>>>>>>>>>>> can finalize our proposal soon!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> >>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
> >> I
> >>>> have
> >>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>>>> suggestions and questions.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> >>>>>>> TableFunction
> >>>>>>>>>>> is a
> >>>>>>>>>>>>> good
> >>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> >>>> class.
> >>>>>>>>> 'eval'
> >>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
> >> The
> >>>> same
> >>>>>>> is
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> 'async' case.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
> >>>>>>>>>>>>> 'cacheMissingKey'
> >>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> >>>>>>>>>>>>> ScanRuntimeProvider.
> >>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
> >>> and
> >>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> >>>>> 'build'
> >>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
> >>>> TableFunctionProvider
> >>>>>>> and
> >>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> >>>>>>> deprecated.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
> >> assume
> >>>>>>> usage of
> >>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
> >>> case,
> >>>>> it
> >>>>>>> is
> >>>>>>>>>>> not
> >>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
> >> or
> >>>>>>> 'putAll'
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> LookupCache.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
> >>>> version
> >>>>>>> of
> >>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>> Do
> >>>>>>>>>>>>>>>>>>>> you have any special plans for them?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
> >> make
> >>>>> small
> >>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
> >>>> worth
> >>>>>>>>>>>> mentioning
> >>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
> >> the
> >>>>>>> future.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> >>>>>>> renqschn@gmail.com
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
> >> As
> >>>> Jark
> >>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
> >>>>>>> refactor on
> >>>>>>>>>>> our
> >>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
> >> design
> >>>> now
> >>>>>>> and
> >>>>>>>>> we
> >>>>>>>>>>>> are
> >>>>>>>>>>>>> happy to hear more suggestions from you!
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
> >>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
> >>> and
> >>>> is
> >>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
> >>>>>>>>> previously.
> >>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
> >> reflect
> >>>> the
> >>>>>>> new
> >>>>>>>>>>>>> design.
> >>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
> >> and
> >>>>>>>>>>> introduce a
> >>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
> >> scanning.
> >>> We
> >>>>> are
> >>>>>>>>>>>> planning
> >>>>>>>>>>>>> to support SourceFunction / InputFormat for now
> >> considering
> >>>> the
> >>>>>>>>>>>> complexity
> >>>>>>>>>>>>> of FLIP-27 Source API.
> >>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> >>>> make
> >>>>>>> the
> >>>>>>>>>>>>> semantic of lookup more straightforward for developers.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
> >>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> >>> is
> >>>>>>>>>>> deprecated
> >>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
> >>>>> currently
> >>>>>>>>> it's
> >>>>>>>>>>>> not?
> >>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
> >> for
> >>>>> now.
> >>>>>>> I
> >>>>>>>>>>>> think
> >>>>>>>>>>>>> it will be deprecated in the future but we don't have a
> >>> clear
> >>>>> plan
> >>>>>>>>> for
> >>>>>>>>>>>> that.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> >>>> looking
> >>>>>>>>>>> forward
> >>>>>>>>>>>>> to cooperating with you after we finalize the design and
> >>>>>>> interfaces!
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
> >> Смирнов <
> >>>>>>>>>>>>> smiralexan@gmail.com> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
> >>> all
> >>>>>>>>> points!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> >>> is
> >>>>>>>>>>> deprecated
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
> >>> but
> >>>>>>>>>>> currently
> >>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
> >>> version
> >>>>>>> it's
> >>>>>>>>> OK
> >>>>>>>>>>>> to
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
> >>>>> supporting
> >>>>>>>>>>> rescan
> >>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
> >> for
> >>>>> this
> >>>>>>>>>>>>> decision we
> >>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
> >> participants.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
> >>> your
> >>>>>>>>>>>>> statements. All
> >>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
> >>> would
> >>>> be
> >>>>>>> nice
> >>>>>>>>>>> to
> >>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
> >> lot
> >>>> of
> >>>>>>> work
> >>>>>>>>>>> on
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
> >> one
> >>>> we
> >>>>>>> are
> >>>>>>>>>>>>> discussing,
> >>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
> >> Anyway
> >>>>>>> looking
> >>>>>>>>>>>>> forward for
> >>>>>>>>>>>>>>>>>>>>>> the FLIP update!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> >>>> imjark@gmail.com
> >>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> >>>>>>> discussed
> >>>>>>>>>>> it
> >>>>>>>>>>>>> several times
> >>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
> >>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
> >>> many
> >>>> of
> >>>>>>> your
> >>>>>>>>>>>>> points!
> >>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
> >> design
> >>>> docs
> >>>>>>> and
> >>>>>>>>>>>>> maybe can be
> >>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
> >>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
> >>> discussions:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
> >> "cache
> >>>> in
> >>>>>>>>>>>>> framework" way.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> >>> customize
> >>>>> and
> >>>>>>> a
> >>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
> >> easy-use.
> >>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
> >>>>> flexibility
> >>>>>>> and
> >>>>>>>>>>>>> conciseness.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> >>>> lookup
> >>>>>>>>>>> cache,
> >>>>>>>>>>>>> esp reducing
> >>>>>>>>>>>>>>>>>>>>>>> IO.
> >>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
> >> the
> >>>>>>> unified
> >>>>>>>>>>> way
> >>>>>>>>>>>>> to both
> >>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> >>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
> >>> direction.
> >>>> If
> >>>>>>> we
> >>>>>>>>>>> need
> >>>>>>>>>>>>> to support
> >>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
> >> use
> >>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
> >> decide
> >>>> to
> >>>>>>>>>>>> implement
> >>>>>>>>>>>>> the cache
> >>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
> >>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
> >>> and
> >>>>> it
> >>>>>>>>>>>> doesn't
> >>>>>>>>>>>>> affect the
> >>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
> >> to
> >>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
> >>> your
> >>>>>>>>>>> proposal.
> >>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
> >>>>> InputFormat,
> >>>>>>>>>>>>> SourceFunction for
> >>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> >>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
> >> source
> >>>>>>> operator
> >>>>>>>>>>>>> instead of
> >>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
> >>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
> >>>>> re-scan
> >>>>>>>>>>>> ability
> >>>>>>>>>>>>> for FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
> >>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
> >>>>> effort
> >>>>>>> of
> >>>>>>>>>>>>> FLIP-27 source
> >>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
> >>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
> >>> InputFormat&SourceFunction,
> >>>>> as
> >>>>>>>>> they
> >>>>>>>>>>>>> are not
> >>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> >>> another
> >>>>>>>>> function
> >>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
> >>>> plan
> >>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
> >>> SourceFunction
> >>>>> are
> >>>>>>>>>>>>> deprecated.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
> >> <
> >>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
> >>> InputFormat
> >>>>> is
> >>>>>>> not
> >>>>>>>>>>>>> considered.
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> >>>>>>>>>>>>> martijn@ververica.com>:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
> >>> connectors
> >>>>> to
> >>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
> >>> The
> >>>>> old
> >>>>>>>>>>>>> interfaces will be
> >>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
> >>>> refactored
> >>>>> to
> >>>>>>>>> use
> >>>>>>>>>>>>> the new ones
> >>>>>>>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>>> dropped.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
> >> are
> >>>>> using
> >>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>> interfaces,
> >>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
> >>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
> >> Смирнов
> >>> <
> >>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
> >>> make
> >>>>>>> some
> >>>>>>>>>>>>> comments and
> >>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
> >>> we
> >>>>> can
> >>>>>>>>>>>> achieve
> >>>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
> >> in
> >>>>>>>>>>>>> flink-table-common,
> >>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
> >>>>>>> flink-table-runtime.
> >>>>>>>>>>>>> Therefore if a
> >>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
> >> cache
> >>>>>>>>>>> strategies
> >>>>>>>>>>>>> and their
> >>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
> >> lookupConfig
> >>> to
> >>>>> the
> >>>>>>>>>>>>> planner, but if
> >>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
> >>> in
> >>>>> his
> >>>>>>>>>>>>> TableFunction, it
> >>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
> >>>>>>> interface
> >>>>>>>>>>> for
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
> >>> the
> >>>>>>>>>>>>> documentation). In
> >>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
> >>> unified.
> >>>>>>> WDYT?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> >>> cache,
> >>>> we
> >>>>>>> will
> >>>>>>>>>>>>> have 90% of
> >>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> >>> optimization
> >>>> in
> >>>>>>> case
> >>>>>>>>>>> of
> >>>>>>>>>>>>> LRU cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> >>>> Collection<RowData>>.
> >>>>>>> Here
> >>>>>>>>>>> we
> >>>>>>>>>>>>> always
> >>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
> >>>> cache,
> >>>>>>> even
> >>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
> >>> rows
> >>>>>>> after
> >>>>>>>>>>>>> applying
> >>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> >>>>>>>>>>>> TableFunction,
> >>>>>>>>>>>>> we store
> >>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> >>>> cache
> >>>>>>> line
> >>>>>>>>>>>> will
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> >>>>> bytes).
> >>>>>>>>>>> I.e.
> >>>>>>>>>>>>> we don't
> >>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
> >>>> pruned,
> >>>>>>> but
> >>>>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
> >> If
> >>>> the
> >>>>>>> user
> >>>>>>>>>>>>> knows about
> >>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> >>>>> option
> >>>>>>>>>>> before
> >>>>>>>>>>>>> the start
> >>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
> >>> idea
> >>>>>>> that we
> >>>>>>>>>>>> can
> >>>>>>>>>>>>> do this
> >>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
> >> and
> >>>>>>> 'weigher'
> >>>>>>>>>>>>> methods of
> >>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> >>>>>>> collection
> >>>>>>>>>>> of
> >>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> >>>> automatically
> >>>>>>> fit
> >>>>>>>>>>>> much
> >>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> >>>>> filters
> >>>>>>> and
> >>>>>>>>>>>>> projects
> >>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> >>>>> interfaces,
> >>>>>>>>>>> don't
> >>>>>>>>>>>>> mean it's
> >>>>>>>>>>>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> >>>>> implement
> >>>>>>>>>>> filter
> >>>>>>>>>>>>> pushdown.
> >>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
> >> no
> >>>>>>> database
> >>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> >>>>> feature
> >>>>>>>>>>> won't
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> >>>> talk
> >>>>>>> about
> >>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
> >> databases
> >>>>> might
> >>>>>>>>> not
> >>>>>>>>>>>>> support all
> >>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
> >> all).
> >>> I
> >>>>>>> think
> >>>>>>>>>>>> users
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
> >>>> optimization
> >>>>>>>>>>>>> independently of
> >>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
> >>>> complex
> >>>>>>>>>>> problems
> >>>>>>>>>>>>> (or
> >>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
> >> Actually
> >>> in
> >>>>> our
> >>>>>>>>>>>>> internal version
> >>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
> >> and
> >>>>>>>>> reloading
> >>>>>>>>>>>>> data from
> >>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
> >> a
> >>>> way
> >>>>> to
> >>>>>>>>>>> unify
> >>>>>>>>>>>>> the logic
> >>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> >>>>>>>>> SourceFunction,
> >>>>>>>>>>>>> Source,...)
> >>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
> >>> result
> >>>> I
> >>>>>>>>>>> settled
> >>>>>>>>>>>>> on using
> >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
> >>> in
> >>>>> all
> >>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
> >> plans
> >>>> to
> >>>>>>>>>>>> deprecate
> >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> >>>> usage
> >>>>> of
> >>>>>>>>>>>>> FLIP-27 source
> >>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> >>>>> source
> >>>>>>> was
> >>>>>>>>>>>>> designed to
> >>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
> >>> (SplitEnumerator
> >>>> on
> >>>>>>>>>>>>> JobManager and
> >>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> >>>> operator
> >>>>>>>>>>> (lookup
> >>>>>>>>>>>>> join
> >>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
> >> direct
> >>>> way
> >>>>> to
> >>>>>>>>>>> pass
> >>>>>>>>>>>>> splits from
> >>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
> >>> works
> >>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> >>>>>>>>>>> AddSplitEvents).
> >>>>>>>>>>>>> Usage of
> >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
> >>> clearer
> >>>>> and
> >>>>>>>>>>>>> easier. But if
> >>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> >>>>>>> FLIP-27, I
> >>>>>>>>>>>>> have the
> >>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
> >>> lookup
> >>>>> join
> >>>>>>>>> ALL
> >>>>>>>>>>>>> cache in
> >>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
> >> of
> >>>>> batch
> >>>>>>>>>>>> source?
> >>>>>>>>>>>>> The point
> >>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
> >> join
> >>>> ALL
> >>>>>>>>> cache
> >>>>>>>>>>>>> and simple
> >>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
> >>> case
> >>>>>>>>> scanning
> >>>>>>>>>>>> is
> >>>>>>>>>>>>> performed
> >>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
> >> (cache)
> >>> is
> >>>>>>>>> cleared
> >>>>>>>>>>>>> (correct me
> >>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> >>>>>>> functionality of
> >>>>>>>>>>>>> simple join
> >>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
> >>>>>>> functionality of
> >>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
> >> be
> >>>>> easy
> >>>>>>>>> with
> >>>>>>>>>>>>> new FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
> >> -
> >>> we
> >>>>>>> will
> >>>>>>>>>>> need
> >>>>>>>>>>>>> to change
> >>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> >>>> again
> >>>>>>> after
> >>>>>>>>>>>>> some TTL).
> >>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> >>> long-term
> >>>>>>> goal
> >>>>>>>>>>> and
> >>>>>>>>>>>>> will make
> >>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
> >>> said.
> >>>>>>> Maybe
> >>>>>>>>>>> we
> >>>>>>>>>>>>> can limit
> >>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> >>>> (InputFormats).
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> >>>> flexible
> >>>>>>>>>>>>> interfaces for
> >>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
> >> both
> >>>> in
> >>>>>>> LRU
> >>>>>>>>>>> and
> >>>>>>>>>>>>> ALL caches.
> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> >>>>>>> supported
> >>>>>>>>>>> in
> >>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
> >>> have
> >>>>> the
> >>>>>>>>>>>>> opportunity to
> >>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
> >> currently
> >>>>> filter
> >>>>>>>>>>>>> pushdown works
> >>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> >>> filters
> >>>> +
> >>>>>>>>>>>>> projections
> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
> >>>>>>> features.
> >>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
> >>> that
> >>>>>>>>> involves
> >>>>>>>>>>>>> multiple
> >>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
> >>> from
> >>>>>>>>>>>>> InputFormat in favor
> >>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> >>> realization
> >>>>>>> really
> >>>>>>>>>>>>> complex and
> >>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
> >>> extend
> >>>>> the
> >>>>>>>>>>>>> functionality of
> >>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> >>>> case
> >>>>> of
> >>>>>>>>>>>> lookup
> >>>>>>>>>>>>> join ALL
> >>>>>>>>>>>>>>>>>>>>>>>>>> cache?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> >>>>> imjark@gmail.com
> >>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
> >>> want
> >>>> to
> >>>>>>>>> share
> >>>>>>>>>>>> my
> >>>>>>>>>>>>> ideas:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> >>>> connectors
> >>>>>>> base
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
> >>> ways
> >>>>>>> should
> >>>>>>>>>>>>> work (e.g.,
> >>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
> >>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
> >>>>>>> interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
> >>> flexible
> >>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
> >> we
> >>>> can
> >>>>>>> have
> >>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>> advantages.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
> >>> should
> >>>>> be a
> >>>>>>>>>>> final
> >>>>>>>>>>>>> state,
> >>>>>>>>>>>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
> >>> into
> >>>>>>> cache
> >>>>>>>>>>> can
> >>>>>>>>>>>>> benefit a
> >>>>>>>>>>>>>>>>>>>>>>>> lot
> >>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> >>>>> Connectors
> >>>>>>> use
> >>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>>> reduce
> >>>>>>>>>>>>>>>>>>>>>>>>>> IO
> >>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> >>> cache,
> >>>> we
> >>>>>>> will
> >>>>>>>>>>>>> have 90% of
> >>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
> >>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
> >> means
> >>>> the
> >>>>>>> cache
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>> meaningless in
> >>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
> >> to
> >>> do
> >>>>>>>>> filters
> >>>>>>>>>>>>> and projects
> >>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> >>>>> interfaces,
> >>>>>>>>>>> don't
> >>>>>>>>>>>>> mean it's
> >>>>>>>>>>>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
> >> interfaces
> >>> to
> >>>>>>> reduce
> >>>>>>>>>>> IO
> >>>>>>>>>>>>> and the
> >>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
> >>>> source
> >>>>>>> and
> >>>>>>>>>>>>> lookup source
> >>>>>>>>>>>>>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> >>>> pushdown
> >>>>>>> logic
> >>>>>>>>>>> in
> >>>>>>>>>>>>> caches,
> >>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> >>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
> >>> of
> >>>>> this
> >>>>>>>>>>> FLIP.
> >>>>>>>>>>>>> We have
> >>>>>>>>>>>>>>>>>>>>>>>> never
> >>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
> >>> "eval"
> >>>>>>> method
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
> >> share
> >>>> the
> >>>>>>>>> logic
> >>>>>>>>>>>> of
> >>>>>>>>>>>>> reload
> >>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> >>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>> Source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> >>>>> deprecated,
> >>>>>>> and
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> FLIP-27
> >>>>>>>>>>>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> >>>>>>> LookupJoin,
> >>>>>>>>>>>> this
> >>>>>>>>>>>>> may make
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
> >> the
> >>>> ALL
> >>>>>>>>> cache
> >>>>>>>>>>>>> logic and
> >>>>>>>>>>>>>>>>>>>>>>>> reuse
> >>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> >>>>>>>>>>>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
> >>> lies
> >>>>> out
> >>>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>>>> scope of
> >>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
> >> be
> >>>>> done
> >>>>>>> for
> >>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
> >> Visser <
> >>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> >>>> correctly
> >>>>>>>>>>>> mentioned
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> >>>>>>>>>>> jdbc/hive/hbase."
> >>>>>>>>>>>>> -> Would
> >>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> >>> implement
> >>>>>>> these
> >>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
> >>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
> >> to
> >>>>> doing
> >>>>>>>>>>> that,
> >>>>>>>>>>>>> outside
> >>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> >>>>>>>>>>>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> >>>> improvement!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
> >> implementation
> >>>>>>> would be
> >>>>>>>>>>> a
> >>>>>>>>>>>>> nice
> >>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> >>> SYSTEM_TIME
> >>>>> AS
> >>>>>>> OF
> >>>>>>>>>>>>> proc_time"
> >>>>>>>>>>>>>>>>>>>>>>>>>> semantics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> >>>>>>> implemented.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
> >>> say
> >>>>>>> that:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
> >>> to
> >>>>> cut
> >>>>>>> off
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
> >> the
> >>>> most
> >>>>>>>>> handy
> >>>>>>>>>>>>> way to do
> >>>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
> >> bit
> >>>>>>> harder to
> >>>>>>>>>>>>> pass it
> >>>>>>>>>>>>>>>>>>>>>>>>>> through the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> >>>> Alexander
> >>>>>>>>>>>> correctly
> >>>>>>>>>>>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
> >>> for
> >>>>>>>>>>>>> jdbc/hive/hbase.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
> >> caching
> >>>>>>>>>>> parameters
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
> >>> set
> >>>> it
> >>>>>>>>>>> through
> >>>>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
> >>> options
> >>>>> for
> >>>>>>>>> all
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>> tables.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> >>>>> really
> >>>>>>>>>>>>> deprives us of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> >>>> implement
> >>>>>>>>> their
> >>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>> cache).
> >>>>>>>>>>>>>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
> >>> more
> >>>>>>>>>>> different
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> >>>> schema
> >>>>>>>>>>> proposed
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
> >> not
> >>>>> right
> >>>>>>> and
> >>>>>>>>>>>> all
> >>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> >>>>> architecture?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> >>> Visser <
> >>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> >>>> wanted
> >>>>> to
> >>>>>>>>>>>>> express that
> >>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
> >> this
> >>>>> topic
> >>>>>>>>>>> and I
> >>>>>>>>>>>>> hope
> >>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> >>>>> Смирнов <
> >>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> >>>> However, I
> >>>>>>> have
> >>>>>>>>>>>>> questions
> >>>>>>>>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
> >>> get
> >>>>>>>>>>>>> something?).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
> >> of
> >>>> "FOR
> >>>>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> >>>>> SYSTEM_TIME
> >>>>>>> AS
> >>>>>>>>>>> OF
> >>>>>>>>>>>>>>>>>>>>>>>> proc_time"
> >>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
> >>> you
> >>>>>>> said,
> >>>>>>>>>>>> users
> >>>>>>>>>>>>> go
> >>>>>>>>>>>>>>>>>>>>>>>> on it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
> >> performance
> >>>> (no
> >>>>>>> one
> >>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
> >> do
> >>>> you
> >>>>>>> mean
> >>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> >>>>> explicitly
> >>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>>>> whether
> >>>>>>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
> >> the
> >>>>> list
> >>>>>>> of
> >>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> >>>> want
> >>>>>>> to.
> >>>>>>>>> So
> >>>>>>>>>>>>> what
> >>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
> >>> caching
> >>>>> in
> >>>>>>>>>>>> modules
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> >>>> flink-table-common
> >>>>>>> from
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> considered
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> >>>>>>>>>>>>> breaking/non-breaking
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> >>>>> proc_time"?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> >>>>>>> options in
> >>>>>>>>>>>> DDL
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>> control
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
> >>> never
> >>>>>>>>> happened
> >>>>>>>>>>>>>>>>>>>>>>>> previously
> >>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> >>>>> semantics
> >>>>>>> of
> >>>>>>>>>>> DDL
> >>>>>>>>>>>>>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
> >>> it
> >>>>>>> about
> >>>>>>>>>>>>> limiting
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> scope
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> >>>>> business
> >>>>>>>>>>> logic
> >>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
> >> logic
> >>> in
> >>>>> the
> >>>>>>>>>>>>> framework? I
> >>>>>>>>>>>>>>>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> >>>> option
> >>>>>>> with
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
> >> the
> >>>>> wrong
> >>>>>>>>>>>>> decision,
> >>>>>>>>>>>>>>>>>>>>>>>>>> because it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
> >>> logic
> >>>>> (not
> >>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> >>>>> functions
> >>>>>>> of
> >>>>>>>>>>> ONE
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>> (there
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> >>> caches).
> >>>>>>> Does it
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>> matter for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
> >>> logic
> >>>> is
> >>>>>>>>>>>> located,
> >>>>>>>>>>>>>>>>>>>>>>>> which is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> >>>>>>> 'sink.parallelism',
> >>>>>>>>>>>>> which in
> >>>>>>>>>>>>>>>>>>>>>>>>>> some way
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
> >> framework"
> >>>> and
> >>>>> I
> >>>>>>>>>>> don't
> >>>>>>>>>>>>> see any
> >>>>>>>>>>>>>>>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> >>>>> all-caching
> >>>>>>>>>>>>> scenario
> >>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> >>>> discussion,
> >>>>>>> but
> >>>>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>>>>>>>>> in our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> >>>> quite
> >>>>>>>>>>> easily
> >>>>>>>>>>>> -
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>> reused
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
> >>> for
> >>>> a
> >>>>>>> new
> >>>>>>>>>>>> API).
> >>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>> point is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
> >> use
> >>>>>>>>>>> InputFormat
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
> >> even
> >>>> Hive
> >>>>>>> - it
> >>>>>>>>>>>> uses
> >>>>>>>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
> >> a
> >>>>>>> wrapper
> >>>>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> >>>> ability
> >>>>>>> to
> >>>>>>>>>>>> reload
> >>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> >>>> number
> >>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
> >>>>>>>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
> >> reload
> >>>>> time
> >>>>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>>>>>>>>>> reduces
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> >>>> blocking). I
> >>>>>>> know
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>> usually
> >>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> >>>> code,
> >>>>>>> but
> >>>>>>>>>>>> maybe
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
> >>> an
> >>>>>>> ideal
> >>>>>>>>>>>>> solution,
> >>>>>>>>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
> >>> might
> >>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> >>>>> developer
> >>>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
> >>> new
> >>>>>>> cache
> >>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
> >> options
> >>>>> into
> >>>>>>> 2
> >>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
> >> will
> >>>>> need
> >>>>>>> to
> >>>>>>>>>>> do
> >>>>>>>>>>>>> is to
> >>>>>>>>>>>>>>>>>>>>>>>>>> redirect
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> >>>>>>> LookupConfig
> >>>>>>>>> (+
> >>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>>>>> add an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> >>>> naming),
> >>>>>>>>>>>> everything
> >>>>>>>>>>>>>>>>>>>>>>>> will be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> >>>> won't
> >>>>>>> do
> >>>>>>>>>>>>>>>>>>>>>>>> refactoring at
> >>>>>>>>>>>>>>>>>>>>>>>>>> all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
> >> connector
> >>>>>>> because
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>> backward
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
> >> wants
> >>> to
> >>>>> use
> >>>>>>>>> his
> >>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>> logic,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> >>>>> configs
> >>>>>>>>> into
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
> >> with
> >>>>>>> already
> >>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>> configs
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
> >> it's a
> >>>>> rare
> >>>>>>>>>>> case).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
> >> pushed
> >>>> all
> >>>>>>> the
> >>>>>>>>>>> way
> >>>>>>>>>>>>> down
> >>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> >>>> source
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
> >> is
> >>>> that
> >>>>>>> the
> >>>>>>>>>>>> ONLY
> >>>>>>>>>>>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> >>>>>>> FileSystemTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> >>>>> currently).
> >>>>>>>>> Also
> >>>>>>>>>>>>> for some
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> >>>>> complex
> >>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
> >> the
> >>>>> cache
> >>>>>>>>>>> seems
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
> >> large
> >>>>>>> amount of
> >>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>> from the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> >>>>> suppose
> >>>>>>> in
> >>>>>>>>>>>>> dimension
> >>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
> >> 20
> >>> to
> >>>>> 40,
> >>>>>>>>> and
> >>>>>>>>>>>>> input
> >>>>>>>>>>>>>>>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
> >>> by
> >>>>> age
> >>>>>>> of
> >>>>>>>>>>>>> users. If
> >>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> >>>> This
> >>>>>>> means
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
> >>> almost
> >>>> 2
> >>>>>>>>> times.
> >>>>>>>>>>>> It
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>> gain a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> >>>>> optimization
> >>>>>>>>>>> starts
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> >>>> filters
> >>>>>>> and
> >>>>>>>>>>>>> projections
> >>>>>>>>>>>>>>>>>>>>>>>>>> can't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
> >>> opens
> >>>> up
> >>>>>>>>>>>>> additional
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
> >> 'not
> >>>>> quite
> >>>>>>>>>>>>> useful'.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> >>>>>>> regarding
> >>>>>>>>>>> this
> >>>>>>>>>>>>> topic!
> >>>>>>>>>>>>>>>>>>>>>>>>>> Because
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> >>>> points,
> >>>>>>> and I
> >>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
> >>> come
> >>>>> to a
> >>>>>>>>>>>>> consensus.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
> >>> Ren
> >>>> <
> >>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
> >> for
> >>> my
> >>>>>>> late
> >>>>>>>>>>>>> response!
> >>>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
> >>> and
> >>>>>>> Leonard
> >>>>>>>>>>>> and
> >>>>>>>>>>>>> I’d
> >>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> >>>> implementing
> >>>>>>> the
> >>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>> logic in
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> >>>>>>> user-provided
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> >>> extending
> >>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
> >> semantic
> >>> of
> >>>>>>> "FOR
> >>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> >>>> reflect
> >>>>>>> the
> >>>>>>>>>>>>> content
> >>>>>>>>>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
> >> users
> >>>>>>> choose
> >>>>>>>>> to
> >>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
> >>> that
> >>>>>>> this
> >>>>>>>>>>>>> breakage is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
> >>> prefer
> >>>>> not
> >>>>>>> to
> >>>>>>>>>>>>> provide
> >>>>>>>>>>>>>>>>>>>>>>>>>> caching on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
> >>> in
> >>>>> the
> >>>>>>>>>>>>> framework
> >>>>>>>>>>>>>>>>>>>>>>>>>> (whether
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> >>> TableFunction),
> >>>> we
> >>>>>>> have
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> confront a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
> >>> control
> >>>>> the
> >>>>>>>>>>>>> behavior of
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> >>>>> should
> >>>>>>> be
> >>>>>>>>>>>>> cautious.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Under
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> >>>> framework
> >>>>>>>>> should
> >>>>>>>>>>>>> only be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
> >>> it’s
> >>>>>>> hard
> >>>>>>>>> to
> >>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
> >> source
> >>>>> loads
> >>>>>>> and
> >>>>>>>>>>>>> refresh
> >>>>>>>>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> >>>> high
> >>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
> >>> widely
> >>>>>>> used
> >>>>>>>>> by
> >>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>>>>>>> internal
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
> >>> the
> >>>>>>> user’s
> >>>>>>>>>>>>>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> >>>>>>> introduce a
> >>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>>>>>>>> interface for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
> >> would
> >>>>>>> become
> >>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>> complex.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
> >> framework
> >>>>> might
> >>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> >>>> there
> >>>>>>> might
> >>>>>>>>>>>>> exist two
> >>>>>>>>>>>>>>>>>>>>>>>>>> caches
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
> >> user
> >>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>>>>>>> configures
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> >>>> implemented
> >>>>>>> by
> >>>>>>>>>>> the
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>> source).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> >>>>>>> Alexander, I
> >>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
> >> way
> >>>> down
> >>>>>>> to
> >>>>>>>>>>> the
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
> >> of
> >>>> the
> >>>>>>>>>>> runner
> >>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> >>>> network
> >>>>>>> I/O
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> pressure
> >>>>>>>>>>>>>>>>>>>>>>>>>> on the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
> >> these
> >>>>>>>>>>>> optimizations
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> >>>>> reflect
> >>>>>>> our
> >>>>>>>>>>>>> ideas.
> >>>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
> >>> of
> >>>>>>>>>>>>> TableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> >>>>>>> (CachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
> >> developers
> >>>> and
> >>>>>>>>>>> regulate
> >>>>>>>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
> >> reference.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> >>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> >>> Александр
> >>>>>>> Смирнов
> >>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> >>>>> solution
> >>>>>>> as
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>>>>>>>>>>> step:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
> >> mutually
> >>>>>>> exclusive
> >>>>>>>>>>>>>>>>>>>>>>>> (originally
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> >>>>> conceptually
> >>>>>>>>> they
> >>>>>>>>>>>>> follow
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> >>>>>>> different.
> >>>>>>>>>>> If
> >>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>> go one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
> >>> will
> >>>>> mean
> >>>>>>>>>>>>> deleting
> >>>>>>>>>>>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> >>>>>>> connectors.
> >>>>>>>>>>> So
> >>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>> think we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> >>>> about
> >>>>>>> that
> >>>>>>>>>>> and
> >>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> >>>> tasks
> >>>>>>> for
> >>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>> parts
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> >>> unification
> >>>> /
> >>>>>>>>>>>>> introducing
> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> >>>> Qingsheng?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> >>>>> requests
> >>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
> >>> fields
> >>>>> of
> >>>>>>> the
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>> table, we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
> >>> after
> >>>>>>> that we
> >>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> >>>> filter
> >>>>>>>>>>>>> pushdown. So
> >>>>>>>>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> >>>> much
> >>>>>>> less
> >>>>>>>>>>>> rows
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> >>>>> architecture
> >>>>>>> is
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> >> honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> >>>> kinds
> >>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>> conversations
> >>>>>>>>>>>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> >>> confluence,
> >>>>> so
> >>>>>>> I
> >>>>>>>>>>>> made a
> >>>>>>>>>>>>>>>>>>>>>>>> Jira
> >>>>>>>>>>>>>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
> >> in
> >>>>> more
> >>>>>>>>>>>> details
> >>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
> >>> Heise
> >>>> <
> >>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> >>>> inconsistency
> >>>>>>> was
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> >>>> could
> >>>>>>> also
> >>>>>>>>>>>> live
> >>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
> >> of
> >>>>>>> making
> >>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> >>>> devise a
> >>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>> layer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> >>> CachingTableFunction
> >>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>> delegates to
> >>>>>>>>>>>>>>>>>>>>>>>>>> X in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> >>>> Lifting
> >>>>>>> it
> >>>>>>>>>>> into
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> >>>>>>> probably
> >>>>>>>>>>>>>>>>>>>>>>>> unnecessary
> >>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
> >>> will
> >>>>> only
> >>>>>>>>>>>> receive
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
> >>> more
> >>>>>>>>>>>> interesting
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> save
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> >>>> changes
> >>>>> of
> >>>>>>>>>>> this
> >>>>>>>>>>>>> FLIP
> >>>>>>>>>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> >>>>> interfaces.
> >>>>>>>>>>>>> Everything
> >>>>>>>>>>>>>>>>>>>>>>>> else
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
> >> That
> >>>>> means
> >>>>>>> we
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
> >> Alexander
> >>>>>>> pointed
> >>>>>>>>>>> out
> >>>>>>>>>>>>>>>>>>>>>>>> later.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> >>>>> architecture
> >>>>>>> is
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> >> honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> >>>> Александр
> >>>>>>>>>>> Смирнов
> >>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
> >>> I'm
> >>>>>>> not a
> >>>>>>>>>>>>>>>>>>>>>>>> committer
> >>>>>>>>>>>>>>>>>>>>>>>>>> yet,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> >>>> FLIP
> >>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>> interested
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> >>>>>>> feature in
> >>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>>>>>>> company’s
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
> >> our
> >>>>>>> thoughts
> >>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>> this and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
> >> alternative
> >>>>> than
> >>>>>>>>>>>>>>>>>>>>>>>> introducing an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> >>>>>>>>> (CachingTableFunction).
> >>>>>>>>>>>> As
> >>>>>>>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>>>>>> know,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> >>>>>>> flink-table-common
> >>>>>>>>>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
> >> tables –
> >>>>> it’s
> >>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>> convenient
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> >>>>>>> CachingTableFunction
> >>>>>>>>>>>>> contains
> >>>>>>>>>>>>>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
> >> and
> >>>>>>>>>>> everything
> >>>>>>>>>>>>>>>>>>>>>>>>>> connected
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
> >> module,
> >>>>>>> probably
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> >>>>> depend
> >>>>>>> on
> >>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
> >>> which
> >>>>>>> doesn’t
> >>>>>>>>>>>>> sound
> >>>>>>>>>>>>>>>>>>>>>>>>>> good.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> >>>>>>>>> ‘getLookupConfig’
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> >>>>>>> connectors
> >>>>>>>>> to
> >>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> >>>> therefore
> >>>>>>> they
> >>>>>>>>>>>> won’t
> >>>>>>>>>>>>>>>>>>>>>>>>>> depend on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> >>>>> planner
> >>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>> construct a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> >>>> runtime
> >>>>>>> logic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> >>>> Architecture
> >>>>>>>>> looks
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> >>>>> actually
> >>>>>>>>>>> yours
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
> >> that
> >>>> will
> >>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>> responsible
> >>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> >>>>>>> inheritors.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
> >>>>>>>>>>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> >>>> AsyncLookupJoinRunner,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> >>>>>>>>>>> LookupJoinCachingRunner,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
> >> etc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
> >> powerful
> >>>>>>> advantage
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>> such a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
> >>> level,
> >>>>> we
> >>>>>>> can
> >>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> >>>>>>> LookupJoinRunnerWithCalc
> >>>>>>>>>>> was
> >>>>>>>>>>>>>>>>>>>>>>>> named
> >>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
> >> function,
> >>>>> which
> >>>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>>>>>>>>>>> mostly
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> >>>> lookup
> >>>>>>> table
> >>>>>>>>>>> B
> >>>>>>>>>>>>>>>>>>>>>>>>>> condition
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> >>>> WHERE
> >>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>>>>>>> 1000’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
> >> A.age =
> >>>>>>> B.age +
> >>>>>>>>>>> 10
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> >>>> storing
> >>>>>>>>>>> records
> >>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
> >> reduced:
> >>>>>>> filters =
> >>>>>>>>>>>>> avoid
> >>>>>>>>>>>>>>>>>>>>>>>>>> storing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
> >>> reduce
> >>>>>>>>> records’
> >>>>>>>>>>>>>>>>>>>>>>>> size. So
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
> >> be
> >>>>>>>>> increased
> >>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> user.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
> >> Ren
> >>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> >>>>>>> discussion
> >>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> >>>> table
> >>>>>>>>> cache
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> >>>>> should
> >>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
> >>> isn’t a
> >>>>>>>>>>> standard
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
> >> with
> >>>>> lookup
> >>>>>>>>>>>> joins,
> >>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> >>>>>>> including
> >>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
> >>> table
> >>>>>>>>> options.
> >>>>>>>>>>>>>>>>>>>>>>>> Please
> >>>>>>>>>>>>>>>>>>>>>>>>>> take a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
> >>> Any
> >>>>>>>>>>>> suggestions
> >>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi Jingsong,

Thanks for your comments! 

> AllCache definition is not flexible, for example, PartialCache can use any custom storage, while the AllCache can not, AllCache can also be considered to store memory or disk, also need a flexible strategy.

We had an offline discussion with Jark and Leonard. Basically we think exposing the interface of full cache storage to connector developers might limit our future optimizations. The storage of full caching shouldn’t have too many variations for different lookup tables so making it pluggable might not help a lot. Also I think it is not quite easy for connector developers to implement such an optimized storage. We can keep optimizing this storage in the future and all full caching lookup tables would benefit from this. 

> We are more inclined to deprecate the connector `async` option when discussing FLIP-234. Can we remove this option from this FLIP?

Thanks for the reminder! This option has been removed in the latest version. 

Best regards,

Qingsheng 


> On Jun 1, 2022, at 15:28, Jingsong Li <ji...@gmail.com> wrote:
> 
> Thanks Alexander for your reply. We can discuss the new interface when it
> comes out.
> 
> We are more inclined to deprecate the connector `async` option when
> discussing FLIP-234 [1]. We should use hint to let planner decide.
> Although the discussion has not yet produced a conclusion, can we remove
> this option from this FLIP? It doesn't seem to be related to this FLIP, but
> more to FLIP-234, and we can form a conclusion over there.
> 
> [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> 
> Best,
> Jingsong
> 
> On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:
> 
>> Hi Jark,
>> 
>> Thanks for clarifying it. It would be fine. as long as we could provide the
>> no-cache solution. I was just wondering if the client side cache could
>> really help when HBase is used, since the data to look up should be huge.
>> Depending how much data will be cached on the client side, the data that
>> should be lru in e.g. LruBlockCache will not be lru anymore. In the worst
>> case scenario, once the cached data at client side is expired, the request
>> will hit disk which will cause extra latency temporarily, if I am not
>> mistaken.
>> 
>> Best regards,
>> Jing
>> 
>> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
>> 
>>> Hi Jing Ge,
>>> 
>>> What do you mean about the "impact on the block cache used by HBase"?
>>> In my understanding, the connector cache and HBase cache are totally two
>>> things.
>>> The connector cache is a local/client cache, and the HBase cache is a
>>> server cache.
>>> 
>>>> does it make sense to have a no-cache solution as one of the
>>> default solutions so that customers will have no effort for the migration
>>> if they want to stick with Hbase cache
>>> 
>>> The implementation migration should be transparent to users. Take the
>> HBase
>>> connector as
>>> an example,  it already supports lookup cache but is disabled by default.
>>> After migration, the
>>> connector still disables cache by default (i.e. no-cache solution). No
>>> migration effort for users.
>>> 
>>> HBase cache and connector cache are two different things. HBase cache
>> can't
>>> simply replace
>>> connector cache. Because one of the most important usages for connector
>>> cache is reducing
>>> the I/O request/response and improving the throughput, which can achieve
>>> by just using a server cache.
>>> 
>>> Best,
>>> Jark
>>> 
>>> 
>>> 
>>> 
>>> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
>>> 
>>>> Thanks all for the valuable discussion. The new feature looks very
>>>> interesting.
>>>> 
>>>> According to the FLIP description: "*Currently we have JDBC, Hive and
>>> HBase
>>>> connector implemented lookup table source. All existing implementations
>>>> will be migrated to the current design and the migration will be
>>>> transparent to end users*." I was only wondering if we should pay
>>> attention
>>>> to HBase and similar DBs. Since, commonly, the lookup data will be huge
>>>> while using HBase, partial caching will be used in this case, if I am
>> not
>>>> mistaken, which might have an impact on the block cache used by HBase,
>>> e.g.
>>>> LruBlockCache.
>>>> Another question is that, since HBase provides a sophisticated cache
>>>> solution, does it make sense to have a no-cache solution as one of the
>>>> default solutions so that customers will have no effort for the
>> migration
>>>> if they want to stick with Hbase cache?
>>>> 
>>>> Best regards,
>>>> Jing
>>>> 
>>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <ji...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I think the problem now is below:
>>>>> 1. AllCache and PartialCache interface on the non-uniform, one needs
>> to
>>>>> provide LookupProvider, the other needs to provide CacheBuilder.
>>>>> 2. AllCache definition is not flexible, for example, PartialCache can
>>> use
>>>>> any custom storage, while the AllCache can not, AllCache can also be
>>>>> considered to store memory or disk, also need a flexible strategy.
>>>>> 3. AllCache can not customize ReloadStrategy, currently only
>>>>> ScheduledReloadStrategy.
>>>>> 
>>>>> In order to solve the above problems, the following are my ideas.
>>>>> 
>>>>> ## Top level cache interfaces:
>>>>> 
>>>>> ```
>>>>> 
>>>>> public interface CacheLookupProvider extends
>>>>> LookupTableSource.LookupRuntimeProvider {
>>>>> 
>>>>>    CacheBuilder createCacheBuilder();
>>>>> }
>>>>> 
>>>>> 
>>>>> public interface CacheBuilder {
>>>>>    Cache create();
>>>>> }
>>>>> 
>>>>> 
>>>>> public interface Cache {
>>>>> 
>>>>>    /**
>>>>>     * Returns the value associated with key in this cache, or null
>> if
>>>>> there is no cached value for
>>>>>     * key.
>>>>>     */
>>>>>    @Nullable
>>>>>    Collection<RowData> getIfPresent(RowData key);
>>>>> 
>>>>>    /** Returns the number of key-value mappings in the cache. */
>>>>>    long size();
>>>>> }
>>>>> 
>>>>> ```
>>>>> 
>>>>> ## Partial cache
>>>>> 
>>>>> ```
>>>>> 
>>>>> public interface PartialCacheLookupFunction extends
>>> CacheLookupProvider {
>>>>> 
>>>>>    @Override
>>>>>    PartialCacheBuilder createCacheBuilder();
>>>>> 
>>>>> /** Creates an {@link LookupFunction} instance. */
>>>>> LookupFunction createLookupFunction();
>>>>> }
>>>>> 
>>>>> 
>>>>> public interface PartialCacheBuilder extends CacheBuilder {
>>>>> 
>>>>>    PartialCache create();
>>>>> }
>>>>> 
>>>>> 
>>>>> public interface PartialCache extends Cache {
>>>>> 
>>>>>    /**
>>>>>     * Associates the specified value rows with the specified key row
>>>>> in the cache. If the cache
>>>>>     * previously contained value associated with the key, the old
>>>>> value is replaced by the
>>>>>     * specified value.
>>>>>     *
>>>>>     * @return the previous value rows associated with key, or null
>> if
>>>>> there was no mapping for key.
>>>>>     * @param key - key row with which the specified value is to be
>>>>> associated
>>>>>     * @param value – value rows to be associated with the specified
>>> key
>>>>>     */
>>>>>    Collection<RowData> put(RowData key, Collection<RowData> value);
>>>>> 
>>>>>    /** Discards any cached value for the specified key. */
>>>>>    void invalidate(RowData key);
>>>>> }
>>>>> 
>>>>> ```
>>>>> 
>>>>> ## All cache
>>>>> ```
>>>>> 
>>>>> public interface AllCacheLookupProvider extends CacheLookupProvider {
>>>>> 
>>>>>    void registerReloadStrategy(ScheduledExecutorService
>>>>> executorService, Reloader reloader);
>>>>> 
>>>>>    ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
>>>>> 
>>>>>    @Override
>>>>>    AllCacheBuilder createCacheBuilder();
>>>>> }
>>>>> 
>>>>> 
>>>>> public interface AllCacheBuilder extends CacheBuilder {
>>>>> 
>>>>>    AllCache create();
>>>>> }
>>>>> 
>>>>> 
>>>>> public interface AllCache extends Cache {
>>>>> 
>>>>>    void putAll(Iterator<Map<RowData, RowData>> allEntries);
>>>>> 
>>>>>    void clearAll();
>>>>> }
>>>>> 
>>>>> 
>>>>> public interface Reloader {
>>>>> 
>>>>>    void reload();
>>>>> }
>>>>> 
>>>>> ```
>>>>> 
>>>>> Best,
>>>>> Jingsong
>>>>> 
>>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <jingsonglee0@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Thanks Qingsheng and all for your discussion.
>>>>>> 
>>>>>> Very sorry to jump in so late.
>>>>>> 
>>>>>> Maybe I missed something?
>>>>>> My first impression when I saw the cache interface was, why don't
>> we
>>>>>> provide an interface similar to guava cache [1], on top of guava
>>> cache,
>>>>>> caffeine also makes extensions for asynchronous calls.[2]
>>>>>> There is also the bulk load in caffeine too.
>>>>>> 
>>>>>> I am also more confused why first from LookupCacheFactory.Builder
>> and
>>>>> then
>>>>>> to Factory to create Cache.
>>>>>> 
>>>>>> [1] https://github.com/google/guava
>>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
>>>>>> 
>>>>>> Best,
>>>>>> Jingsong
>>>>>> 
>>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
>>>>>> 
>>>>>>> After looking at the new introduced ReloadTime and Becket's
>> comment,
>>>>>>> I agree with Becket we should have a pluggable reloading strategy.
>>>>>>> We can provide some common implementations, e.g., periodic
>>> reloading,
>>>>> and
>>>>>>> daily reloading.
>>>>>>> But there definitely be some connector- or business-specific
>>> reloading
>>>>>>> strategies, e.g.
>>>>>>> notify by a zookeeper watcher, reload once a new Hive partition is
>>>>>>> complete.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jark
>>>>>>> 
>>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Qingsheng,
>>>>>>>> 
>>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
>>>>>>>> 
>>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
>>>> "XXXProvider".
>>>>>>>> What is the difference between them? If they are the same, can
>> we
>>>> just
>>>>>>> use
>>>>>>>> XXXFactory everywhere?
>>>>>>>> 
>>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
>>>>> policy
>>>>>>>> also be pluggable? Periodical reloading could be sometimes be
>>> tricky
>>>>> in
>>>>>>>> practice. For example, if user uses 24 hours as the cache
>> refresh
>>>>>>> interval
>>>>>>>> and some nightly batch job delayed, the cache update may still
>> see
>>>> the
>>>>>>>> stale data.
>>>>>>>> 
>>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
>>>> should
>>>>> be
>>>>>>>> removed.
>>>>>>>> 
>>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
>> seems a
>>>>>>> little
>>>>>>>> confusing to me. If Optional<LookupCacheFactory>
>> getCacheFactory()
>>>>>>> returns
>>>>>>>> a non-empty factory, doesn't that already indicates the
>> framework
>>> to
>>>>>>> cache
>>>>>>>> the missing keys? Also, why is this method returning an
>>>>>>> Optional<Boolean>
>>>>>>>> instead of boolean?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
>> renqschn@gmail.com
>>>> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Lincoln and Jark,
>>>>>>>>> 
>>>>>>>>> Thanks for the comments! If the community reaches a consensus
>>> that
>>>> we
>>>>>>> use
>>>>>>>>> SQL hint instead of table options to decide whether to use sync
>>> or
>>>>>>> async
>>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
>>>>> option.
>>>>>>>>> 
>>>>>>>>> I think it’s a good idea to let the decision of async made on
>>> query
>>>>>>>>> level, which could make better optimization with more
>> infomation
>>>>>>> gathered
>>>>>>>>> by planner. Is there any FLIP describing the issue in
>>> FLINK-27625?
>>>> I
>>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
>>>>> missing
>>>>>>>>> instead of the entire async mode to be controlled by hint.
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> 
>>>>>>>>> Qingsheng
>>>>>>>>> 
>>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
>> lincoln.86xy@gmail.com
>>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Jark,
>>>>>>>>>> 
>>>>>>>>>> Thanks for your reply!
>>>>>>>>>> 
>>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
>>> no
>>>>> idea
>>>>>>>>>> whether or when to remove it (we can discuss it in another
>>> issue
>>>>> for
>>>>>>> the
>>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
>>> into
>>>> a
>>>>>>>>> common
>>>>>>>>>> option now.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Lincoln Lee
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
>>>>>>>>>> 
>>>>>>>>>>> Hi Lincoln,
>>>>>>>>>>> 
>>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
>> the
>>>>>>>>> connectors
>>>>>>>>>>> can
>>>>>>>>>>> provide both async and sync runtime providers simultaneously
>>>>> instead
>>>>>>>>> of one
>>>>>>>>>>> of them.
>>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
>> option
>>> is
>>>>>>>>> planned to
>>>>>>>>>>> be removed
>>>>>>>>>>> in the long term, I think it makes sense not to introduce it
>>> in
>>>>> this
>>>>>>>>> FLIP.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
>>>> lincoln.86xy@gmail.com
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
>>> idea
>>>>>>> that
>>>>>>>>> we
>>>>>>>>>>> can
>>>>>>>>>>>> have a common table option. I have a minor comments on
>>>>>>> 'lookup.async'
>>>>>>>>>>> that
>>>>>>>>>>>> not make it a common option:
>>>>>>>>>>>> 
>>>>>>>>>>>> The table layer abstracts both sync and async lookup
>>>>> capabilities,
>>>>>>>>>>>> connectors implementers can choose one or both, in the case
>>> of
>>>>>>>>>>> implementing
>>>>>>>>>>>> only one capability(status of the most of existing builtin
>>>>>>> connectors)
>>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
>>> both
>>>>>>>>>>>> capabilities, I think this choice is more suitable for
>> making
>>>>>>>>> decisions
>>>>>>>>>>> at
>>>>>>>>>>>> the query level, for example, table planner can choose the
>>>>> physical
>>>>>>>>>>>> implementation of async lookup or sync lookup based on its
>>> cost
>>>>>>>>> model, or
>>>>>>>>>>>> users can give query hint based on their own better
>>>>>>> understanding.  If
>>>>>>>>>>>> there is another common table option 'lookup.async', it may
>>>>> confuse
>>>>>>>>> the
>>>>>>>>>>>> users in the long run.
>>>>>>>>>>>> 
>>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
>>>> place
>>>>>>> (for
>>>>>>>>> the
>>>>>>>>>>>> current hbase connector) and not turn it into a common
>>> option.
>>>>>>>>>>>> 
>>>>>>>>>>>> WDYT?
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Lincoln Lee
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
>> you
>>>> can
>>>>>>> find
>>>>>>>>>>>> those
>>>>>>>>>>>>> changes from my latest email. Since some terminologies has
>>>>>>> changed so
>>>>>>>>>>>> I’ll
>>>>>>>>>>>>> use the new concept for replying your comments.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. Builder vs ‘of’
>>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
>> optional
>>>>>>>>> parameters
>>>>>>>>>>>>> for full caching mode (“rescan” previously). The
>>>>>>> schedule-with-delay
>>>>>>>>>>> idea
>>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
>> the
>>>>>>> builder
>>>>>>>>> API
>>>>>>>>>>>> of
>>>>>>>>>>>>> full caching to make it more descriptive for developers.
>>> Would
>>>>> you
>>>>>>>>> mind
>>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
>>>>> workspace
>>>>>>>>> you
>>>>>>>>>>>> can
>>>>>>>>>>>>> just provide your account ID and ping any PMC member
>>> including
>>>>>>> Jark.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2. Common table options
>>>>>>>>>>>>> We have some discussions these days and propose to
>>> introduce 8
>>>>>>> common
>>>>>>>>>>>>> table options about caching. It has been updated on the
>>> FLIP.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 3. Retries
>>>>>>>>>>>>> I think we are on the same page :-)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For your additional concerns:
>>>>>>>>>>>>> 1) The table option has been updated.
>>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
>> use
>>>>>>> partial
>>>>>>>>> or
>>>>>>>>>>>>> full caching mode.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Also I have a few additions:
>>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
>>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
>> that
>>>> we
>>>>>>> talk
>>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
>> fits
>>>>> more,
>>>>>>>>>>>>>> considering my optimization with filters.
>>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
>>> separate
>>>>>>>>> caching
>>>>>>>>>>>>>> and rescanning from the options point of view? Like
>>> initially
>>>>> we
>>>>>>> had
>>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
>>> now
>>>> we
>>>>>>> can
>>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
>>> be
>>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
>>>>>>> smiralexan@gmail.com
>>>>>>>>>>>> :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Qingsheng and Jark,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. Builders vs 'of'
>>>>>>>>>>>>>>> I understand that builders are used when we have
>> multiple
>>>>>>>>>>> parameters.
>>>>>>>>>>>>>>> I suggested them because we could add parameters later.
>> To
>>>>>>> prevent
>>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
>>> can
>>>>>>>>> suggest
>>>>>>>>>>>>>>> one more config now - "rescanStartTime".
>>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
>> reload
>>>> of
>>>>>>> cache
>>>>>>>>>>>>>>> starts. This parameter can be thought of as
>> 'initialDelay'
>>>>> (diff
>>>>>>>>>>>>>>> between current time and rescanStartTime) in method
>>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
>>> can
>>>> be
>>>>>>> very
>>>>>>>>>>>>>>> useful when the dimension table is updated by some other
>>>>>>> scheduled
>>>>>>>>>>> job
>>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
>> second
>>>> scan
>>>>>>>>>>> (first
>>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
>>>> without
>>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
>>> one
>>>>>>> day.
>>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
>> if
>>>> you
>>>>>>> would
>>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
>> myself
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2. Common table options
>>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
>>>> cache
>>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
>>> for
>>>>>>>>> default
>>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
>>>> cache
>>>>>>>>>>> options,
>>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 3. Retries
>>>>>>>>>>>>>>> I'm fine with suggestion close to
>>> RetryUtils#tryTimes(times,
>>>>>>> call)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
>>>> renqschn@gmail.com
>>>>>> :
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Jark and Alexander,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
>> common
>>>>> table
>>>>>>>>>>>>> options. I prefer to introduce a new
>>> DefaultLookupCacheOptions
>>>>>>> class
>>>>>>>>>>> for
>>>>>>>>>>>>> holding these option definitions because putting all
>> options
>>>>> into
>>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
>>>>>>> categorized.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
>>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
>>>>>>> RescanRuntimeProvider
>>>>>>>>>>>>> considering both arguments are required.
>>>>>>>>>>>>>>>> 2. Introduce new table options matching
>>>>>>> DefaultLookupCacheFactory
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
>>> imjark@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1) retry logic
>>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
>>>>> utilities,
>>>>>>>>>>> e.g.
>>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
>>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
>> by
>>>>>>>>>>> DataStream
>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
>> to
>>>> put
>>>>>>> it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2) cache ConfigOptions
>>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
>>>>> framework.
>>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
>>>> includes
>>>>>>>>>>>>> "sink.parallelism", "format" options.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thank you for considering my comments.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
>> such
>>> as
>>>>>>>>>>>>> re-establish the connection
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
>> be
>>>>>>> placed in
>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> separate function, that can be implemented by
>>> connectors.
>>>>>>> Just
>>>>>>>>>>>> moving
>>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
>>>> more
>>>>>>>>>>> concise
>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
>> The
>>>>>>> decision
>>>>>>>>>>> is
>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>> to you.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>>>>> developers
>>>>>>>>>>> to
>>>>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
>>>> this
>>>>>>> FLIP
>>>>>>>>>>> was
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
>> current
>>>>> cache
>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
>>>> still
>>>>>>> we
>>>>>>>>>>> can
>>>>>>>>>>>>> put
>>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
>>> reuse
>>>>>>> them
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
>> significant,
>>>>> avoid
>>>>>>>>>>>> possible
>>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
>>> out
>>>> in
>>>>>>>>>>>>>>>>>> documentation for connector developers.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
>>>>>>> renqschn@gmail.com>:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
>>> same
>>>>>>> page!
>>>>>>>>> I
>>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
>>>> quoting
>>>>>>> your
>>>>>>>>>>>> reply
>>>>>>>>>>>>> under this email.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
>> in
>>>>>>> lookup()
>>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
>>>> meaningful
>>>>>>>>> under
>>>>>>>>>>>> some
>>>>>>>>>>>>> specific retriable failures, and there might be custom
>> logic
>>>>>>> before
>>>>>>>>>>>> making
>>>>>>>>>>>>> retry, such as re-establish the connection
>>>>>>> (JdbcRowDataLookupFunction
>>>>>>>>>>> is
>>>>>>>>>>>> an
>>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
>>> version
>>>> of
>>>>>>>>> FLIP.
>>>>>>>>>>>> Do
>>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>>>>> developers
>>>>>>>>>>> to
>>>>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
>>>> FLIP.
>>>>>>> Hope
>>>>>>>>>>> we
>>>>>>>>>>>>> can finalize our proposal soon!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
>> I
>>>> have
>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>>> suggestions and questions.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
>>>>>>> TableFunction
>>>>>>>>>>> is a
>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
>>>> class.
>>>>>>>>> 'eval'
>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
>> The
>>>> same
>>>>>>> is
>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> 'async' case.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
>>>>>>>>>>>>> 'cacheMissingKey'
>>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
>>>>>>>>>>>>> ScanRuntimeProvider.
>>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
>>> and
>>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
>>>>> 'build'
>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
>>>> TableFunctionProvider
>>>>>>> and
>>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
>>>>>>> deprecated.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
>> assume
>>>>>>> usage of
>>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
>>> case,
>>>>> it
>>>>>>> is
>>>>>>>>>>> not
>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
>> or
>>>>>>> 'putAll'
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> LookupCache.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
>>>> version
>>>>>>> of
>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>> Do
>>>>>>>>>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
>> make
>>>>> small
>>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
>>>> worth
>>>>>>>>>>>> mentioning
>>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
>> the
>>>>>>> future.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
>>>>>>> renqschn@gmail.com
>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
>> As
>>>> Jark
>>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
>>>>>>> refactor on
>>>>>>>>>>> our
>>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
>> design
>>>> now
>>>>>>> and
>>>>>>>>> we
>>>>>>>>>>>> are
>>>>>>>>>>>>> happy to hear more suggestions from you!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
>>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
>>> and
>>>> is
>>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
>>>>>>>>> previously.
>>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
>> reflect
>>>> the
>>>>>>> new
>>>>>>>>>>>>> design.
>>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
>> and
>>>>>>>>>>> introduce a
>>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
>> scanning.
>>> We
>>>>> are
>>>>>>>>>>>> planning
>>>>>>>>>>>>> to support SourceFunction / InputFormat for now
>> considering
>>>> the
>>>>>>>>>>>> complexity
>>>>>>>>>>>>> of FLIP-27 Source API.
>>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
>>>> make
>>>>>>> the
>>>>>>>>>>>>> semantic of lookup more straightforward for developers.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
>>> is
>>>>>>>>>>> deprecated
>>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
>>>>> currently
>>>>>>>>> it's
>>>>>>>>>>>> not?
>>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
>> for
>>>>> now.
>>>>>>> I
>>>>>>>>>>>> think
>>>>>>>>>>>>> it will be deprecated in the future but we don't have a
>>> clear
>>>>> plan
>>>>>>>>> for
>>>>>>>>>>>> that.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
>>>> looking
>>>>>>>>>>> forward
>>>>>>>>>>>>> to cooperating with you after we finalize the design and
>>>>>>> interfaces!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
>> Смирнов <
>>>>>>>>>>>>> smiralexan@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
>>> all
>>>>>>>>> points!
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
>>> is
>>>>>>>>>>> deprecated
>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
>>> but
>>>>>>>>>>> currently
>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
>>> version
>>>>>>> it's
>>>>>>>>> OK
>>>>>>>>>>>> to
>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
>>>>> supporting
>>>>>>>>>>> rescan
>>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
>> for
>>>>> this
>>>>>>>>>>>>> decision we
>>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
>> participants.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
>>> your
>>>>>>>>>>>>> statements. All
>>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
>>> would
>>>> be
>>>>>>> nice
>>>>>>>>>>> to
>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
>> lot
>>>> of
>>>>>>> work
>>>>>>>>>>> on
>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
>> one
>>>> we
>>>>>>> are
>>>>>>>>>>>>> discussing,
>>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
>> Anyway
>>>>>>> looking
>>>>>>>>>>>>> forward for
>>>>>>>>>>>>>>>>>>>>>> the FLIP update!
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
>>>> imjark@gmail.com
>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
>>>>>>> discussed
>>>>>>>>>>> it
>>>>>>>>>>>>> several times
>>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
>>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
>>> many
>>>> of
>>>>>>> your
>>>>>>>>>>>>> points!
>>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
>> design
>>>> docs
>>>>>>> and
>>>>>>>>>>>>> maybe can be
>>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
>>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
>>> discussions:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
>> "cache
>>>> in
>>>>>>>>>>>>> framework" way.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
>>> customize
>>>>> and
>>>>>>> a
>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
>> easy-use.
>>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
>>>>> flexibility
>>>>>>> and
>>>>>>>>>>>>> conciseness.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
>>>> lookup
>>>>>>>>>>> cache,
>>>>>>>>>>>>> esp reducing
>>>>>>>>>>>>>>>>>>>>>>> IO.
>>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
>> the
>>>>>>> unified
>>>>>>>>>>> way
>>>>>>>>>>>>> to both
>>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
>>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
>>> direction.
>>>> If
>>>>>>> we
>>>>>>>>>>> need
>>>>>>>>>>>>> to support
>>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
>> use
>>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
>> decide
>>>> to
>>>>>>>>>>>> implement
>>>>>>>>>>>>> the cache
>>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
>>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
>>> and
>>>>> it
>>>>>>>>>>>> doesn't
>>>>>>>>>>>>> affect the
>>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
>> to
>>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
>>> your
>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
>>>>> InputFormat,
>>>>>>>>>>>>> SourceFunction for
>>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
>>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
>> source
>>>>>>> operator
>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
>>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
>>>>> re-scan
>>>>>>>>>>>> ability
>>>>>>>>>>>>> for FLIP-27
>>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
>>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
>>>>> effort
>>>>>>> of
>>>>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
>>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
>>> InputFormat&SourceFunction,
>>>>> as
>>>>>>>>> they
>>>>>>>>>>>>> are not
>>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
>>> another
>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
>>>> plan
>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
>>> SourceFunction
>>>>> are
>>>>>>>>>>>>> deprecated.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
>> <
>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
>>> InputFormat
>>>>> is
>>>>>>> not
>>>>>>>>>>>>> considered.
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
>>>>>>>>>>>>> martijn@ververica.com>:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
>>> connectors
>>>>> to
>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
>>> The
>>>>> old
>>>>>>>>>>>>> interfaces will be
>>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
>>>> refactored
>>>>> to
>>>>>>>>> use
>>>>>>>>>>>>> the new ones
>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>> dropped.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
>> are
>>>>> using
>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
>>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
>> Смирнов
>>> <
>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
>>> make
>>>>>>> some
>>>>>>>>>>>>> comments and
>>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
>>> we
>>>>> can
>>>>>>>>>>>> achieve
>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
>> in
>>>>>>>>>>>>> flink-table-common,
>>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
>>>>>>> flink-table-runtime.
>>>>>>>>>>>>> Therefore if a
>>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
>> cache
>>>>>>>>>>> strategies
>>>>>>>>>>>>> and their
>>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
>> lookupConfig
>>> to
>>>>> the
>>>>>>>>>>>>> planner, but if
>>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
>>> in
>>>>> his
>>>>>>>>>>>>> TableFunction, it
>>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
>>>>>>> interface
>>>>>>>>>>> for
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
>>> the
>>>>>>>>>>>>> documentation). In
>>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
>>> unified.
>>>>>>> WDYT?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
>>> cache,
>>>> we
>>>>>>> will
>>>>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
>>> optimization
>>>> in
>>>>>>> case
>>>>>>>>>>> of
>>>>>>>>>>>>> LRU cache.
>>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
>>>> Collection<RowData>>.
>>>>>>> Here
>>>>>>>>>>> we
>>>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
>>>> cache,
>>>>>>> even
>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
>>> rows
>>>>>>> after
>>>>>>>>>>>>> applying
>>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
>>>>>>>>>>>> TableFunction,
>>>>>>>>>>>>> we store
>>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
>>>> cache
>>>>>>> line
>>>>>>>>>>>> will
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
>>>>> bytes).
>>>>>>>>>>> I.e.
>>>>>>>>>>>>> we don't
>>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
>>>> pruned,
>>>>>>> but
>>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
>> If
>>>> the
>>>>>>> user
>>>>>>>>>>>>> knows about
>>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
>>>>> option
>>>>>>>>>>> before
>>>>>>>>>>>>> the start
>>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
>>> idea
>>>>>>> that we
>>>>>>>>>>>> can
>>>>>>>>>>>>> do this
>>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
>> and
>>>>>>> 'weigher'
>>>>>>>>>>>>> methods of
>>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
>>>>>>> collection
>>>>>>>>>>> of
>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
>>>> automatically
>>>>>>> fit
>>>>>>>>>>>> much
>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
>>>>> filters
>>>>>>> and
>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
>>>>> interfaces,
>>>>>>>>>>> don't
>>>>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
>>>>> implement
>>>>>>>>>>> filter
>>>>>>>>>>>>> pushdown.
>>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
>> no
>>>>>>> database
>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
>>>>> feature
>>>>>>>>>>> won't
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
>>>> talk
>>>>>>> about
>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
>> databases
>>>>> might
>>>>>>>>> not
>>>>>>>>>>>>> support all
>>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
>> all).
>>> I
>>>>>>> think
>>>>>>>>>>>> users
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
>>>> optimization
>>>>>>>>>>>>> independently of
>>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
>>>> complex
>>>>>>>>>>> problems
>>>>>>>>>>>>> (or
>>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
>> Actually
>>> in
>>>>> our
>>>>>>>>>>>>> internal version
>>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
>> and
>>>>>>>>> reloading
>>>>>>>>>>>>> data from
>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
>> a
>>>> way
>>>>> to
>>>>>>>>>>> unify
>>>>>>>>>>>>> the logic
>>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
>>>>>>>>> SourceFunction,
>>>>>>>>>>>>> Source,...)
>>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
>>> result
>>>> I
>>>>>>>>>>> settled
>>>>>>>>>>>>> on using
>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
>>> in
>>>>> all
>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
>> plans
>>>> to
>>>>>>>>>>>> deprecate
>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
>>>> usage
>>>>> of
>>>>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
>>>>> source
>>>>>>> was
>>>>>>>>>>>>> designed to
>>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
>>> (SplitEnumerator
>>>> on
>>>>>>>>>>>>> JobManager and
>>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
>>>> operator
>>>>>>>>>>> (lookup
>>>>>>>>>>>>> join
>>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
>> direct
>>>> way
>>>>> to
>>>>>>>>>>> pass
>>>>>>>>>>>>> splits from
>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
>>> works
>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
>>>>>>>>>>> AddSplitEvents).
>>>>>>>>>>>>> Usage of
>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
>>> clearer
>>>>> and
>>>>>>>>>>>>> easier. But if
>>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
>>>>>>> FLIP-27, I
>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
>>> lookup
>>>>> join
>>>>>>>>> ALL
>>>>>>>>>>>>> cache in
>>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
>> of
>>>>> batch
>>>>>>>>>>>> source?
>>>>>>>>>>>>> The point
>>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
>> join
>>>> ALL
>>>>>>>>> cache
>>>>>>>>>>>>> and simple
>>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
>>> case
>>>>>>>>> scanning
>>>>>>>>>>>> is
>>>>>>>>>>>>> performed
>>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
>> (cache)
>>> is
>>>>>>>>> cleared
>>>>>>>>>>>>> (correct me
>>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
>>>>>>> functionality of
>>>>>>>>>>>>> simple join
>>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
>>>>>>> functionality of
>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
>> be
>>>>> easy
>>>>>>>>> with
>>>>>>>>>>>>> new FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
>> -
>>> we
>>>>>>> will
>>>>>>>>>>> need
>>>>>>>>>>>>> to change
>>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
>>>> again
>>>>>>> after
>>>>>>>>>>>>> some TTL).
>>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
>>> long-term
>>>>>>> goal
>>>>>>>>>>> and
>>>>>>>>>>>>> will make
>>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
>>> said.
>>>>>>> Maybe
>>>>>>>>>>> we
>>>>>>>>>>>>> can limit
>>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
>>>> (InputFormats).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
>>>> flexible
>>>>>>>>>>>>> interfaces for
>>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
>> both
>>>> in
>>>>>>> LRU
>>>>>>>>>>> and
>>>>>>>>>>>>> ALL caches.
>>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
>>>>>>> supported
>>>>>>>>>>> in
>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
>>> have
>>>>> the
>>>>>>>>>>>>> opportunity to
>>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
>> currently
>>>>> filter
>>>>>>>>>>>>> pushdown works
>>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
>>> filters
>>>> +
>>>>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
>>>>>>> features.
>>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
>>> that
>>>>>>>>> involves
>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
>>> from
>>>>>>>>>>>>> InputFormat in favor
>>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
>>> realization
>>>>>>> really
>>>>>>>>>>>>> complex and
>>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
>>> extend
>>>>> the
>>>>>>>>>>>>> functionality of
>>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
>>>> case
>>>>> of
>>>>>>>>>>>> lookup
>>>>>>>>>>>>> join ALL
>>>>>>>>>>>>>>>>>>>>>>>>>> cache?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
>>>>> imjark@gmail.com
>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
>>> want
>>>> to
>>>>>>>>> share
>>>>>>>>>>>> my
>>>>>>>>>>>>> ideas:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
>>>> connectors
>>>>>>> base
>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
>>> ways
>>>>>>> should
>>>>>>>>>>>>> work (e.g.,
>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
>>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
>>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
>>> flexible
>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
>> we
>>>> can
>>>>>>> have
>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>> advantages.
>>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
>>> should
>>>>> be a
>>>>>>>>>>> final
>>>>>>>>>>>>> state,
>>>>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
>>> into
>>>>>>> cache
>>>>>>>>>>> can
>>>>>>>>>>>>> benefit a
>>>>>>>>>>>>>>>>>>>>>>>> lot
>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
>>>>> Connectors
>>>>>>> use
>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>> reduce
>>>>>>>>>>>>>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
>>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
>>> cache,
>>>> we
>>>>>>> will
>>>>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
>> means
>>>> the
>>>>>>> cache
>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> meaningless in
>>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
>> to
>>> do
>>>>>>>>> filters
>>>>>>>>>>>>> and projects
>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
>>>>> interfaces,
>>>>>>>>>>> don't
>>>>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
>>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
>> interfaces
>>> to
>>>>>>> reduce
>>>>>>>>>>> IO
>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
>>>> source
>>>>>>> and
>>>>>>>>>>>>> lookup source
>>>>>>>>>>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
>>>> pushdown
>>>>>>> logic
>>>>>>>>>>> in
>>>>>>>>>>>>> caches,
>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
>>> of
>>>>> this
>>>>>>>>>>> FLIP.
>>>>>>>>>>>>> We have
>>>>>>>>>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
>>> "eval"
>>>>>>> method
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
>>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
>> share
>>>> the
>>>>>>>>> logic
>>>>>>>>>>>> of
>>>>>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
>>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>> Source.
>>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
>>>>> deprecated,
>>>>>>> and
>>>>>>>>>>>> the
>>>>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
>>>>>>> LookupJoin,
>>>>>>>>>>>> this
>>>>>>>>>>>>> may make
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
>> the
>>>> ALL
>>>>>>>>> cache
>>>>>>>>>>>>> logic and
>>>>>>>>>>>>>>>>>>>>>>>> reuse
>>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
>>>>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
>>> lies
>>>>> out
>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>> scope of
>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
>> be
>>>>> done
>>>>>>> for
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
>> Visser <
>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
>>>> correctly
>>>>>>>>>>>> mentioned
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
>>>>>>>>>>> jdbc/hive/hbase."
>>>>>>>>>>>>> -> Would
>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
>>> implement
>>>>>>> these
>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
>> to
>>>>> doing
>>>>>>>>>>> that,
>>>>>>>>>>>>> outside
>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>>>>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
>>>> improvement!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
>> implementation
>>>>>>> would be
>>>>>>>>>>> a
>>>>>>>>>>>>> nice
>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
>>> SYSTEM_TIME
>>>>> AS
>>>>>>> OF
>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>>>>> semantics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
>>>>>>> implemented.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
>>> say
>>>>>>> that:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
>>> to
>>>>> cut
>>>>>>> off
>>>>>>>>>>>> the
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
>> the
>>>> most
>>>>>>>>> handy
>>>>>>>>>>>>> way to do
>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
>> bit
>>>>>>> harder to
>>>>>>>>>>>>> pass it
>>>>>>>>>>>>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
>>>> Alexander
>>>>>>>>>>>> correctly
>>>>>>>>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
>>> for
>>>>>>>>>>>>> jdbc/hive/hbase.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
>> caching
>>>>>>>>>>> parameters
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
>>> set
>>>> it
>>>>>>>>>>> through
>>>>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
>>> options
>>>>> for
>>>>>>>>> all
>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
>>>>> really
>>>>>>>>>>>>> deprives us of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
>>>> implement
>>>>>>>>> their
>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>> cache).
>>>>>>>>>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
>>> more
>>>>>>>>>>> different
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
>>>> schema
>>>>>>>>>>> proposed
>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
>> not
>>>>> right
>>>>>>> and
>>>>>>>>>>>> all
>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
>>>>> architecture?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
>>> Visser <
>>>>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
>>>> wanted
>>>>> to
>>>>>>>>>>>>> express that
>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
>> this
>>>>> topic
>>>>>>>>>>> and I
>>>>>>>>>>>>> hope
>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
>>>>> Смирнов <
>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
>>>> However, I
>>>>>>> have
>>>>>>>>>>>>> questions
>>>>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
>>> get
>>>>>>>>>>>>> something?).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
>> of
>>>> "FOR
>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
>>>>> SYSTEM_TIME
>>>>>>> AS
>>>>>>>>>>> OF
>>>>>>>>>>>>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
>>> you
>>>>>>> said,
>>>>>>>>>>>> users
>>>>>>>>>>>>> go
>>>>>>>>>>>>>>>>>>>>>>>> on it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
>> performance
>>>> (no
>>>>>>> one
>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
>> do
>>>> you
>>>>>>> mean
>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
>>>>> explicitly
>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
>> the
>>>>> list
>>>>>>> of
>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
>>>> want
>>>>>>> to.
>>>>>>>>> So
>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
>>> caching
>>>>> in
>>>>>>>>>>>> modules
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
>>>> flink-table-common
>>>>>>> from
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
>>>>>>>>>>>>> breaking/non-breaking
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
>>>>> proc_time"?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
>>>>>>> options in
>>>>>>>>>>>> DDL
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
>>> never
>>>>>>>>> happened
>>>>>>>>>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
>>>>> semantics
>>>>>>> of
>>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
>>> it
>>>>>>> about
>>>>>>>>>>>>> limiting
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
>>>>> business
>>>>>>>>>>> logic
>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
>> logic
>>> in
>>>>> the
>>>>>>>>>>>>> framework? I
>>>>>>>>>>>>>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
>>>> option
>>>>>>> with
>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
>> the
>>>>> wrong
>>>>>>>>>>>>> decision,
>>>>>>>>>>>>>>>>>>>>>>>>>> because it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
>>> logic
>>>>> (not
>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
>>>>> functions
>>>>>>> of
>>>>>>>>>>> ONE
>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>> (there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
>>> caches).
>>>>>>> Does it
>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>> matter for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
>>> logic
>>>> is
>>>>>>>>>>>> located,
>>>>>>>>>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
>>>>>>> 'sink.parallelism',
>>>>>>>>>>>>> which in
>>>>>>>>>>>>>>>>>>>>>>>>>> some way
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
>> framework"
>>>> and
>>>>> I
>>>>>>>>>>> don't
>>>>>>>>>>>>> see any
>>>>>>>>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
>>>>> all-caching
>>>>>>>>>>>>> scenario
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
>>>> discussion,
>>>>>>> but
>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>>>>> in our
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
>>>> quite
>>>>>>>>>>> easily
>>>>>>>>>>>> -
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>> reused
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
>>> for
>>>> a
>>>>>>> new
>>>>>>>>>>>> API).
>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>> point is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
>> use
>>>>>>>>>>> InputFormat
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
>> even
>>>> Hive
>>>>>>> - it
>>>>>>>>>>>> uses
>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
>> a
>>>>>>> wrapper
>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
>>>> ability
>>>>>>> to
>>>>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
>>>> number
>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
>> reload
>>>>> time
>>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>>>>> reduces
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
>>>> blocking). I
>>>>>>> know
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
>>>> code,
>>>>>>> but
>>>>>>>>>>>> maybe
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
>>> an
>>>>>>> ideal
>>>>>>>>>>>>> solution,
>>>>>>>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
>>> might
>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
>>>>> developer
>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
>>> new
>>>>>>> cache
>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
>> options
>>>>> into
>>>>>>> 2
>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
>> will
>>>>> need
>>>>>>> to
>>>>>>>>>>> do
>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>>>>>>>>>>>> redirect
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
>>>>>>> LookupConfig
>>>>>>>>> (+
>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>> add an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
>>>> naming),
>>>>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
>>>> won't
>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>> refactoring at
>>>>>>>>>>>>>>>>>>>>>>>>>> all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
>> connector
>>>>>>> because
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
>> wants
>>> to
>>>>> use
>>>>>>>>> his
>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
>>>>> configs
>>>>>>>>> into
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
>> with
>>>>>>> already
>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>> configs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
>> it's a
>>>>> rare
>>>>>>>>>>> case).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
>> pushed
>>>> all
>>>>>>> the
>>>>>>>>>>> way
>>>>>>>>>>>>> down
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
>> is
>>>> that
>>>>>>> the
>>>>>>>>>>>> ONLY
>>>>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
>>>>>>> FileSystemTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
>>>>> currently).
>>>>>>>>> Also
>>>>>>>>>>>>> for some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
>>>>> complex
>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
>> the
>>>>> cache
>>>>>>>>>>> seems
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
>> large
>>>>>>> amount of
>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
>>>>> suppose
>>>>>>> in
>>>>>>>>>>>>> dimension
>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
>> 20
>>> to
>>>>> 40,
>>>>>>>>> and
>>>>>>>>>>>>> input
>>>>>>>>>>>>>>>>>>>>>>>> stream
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
>>> by
>>>>> age
>>>>>>> of
>>>>>>>>>>>>> users. If
>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
>>>> This
>>>>>>> means
>>>>>>>>>>>> the
>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
>>> almost
>>>> 2
>>>>>>>>> times.
>>>>>>>>>>>> It
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>> gain a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
>>>>> optimization
>>>>>>>>>>> starts
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
>>>> filters
>>>>>>> and
>>>>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
>>> opens
>>>> up
>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
>> 'not
>>>>> quite
>>>>>>>>>>>>> useful'.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
>>>>>>> regarding
>>>>>>>>>>> this
>>>>>>>>>>>>> topic!
>>>>>>>>>>>>>>>>>>>>>>>>>> Because
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
>>>> points,
>>>>>>> and I
>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
>>> come
>>>>> to a
>>>>>>>>>>>>> consensus.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
>>> Ren
>>>> <
>>>>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
>> for
>>> my
>>>>>>> late
>>>>>>>>>>>>> response!
>>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
>>> and
>>>>>>> Leonard
>>>>>>>>>>>> and
>>>>>>>>>>>>> I’d
>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
>>>> implementing
>>>>>>> the
>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>> logic in
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
>>>>>>> user-provided
>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
>>> extending
>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
>> semantic
>>> of
>>>>>>> "FOR
>>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
>>>> reflect
>>>>>>> the
>>>>>>>>>>>>> content
>>>>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
>> users
>>>>>>> choose
>>>>>>>>> to
>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
>>> that
>>>>>>> this
>>>>>>>>>>>>> breakage is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
>>> prefer
>>>>> not
>>>>>>> to
>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>>>>>>>> caching on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
>>> in
>>>>> the
>>>>>>>>>>>>> framework
>>>>>>>>>>>>>>>>>>>>>>>>>> (whether
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
>>> TableFunction),
>>>> we
>>>>>>> have
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> confront a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
>>> control
>>>>> the
>>>>>>>>>>>>> behavior of
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
>>>>> should
>>>>>>> be
>>>>>>>>>>>>> cautious.
>>>>>>>>>>>>>>>>>>>>>>>>>> Under
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
>>>> framework
>>>>>>>>> should
>>>>>>>>>>>>> only be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
>>> it’s
>>>>>>> hard
>>>>>>>>> to
>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
>> source
>>>>> loads
>>>>>>> and
>>>>>>>>>>>>> refresh
>>>>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
>>>> high
>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
>>> widely
>>>>>>> used
>>>>>>>>> by
>>>>>>>>>>>> our
>>>>>>>>>>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
>>> the
>>>>>>> user’s
>>>>>>>>>>>>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
>>>>>>> introduce a
>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>>>>>>> interface for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
>> would
>>>>>>> become
>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
>> framework
>>>>> might
>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
>>>> there
>>>>>>> might
>>>>>>>>>>>>> exist two
>>>>>>>>>>>>>>>>>>>>>>>>>> caches
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
>> user
>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>>> configures
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
>>>> implemented
>>>>>>> by
>>>>>>>>>>> the
>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>> source).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
>>>>>>> Alexander, I
>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
>> way
>>>> down
>>>>>>> to
>>>>>>>>>>> the
>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
>> of
>>>> the
>>>>>>>>>>> runner
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
>>>> network
>>>>>>> I/O
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> pressure
>>>>>>>>>>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
>> these
>>>>>>>>>>>> optimizations
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
>>>>> reflect
>>>>>>> our
>>>>>>>>>>>>> ideas.
>>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
>>> of
>>>>>>>>>>>>> TableFunction,
>>>>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
>>>>>>> (CachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
>> developers
>>>> and
>>>>>>>>>>> regulate
>>>>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
>> reference.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
>>> Александр
>>>>>>> Смирнов
>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
>>>>> solution
>>>>>>> as
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>>>>> step:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
>> mutually
>>>>>>> exclusive
>>>>>>>>>>>>>>>>>>>>>>>> (originally
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
>>>>> conceptually
>>>>>>>>> they
>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
>>>>>>> different.
>>>>>>>>>>> If
>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>> go one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
>>> will
>>>>> mean
>>>>>>>>>>>>> deleting
>>>>>>>>>>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
>>>>>>> connectors.
>>>>>>>>>>> So
>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
>>>> about
>>>>>>> that
>>>>>>>>>>> and
>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
>>>> tasks
>>>>>>> for
>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>> parts
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
>>> unification
>>>> /
>>>>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
>>>> Qingsheng?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
>>>>> requests
>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
>>> fields
>>>>> of
>>>>>>> the
>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>> table, we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
>>> after
>>>>>>> that we
>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
>>>> filter
>>>>>>>>>>>>> pushdown. So
>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
>>>> much
>>>>>>> less
>>>>>>>>>>>> rows
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
>>>>> architecture
>>>>>>> is
>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
>> honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
>>>> kinds
>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> conversations
>>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
>>> confluence,
>>>>> so
>>>>>>> I
>>>>>>>>>>>> made a
>>>>>>>>>>>>>>>>>>>>>>>> Jira
>>>>>>>>>>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
>> in
>>>>> more
>>>>>>>>>>>> details
>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
>>> Heise
>>>> <
>>>>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
>>>> inconsistency
>>>>>>> was
>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
>>>> could
>>>>>>> also
>>>>>>>>>>>> live
>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
>> of
>>>>>>> making
>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
>>>> devise a
>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>> layer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
>>> CachingTableFunction
>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> delegates to
>>>>>>>>>>>>>>>>>>>>>>>>>> X in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
>>>> Lifting
>>>>>>> it
>>>>>>>>>>> into
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
>>>>>>> probably
>>>>>>>>>>>>>>>>>>>>>>>> unnecessary
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
>>> will
>>>>> only
>>>>>>>>>>>> receive
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
>>> more
>>>>>>>>>>>> interesting
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> save
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
>>>> changes
>>>>> of
>>>>>>>>>>> this
>>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
>>>>> interfaces.
>>>>>>>>>>>>> Everything
>>>>>>>>>>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
>> That
>>>>> means
>>>>>>> we
>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
>> Alexander
>>>>>>> pointed
>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
>>>>> architecture
>>>>>>> is
>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
>> honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
>>>> Александр
>>>>>>>>>>> Смирнов
>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
>>> I'm
>>>>>>> not a
>>>>>>>>>>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>>>>>>>>>>> yet,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
>>>> FLIP
>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>> interested
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
>>>>>>> feature in
>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>>>>> company’s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
>> our
>>>>>>> thoughts
>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>> this and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
>> alternative
>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>> introducing an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
>>>>>>>>> (CachingTableFunction).
>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>> know,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
>>>>>>> flink-table-common
>>>>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
>> tables –
>>>>> it’s
>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>>> convenient
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
>>>>>>> CachingTableFunction
>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
>> and
>>>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>>>>>>> connected
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
>> module,
>>>>>>> probably
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
>>>>> depend
>>>>>>> on
>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
>>> which
>>>>>>> doesn’t
>>>>>>>>>>>>> sound
>>>>>>>>>>>>>>>>>>>>>>>>>> good.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
>>>>>>>>> ‘getLookupConfig’
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
>>>>>>> connectors
>>>>>>>>> to
>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
>>>> therefore
>>>>>>> they
>>>>>>>>>>>> won’t
>>>>>>>>>>>>>>>>>>>>>>>>>> depend on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
>>>>> planner
>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>> construct a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
>>>> runtime
>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
>>>> Architecture
>>>>>>>>> looks
>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
>>>>> actually
>>>>>>>>>>> yours
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
>> that
>>>> will
>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>> responsible
>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
>>>>>>> inheritors.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
>>>> AsyncLookupJoinRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
>>>>>>>>>>> LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
>> etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
>> powerful
>>>>>>> advantage
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> such a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
>>> level,
>>>>> we
>>>>>>> can
>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
>>>>>>> LookupJoinRunnerWithCalc
>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>>>>>>> named
>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
>> function,
>>>>> which
>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>>>>>>> mostly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
>>>> lookup
>>>>>>> table
>>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>>>>>>>>> condition
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
>>>> WHERE
>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>>> 1000’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
>> A.age =
>>>>>>> B.age +
>>>>>>>>>>> 10
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
>>>> storing
>>>>>>>>>>> records
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
>> reduced:
>>>>>>> filters =
>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
>>> reduce
>>>>>>>>> records’
>>>>>>>>>>>>>>>>>>>>>>>> size. So
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
>> be
>>>>>>>>> increased
>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> user.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
>> Ren
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
>>>>>>> discussion
>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
>>>> table
>>>>>>>>> cache
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
>>>>> should
>>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
>>> isn’t a
>>>>>>>>>>> standard
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
>> with
>>>>> lookup
>>>>>>>>>>>> joins,
>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
>>>>>>> including
>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
>>> table
>>>>>>>>> options.
>>>>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
>>> Any
>>>>>>>>>>>> suggestions
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jingsong Li <ji...@gmail.com>.

Thanks Alexander for your reply. We can discuss the new interface when it
comes out.

We are more inclined to deprecate the connector `async` option when
discussing FLIP-234 [1]. We should use hint to let planner decide.
Although the discussion has not yet produced a conclusion, can we remove
this option from this FLIP? It doesn't seem to be related to this FLIP, but
more to FLIP-234, and we can form a conclusion over there.

[1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h

Best,
Jingsong

On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <ji...@ververica.com> wrote:

> Hi Jark,
>
> Thanks for clarifying it. It would be fine. as long as we could provide the
> no-cache solution. I was just wondering if the client side cache could
> really help when HBase is used, since the data to look up should be huge.
> Depending how much data will be cached on the client side, the data that
> should be lru in e.g. LruBlockCache will not be lru anymore. In the worst
> case scenario, once the cached data at client side is expired, the request
> will hit disk which will cause extra latency temporarily, if I am not
> mistaken.
>
> Best regards,
> Jing
>
> On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:
>
> > Hi Jing Ge,
> >
> > What do you mean about the "impact on the block cache used by HBase"?
> > In my understanding, the connector cache and HBase cache are totally two
> > things.
> > The connector cache is a local/client cache, and the HBase cache is a
> > server cache.
> >
> > > does it make sense to have a no-cache solution as one of the
> > default solutions so that customers will have no effort for the migration
> > if they want to stick with Hbase cache
> >
> > The implementation migration should be transparent to users. Take the
> HBase
> > connector as
> > an example,  it already supports lookup cache but is disabled by default.
> > After migration, the
> > connector still disables cache by default (i.e. no-cache solution). No
> > migration effort for users.
> >
> > HBase cache and connector cache are two different things. HBase cache
> can't
> > simply replace
> > connector cache. Because one of the most important usages for connector
> > cache is reducing
> >  the I/O request/response and improving the throughput, which can achieve
> > by just using a server cache.
> >
> > Best,
> > Jark
> >
> >
> >
> >
> > On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
> >
> > > Thanks all for the valuable discussion. The new feature looks very
> > > interesting.
> > >
> > > According to the FLIP description: "*Currently we have JDBC, Hive and
> > HBase
> > > connector implemented lookup table source. All existing implementations
> > > will be migrated to the current design and the migration will be
> > > transparent to end users*." I was only wondering if we should pay
> > attention
> > > to HBase and similar DBs. Since, commonly, the lookup data will be huge
> > > while using HBase, partial caching will be used in this case, if I am
> not
> > > mistaken, which might have an impact on the block cache used by HBase,
> > e.g.
> > > LruBlockCache.
> > > Another question is that, since HBase provides a sophisticated cache
> > > solution, does it make sense to have a no-cache solution as one of the
> > > default solutions so that customers will have no effort for the
> migration
> > > if they want to stick with Hbase cache?
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Fri, May 27, 2022 at 11:19 AM Jingsong Li <ji...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I think the problem now is below:
> > > > 1. AllCache and PartialCache interface on the non-uniform, one needs
> to
> > > > provide LookupProvider, the other needs to provide CacheBuilder.
> > > > 2. AllCache definition is not flexible, for example, PartialCache can
> > use
> > > > any custom storage, while the AllCache can not, AllCache can also be
> > > > considered to store memory or disk, also need a flexible strategy.
> > > > 3. AllCache can not customize ReloadStrategy, currently only
> > > > ScheduledReloadStrategy.
> > > >
> > > > In order to solve the above problems, the following are my ideas.
> > > >
> > > > ## Top level cache interfaces:
> > > >
> > > > ```
> > > >
> > > > public interface CacheLookupProvider extends
> > > > LookupTableSource.LookupRuntimeProvider {
> > > >
> > > >     CacheBuilder createCacheBuilder();
> > > > }
> > > >
> > > >
> > > > public interface CacheBuilder {
> > > >     Cache create();
> > > > }
> > > >
> > > >
> > > > public interface Cache {
> > > >
> > > >     /**
> > > >      * Returns the value associated with key in this cache, or null
> if
> > > > there is no cached value for
> > > >      * key.
> > > >      */
> > > >     @Nullable
> > > >     Collection<RowData> getIfPresent(RowData key);
> > > >
> > > >     /** Returns the number of key-value mappings in the cache. */
> > > >     long size();
> > > > }
> > > >
> > > > ```
> > > >
> > > > ## Partial cache
> > > >
> > > > ```
> > > >
> > > > public interface PartialCacheLookupFunction extends
> > CacheLookupProvider {
> > > >
> > > >     @Override
> > > >     PartialCacheBuilder createCacheBuilder();
> > > >
> > > > /** Creates an {@link LookupFunction} instance. */
> > > > LookupFunction createLookupFunction();
> > > > }
> > > >
> > > >
> > > > public interface PartialCacheBuilder extends CacheBuilder {
> > > >
> > > >     PartialCache create();
> > > > }
> > > >
> > > >
> > > > public interface PartialCache extends Cache {
> > > >
> > > >     /**
> > > >      * Associates the specified value rows with the specified key row
> > > > in the cache. If the cache
> > > >      * previously contained value associated with the key, the old
> > > > value is replaced by the
> > > >      * specified value.
> > > >      *
> > > >      * @return the previous value rows associated with key, or null
> if
> > > > there was no mapping for key.
> > > >      * @param key - key row with which the specified value is to be
> > > > associated
> > > >      * @param value – value rows to be associated with the specified
> > key
> > > >      */
> > > >     Collection<RowData> put(RowData key, Collection<RowData> value);
> > > >
> > > >     /** Discards any cached value for the specified key. */
> > > >     void invalidate(RowData key);
> > > > }
> > > >
> > > > ```
> > > >
> > > > ## All cache
> > > > ```
> > > >
> > > > public interface AllCacheLookupProvider extends CacheLookupProvider {
> > > >
> > > >     void registerReloadStrategy(ScheduledExecutorService
> > > > executorService, Reloader reloader);
> > > >
> > > >     ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> > > >
> > > >     @Override
> > > >     AllCacheBuilder createCacheBuilder();
> > > > }
> > > >
> > > >
> > > > public interface AllCacheBuilder extends CacheBuilder {
> > > >
> > > >     AllCache create();
> > > > }
> > > >
> > > >
> > > > public interface AllCache extends Cache {
> > > >
> > > >     void putAll(Iterator<Map<RowData, RowData>> allEntries);
> > > >
> > > >     void clearAll();
> > > > }
> > > >
> > > >
> > > > public interface Reloader {
> > > >
> > > >     void reload();
> > > > }
> > > >
> > > > ```
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Fri, May 27, 2022 at 11:10 AM Jingsong Li <jingsonglee0@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Thanks Qingsheng and all for your discussion.
> > > > >
> > > > > Very sorry to jump in so late.
> > > > >
> > > > > Maybe I missed something?
> > > > > My first impression when I saw the cache interface was, why don't
> we
> > > > > provide an interface similar to guava cache [1], on top of guava
> > cache,
> > > > > caffeine also makes extensions for asynchronous calls.[2]
> > > > > There is also the bulk load in caffeine too.
> > > > >
> > > > > I am also more confused why first from LookupCacheFactory.Builder
> and
> > > > then
> > > > > to Factory to create Cache.
> > > > >
> > > > > [1] https://github.com/google/guava
> > > > > [2] https://github.com/ben-manes/caffeine/wiki/Population
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
> > > > >
> > > > >> After looking at the new introduced ReloadTime and Becket's
> comment,
> > > > >> I agree with Becket we should have a pluggable reloading strategy.
> > > > >> We can provide some common implementations, e.g., periodic
> > reloading,
> > > > and
> > > > >> daily reloading.
> > > > >> But there definitely be some connector- or business-specific
> > reloading
> > > > >> strategies, e.g.
> > > > >> notify by a zookeeper watcher, reload once a new Hive partition is
> > > > >> complete.
> > > > >>
> > > > >> Best,
> > > > >> Jark
> > > > >>
> > > > >> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
> > > wrote:
> > > > >>
> > > > >> > Hi Qingsheng,
> > > > >> >
> > > > >> > Thanks for updating the FLIP. A few comments / questions below:
> > > > >> >
> > > > >> > 1. Is there a reason that we have both "XXXFactory" and
> > > "XXXProvider".
> > > > >> > What is the difference between them? If they are the same, can
> we
> > > just
> > > > >> use
> > > > >> > XXXFactory everywhere?
> > > > >> >
> > > > >> > 2. Regarding the FullCachingLookupProvider, should the reloading
> > > > policy
> > > > >> > also be pluggable? Periodical reloading could be sometimes be
> > tricky
> > > > in
> > > > >> > practice. For example, if user uses 24 hours as the cache
> refresh
> > > > >> interval
> > > > >> > and some nightly batch job delayed, the cache update may still
> see
> > > the
> > > > >> > stale data.
> > > > >> >
> > > > >> > 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> > > should
> > > > be
> > > > >> > removed.
> > > > >> >
> > > > >> > 4. The purpose of LookupFunctionProvider#cacheMissingKey()
> seems a
> > > > >> little
> > > > >> > confusing to me. If Optional<LookupCacheFactory>
> getCacheFactory()
> > > > >> returns
> > > > >> > a non-empty factory, doesn't that already indicates the
> framework
> > to
> > > > >> cache
> > > > >> > the missing keys? Also, why is this method returning an
> > > > >> Optional<Boolean>
> > > > >> > instead of boolean?
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Jiangjie (Becket) Qin
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
> renqschn@gmail.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> >> Hi Lincoln and Jark,
> > > > >> >>
> > > > >> >> Thanks for the comments! If the community reaches a consensus
> > that
> > > we
> > > > >> use
> > > > >> >> SQL hint instead of table options to decide whether to use sync
> > or
> > > > >> async
> > > > >> >> mode, it’s indeed not necessary to introduce the “lookup.async”
> > > > option.
> > > > >> >>
> > > > >> >> I think it’s a good idea to let the decision of async made on
> > query
> > > > >> >> level, which could make better optimization with more
> infomation
> > > > >> gathered
> > > > >> >> by planner. Is there any FLIP describing the issue in
> > FLINK-27625?
> > > I
> > > > >> >> thought FLIP-234 is only proposing adding SQL hint for retry on
> > > > missing
> > > > >> >> instead of the entire async mode to be controlled by hint.
> > > > >> >>
> > > > >> >> Best regards,
> > > > >> >>
> > > > >> >> Qingsheng
> > > > >> >>
> > > > >> >> > On May 25, 2022, at 15:13, Lincoln Lee <
> lincoln.86xy@gmail.com
> > >
> > > > >> wrote:
> > > > >> >> >
> > > > >> >> > Hi Jark,
> > > > >> >> >
> > > > >> >> > Thanks for your reply!
> > > > >> >> >
> > > > >> >> > Currently 'lookup.async' just lies in HBase connector, I have
> > no
> > > > idea
> > > > >> >> > whether or when to remove it (we can discuss it in another
> > issue
> > > > for
> > > > >> the
> > > > >> >> > HBase connector after FLINK-27625 is done), just not add it
> > into
> > > a
> > > > >> >> common
> > > > >> >> > option now.
> > > > >> >> >
> > > > >> >> > Best,
> > > > >> >> > Lincoln Lee
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> > > > >> >> >
> > > > >> >> >> Hi Lincoln,
> > > > >> >> >>
> > > > >> >> >> I have taken a look at FLIP-234, and I agree with you that
> the
> > > > >> >> connectors
> > > > >> >> >> can
> > > > >> >> >> provide both async and sync runtime providers simultaneously
> > > > instead
> > > > >> >> of one
> > > > >> >> >> of them.
> > > > >> >> >> At that point, "lookup.async" looks redundant. If this
> option
> > is
> > > > >> >> planned to
> > > > >> >> >> be removed
> > > > >> >> >> in the long term, I think it makes sense not to introduce it
> > in
> > > > this
> > > > >> >> FLIP.
> > > > >> >> >>
> > > > >> >> >> Best,
> > > > >> >> >> Jark
> > > > >> >> >>
> > > > >> >> >> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> > > lincoln.86xy@gmail.com
> > > > >
> > > > >> >> wrote:
> > > > >> >> >>
> > > > >> >> >>> Hi Qingsheng,
> > > > >> >> >>>
> > > > >> >> >>> Sorry for jumping into the discussion so late. It's a good
> > idea
> > > > >> that
> > > > >> >> we
> > > > >> >> >> can
> > > > >> >> >>> have a common table option. I have a minor comments on
> > > > >> 'lookup.async'
> > > > >> >> >> that
> > > > >> >> >>> not make it a common option:
> > > > >> >> >>>
> > > > >> >> >>> The table layer abstracts both sync and async lookup
> > > > capabilities,
> > > > >> >> >>> connectors implementers can choose one or both, in the case
> > of
> > > > >> >> >> implementing
> > > > >> >> >>> only one capability(status of the most of existing builtin
> > > > >> connectors)
> > > > >> >> >>> 'lookup.async' will not be used.  And when a connector has
> > both
> > > > >> >> >>> capabilities, I think this choice is more suitable for
> making
> > > > >> >> decisions
> > > > >> >> >> at
> > > > >> >> >>> the query level, for example, table planner can choose the
> > > > physical
> > > > >> >> >>> implementation of async lookup or sync lookup based on its
> > cost
> > > > >> >> model, or
> > > > >> >> >>> users can give query hint based on their own better
> > > > >> understanding.  If
> > > > >> >> >>> there is another common table option 'lookup.async', it may
> > > > confuse
> > > > >> >> the
> > > > >> >> >>> users in the long run.
> > > > >> >> >>>
> > > > >> >> >>> So, I prefer to leave the 'lookup.async' option in private
> > > place
> > > > >> (for
> > > > >> >> the
> > > > >> >> >>> current hbase connector) and not turn it into a common
> > option.
> > > > >> >> >>>
> > > > >> >> >>> WDYT?
> > > > >> >> >>>
> > > > >> >> >>> Best,
> > > > >> >> >>> Lincoln Lee
> > > > >> >> >>>
> > > > >> >> >>>
> > > > >> >> >>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> > > > >> >> >>>
> > > > >> >> >>>> Hi Alexander,
> > > > >> >> >>>>
> > > > >> >> >>>> Thanks for the review! We recently updated the FLIP and
> you
> > > can
> > > > >> find
> > > > >> >> >>> those
> > > > >> >> >>>> changes from my latest email. Since some terminologies has
> > > > >> changed so
> > > > >> >> >>> I’ll
> > > > >> >> >>>> use the new concept for replying your comments.
> > > > >> >> >>>>
> > > > >> >> >>>> 1. Builder vs ‘of’
> > > > >> >> >>>> I’m OK to use builder pattern if we have additional
> optional
> > > > >> >> parameters
> > > > >> >> >>>> for full caching mode (“rescan” previously). The
> > > > >> schedule-with-delay
> > > > >> >> >> idea
> > > > >> >> >>>> looks reasonable to me, but I think we need to redesign
> the
> > > > >> builder
> > > > >> >> API
> > > > >> >> >>> of
> > > > >> >> >>>> full caching to make it more descriptive for developers.
> > Would
> > > > you
> > > > >> >> mind
> > > > >> >> >>>> sharing your ideas about the API? For accessing the FLIP
> > > > workspace
> > > > >> >> you
> > > > >> >> >>> can
> > > > >> >> >>>> just provide your account ID and ping any PMC member
> > including
> > > > >> Jark.
> > > > >> >> >>>>
> > > > >> >> >>>> 2. Common table options
> > > > >> >> >>>> We have some discussions these days and propose to
> > introduce 8
> > > > >> common
> > > > >> >> >>>> table options about caching. It has been updated on the
> > FLIP.
> > > > >> >> >>>>
> > > > >> >> >>>> 3. Retries
> > > > >> >> >>>> I think we are on the same page :-)
> > > > >> >> >>>>
> > > > >> >> >>>> For your additional concerns:
> > > > >> >> >>>> 1) The table option has been updated.
> > > > >> >> >>>> 2) We got “lookup.cache” back for configuring whether to
> use
> > > > >> partial
> > > > >> >> or
> > > > >> >> >>>> full caching mode.
> > > > >> >> >>>>
> > > > >> >> >>>> Best regards,
> > > > >> >> >>>>
> > > > >> >> >>>> Qingsheng
> > > > >> >> >>>>
> > > > >> >> >>>>
> > > > >> >> >>>>
> > > > >> >> >>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> > > > >> smiralexan@gmail.com>
> > > > >> >> >>>> wrote:
> > > > >> >> >>>>>
> > > > >> >> >>>>> Also I have a few additions:
> > > > >> >> >>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> > > > >> >> >>>>> 'lookup.cache.max-rows'? I think it will be more clear
> that
> > > we
> > > > >> talk
> > > > >> >> >>>>> not about bytes, but about the number of rows. Plus it
> fits
> > > > more,
> > > > >> >> >>>>> considering my optimization with filters.
> > > > >> >> >>>>> 2) How will users enable rescanning? Are we going to
> > separate
> > > > >> >> caching
> > > > >> >> >>>>> and rescanning from the options point of view? Like
> > initially
> > > > we
> > > > >> had
> > > > >> >> >>>>> one option 'lookup.cache' with values LRU / ALL. I think
> > now
> > > we
> > > > >> can
> > > > >> >> >>>>> make a boolean option 'lookup.rescan'. RescanInterval can
> > be
> > > > >> >> >>>>> 'lookup.rescan.interval', etc.
> > > > >> >> >>>>>
> > > > >> >> >>>>> Best regards,
> > > > >> >> >>>>> Alexander
> > > > >> >> >>>>>
> > > > >> >> >>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> > > > >> smiralexan@gmail.com
> > > > >> >> >>> :
> > > > >> >> >>>>>>
> > > > >> >> >>>>>> Hi Qingsheng and Jark,
> > > > >> >> >>>>>>
> > > > >> >> >>>>>> 1. Builders vs 'of'
> > > > >> >> >>>>>> I understand that builders are used when we have
> multiple
> > > > >> >> >> parameters.
> > > > >> >> >>>>>> I suggested them because we could add parameters later.
> To
> > > > >> prevent
> > > > >> >> >>>>>> Builder for ScanRuntimeProvider from looking redundant I
> > can
> > > > >> >> suggest
> > > > >> >> >>>>>> one more config now - "rescanStartTime".
> > > > >> >> >>>>>> It's a time in UTC (LocalTime class) when the first
> reload
> > > of
> > > > >> cache
> > > > >> >> >>>>>> starts. This parameter can be thought of as
> 'initialDelay'
> > > > (diff
> > > > >> >> >>>>>> between current time and rescanStartTime) in method
> > > > >> >> >>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
> > can
> > > be
> > > > >> very
> > > > >> >> >>>>>> useful when the dimension table is updated by some other
> > > > >> scheduled
> > > > >> >> >> job
> > > > >> >> >>>>>> at a certain time. Or when the user simply wants a
> second
> > > scan
> > > > >> >> >> (first
> > > > >> >> >>>>>> cache reload) be delayed. This option can be used even
> > > without
> > > > >> >> >>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
> > one
> > > > >> day.
> > > > >> >> >>>>>> If you are fine with this option, I would be very glad
> if
> > > you
> > > > >> would
> > > > >> >> >>>>>> give me access to edit FLIP page, so I could add it
> myself
> > > > >> >> >>>>>>
> > > > >> >> >>>>>> 2. Common table options
> > > > >> >> >>>>>> I also think that FactoryUtil would be overloaded by all
> > > cache
> > > > >> >> >>>>>> options. But maybe unify all suggested options, not only
> > for
> > > > >> >> default
> > > > >> >> >>>>>> cache? I.e. class 'LookupOptions', that unifies default
> > > cache
> > > > >> >> >> options,
> > > > >> >> >>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> > > > >> >> >>>>>>
> > > > >> >> >>>>>> 3. Retries
> > > > >> >> >>>>>> I'm fine with suggestion close to
> > RetryUtils#tryTimes(times,
> > > > >> call)
> > > > >> >> >>>>>>
> > > > >> >> >>>>>> [1]
> > > > >> >> >>>>
> > > > >> >> >>>
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> > > > >> >> >>>>>>
> > > > >> >> >>>>>> Best regards,
> > > > >> >> >>>>>> Alexander
> > > > >> >> >>>>>>
> > > > >> >> >>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> > > renqschn@gmail.com
> > > > >:
> > > > >> >> >>>>>>>
> > > > >> >> >>>>>>> Hi Jark and Alexander,
> > > > >> >> >>>>>>>
> > > > >> >> >>>>>>> Thanks for your comments! I’m also OK to introduce
> common
> > > > table
> > > > >> >> >>>> options. I prefer to introduce a new
> > DefaultLookupCacheOptions
> > > > >> class
> > > > >> >> >> for
> > > > >> >> >>>> holding these option definitions because putting all
> options
> > > > into
> > > > >> >> >>>> FactoryUtil would make it a bit ”crowded” and not well
> > > > >> categorized.
> > > > >> >> >>>>>>>
> > > > >> >> >>>>>>> FLIP has been updated according to suggestions above:
> > > > >> >> >>>>>>> 1. Use static “of” method for constructing
> > > > >> RescanRuntimeProvider
> > > > >> >> >>>> considering both arguments are required.
> > > > >> >> >>>>>>> 2. Introduce new table options matching
> > > > >> DefaultLookupCacheFactory
> > > > >> >> >>>>>>>
> > > > >> >> >>>>>>> Best,
> > > > >> >> >>>>>>> Qingsheng
> > > > >> >> >>>>>>>
> > > > >> >> >>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> > imjark@gmail.com>
> > > > >> wrote:
> > > > >> >> >>>>>>>>
> > > > >> >> >>>>>>>> Hi Alex,
> > > > >> >> >>>>>>>>
> > > > >> >> >>>>>>>> 1) retry logic
> > > > >> >> >>>>>>>> I think we can extract some common retry logic into
> > > > utilities,
> > > > >> >> >> e.g.
> > > > >> >> >>>> RetryUtils#tryTimes(times, call).
> > > > >> >> >>>>>>>> This seems independent of this FLIP and can be reused
> by
> > > > >> >> >> DataStream
> > > > >> >> >>>> users.
> > > > >> >> >>>>>>>> Maybe we can open an issue to discuss this and where
> to
> > > put
> > > > >> it.
> > > > >> >> >>>>>>>>
> > > > >> >> >>>>>>>> 2) cache ConfigOptions
> > > > >> >> >>>>>>>> I'm fine with defining cache config options in the
> > > > framework.
> > > > >> >> >>>>>>>> A candidate place to put is FactoryUtil which also
> > > includes
> > > > >> >> >>>> "sink.parallelism", "format" options.
> > > > >> >> >>>>>>>>
> > > > >> >> >>>>>>>> Best,
> > > > >> >> >>>>>>>> Jark
> > > > >> >> >>>>>>>>
> > > > >> >> >>>>>>>>
> > > > >> >> >>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> > > > >> >> >>> smiralexan@gmail.com>
> > > > >> >> >>>> wrote:
> > > > >> >> >>>>>>>>>
> > > > >> >> >>>>>>>>> Hi Qingsheng,
> > > > >> >> >>>>>>>>>
> > > > >> >> >>>>>>>>> Thank you for considering my comments.
> > > > >> >> >>>>>>>>>
> > > > >> >> >>>>>>>>>> there might be custom logic before making retry,
> such
> > as
> > > > >> >> >>>> re-establish the connection
> > > > >> >> >>>>>>>>>
> > > > >> >> >>>>>>>>> Yes, I understand that. I meant that such logic can
> be
> > > > >> placed in
> > > > >> >> >> a
> > > > >> >> >>>>>>>>> separate function, that can be implemented by
> > connectors.
> > > > >> Just
> > > > >> >> >>> moving
> > > > >> >> >>>>>>>>> the retry logic would make connector's LookupFunction
> > > more
> > > > >> >> >> concise
> > > > >> >> >>> +
> > > > >> >> >>>>>>>>> avoid duplicate code. However, it's a minor change.
> The
> > > > >> decision
> > > > >> >> >> is
> > > > >> >> >>>> up
> > > > >> >> >>>>>>>>> to you.
> > > > >> >> >>>>>>>>>
> > > > >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> > > > >> developers
> > > > >> >> >> to
> > > > >> >> >>>> define their own options as we do now per connector.
> > > > >> >> >>>>>>>>>
> > > > >> >> >>>>>>>>> What is the reason for that? One of the main goals of
> > > this
> > > > >> FLIP
> > > > >> >> >> was
> > > > >> >> >>>> to
> > > > >> >> >>>>>>>>> unify the configs, wasn't it? I understand that
> current
> > > > cache
> > > > >> >> >>> design
> > > > >> >> >>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> > > still
> > > > >> we
> > > > >> >> >> can
> > > > >> >> >>>> put
> > > > >> >> >>>>>>>>> these options into the framework, so connectors can
> > reuse
> > > > >> them
> > > > >> >> >> and
> > > > >> >> >>>>>>>>> avoid code duplication, and, what is more
> significant,
> > > > avoid
> > > > >> >> >>> possible
> > > > >> >> >>>>>>>>> different options naming. This moment can be pointed
> > out
> > > in
> > > > >> >> >>>>>>>>> documentation for connector developers.
> > > > >> >> >>>>>>>>>
> > > > >> >> >>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>> Alexander
> > > > >> >> >>>>>>>>>
> > > > >> >> >>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> > > > >> renqschn@gmail.com>:
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>> Hi Alexander,
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>> Thanks for the review and glad to see we are on the
> > same
> > > > >> page!
> > > > >> >> I
> > > > >> >> >>>> think you forgot to cc the dev mailing list so I’m also
> > > quoting
> > > > >> your
> > > > >> >> >>> reply
> > > > >> >> >>>> under this email.
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>> In my opinion the retry logic should be implemented
> in
> > > > >> lookup()
> > > > >> >> >>>> instead of in LookupFunction#eval(). Retrying is only
> > > meaningful
> > > > >> >> under
> > > > >> >> >>> some
> > > > >> >> >>>> specific retriable failures, and there might be custom
> logic
> > > > >> before
> > > > >> >> >>> making
> > > > >> >> >>>> retry, such as re-establish the connection
> > > > >> (JdbcRowDataLookupFunction
> > > > >> >> >> is
> > > > >> >> >>> an
> > > > >> >> >>>> example), so it's more handy to leave it to the connector.
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>>> I don't see DDL options, that were in previous
> > version
> > > of
> > > > >> >> FLIP.
> > > > >> >> >>> Do
> > > > >> >> >>>> you have any special plans for them?
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> > > > >> developers
> > > > >> >> >> to
> > > > >> >> >>>> define their own options as we do now per connector.
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>> The rest of comments sound great and I’ll update the
> > > FLIP.
> > > > >> Hope
> > > > >> >> >> we
> > > > >> >> >>>> can finalize our proposal soon!
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>> Best,
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>> Qingsheng
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> > > > >> >> >>> smiralexan@gmail.com>
> > > > >> >> >>>> wrote:
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> Hi Qingsheng and devs!
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> I like the overall design of updated FLIP, however
> I
> > > have
> > > > >> >> >> several
> > > > >> >> >>>>>>>>>>> suggestions and questions.
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> > > > >> TableFunction
> > > > >> >> >> is a
> > > > >> >> >>>> good
> > > > >> >> >>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> > > class.
> > > > >> >> 'eval'
> > > > >> >> >>>> method
> > > > >> >> >>>>>>>>>>> of new LookupFunction is great for this purpose.
> The
> > > same
> > > > >> is
> > > > >> >> >> for
> > > > >> >> >>>>>>>>>>> 'async' case.
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> 2) There might be other configs in future, such as
> > > > >> >> >>>> 'cacheMissingKey'
> > > > >> >> >>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> > > > >> >> >>>> ScanRuntimeProvider.
> > > > >> >> >>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
> > and
> > > > >> >> >>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> > > > 'build'
> > > > >> >> >>> method
> > > > >> >> >>>>>>>>>>> instead of many 'of' methods in future)?
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> 3) What are the plans for existing
> > > TableFunctionProvider
> > > > >> and
> > > > >> >> >>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> > > > >> deprecated.
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> 4) Am I right that the current design does not
> assume
> > > > >> usage of
> > > > >> >> >>>>>>>>>>> user-provided LookupCache in re-scanning? In this
> > case,
> > > > it
> > > > >> is
> > > > >> >> >> not
> > > > >> >> >>>> very
> > > > >> >> >>>>>>>>>>> clear why do we need methods such as 'invalidate'
> or
> > > > >> 'putAll'
> > > > >> >> >> in
> > > > >> >> >>>>>>>>>>> LookupCache.
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> 5) I don't see DDL options, that were in previous
> > > version
> > > > >> of
> > > > >> >> >>> FLIP.
> > > > >> >> >>>> Do
> > > > >> >> >>>>>>>>>>> you have any special plans for them?
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> If you don't mind, I would be glad to be able to
> make
> > > > small
> > > > >> >> >>>>>>>>>>> adjustments to the FLIP document too. I think it's
> > > worth
> > > > >> >> >>> mentioning
> > > > >> >> >>>>>>>>>>> about what exactly optimizations are planning in
> the
> > > > >> future.
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>> Smirnov Alexander
> > > > >> >> >>>>>>>>>>>
> > > > >> >> >>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> > > > >> renqschn@gmail.com
> > > > >> >> >>> :
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> Hi Alexander and devs,
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> Thank you very much for the in-depth discussion!
> As
> > > Jark
> > > > >> >> >>>> mentioned we were inspired by Alexander's idea and made a
> > > > >> refactor on
> > > > >> >> >> our
> > > > >> >> >>>> design. FLIP-221 [1] has been updated to reflect our
> design
> > > now
> > > > >> and
> > > > >> >> we
> > > > >> >> >>> are
> > > > >> >> >>>> happy to hear more suggestions from you!
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> Compared to the previous design:
> > > > >> >> >>>>>>>>>>>> 1. The lookup cache serves at table runtime level
> > and
> > > is
> > > > >> >> >>>> integrated as a component of LookupJoinRunner as discussed
> > > > >> >> previously.
> > > > >> >> >>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
> reflect
> > > the
> > > > >> new
> > > > >> >> >>>> design.
> > > > >> >> >>>>>>>>>>>> 3. We separate the all-caching case individually
> and
> > > > >> >> >> introduce a
> > > > >> >> >>>> new RescanRuntimeProvider to reuse the ability of
> scanning.
> > We
> > > > are
> > > > >> >> >>> planning
> > > > >> >> >>>> to support SourceFunction / InputFormat for now
> considering
> > > the
> > > > >> >> >>> complexity
> > > > >> >> >>>> of FLIP-27 Source API.
> > > > >> >> >>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> > > make
> > > > >> the
> > > > >> >> >>>> semantic of lookup more straightforward for developers.
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> For replying to Alexander:
> > > > >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat
> > is
> > > > >> >> >> deprecated
> > > > >> >> >>>> or not. Am I right that it will be so in the future, but
> > > > currently
> > > > >> >> it's
> > > > >> >> >>> not?
> > > > >> >> >>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
> for
> > > > now.
> > > > >> I
> > > > >> >> >>> think
> > > > >> >> >>>> it will be deprecated in the future but we don't have a
> > clear
> > > > plan
> > > > >> >> for
> > > > >> >> >>> that.
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> > > looking
> > > > >> >> >> forward
> > > > >> >> >>>> to cooperating with you after we finalize the design and
> > > > >> interfaces!
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> [1]
> > > > >> >> >>>>
> > > > >> >> >>>
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> Qingsheng
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
> Смирнов <
> > > > >> >> >>>> smiralexan@gmail.com> wrote:
> > > > >> >> >>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> > > > >> >> >>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>> Glad to see that we came to a consensus on almost
> > all
> > > > >> >> points!
> > > > >> >> >>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat
> > is
> > > > >> >> >> deprecated
> > > > >> >> >>>> or
> > > > >> >> >>>>>>>>>>>>> not. Am I right that it will be so in the future,
> > but
> > > > >> >> >> currently
> > > > >> >> >>>> it's
> > > > >> >> >>>>>>>>>>>>> not? Actually I also think that for the first
> > version
> > > > >> it's
> > > > >> >> OK
> > > > >> >> >>> to
> > > > >> >> >>>> use
> > > > >> >> >>>>>>>>>>>>> InputFormat in ALL cache realization, because
> > > > supporting
> > > > >> >> >> rescan
> > > > >> >> >>>>>>>>>>>>> ability seems like a very distant prospect. But
> for
> > > > this
> > > > >> >> >>>> decision we
> > > > >> >> >>>>>>>>>>>>> need a consensus among all discussion
> participants.
> > > > >> >> >>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>> In general, I don't have something to argue with
> > your
> > > > >> >> >>>> statements. All
> > > > >> >> >>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
> > would
> > > be
> > > > >> nice
> > > > >> >> >> to
> > > > >> >> >>>> work
> > > > >> >> >>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
> lot
> > > of
> > > > >> work
> > > > >> >> >> on
> > > > >> >> >>>> lookup
> > > > >> >> >>>>>>>>>>>>> join caching with realization very close to the
> one
> > > we
> > > > >> are
> > > > >> >> >>>> discussing,
> > > > >> >> >>>>>>>>>>>>> and want to share the results of this work.
> Anyway
> > > > >> looking
> > > > >> >> >>>> forward for
> > > > >> >> >>>>>>>>>>>>> the FLIP update!
> > > > >> >> >>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>> Smirnov Alexander
> > > > >> >> >>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> > > imjark@gmail.com
> > > > >:
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> Hi Alex,
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> Thanks for summarizing your points.
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> > > > >> discussed
> > > > >> >> >> it
> > > > >> >> >>>> several times
> > > > >> >> >>>>>>>>>>>>>> and we have totally refactored the design.
> > > > >> >> >>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
> > many
> > > of
> > > > >> your
> > > > >> >> >>>> points!
> > > > >> >> >>>>>>>>>>>>>> Qingsheng is still working on updating the
> design
> > > docs
> > > > >> and
> > > > >> >> >>>> maybe can be
> > > > >> >> >>>>>>>>>>>>>> available in the next few days.
> > > > >> >> >>>>>>>>>>>>>> I will share some conclusions from our
> > discussions:
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> 1) we have refactored the design towards to
> "cache
> > > in
> > > > >> >> >>>> framework" way.
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> > customize
> > > > and
> > > > >> a
> > > > >> >> >>>> default
> > > > >> >> >>>>>>>>>>>>>> implementation with builder for users to
> easy-use.
> > > > >> >> >>>>>>>>>>>>>> This can both make it possible to both have
> > > > flexibility
> > > > >> and
> > > > >> >> >>>> conciseness.
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> > > lookup
> > > > >> >> >> cache,
> > > > >> >> >>>> esp reducing
> > > > >> >> >>>>>>>>>>>>>> IO.
> > > > >> >> >>>>>>>>>>>>>> Filter pushdown should be the final state and
> the
> > > > >> unified
> > > > >> >> >> way
> > > > >> >> >>>> to both
> > > > >> >> >>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> > > > >> >> >>>>>>>>>>>>>> so I think we should make effort in this
> > direction.
> > > If
> > > > >> we
> > > > >> >> >> need
> > > > >> >> >>>> to support
> > > > >> >> >>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
> use
> > > > >> >> >>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
> decide
> > > to
> > > > >> >> >>> implement
> > > > >> >> >>>> the cache
> > > > >> >> >>>>>>>>>>>>>> in the framework, we have the chance to support
> > > > >> >> >>>>>>>>>>>>>> filter on cache anytime. This is an optimization
> > and
> > > > it
> > > > >> >> >>> doesn't
> > > > >> >> >>>> affect the
> > > > >> >> >>>>>>>>>>>>>> public API. I think we can create a JIRA issue
> to
> > > > >> >> >>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
> > your
> > > > >> >> >> proposal.
> > > > >> >> >>>>>>>>>>>>>> In the first version, we will only support
> > > > InputFormat,
> > > > >> >> >>>> SourceFunction for
> > > > >> >> >>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> > > > >> >> >>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
> source
> > > > >> operator
> > > > >> >> >>>> instead of
> > > > >> >> >>>>>>>>>>>>>> calling it embedded in the join operator.
> > > > >> >> >>>>>>>>>>>>>> However, this needs another FLIP to support the
> > > > re-scan
> > > > >> >> >>> ability
> > > > >> >> >>>> for FLIP-27
> > > > >> >> >>>>>>>>>>>>>> Source, and this can be a large work.
> > > > >> >> >>>>>>>>>>>>>> In order to not block this issue, we can put the
> > > > effort
> > > > >> of
> > > > >> >> >>>> FLIP-27 source
> > > > >> >> >>>>>>>>>>>>>> integration into future work and integrate
> > > > >> >> >>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> I think it's fine to use
> > InputFormat&SourceFunction,
> > > > as
> > > > >> >> they
> > > > >> >> >>>> are not
> > > > >> >> >>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> > another
> > > > >> >> function
> > > > >> >> >>>>>>>>>>>>>> similar to them which is meaningless. We need to
> > > plan
> > > > >> >> >> FLIP-27
> > > > >> >> >>>> source
> > > > >> >> >>>>>>>>>>>>>> integration ASAP before InputFormat &
> > SourceFunction
> > > > are
> > > > >> >> >>>> deprecated.
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> Best,
> > > > >> >> >>>>>>>>>>>>>> Jark
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
> <
> > > > >> >> >>>> smiralexan@gmail.com>
> > > > >> >> >>>>>>>>>>>>>> wrote:
> > > > >> >> >>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>> Hi Martijn!
> > > > >> >> >>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>> Got it. Therefore, the realization with
> > InputFormat
> > > > is
> > > > >> not
> > > > >> >> >>>> considered.
> > > > >> >> >>>>>>>>>>>>>>> Thanks for clearing that up!
> > > > >> >> >>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>> Smirnov Alexander
> > > > >> >> >>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> > > > >> >> >>>> martijn@ververica.com>:
> > > > >> >> >>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>> Hi,
> > > > >> >> >>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>> With regards to:
> > > > >> >> >>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> But if there are plans to refactor all
> > connectors
> > > > to
> > > > >> >> >>> FLIP-27
> > > > >> >> >>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
> > The
> > > > old
> > > > >> >> >>>> interfaces will be
> > > > >> >> >>>>>>>>>>>>>>>> deprecated and connectors will either be
> > > refactored
> > > > to
> > > > >> >> use
> > > > >> >> >>>> the new ones
> > > > >> >> >>>>>>>>>>>>>>> or
> > > > >> >> >>>>>>>>>>>>>>>> dropped.
> > > > >> >> >>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>> The caching should work for connectors that
> are
> > > > using
> > > > >> >> >>> FLIP-27
> > > > >> >> >>>> interfaces,
> > > > >> >> >>>>>>>>>>>>>>>> we should not introduce new features for old
> > > > >> interfaces.
> > > > >> >> >>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>> Martijn
> > > > >> >> >>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
> Смирнов
> > <
> > > > >> >> >>>> smiralexan@gmail.com>
> > > > >> >> >>>>>>>>>>>>>>>> wrote:
> > > > >> >> >>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> Hi Jark!
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
> > make
> > > > >> some
> > > > >> >> >>>> comments and
> > > > >> >> >>>>>>>>>>>>>>>>> clarify my points.
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
> > we
> > > > can
> > > > >> >> >>> achieve
> > > > >> >> >>>> both
> > > > >> >> >>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
> in
> > > > >> >> >>>> flink-table-common,
> > > > >> >> >>>>>>>>>>>>>>>>> but have implementations of it in
> > > > >> flink-table-runtime.
> > > > >> >> >>>> Therefore if a
> > > > >> >> >>>>>>>>>>>>>>>>> connector developer wants to use existing
> cache
> > > > >> >> >> strategies
> > > > >> >> >>>> and their
> > > > >> >> >>>>>>>>>>>>>>>>> implementations, he can just pass
> lookupConfig
> > to
> > > > the
> > > > >> >> >>>> planner, but if
> > > > >> >> >>>>>>>>>>>>>>>>> he wants to have its own cache implementation
> > in
> > > > his
> > > > >> >> >>>> TableFunction, it
> > > > >> >> >>>>>>>>>>>>>>>>> will be possible for him to use the existing
> > > > >> interface
> > > > >> >> >> for
> > > > >> >> >>>> this
> > > > >> >> >>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
> > the
> > > > >> >> >>>> documentation). In
> > > > >> >> >>>>>>>>>>>>>>>>> this way all configs and metrics will be
> > unified.
> > > > >> WDYT?
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> > cache,
> > > we
> > > > >> will
> > > > >> >> >>>> have 90% of
> > > > >> >> >>>>>>>>>>>>>>>>> lookup requests that can never be cached
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> > optimization
> > > in
> > > > >> case
> > > > >> >> >> of
> > > > >> >> >>>> LRU cache.
> > > > >> >> >>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> > > Collection<RowData>>.
> > > > >> Here
> > > > >> >> >> we
> > > > >> >> >>>> always
> > > > >> >> >>>>>>>>>>>>>>>>> store the response of the dimension table in
> > > cache,
> > > > >> even
> > > > >> >> >>>> after
> > > > >> >> >>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
> > rows
> > > > >> after
> > > > >> >> >>>> applying
> > > > >> >> >>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> > > > >> >> >>> TableFunction,
> > > > >> >> >>>> we store
> > > > >> >> >>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> > > cache
> > > > >> line
> > > > >> >> >>> will
> > > > >> >> >>>> be
> > > > >> >> >>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> > > > bytes).
> > > > >> >> >> I.e.
> > > > >> >> >>>> we don't
> > > > >> >> >>>>>>>>>>>>>>>>> completely filter keys, by which result was
> > > pruned,
> > > > >> but
> > > > >> >> >>>> significantly
> > > > >> >> >>>>>>>>>>>>>>>>> reduce required memory to store this result.
> If
> > > the
> > > > >> user
> > > > >> >> >>>> knows about
> > > > >> >> >>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> > > > option
> > > > >> >> >> before
> > > > >> >> >>>> the start
> > > > >> >> >>>>>>>>>>>>>>>>> of the job. But actually I came up with the
> > idea
> > > > >> that we
> > > > >> >> >>> can
> > > > >> >> >>>> do this
> > > > >> >> >>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
> and
> > > > >> 'weigher'
> > > > >> >> >>>> methods of
> > > > >> >> >>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> > > > >> collection
> > > > >> >> >> of
> > > > >> >> >>>> rows
> > > > >> >> >>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> > > automatically
> > > > >> fit
> > > > >> >> >>> much
> > > > >> >> >>>> more
> > > > >> >> >>>>>>>>>>>>>>>>> records than before.
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> > > > filters
> > > > >> and
> > > > >> >> >>>> projects
> > > > >> >> >>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > > > >> >> >>>> SupportsProjectionPushDown.
> > > > >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > > > interfaces,
> > > > >> >> >> don't
> > > > >> >> >>>> mean it's
> > > > >> >> >>>>>>>>>>>>>>> hard
> > > > >> >> >>>>>>>>>>>>>>>>> to implement.
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> > > > implement
> > > > >> >> >> filter
> > > > >> >> >>>> pushdown.
> > > > >> >> >>>>>>>>>>>>>>>>> But I think the fact that currently there is
> no
> > > > >> database
> > > > >> >> >>>> connector
> > > > >> >> >>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> > > > feature
> > > > >> >> >> won't
> > > > >> >> >>>> be
> > > > >> >> >>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> > > talk
> > > > >> about
> > > > >> >> >>>> other
> > > > >> >> >>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
> databases
> > > > might
> > > > >> >> not
> > > > >> >> >>>> support all
> > > > >> >> >>>>>>>>>>>>>>>>> Flink filters (or not support filters at
> all).
> > I
> > > > >> think
> > > > >> >> >>> users
> > > > >> >> >>>> are
> > > > >> >> >>>>>>>>>>>>>>>>> interested in supporting cache filters
> > > optimization
> > > > >> >> >>>> independently of
> > > > >> >> >>>>>>>>>>>>>>>>> supporting other features and solving more
> > > complex
> > > > >> >> >> problems
> > > > >> >> >>>> (or
> > > > >> >> >>>>>>>>>>>>>>>>> unsolvable at all).
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> 3) I agree with your third statement.
> Actually
> > in
> > > > our
> > > > >> >> >>>> internal version
> > > > >> >> >>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
> and
> > > > >> >> reloading
> > > > >> >> >>>> data from
> > > > >> >> >>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
> a
> > > way
> > > > to
> > > > >> >> >> unify
> > > > >> >> >>>> the logic
> > > > >> >> >>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> > > > >> >> SourceFunction,
> > > > >> >> >>>> Source,...)
> > > > >> >> >>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
> > result
> > > I
> > > > >> >> >> settled
> > > > >> >> >>>> on using
> > > > >> >> >>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
> > in
> > > > all
> > > > >> >> >> lookup
> > > > >> >> >>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
> plans
> > > to
> > > > >> >> >>> deprecate
> > > > >> >> >>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> > > usage
> > > > of
> > > > >> >> >>>> FLIP-27 source
> > > > >> >> >>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> > > > source
> > > > >> was
> > > > >> >> >>>> designed to
> > > > >> >> >>>>>>>>>>>>>>>>> work in distributed environment
> > (SplitEnumerator
> > > on
> > > > >> >> >>>> JobManager and
> > > > >> >> >>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> > > operator
> > > > >> >> >> (lookup
> > > > >> >> >>>> join
> > > > >> >> >>>>>>>>>>>>>>>>> operator in our case). There is even no
> direct
> > > way
> > > > to
> > > > >> >> >> pass
> > > > >> >> >>>> splits from
> > > > >> >> >>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
> > works
> > > > >> >> through
> > > > >> >> >>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> > > > >> >> >>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> > > > >> >> >> AddSplitEvents).
> > > > >> >> >>>> Usage of
> > > > >> >> >>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
> > clearer
> > > > and
> > > > >> >> >>>> easier. But if
> > > > >> >> >>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> > > > >> FLIP-27, I
> > > > >> >> >>>> have the
> > > > >> >> >>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
> > lookup
> > > > join
> > > > >> >> ALL
> > > > >> >> >>>> cache in
> > > > >> >> >>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
> of
> > > > batch
> > > > >> >> >>> source?
> > > > >> >> >>>> The point
> > > > >> >> >>>>>>>>>>>>>>>>> is that the only difference between lookup
> join
> > > ALL
> > > > >> >> cache
> > > > >> >> >>>> and simple
> > > > >> >> >>>>>>>>>>>>>>>>> join with batch source is that in the first
> > case
> > > > >> >> scanning
> > > > >> >> >>> is
> > > > >> >> >>>> performed
> > > > >> >> >>>>>>>>>>>>>>>>> multiple times, in between which state
> (cache)
> > is
> > > > >> >> cleared
> > > > >> >> >>>> (correct me
> > > > >> >> >>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> > > > >> functionality of
> > > > >> >> >>>> simple join
> > > > >> >> >>>>>>>>>>>>>>>>> to support state reloading + extend the
> > > > >> functionality of
> > > > >> >> >>>> scanning
> > > > >> >> >>>>>>>>>>>>>>>>> batch source multiple times (this one should
> be
> > > > easy
> > > > >> >> with
> > > > >> >> >>>> new FLIP-27
> > > > >> >> >>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
> -
> > we
> > > > >> will
> > > > >> >> >> need
> > > > >> >> >>>> to change
> > > > >> >> >>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> > > again
> > > > >> after
> > > > >> >> >>>> some TTL).
> > > > >> >> >>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> > long-term
> > > > >> goal
> > > > >> >> >> and
> > > > >> >> >>>> will make
> > > > >> >> >>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
> > said.
> > > > >> Maybe
> > > > >> >> >> we
> > > > >> >> >>>> can limit
> > > > >> >> >>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> > > (InputFormats).
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> > > > >> >> >>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> > > flexible
> > > > >> >> >>>> interfaces for
> > > > >> >> >>>>>>>>>>>>>>>>> caching in lookup join.
> > > > >> >> >>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
> both
> > > in
> > > > >> LRU
> > > > >> >> >> and
> > > > >> >> >>>> ALL caches.
> > > > >> >> >>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> > > > >> supported
> > > > >> >> >> in
> > > > >> >> >>>> Flink
> > > > >> >> >>>>>>>>>>>>>>>>> connectors, some of the connectors might not
> > have
> > > > the
> > > > >> >> >>>> opportunity to
> > > > >> >> >>>>>>>>>>>>>>>>> support filter pushdown + as I know,
> currently
> > > > filter
> > > > >> >> >>>> pushdown works
> > > > >> >> >>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> > filters
> > > +
> > > > >> >> >>>> projections
> > > > >> >> >>>>>>>>>>>>>>>>> optimization should be independent from other
> > > > >> features.
> > > > >> >> >>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
> > that
> > > > >> >> involves
> > > > >> >> >>>> multiple
> > > > >> >> >>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
> > from
> > > > >> >> >>>> InputFormat in favor
> > > > >> >> >>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> > realization
> > > > >> really
> > > > >> >> >>>> complex and
> > > > >> >> >>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
> > extend
> > > > the
> > > > >> >> >>>> functionality of
> > > > >> >> >>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> > > case
> > > > of
> > > > >> >> >>> lookup
> > > > >> >> >>>> join ALL
> > > > >> >> >>>>>>>>>>>>>>>>> cache?
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>>>> Smirnov Alexander
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> [1]
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>
> > > > >> >> >>>>
> > > > >> >> >>>
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> > > > imjark@gmail.com
> > > > >> >:
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
> > want
> > > to
> > > > >> >> share
> > > > >> >> >>> my
> > > > >> >> >>>> ideas:
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> > > connectors
> > > > >> base
> > > > >> >> >>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
> > ways
> > > > >> should
> > > > >> >> >>>> work (e.g.,
> > > > >> >> >>>>>>>>>>>>>>> cache
> > > > >> >> >>>>>>>>>>>>>>>>>> pruning, compatibility).
> > > > >> >> >>>>>>>>>>>>>>>>>> The framework way can provide more concise
> > > > >> interfaces.
> > > > >> >> >>>>>>>>>>>>>>>>>> The connector base way can define more
> > flexible
> > > > >> cache
> > > > >> >> >>>>>>>>>>>>>>>>>> strategies/implementations.
> > > > >> >> >>>>>>>>>>>>>>>>>> We are still investigating a way to see if
> we
> > > can
> > > > >> have
> > > > >> >> >>> both
> > > > >> >> >>>>>>>>>>>>>>> advantages.
> > > > >> >> >>>>>>>>>>>>>>>>>> We should reach a consensus that the way
> > should
> > > > be a
> > > > >> >> >> final
> > > > >> >> >>>> state,
> > > > >> >> >>>>>>>>>>>>>>> and we
> > > > >> >> >>>>>>>>>>>>>>>>>> are on the path to it.
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> > > > >> >> >>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
> > into
> > > > >> cache
> > > > >> >> >> can
> > > > >> >> >>>> benefit a
> > > > >> >> >>>>>>>>>>>>>>> lot
> > > > >> >> >>>>>>>>>>>>>>>>> for
> > > > >> >> >>>>>>>>>>>>>>>>>> ALL cache.
> > > > >> >> >>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> > > > Connectors
> > > > >> use
> > > > >> >> >>>> cache to
> > > > >> >> >>>>>>>>>>>>>>> reduce
> > > > >> >> >>>>>>>>>>>>>>>>> IO
> > > > >> >> >>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> > > > >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> > cache,
> > > we
> > > > >> will
> > > > >> >> >>>> have 90% of
> > > > >> >> >>>>>>>>>>>>>>>>> lookup
> > > > >> >> >>>>>>>>>>>>>>>>>> requests that can never be cached
> > > > >> >> >>>>>>>>>>>>>>>>>> and hit directly to the databases. That
> means
> > > the
> > > > >> cache
> > > > >> >> >> is
> > > > >> >> >>>>>>>>>>>>>>> meaningless in
> > > > >> >> >>>>>>>>>>>>>>>>>> this case.
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
> to
> > do
> > > > >> >> filters
> > > > >> >> >>>> and projects
> > > > >> >> >>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > > > >> >> >>>>>>>>>>>>>>> SupportsProjectionPushDown.
> > > > >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > > > interfaces,
> > > > >> >> >> don't
> > > > >> >> >>>> mean it's
> > > > >> >> >>>>>>>>>>>>>>> hard
> > > > >> >> >>>>>>>>>>>>>>>>> to
> > > > >> >> >>>>>>>>>>>>>>>>>> implement.
> > > > >> >> >>>>>>>>>>>>>>>>>> They should implement the pushdown
> interfaces
> > to
> > > > >> reduce
> > > > >> >> >> IO
> > > > >> >> >>>> and the
> > > > >> >> >>>>>>>>>>>>>>> cache
> > > > >> >> >>>>>>>>>>>>>>>>>> size.
> > > > >> >> >>>>>>>>>>>>>>>>>> That should be a final state that the scan
> > > source
> > > > >> and
> > > > >> >> >>>> lookup source
> > > > >> >> >>>>>>>>>>>>>>> share
> > > > >> >> >>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> > > > >> >> >>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> > > pushdown
> > > > >> logic
> > > > >> >> >> in
> > > > >> >> >>>> caches,
> > > > >> >> >>>>>>>>>>>>>>> which
> > > > >> >> >>>>>>>>>>>>>>>>>> will complex the lookup join design.
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> > > > >> >> >>>>>>>>>>>>>>>>>> All cache might be the most challenging part
> > of
> > > > this
> > > > >> >> >> FLIP.
> > > > >> >> >>>> We have
> > > > >> >> >>>>>>>>>>>>>>> never
> > > > >> >> >>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> > > > >> >> >>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
> > "eval"
> > > > >> method
> > > > >> >> >> of
> > > > >> >> >>>>>>>>>>>>>>> TableFunction.
> > > > >> >> >>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> > > > >> >> >>>>>>>>>>>>>>>>>> Ideally, connector implementation should
> share
> > > the
> > > > >> >> logic
> > > > >> >> >>> of
> > > > >> >> >>>> reload
> > > > >> >> >>>>>>>>>>>>>>> and
> > > > >> >> >>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> > > > >> >> >>>> InputFormat/SourceFunction/FLIP-27
> > > > >> >> >>>>>>>>>>>>>>>>> Source.
> > > > >> >> >>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> > > > deprecated,
> > > > >> and
> > > > >> >> >>> the
> > > > >> >> >>>> FLIP-27
> > > > >> >> >>>>>>>>>>>>>>>>> source
> > > > >> >> >>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> > > > >> >> >>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> > > > >> LookupJoin,
> > > > >> >> >>> this
> > > > >> >> >>>> may make
> > > > >> >> >>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> > > > >> >> >>>>>>>>>>>>>>>>>> We are still investigating how to abstract
> the
> > > ALL
> > > > >> >> cache
> > > > >> >> >>>> logic and
> > > > >> >> >>>>>>>>>>>>>>> reuse
> > > > >> >> >>>>>>>>>>>>>>>>>> the existing source interfaces.
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>> Best,
> > > > >> >> >>>>>>>>>>>>>>>>>> Jark
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> > > > >> >> >>>> ro.v.boyko@gmail.com>
> > > > >> >> >>>>>>>>>>>>>>> wrote:
> > > > >> >> >>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
> > lies
> > > > out
> > > > >> of
> > > > >> >> >> the
> > > > >> >> >>>> scope of
> > > > >> >> >>>>>>>>>>>>>>> this
> > > > >> >> >>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
> be
> > > > done
> > > > >> for
> > > > >> >> >>> all
> > > > >> >> >>>>>>>>>>>>>>>>> ScanTableSource
> > > > >> >> >>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> > > > >> >> >>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
> Visser <
> > > > >> >> >>>>>>>>>>>>>>> martijnvisser@apache.org>
> > > > >> >> >>>>>>>>>>>>>>>>>>> wrote:
> > > > >> >> >>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>> Hi everyone,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> > > correctly
> > > > >> >> >>> mentioned
> > > > >> >> >>>> that
> > > > >> >> >>>>>>>>>>>>>>> filter
> > > > >> >> >>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> > > > >> >> >> jdbc/hive/hbase."
> > > > >> >> >>>> -> Would
> > > > >> >> >>>>>>>>>>>>>>> an
> > > > >> >> >>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> > implement
> > > > >> these
> > > > >> >> >>> filter
> > > > >> >> >>>>>>>>>>>>>>> pushdowns?
> > > > >> >> >>>>>>>>>>>>>>>>> I
> > > > >> >> >>>>>>>>>>>>>>>>>>>> can
> > > > >> >> >>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
> to
> > > > doing
> > > > >> >> >> that,
> > > > >> >> >>>> outside
> > > > >> >> >>>>>>>>>>>>>>> of
> > > > >> >> >>>>>>>>>>>>>>>>> lookup
> > > > >> >> >>>>>>>>>>>>>>>>>>>> caching and metrics.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>> Martijn Visser
> > > > >> >> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > > > >> >> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> > > > >> >> >>>> ro.v.boyko@gmail.com>
> > > > >> >> >>>>>>>>>>>>>>>>> wrote:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> Hi everyone!
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> > > improvement!
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> I do think that single cache
> implementation
> > > > >> would be
> > > > >> >> >> a
> > > > >> >> >>>> nice
> > > > >> >> >>>>>>>>>>>>>>>>> opportunity
> > > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> > SYSTEM_TIME
> > > > AS
> > > > >> OF
> > > > >> >> >>>> proc_time"
> > > > >> >> >>>>>>>>>>>>>>>>> semantics
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> > > > >> implemented.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
> > say
> > > > >> that:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
> > to
> > > > cut
> > > > >> off
> > > > >> >> >>> the
> > > > >> >> >>>> cache
> > > > >> >> >>>>>>>>>>>>>>> size
> > > > >> >> >>>>>>>>>>>>>>>>> by
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
> the
> > > most
> > > > >> >> handy
> > > > >> >> >>>> way to do
> > > > >> >> >>>>>>>>>>>>>>> it
> > > > >> >> >>>>>>>>>>>>>>>>> is
> > > > >> >> >>>>>>>>>>>>>>>>>>>> apply
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
> bit
> > > > >> harder to
> > > > >> >> >>>> pass it
> > > > >> >> >>>>>>>>>>>>>>>>> through the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> > > Alexander
> > > > >> >> >>> correctly
> > > > >> >> >>>>>>>>>>>>>>> mentioned
> > > > >> >> >>>>>>>>>>>>>>>>> that
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
> > for
> > > > >> >> >>>> jdbc/hive/hbase.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
> caching
> > > > >> >> >> parameters
> > > > >> >> >>>> for
> > > > >> >> >>>>>>>>>>>>>>> different
> > > > >> >> >>>>>>>>>>>>>>>>>>>> tables
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
> > set
> > > it
> > > > >> >> >> through
> > > > >> >> >>>> DDL
> > > > >> >> >>>>>>>>>>>>>>> rather
> > > > >> >> >>>>>>>>>>>>>>>>> than
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
> > options
> > > > for
> > > > >> >> all
> > > > >> >> >>>> lookup
> > > > >> >> >>>>>>>>>>>>>>> tables.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> > > > really
> > > > >> >> >>>> deprives us of
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> > > implement
> > > > >> >> their
> > > > >> >> >>> own
> > > > >> >> >>>>>>>>>>>>>>> cache).
> > > > >> >> >>>>>>>>>>>>>>>>> But
> > > > >> >> >>>>>>>>>>>>>>>>>>>> most
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
> > more
> > > > >> >> >> different
> > > > >> >> >>>> cache
> > > > >> >> >>>>>>>>>>>>>>>>> strategies
> > > > >> >> >>>>>>>>>>>>>>>>>>>> and
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> > > schema
> > > > >> >> >> proposed
> > > > >> >> >>>> by
> > > > >> >> >>>>>>>>>>>>>>>>> Alexander.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
> not
> > > > right
> > > > >> and
> > > > >> >> >>> all
> > > > >> >> >>>> these
> > > > >> >> >>>>>>>>>>>>>>>>>>>> facilities
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> > > > architecture?
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> > Visser <
> > > > >> >> >>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> wrote:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> > > wanted
> > > > to
> > > > >> >> >>>> express that
> > > > >> >> >>>>>>>>>>>>>>> I
> > > > >> >> >>>>>>>>>>>>>>>>> really
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
> this
> > > > topic
> > > > >> >> >> and I
> > > > >> >> >>>> hope
> > > > >> >> >>>>>>>>>>>>>>> that
> > > > >> >> >>>>>>>>>>>>>>>>>>>> others
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> Martijn
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> > > > Смирнов <
> > > > >> >> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> > > However, I
> > > > >> have
> > > > >> >> >>>> questions
> > > > >> >> >>>>>>>>>>>>>>>>> about
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
> > get
> > > > >> >> >>>> something?).
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
> of
> > > "FOR
> > > > >> >> >>>> SYSTEM_TIME
> > > > >> >> >>>>>>>>>>>>>>> AS OF
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> proc_time”
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> > > > SYSTEM_TIME
> > > > >> AS
> > > > >> >> >> OF
> > > > >> >> >>>>>>>>>>>>>>> proc_time"
> > > > >> >> >>>>>>>>>>>>>>>>> is
> > > > >> >> >>>>>>>>>>>>>>>>>>>> not
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
> > you
> > > > >> said,
> > > > >> >> >>> users
> > > > >> >> >>>> go
> > > > >> >> >>>>>>>>>>>>>>> on it
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
> performance
> > > (no
> > > > >> one
> > > > >> >> >>>> proposed
> > > > >> >> >>>>>>>>>>>>>>> to
> > > > >> >> >>>>>>>>>>>>>>>>> enable
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
> do
> > > you
> > > > >> mean
> > > > >> >> >>>> other
> > > > >> >> >>>>>>>>>>>>>>>>> developers
> > > > >> >> >>>>>>>>>>>>>>>>>>>> of
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> > > > explicitly
> > > > >> >> >>> specify
> > > > >> >> >>>>>>>>>>>>>>> whether
> > > > >> >> >>>>>>>>>>>>>>>>> their
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
> the
> > > > list
> > > > >> of
> > > > >> >> >>>> supported
> > > > >> >> >>>>>>>>>>>>>>>>>>>> options),
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> > > want
> > > > >> to.
> > > > >> >> So
> > > > >> >> >>>> what
> > > > >> >> >>>>>>>>>>>>>>>>> exactly is
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
> > caching
> > > > in
> > > > >> >> >>> modules
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> > > flink-table-common
> > > > >> from
> > > > >> >> >>> the
> > > > >> >> >>>>>>>>>>>>>>>>> considered
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> > > > >> >> >>>> breaking/non-breaking
> > > > >> >> >>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> > > > proc_time"?
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> > > > >> options in
> > > > >> >> >>> DDL
> > > > >> >> >>>> to
> > > > >> >> >>>>>>>>>>>>>>>>> control
> > > > >> >> >>>>>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
> > never
> > > > >> >> happened
> > > > >> >> >>>>>>>>>>>>>>> previously
> > > > >> >> >>>>>>>>>>>>>>>>> and
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> should
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> be cautious
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> > > > semantics
> > > > >> of
> > > > >> >> >> DDL
> > > > >> >> >>>>>>>>>>>>>>> options
> > > > >> >> >>>>>>>>>>>>>>>>> and
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
> > it
> > > > >> about
> > > > >> >> >>>> limiting
> > > > >> >> >>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>> scope
> > > > >> >> >>>>>>>>>>>>>>>>>>>> of
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> > > > business
> > > > >> >> >> logic
> > > > >> >> >>>> rather
> > > > >> >> >>>>>>>>>>>>>>> than
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
> logic
> > in
> > > > the
> > > > >> >> >>>> framework? I
> > > > >> >> >>>>>>>>>>>>>>>>> mean
> > > > >> >> >>>>>>>>>>>>>>>>>>>> that
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> > > option
> > > > >> with
> > > > >> >> >>>> lookup
> > > > >> >> >>>>>>>>>>>>>>> cache
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
> the
> > > > wrong
> > > > >> >> >>>> decision,
> > > > >> >> >>>>>>>>>>>>>>>>> because it
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
> > logic
> > > > (not
> > > > >> >> >> just
> > > > >> >> >>>>>>>>>>>>>>> performance
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> > > > functions
> > > > >> of
> > > > >> >> >> ONE
> > > > >> >> >>>> table
> > > > >> >> >>>>>>>>>>>>>>>>> (there
> > > > >> >> >>>>>>>>>>>>>>>>>>>> can
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> > caches).
> > > > >> Does it
> > > > >> >> >>>> really
> > > > >> >> >>>>>>>>>>>>>>>>> matter for
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
> > logic
> > > is
> > > > >> >> >>> located,
> > > > >> >> >>>>>>>>>>>>>>> which is
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> > > > >> 'sink.parallelism',
> > > > >> >> >>>> which in
> > > > >> >> >>>>>>>>>>>>>>>>> some way
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
> framework"
> > > and
> > > > I
> > > > >> >> >> don't
> > > > >> >> >>>> see any
> > > > >> >> >>>>>>>>>>>>>>>>> problem
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> here.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> > > > all-caching
> > > > >> >> >>>> scenario
> > > > >> >> >>>>>>>>>>>>>>> and
> > > > >> >> >>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> design
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> would become more complex
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> > > discussion,
> > > > >> but
> > > > >> >> >>>> actually
> > > > >> >> >>>>>>>>>>>>>>> in our
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> > > quite
> > > > >> >> >> easily
> > > > >> >> >>> -
> > > > >> >> >>>> we
> > > > >> >> >>>>>>>>>>>>>>> reused
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
> > for
> > > a
> > > > >> new
> > > > >> >> >>> API).
> > > > >> >> >>>> The
> > > > >> >> >>>>>>>>>>>>>>>>> point is
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
> use
> > > > >> >> >> InputFormat
> > > > >> >> >>>> for
> > > > >> >> >>>>>>>>>>>>>>>>> scanning
> > > > >> >> >>>>>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
> even
> > > Hive
> > > > >> - it
> > > > >> >> >>> uses
> > > > >> >> >>>>>>>>>>>>>>> class
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
> a
> > > > >> wrapper
> > > > >> >> >>> around
> > > > >> >> >>>>>>>>>>>>>>>>> InputFormat.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> > > ability
> > > > >> to
> > > > >> >> >>> reload
> > > > >> >> >>>>>>>>>>>>>>> cache
> > > > >> >> >>>>>>>>>>>>>>>>> data
> > > > >> >> >>>>>>>>>>>>>>>>>>>> in
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> > > number
> > > > >> of
> > > > >> >> >>>>>>>>>>>>>>> InputSplits,
> > > > >> >> >>>>>>>>>>>>>>>>> but
> > > > >> >> >>>>>>>>>>>>>>>>>>>> has
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
> reload
> > > > time
> > > > >> >> >>>> significantly
> > > > >> >> >>>>>>>>>>>>>>>>> reduces
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> > > blocking). I
> > > > >> know
> > > > >> >> >>> that
> > > > >> >> >>>>>>>>>>>>>>> usually
> > > > >> >> >>>>>>>>>>>>>>>>> we
> > > > >> >> >>>>>>>>>>>>>>>>>>>> try
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> > > code,
> > > > >> but
> > > > >> >> >>> maybe
> > > > >> >> >>>> this
> > > > >> >> >>>>>>>>>>>>>>> one
> > > > >> >> >>>>>>>>>>>>>>>>> can
> > > > >> >> >>>>>>>>>>>>>>>>>>>> be
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
> > an
> > > > >> ideal
> > > > >> >> >>>> solution,
> > > > >> >> >>>>>>>>>>>>>>> maybe
> > > > >> >> >>>>>>>>>>>>>>>>>>>> there
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> are better ones.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
> > might
> > > > >> >> >> introduce
> > > > >> >> >>>>>>>>>>>>>>>>> compatibility
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> issues
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> > > > developer
> > > > >> of
> > > > >> >> >> the
> > > > >> >> >>>>>>>>>>>>>>> connector
> > > > >> >> >>>>>>>>>>>>>>>>>>>> won't
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
> > new
> > > > >> cache
> > > > >> >> >>>> options
> > > > >> >> >>>>>>>>>>>>>>>>>>>> incorrectly
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
> options
> > > > into
> > > > >> 2
> > > > >> >> >>>> different
> > > > >> >> >>>>>>>>>>>>>>> code
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
> will
> > > > need
> > > > >> to
> > > > >> >> >> do
> > > > >> >> >>>> is to
> > > > >> >> >>>>>>>>>>>>>>>>> redirect
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> > > > >> LookupConfig
> > > > >> >> (+
> > > > >> >> >>>> maybe
> > > > >> >> >>>>>>>>>>>>>>> add an
> > > > >> >> >>>>>>>>>>>>>>>>>>>> alias
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> > > naming),
> > > > >> >> >>> everything
> > > > >> >> >>>>>>>>>>>>>>> will be
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> > > won't
> > > > >> do
> > > > >> >> >>>>>>>>>>>>>>> refactoring at
> > > > >> >> >>>>>>>>>>>>>>>>> all,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
> connector
> > > > >> because
> > > > >> >> >> of
> > > > >> >> >>>>>>>>>>>>>>> backward
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
> wants
> > to
> > > > use
> > > > >> >> his
> > > > >> >> >>> own
> > > > >> >> >>>>>>>>>>>>>>> cache
> > > > >> >> >>>>>>>>>>>>>>>>> logic,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> > > > configs
> > > > >> >> into
> > > > >> >> >>> the
> > > > >> >> >>>>>>>>>>>>>>>>> framework,
> > > > >> >> >>>>>>>>>>>>>>>>>>>> and
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
> with
> > > > >> already
> > > > >> >> >>>> existing
> > > > >> >> >>>>>>>>>>>>>>>>> configs
> > > > >> >> >>>>>>>>>>>>>>>>>>>> and
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
> it's a
> > > > rare
> > > > >> >> >> case).
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
> pushed
> > > all
> > > > >> the
> > > > >> >> >> way
> > > > >> >> >>>> down
> > > > >> >> >>>>>>>>>>>>>>> to
> > > > >> >> >>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> table
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> > > source
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
> is
> > > that
> > > > >> the
> > > > >> >> >>> ONLY
> > > > >> >> >>>>>>>>>>>>>>> connector
> > > > >> >> >>>>>>>>>>>>>>>>>>>> that
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> > > > >> FileSystemTableSource
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> > > > currently).
> > > > >> >> Also
> > > > >> >> >>>> for some
> > > > >> >> >>>>>>>>>>>>>>>>>>>> databases
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> > > > complex
> > > > >> >> >>> filters
> > > > >> >> >>>>>>>>>>>>>>> that we
> > > > >> >> >>>>>>>>>>>>>>>>> have
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> in Flink.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
> the
> > > > cache
> > > > >> >> >> seems
> > > > >> >> >>>> not
> > > > >> >> >>>>>>>>>>>>>>>>> quite
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> useful
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
> large
> > > > >> amount of
> > > > >> >> >>> data
> > > > >> >> >>>>>>>>>>>>>>> from the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> > > > suppose
> > > > >> in
> > > > >> >> >>>> dimension
> > > > >> >> >>>>>>>>>>>>>>>>> table
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> 'users'
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
> 20
> > to
> > > > 40,
> > > > >> >> and
> > > > >> >> >>>> input
> > > > >> >> >>>>>>>>>>>>>>> stream
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
> > by
> > > > age
> > > > >> of
> > > > >> >> >>>> users. If
> > > > >> >> >>>>>>>>>>>>>>> we
> > > > >> >> >>>>>>>>>>>>>>>>> have
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> > > This
> > > > >> means
> > > > >> >> >>> the
> > > > >> >> >>>> user
> > > > >> >> >>>>>>>>>>>>>>> can
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
> > almost
> > > 2
> > > > >> >> times.
> > > > >> >> >>> It
> > > > >> >> >>>> will
> > > > >> >> >>>>>>>>>>>>>>>>> gain a
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> huge
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> > > > optimization
> > > > >> >> >> starts
> > > > >> >> >>>> to
> > > > >> >> >>>>>>>>>>>>>>> really
> > > > >> >> >>>>>>>>>>>>>>>>>>>> shine
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> > > filters
> > > > >> and
> > > > >> >> >>>> projections
> > > > >> >> >>>>>>>>>>>>>>>>> can't
> > > > >> >> >>>>>>>>>>>>>>>>>>>> fit
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
> > opens
> > > up
> > > > >> >> >>>> additional
> > > > >> >> >>>>>>>>>>>>>>>>>>>> possibilities
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
> 'not
> > > > quite
> > > > >> >> >>>> useful'.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> > > > >> regarding
> > > > >> >> >> this
> > > > >> >> >>>> topic!
> > > > >> >> >>>>>>>>>>>>>>>>> Because
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> > > points,
> > > > >> and I
> > > > >> >> >>>> think
> > > > >> >> >>>>>>>>>>>>>>> with
> > > > >> >> >>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>> help
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
> > come
> > > > to a
> > > > >> >> >>>> consensus.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
> > Ren
> > > <
> > > > >> >> >>>>>>>>>>>>>>> renqschn@gmail.com
> > > > >> >> >>>>>>>>>>>>>>>>>> :
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
> for
> > my
> > > > >> late
> > > > >> >> >>>> response!
> > > > >> >> >>>>>>>>>>>>>>> We
> > > > >> >> >>>>>>>>>>>>>>>>> had
> > > > >> >> >>>>>>>>>>>>>>>>>>>> an
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
> > and
> > > > >> Leonard
> > > > >> >> >>> and
> > > > >> >> >>>> I’d
> > > > >> >> >>>>>>>>>>>>>>> like
> > > > >> >> >>>>>>>>>>>>>>>>> to
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> > > implementing
> > > > >> the
> > > > >> >> >>> cache
> > > > >> >> >>>>>>>>>>>>>>> logic in
> > > > >> >> >>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> table
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> > > > >> user-provided
> > > > >> >> >>>> table
> > > > >> >> >>>>>>>>>>>>>>>>> function,
> > > > >> >> >>>>>>>>>>>>>>>>>>>> we
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> > extending
> > > > >> >> >>>> TableFunction
> > > > >> >> >>>>>>>>>>>>>>> with
> > > > >> >> >>>>>>>>>>>>>>>>> these
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> concerns:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
> semantic
> > of
> > > > >> "FOR
> > > > >> >> >>>>>>>>>>>>>>> SYSTEM_TIME
> > > > >> >> >>>>>>>>>>>>>>>>> AS OF
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> > > reflect
> > > > >> the
> > > > >> >> >>>> content
> > > > >> >> >>>>>>>>>>>>>>> of the
> > > > >> >> >>>>>>>>>>>>>>>>>>>> lookup
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
> users
> > > > >> choose
> > > > >> >> to
> > > > >> >> >>>> enable
> > > > >> >> >>>>>>>>>>>>>>>>> caching
> > > > >> >> >>>>>>>>>>>>>>>>>>>> on
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
> > that
> > > > >> this
> > > > >> >> >>>> breakage is
> > > > >> >> >>>>>>>>>>>>>>>>>>>> acceptable
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> in
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
> > prefer
> > > > not
> > > > >> to
> > > > >> >> >>>> provide
> > > > >> >> >>>>>>>>>>>>>>>>> caching on
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
> > in
> > > > the
> > > > >> >> >>>> framework
> > > > >> >> >>>>>>>>>>>>>>>>> (whether
> > > > >> >> >>>>>>>>>>>>>>>>>>>> in a
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> > TableFunction),
> > > we
> > > > >> have
> > > > >> >> >> to
> > > > >> >> >>>>>>>>>>>>>>> confront a
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> situation
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
> > control
> > > > the
> > > > >> >> >>>> behavior of
> > > > >> >> >>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> framework,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> > > > should
> > > > >> be
> > > > >> >> >>>> cautious.
> > > > >> >> >>>>>>>>>>>>>>>>> Under
> > > > >> >> >>>>>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> > > framework
> > > > >> >> should
> > > > >> >> >>>> only be
> > > > >> >> >>>>>>>>>>>>>>>>>>>> specified
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> by
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
> > it’s
> > > > >> hard
> > > > >> >> to
> > > > >> >> >>>> apply
> > > > >> >> >>>>>>>>>>>>>>> these
> > > > >> >> >>>>>>>>>>>>>>>>>>>> general
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
> source
> > > > loads
> > > > >> and
> > > > >> >> >>>> refresh
> > > > >> >> >>>>>>>>>>>>>>> all
> > > > >> >> >>>>>>>>>>>>>>>>>>>> records
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> > > high
> > > > >> >> lookup
> > > > >> >> >>>>>>>>>>>>>>> performance
> > > > >> >> >>>>>>>>>>>>>>>>>>>> (like
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> Hive
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
> > widely
> > > > >> used
> > > > >> >> by
> > > > >> >> >>> our
> > > > >> >> >>>>>>>>>>>>>>> internal
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
> > the
> > > > >> user’s
> > > > >> >> >>>>>>>>>>>>>>> TableFunction
> > > > >> >> >>>>>>>>>>>>>>>>>>>> works
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> fine
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> > > > >> introduce a
> > > > >> >> >>> new
> > > > >> >> >>>>>>>>>>>>>>>>> interface for
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> this
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
> would
> > > > >> become
> > > > >> >> >> more
> > > > >> >> >>>>>>>>>>>>>>> complex.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
> framework
> > > > might
> > > > >> >> >>>> introduce
> > > > >> >> >>>>>>>>>>>>>>>>>>>> compatibility
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> > > there
> > > > >> might
> > > > >> >> >>>> exist two
> > > > >> >> >>>>>>>>>>>>>>>>> caches
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> with
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
> user
> > > > >> >> >> incorrectly
> > > > >> >> >>>>>>>>>>>>>>> configures
> > > > >> >> >>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> table
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> > > implemented
> > > > >> by
> > > > >> >> >> the
> > > > >> >> >>>> lookup
> > > > >> >> >>>>>>>>>>>>>>>>> source).
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> > > > >> Alexander, I
> > > > >> >> >>>> think
> > > > >> >> >>>>>>>>>>>>>>>>> filters
> > > > >> >> >>>>>>>>>>>>>>>>>>>> and
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
> way
> > > down
> > > > >> to
> > > > >> >> >> the
> > > > >> >> >>>> table
> > > > >> >> >>>>>>>>>>>>>>>>> function,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> like
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
> of
> > > the
> > > > >> >> >> runner
> > > > >> >> >>>> with
> > > > >> >> >>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>> cache.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> The
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> > > network
> > > > >> I/O
> > > > >> >> >> and
> > > > >> >> >>>>>>>>>>>>>>> pressure
> > > > >> >> >>>>>>>>>>>>>>>>> on the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
> these
> > > > >> >> >>> optimizations
> > > > >> >> >>>> to
> > > > >> >> >>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>> cache
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> seems
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> > > > reflect
> > > > >> our
> > > > >> >> >>>> ideas.
> > > > >> >> >>>>>>>>>>>>>>> We
> > > > >> >> >>>>>>>>>>>>>>>>>>>> prefer to
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
> > of
> > > > >> >> >>>> TableFunction,
> > > > >> >> >>>>>>>>>>>>>>> and we
> > > > >> >> >>>>>>>>>>>>>>>>>>>> could
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> > > > >> (CachingTableFunction,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
> developers
> > > and
> > > > >> >> >> regulate
> > > > >> >> >>>>>>>>>>>>>>> metrics
> > > > >> >> >>>>>>>>>>>>>>>>> of the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> cache.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
> reference.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>
> > > > >> >> >>>>
> > > > >> >> >>>
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> [2]
> > > > >> >> >>> https://github.com/PatrickRen/flink/tree/FLIP-221
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> > Александр
> > > > >> Смирнов
> > > > >> >> >> <
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> > > > solution
> > > > >> as
> > > > >> >> >> the
> > > > >> >> >>>>>>>>>>>>>>> first
> > > > >> >> >>>>>>>>>>>>>>>>> step:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
> mutually
> > > > >> exclusive
> > > > >> >> >>>>>>>>>>>>>>> (originally
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> proposed
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> > > > conceptually
> > > > >> >> they
> > > > >> >> >>>> follow
> > > > >> >> >>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>> same
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> > > > >> different.
> > > > >> >> >> If
> > > > >> >> >>> we
> > > > >> >> >>>>>>>>>>>>>>> will
> > > > >> >> >>>>>>>>>>>>>>>>> go one
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> way,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
> > will
> > > > mean
> > > > >> >> >>>> deleting
> > > > >> >> >>>>>>>>>>>>>>>>> existing
> > > > >> >> >>>>>>>>>>>>>>>>>>>> code
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> > > > >> connectors.
> > > > >> >> >> So
> > > > >> >> >>> I
> > > > >> >> >>>>>>>>>>>>>>> think we
> > > > >> >> >>>>>>>>>>>>>>>>>>>> should
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> > > about
> > > > >> that
> > > > >> >> >> and
> > > > >> >> >>>> then
> > > > >> >> >>>>>>>>>>>>>>> work
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> together
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> > > tasks
> > > > >> for
> > > > >> >> >>>> different
> > > > >> >> >>>>>>>>>>>>>>>>> parts
> > > > >> >> >>>>>>>>>>>>>>>>>>>> of
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> > unification
> > > /
> > > > >> >> >>>> introducing
> > > > >> >> >>>>>>>>>>>>>>>>> proposed
> > > > >> >> >>>>>>>>>>>>>>>>>>>> set
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> of
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> > > Qingsheng?
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> > > > requests
> > > > >> >> >> after
> > > > >> >> >>>>>>>>>>>>>>> filter
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
> > fields
> > > > of
> > > > >> the
> > > > >> >> >>>> lookup
> > > > >> >> >>>>>>>>>>>>>>>>> table, we
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
> > after
> > > > >> that we
> > > > >> >> >>> can
> > > > >> >> >>>>>>>>>>>>>>> filter
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> responses,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> > > filter
> > > > >> >> >>>> pushdown. So
> > > > >> >> >>>>>>>>>>>>>>> if
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> filtering
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> > > much
> > > > >> less
> > > > >> >> >>> rows
> > > > >> >> >>>> in
> > > > >> >> >>>>>>>>>>>>>>>>> cache.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > > > architecture
> > > > >> is
> > > > >> >> >> not
> > > > >> >> >>>>>>>>>>>>>>> shared.
> > > > >> >> >>>>>>>>>>>>>>>>> I
> > > > >> >> >>>>>>>>>>>>>>>>>>>> don't
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> honest.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> > > kinds
> > > > >> of
> > > > >> >> >>>>>>>>>>>>>>> conversations
> > > > >> >> >>>>>>>>>>>>>>>>> :)
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> > confluence,
> > > > so
> > > > >> I
> > > > >> >> >>> made a
> > > > >> >> >>>>>>>>>>>>>>> Jira
> > > > >> >> >>>>>>>>>>>>>>>>> issue,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
> in
> > > > more
> > > > >> >> >>> details
> > > > >> >> >>>> -
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >> https://issues.apache.org/jira/browse/FLINK-27411.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
> > Heise
> > > <
> > > > >> >> >>>>>>>>>>>>>>> arvid@apache.org>:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> > > inconsistency
> > > > >> was
> > > > >> >> >> not
> > > > >> >> >>>>>>>>>>>>>>>>> satisfying
> > > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> me.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> > > could
> > > > >> also
> > > > >> >> >>> live
> > > > >> >> >>>>>>>>>>>>>>> with
> > > > >> >> >>>>>>>>>>>>>>>>> an
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> easier
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
> of
> > > > >> making
> > > > >> >> >>>> caching
> > > > >> >> >>>>>>>>>>>>>>> an
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> implementation
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> > > devise a
> > > > >> >> >> caching
> > > > >> >> >>>>>>>>>>>>>>> layer
> > > > >> >> >>>>>>>>>>>>>>>>>>>> around X.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> So
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> > CachingTableFunction
> > > > >> that
> > > > >> >> >>>>>>>>>>>>>>> delegates to
> > > > >> >> >>>>>>>>>>>>>>>>> X in
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> case
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> > > Lifting
> > > > >> it
> > > > >> >> >> into
> > > > >> >> >>>> the
> > > > >> >> >>>>>>>>>>>>>>>>> operator
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> model
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> as
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> > > > >> probably
> > > > >> >> >>>>>>>>>>>>>>> unnecessary
> > > > >> >> >>>>>>>>>>>>>>>>> in
> > > > >> >> >>>>>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> first step
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
> > will
> > > > only
> > > > >> >> >>> receive
> > > > >> >> >>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>> requests
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> after
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
> > more
> > > > >> >> >>> interesting
> > > > >> >> >>>> to
> > > > >> >> >>>>>>>>>>>>>>> save
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> memory).
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> > > changes
> > > > of
> > > > >> >> >> this
> > > > >> >> >>>> FLIP
> > > > >> >> >>>>>>>>>>>>>>>>> would be
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> limited to
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> > > > interfaces.
> > > > >> >> >>>> Everything
> > > > >> >> >>>>>>>>>>>>>>> else
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> remains
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> an
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
> That
> > > > means
> > > > >> we
> > > > >> >> >> can
> > > > >> >> >>>>>>>>>>>>>>> easily
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> incorporate
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
> Alexander
> > > > >> pointed
> > > > >> >> >> out
> > > > >> >> >>>>>>>>>>>>>>> later.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > > > architecture
> > > > >> is
> > > > >> >> >> not
> > > > >> >> >>>>>>>>>>>>>>> shared.
> > > > >> >> >>>>>>>>>>>>>>>>> I
> > > > >> >> >>>>>>>>>>>>>>>>>>>> don't
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> honest.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> > > Александр
> > > > >> >> >> Смирнов
> > > > >> >> >>> <
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
> > I'm
> > > > >> not a
> > > > >> >> >>>>>>>>>>>>>>> committer
> > > > >> >> >>>>>>>>>>>>>>>>> yet,
> > > > >> >> >>>>>>>>>>>>>>>>>>>> but
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> I'd
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> > > FLIP
> > > > >> >> really
> > > > >> >> >>>>>>>>>>>>>>>>> interested
> > > > >> >> >>>>>>>>>>>>>>>>>>>> me.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> > > > >> feature in
> > > > >> >> >> my
> > > > >> >> >>>>>>>>>>>>>>>>> company’s
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> Flink
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
> our
> > > > >> thoughts
> > > > >> >> >> on
> > > > >> >> >>>>>>>>>>>>>>> this and
> > > > >> >> >>>>>>>>>>>>>>>>>>>> make
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> code
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
> alternative
> > > > than
> > > > >> >> >>>>>>>>>>>>>>> introducing an
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> abstract
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> > > > >> >> (CachingTableFunction).
> > > > >> >> >>> As
> > > > >> >> >>>>>>>>>>>>>>> you
> > > > >> >> >>>>>>>>>>>>>>>>> know,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> > > > >> flink-table-common
> > > > >> >> >>>>>>>>>>>>>>> module,
> > > > >> >> >>>>>>>>>>>>>>>>> which
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> provides
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
> tables –
> > > > it’s
> > > > >> >> very
> > > > >> >> >>>>>>>>>>>>>>>>> convenient
> > > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> importing
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> > > > >> CachingTableFunction
> > > > >> >> >>>> contains
> > > > >> >> >>>>>>>>>>>>>>>>> logic
> > > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
> and
> > > > >> >> >> everything
> > > > >> >> >>>>>>>>>>>>>>>>> connected
> > > > >> >> >>>>>>>>>>>>>>>>>>>> with
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> it
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
> module,
> > > > >> probably
> > > > >> >> >> in
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> > > > depend
> > > > >> on
> > > > >> >> >>>> another
> > > > >> >> >>>>>>>>>>>>>>>>> module,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> which
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
> > which
> > > > >> doesn’t
> > > > >> >> >>>> sound
> > > > >> >> >>>>>>>>>>>>>>>>> good.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> > > > >> >> ‘getLookupConfig’
> > > > >> >> >>> to
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> > > > >> connectors
> > > > >> >> to
> > > > >> >> >>>> only
> > > > >> >> >>>>>>>>>>>>>>> pass
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> > > therefore
> > > > >> they
> > > > >> >> >>> won’t
> > > > >> >> >>>>>>>>>>>>>>>>> depend on
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> > > > planner
> > > > >> >> >> will
> > > > >> >> >>>>>>>>>>>>>>>>> construct a
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> lookup
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> > > runtime
> > > > >> logic
> > > > >> >> >>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> in
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> > > Architecture
> > > > >> >> looks
> > > > >> >> >>>> like
> > > > >> >> >>>>>>>>>>>>>>> in
> > > > >> >> >>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> pinned
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> > > > actually
> > > > >> >> >> yours
> > > > >> >> >>>>>>>>>>>>>>>>>>>> CacheConfig).
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
> that
> > > will
> > > > >> be
> > > > >> >> >>>>>>>>>>>>>>> responsible
> > > > >> >> >>>>>>>>>>>>>>>>> for
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> this
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> –
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> > > > >> inheritors.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> > > > >> >> >>>>>>>>>>>>>>> flink-table-runtime
> > > > >> >> >>>>>>>>>>>>>>>>> -
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> > > AsyncLookupJoinRunner,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> > > > >> >> >> LookupJoinCachingRunner,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
> etc.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
> powerful
> > > > >> advantage
> > > > >> >> >> of
> > > > >> >> >>>>>>>>>>>>>>> such a
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> solution.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> If
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
> > level,
> > > > we
> > > > >> can
> > > > >> >> >>>> apply
> > > > >> >> >>>>>>>>>>>>>>> some
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> > > > >> LookupJoinRunnerWithCalc
> > > > >> >> >> was
> > > > >> >> >>>>>>>>>>>>>>> named
> > > > >> >> >>>>>>>>>>>>>>>>> like
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> this
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
> function,
> > > > which
> > > > >> >> >>> actually
> > > > >> >> >>>>>>>>>>>>>>>>> mostly
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> consists
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> > > lookup
> > > > >> table
> > > > >> >> >> B
> > > > >> >> >>>>>>>>>>>>>>>>> condition
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> ON
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> > > WHERE
> > > > >> >> >>> B.salary >
> > > > >> >> >>>>>>>>>>>>>>> 1000’
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> ‘calc’
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
> A.age =
> > > > >> B.age +
> > > > >> >> >> 10
> > > > >> >> >>>> and
> > > > >> >> >>>>>>>>>>>>>>>>>>>> B.salary >
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> 1000.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> > > storing
> > > > >> >> >> records
> > > > >> >> >>> in
> > > > >> >> >>>>>>>>>>>>>>>>> cache,
> > > > >> >> >>>>>>>>>>>>>>>>>>>> size
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> of
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
> reduced:
> > > > >> filters =
> > > > >> >> >>>> avoid
> > > > >> >> >>>>>>>>>>>>>>>>> storing
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> useless
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
> > reduce
> > > > >> >> records’
> > > > >> >> >>>>>>>>>>>>>>> size. So
> > > > >> >> >>>>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> initial
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
> be
> > > > >> >> increased
> > > > >> >> >>> by
> > > > >> >> >>>>>>>>>>>>>>> the
> > > > >> >> >>>>>>>>>>>>>>>>> user.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
> Ren
> > > > wrote:
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> > > > >> discussion
> > > > >> >> >>> about
> > > > >> >> >>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> which
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> > > table
> > > > >> >> cache
> > > > >> >> >>> and
> > > > >> >> >>>>>>>>>>>>>>> its
> > > > >> >> >>>>>>>>>>>>>>>>>>>> standard
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> > > > should
> > > > >> >> >>>> implement
> > > > >> >> >>>>>>>>>>>>>>>>> their
> > > > >> >> >>>>>>>>>>>>>>>>>>>> own
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> cache to
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
> > isn’t a
> > > > >> >> >> standard
> > > > >> >> >>> of
> > > > >> >> >>>>>>>>>>>>>>>>> metrics
> > > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> users and
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
> with
> > > > lookup
> > > > >> >> >>> joins,
> > > > >> >> >>>>>>>>>>>>>>> which
> > > > >> >> >>>>>>>>>>>>>>>>> is a
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>> quite
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> common
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> > > > >> including
> > > > >> >> >>>> cache,
> > > > >> >> >>>>>>>>>>>>>>>>>>>> metrics,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> wrapper
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
> > table
> > > > >> >> options.
> > > > >> >> >>>>>>>>>>>>>>> Please
> > > > >> >> >>>>>>>>>>>>>>>>> take a
> > > > >> >> >>>>>>>>>>>>>>>>>>>>> look
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> at the
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
> > Any
> > > > >> >> >>> suggestions
> > > > >> >> >>>>>>>>>>>>>>> and
> > > > >> >> >>>>>>>>>>>>>>>>>>>> comments
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>> would be
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>
> > > > >> >> >>>>
> > > > >> >> >>>
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> --
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>>> --
> > > > >> >> >>>>>>>>>>>>>>>>>>> Best regards,
> > > > >> >> >>>>>>>>>>>>>>>>>>> Roman Boyko
> > > > >> >> >>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > > > >> >> >>>>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> --
> > > > >> >> >>>>>>>>>>>> Best Regards,
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> Qingsheng Ren
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> Real-time Computing Team
> > > > >> >> >>>>>>>>>>>> Alibaba Cloud
> > > > >> >> >>>>>>>>>>>>
> > > > >> >> >>>>>>>>>>>> Email: renqschn@gmail.com
> > > > >> >> >>>>>>>>>>
> > > > >> >> >>>>
> > > > >> >> >>>>
> > > > >> >> >>>
> > > > >> >> >>
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jing Ge <ji...@ververica.com>.

Hi Jark,

Thanks for clarifying it. It would be fine. as long as we could provide the
no-cache solution. I was just wondering if the client side cache could
really help when HBase is used, since the data to look up should be huge.
Depending how much data will be cached on the client side, the data that
should be lru in e.g. LruBlockCache will not be lru anymore. In the worst
case scenario, once the cached data at client side is expired, the request
will hit disk which will cause extra latency temporarily, if I am not
mistaken.

Best regards,
Jing

On Mon, May 30, 2022 at 9:59 AM Jark Wu <im...@gmail.com> wrote:

> Hi Jing Ge,
>
> What do you mean about the "impact on the block cache used by HBase"?
> In my understanding, the connector cache and HBase cache are totally two
> things.
> The connector cache is a local/client cache, and the HBase cache is a
> server cache.
>
> > does it make sense to have a no-cache solution as one of the
> default solutions so that customers will have no effort for the migration
> if they want to stick with Hbase cache
>
> The implementation migration should be transparent to users. Take the HBase
> connector as
> an example,  it already supports lookup cache but is disabled by default.
> After migration, the
> connector still disables cache by default (i.e. no-cache solution). No
> migration effort for users.
>
> HBase cache and connector cache are two different things. HBase cache can't
> simply replace
> connector cache. Because one of the most important usages for connector
> cache is reducing
>  the I/O request/response and improving the throughput, which can achieve
> by just using a server cache.
>
> Best,
> Jark
>
>
>
>
> On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:
>
> > Thanks all for the valuable discussion. The new feature looks very
> > interesting.
> >
> > According to the FLIP description: "*Currently we have JDBC, Hive and
> HBase
> > connector implemented lookup table source. All existing implementations
> > will be migrated to the current design and the migration will be
> > transparent to end users*." I was only wondering if we should pay
> attention
> > to HBase and similar DBs. Since, commonly, the lookup data will be huge
> > while using HBase, partial caching will be used in this case, if I am not
> > mistaken, which might have an impact on the block cache used by HBase,
> e.g.
> > LruBlockCache.
> > Another question is that, since HBase provides a sophisticated cache
> > solution, does it make sense to have a no-cache solution as one of the
> > default solutions so that customers will have no effort for the migration
> > if they want to stick with Hbase cache?
> >
> > Best regards,
> > Jing
> >
> > On Fri, May 27, 2022 at 11:19 AM Jingsong Li <ji...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I think the problem now is below:
> > > 1. AllCache and PartialCache interface on the non-uniform, one needs to
> > > provide LookupProvider, the other needs to provide CacheBuilder.
> > > 2. AllCache definition is not flexible, for example, PartialCache can
> use
> > > any custom storage, while the AllCache can not, AllCache can also be
> > > considered to store memory or disk, also need a flexible strategy.
> > > 3. AllCache can not customize ReloadStrategy, currently only
> > > ScheduledReloadStrategy.
> > >
> > > In order to solve the above problems, the following are my ideas.
> > >
> > > ## Top level cache interfaces:
> > >
> > > ```
> > >
> > > public interface CacheLookupProvider extends
> > > LookupTableSource.LookupRuntimeProvider {
> > >
> > >     CacheBuilder createCacheBuilder();
> > > }
> > >
> > >
> > > public interface CacheBuilder {
> > >     Cache create();
> > > }
> > >
> > >
> > > public interface Cache {
> > >
> > >     /**
> > >      * Returns the value associated with key in this cache, or null if
> > > there is no cached value for
> > >      * key.
> > >      */
> > >     @Nullable
> > >     Collection<RowData> getIfPresent(RowData key);
> > >
> > >     /** Returns the number of key-value mappings in the cache. */
> > >     long size();
> > > }
> > >
> > > ```
> > >
> > > ## Partial cache
> > >
> > > ```
> > >
> > > public interface PartialCacheLookupFunction extends
> CacheLookupProvider {
> > >
> > >     @Override
> > >     PartialCacheBuilder createCacheBuilder();
> > >
> > > /** Creates an {@link LookupFunction} instance. */
> > > LookupFunction createLookupFunction();
> > > }
> > >
> > >
> > > public interface PartialCacheBuilder extends CacheBuilder {
> > >
> > >     PartialCache create();
> > > }
> > >
> > >
> > > public interface PartialCache extends Cache {
> > >
> > >     /**
> > >      * Associates the specified value rows with the specified key row
> > > in the cache. If the cache
> > >      * previously contained value associated with the key, the old
> > > value is replaced by the
> > >      * specified value.
> > >      *
> > >      * @return the previous value rows associated with key, or null if
> > > there was no mapping for key.
> > >      * @param key - key row with which the specified value is to be
> > > associated
> > >      * @param value – value rows to be associated with the specified
> key
> > >      */
> > >     Collection<RowData> put(RowData key, Collection<RowData> value);
> > >
> > >     /** Discards any cached value for the specified key. */
> > >     void invalidate(RowData key);
> > > }
> > >
> > > ```
> > >
> > > ## All cache
> > > ```
> > >
> > > public interface AllCacheLookupProvider extends CacheLookupProvider {
> > >
> > >     void registerReloadStrategy(ScheduledExecutorService
> > > executorService, Reloader reloader);
> > >
> > >     ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> > >
> > >     @Override
> > >     AllCacheBuilder createCacheBuilder();
> > > }
> > >
> > >
> > > public interface AllCacheBuilder extends CacheBuilder {
> > >
> > >     AllCache create();
> > > }
> > >
> > >
> > > public interface AllCache extends Cache {
> > >
> > >     void putAll(Iterator<Map<RowData, RowData>> allEntries);
> > >
> > >     void clearAll();
> > > }
> > >
> > >
> > > public interface Reloader {
> > >
> > >     void reload();
> > > }
> > >
> > > ```
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Fri, May 27, 2022 at 11:10 AM Jingsong Li <ji...@gmail.com>
> > > wrote:
> > >
> > > > Thanks Qingsheng and all for your discussion.
> > > >
> > > > Very sorry to jump in so late.
> > > >
> > > > Maybe I missed something?
> > > > My first impression when I saw the cache interface was, why don't we
> > > > provide an interface similar to guava cache [1], on top of guava
> cache,
> > > > caffeine also makes extensions for asynchronous calls.[2]
> > > > There is also the bulk load in caffeine too.
> > > >
> > > > I am also more confused why first from LookupCacheFactory.Builder and
> > > then
> > > > to Factory to create Cache.
> > > >
> > > > [1] https://github.com/google/guava
> > > > [2] https://github.com/ben-manes/caffeine/wiki/Population
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
> > > >
> > > >> After looking at the new introduced ReloadTime and Becket's comment,
> > > >> I agree with Becket we should have a pluggable reloading strategy.
> > > >> We can provide some common implementations, e.g., periodic
> reloading,
> > > and
> > > >> daily reloading.
> > > >> But there definitely be some connector- or business-specific
> reloading
> > > >> strategies, e.g.
> > > >> notify by a zookeeper watcher, reload once a new Hive partition is
> > > >> complete.
> > > >>
> > > >> Best,
> > > >> Jark
> > > >>
> > > >> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
> > wrote:
> > > >>
> > > >> > Hi Qingsheng,
> > > >> >
> > > >> > Thanks for updating the FLIP. A few comments / questions below:
> > > >> >
> > > >> > 1. Is there a reason that we have both "XXXFactory" and
> > "XXXProvider".
> > > >> > What is the difference between them? If they are the same, can we
> > just
> > > >> use
> > > >> > XXXFactory everywhere?
> > > >> >
> > > >> > 2. Regarding the FullCachingLookupProvider, should the reloading
> > > policy
> > > >> > also be pluggable? Periodical reloading could be sometimes be
> tricky
> > > in
> > > >> > practice. For example, if user uses 24 hours as the cache refresh
> > > >> interval
> > > >> > and some nightly batch job delayed, the cache update may still see
> > the
> > > >> > stale data.
> > > >> >
> > > >> > 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> > should
> > > be
> > > >> > removed.
> > > >> >
> > > >> > 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a
> > > >> little
> > > >> > confusing to me. If Optional<LookupCacheFactory> getCacheFactory()
> > > >> returns
> > > >> > a non-empty factory, doesn't that already indicates the framework
> to
> > > >> cache
> > > >> > the missing keys? Also, why is this method returning an
> > > >> Optional<Boolean>
> > > >> > instead of boolean?
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jiangjie (Becket) Qin
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <renqschn@gmail.com
> >
> > > >> wrote:
> > > >> >
> > > >> >> Hi Lincoln and Jark,
> > > >> >>
> > > >> >> Thanks for the comments! If the community reaches a consensus
> that
> > we
> > > >> use
> > > >> >> SQL hint instead of table options to decide whether to use sync
> or
> > > >> async
> > > >> >> mode, it’s indeed not necessary to introduce the “lookup.async”
> > > option.
> > > >> >>
> > > >> >> I think it’s a good idea to let the decision of async made on
> query
> > > >> >> level, which could make better optimization with more infomation
> > > >> gathered
> > > >> >> by planner. Is there any FLIP describing the issue in
> FLINK-27625?
> > I
> > > >> >> thought FLIP-234 is only proposing adding SQL hint for retry on
> > > missing
> > > >> >> instead of the entire async mode to be controlled by hint.
> > > >> >>
> > > >> >> Best regards,
> > > >> >>
> > > >> >> Qingsheng
> > > >> >>
> > > >> >> > On May 25, 2022, at 15:13, Lincoln Lee <lincoln.86xy@gmail.com
> >
> > > >> wrote:
> > > >> >> >
> > > >> >> > Hi Jark,
> > > >> >> >
> > > >> >> > Thanks for your reply!
> > > >> >> >
> > > >> >> > Currently 'lookup.async' just lies in HBase connector, I have
> no
> > > idea
> > > >> >> > whether or when to remove it (we can discuss it in another
> issue
> > > for
> > > >> the
> > > >> >> > HBase connector after FLINK-27625 is done), just not add it
> into
> > a
> > > >> >> common
> > > >> >> > option now.
> > > >> >> >
> > > >> >> > Best,
> > > >> >> > Lincoln Lee
> > > >> >> >
> > > >> >> >
> > > >> >> > Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> > > >> >> >
> > > >> >> >> Hi Lincoln,
> > > >> >> >>
> > > >> >> >> I have taken a look at FLIP-234, and I agree with you that the
> > > >> >> connectors
> > > >> >> >> can
> > > >> >> >> provide both async and sync runtime providers simultaneously
> > > instead
> > > >> >> of one
> > > >> >> >> of them.
> > > >> >> >> At that point, "lookup.async" looks redundant. If this option
> is
> > > >> >> planned to
> > > >> >> >> be removed
> > > >> >> >> in the long term, I think it makes sense not to introduce it
> in
> > > this
> > > >> >> FLIP.
> > > >> >> >>
> > > >> >> >> Best,
> > > >> >> >> Jark
> > > >> >> >>
> > > >> >> >> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> > lincoln.86xy@gmail.com
> > > >
> > > >> >> wrote:
> > > >> >> >>
> > > >> >> >>> Hi Qingsheng,
> > > >> >> >>>
> > > >> >> >>> Sorry for jumping into the discussion so late. It's a good
> idea
> > > >> that
> > > >> >> we
> > > >> >> >> can
> > > >> >> >>> have a common table option. I have a minor comments on
> > > >> 'lookup.async'
> > > >> >> >> that
> > > >> >> >>> not make it a common option:
> > > >> >> >>>
> > > >> >> >>> The table layer abstracts both sync and async lookup
> > > capabilities,
> > > >> >> >>> connectors implementers can choose one or both, in the case
> of
> > > >> >> >> implementing
> > > >> >> >>> only one capability(status of the most of existing builtin
> > > >> connectors)
> > > >> >> >>> 'lookup.async' will not be used.  And when a connector has
> both
> > > >> >> >>> capabilities, I think this choice is more suitable for making
> > > >> >> decisions
> > > >> >> >> at
> > > >> >> >>> the query level, for example, table planner can choose the
> > > physical
> > > >> >> >>> implementation of async lookup or sync lookup based on its
> cost
> > > >> >> model, or
> > > >> >> >>> users can give query hint based on their own better
> > > >> understanding.  If
> > > >> >> >>> there is another common table option 'lookup.async', it may
> > > confuse
> > > >> >> the
> > > >> >> >>> users in the long run.
> > > >> >> >>>
> > > >> >> >>> So, I prefer to leave the 'lookup.async' option in private
> > place
> > > >> (for
> > > >> >> the
> > > >> >> >>> current hbase connector) and not turn it into a common
> option.
> > > >> >> >>>
> > > >> >> >>> WDYT?
> > > >> >> >>>
> > > >> >> >>> Best,
> > > >> >> >>> Lincoln Lee
> > > >> >> >>>
> > > >> >> >>>
> > > >> >> >>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> > > >> >> >>>
> > > >> >> >>>> Hi Alexander,
> > > >> >> >>>>
> > > >> >> >>>> Thanks for the review! We recently updated the FLIP and you
> > can
> > > >> find
> > > >> >> >>> those
> > > >> >> >>>> changes from my latest email. Since some terminologies has
> > > >> changed so
> > > >> >> >>> I’ll
> > > >> >> >>>> use the new concept for replying your comments.
> > > >> >> >>>>
> > > >> >> >>>> 1. Builder vs ‘of’
> > > >> >> >>>> I’m OK to use builder pattern if we have additional optional
> > > >> >> parameters
> > > >> >> >>>> for full caching mode (“rescan” previously). The
> > > >> schedule-with-delay
> > > >> >> >> idea
> > > >> >> >>>> looks reasonable to me, but I think we need to redesign the
> > > >> builder
> > > >> >> API
> > > >> >> >>> of
> > > >> >> >>>> full caching to make it more descriptive for developers.
> Would
> > > you
> > > >> >> mind
> > > >> >> >>>> sharing your ideas about the API? For accessing the FLIP
> > > workspace
> > > >> >> you
> > > >> >> >>> can
> > > >> >> >>>> just provide your account ID and ping any PMC member
> including
> > > >> Jark.
> > > >> >> >>>>
> > > >> >> >>>> 2. Common table options
> > > >> >> >>>> We have some discussions these days and propose to
> introduce 8
> > > >> common
> > > >> >> >>>> table options about caching. It has been updated on the
> FLIP.
> > > >> >> >>>>
> > > >> >> >>>> 3. Retries
> > > >> >> >>>> I think we are on the same page :-)
> > > >> >> >>>>
> > > >> >> >>>> For your additional concerns:
> > > >> >> >>>> 1) The table option has been updated.
> > > >> >> >>>> 2) We got “lookup.cache” back for configuring whether to use
> > > >> partial
> > > >> >> or
> > > >> >> >>>> full caching mode.
> > > >> >> >>>>
> > > >> >> >>>> Best regards,
> > > >> >> >>>>
> > > >> >> >>>> Qingsheng
> > > >> >> >>>>
> > > >> >> >>>>
> > > >> >> >>>>
> > > >> >> >>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> > > >> smiralexan@gmail.com>
> > > >> >> >>>> wrote:
> > > >> >> >>>>>
> > > >> >> >>>>> Also I have a few additions:
> > > >> >> >>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> > > >> >> >>>>> 'lookup.cache.max-rows'? I think it will be more clear that
> > we
> > > >> talk
> > > >> >> >>>>> not about bytes, but about the number of rows. Plus it fits
> > > more,
> > > >> >> >>>>> considering my optimization with filters.
> > > >> >> >>>>> 2) How will users enable rescanning? Are we going to
> separate
> > > >> >> caching
> > > >> >> >>>>> and rescanning from the options point of view? Like
> initially
> > > we
> > > >> had
> > > >> >> >>>>> one option 'lookup.cache' with values LRU / ALL. I think
> now
> > we
> > > >> can
> > > >> >> >>>>> make a boolean option 'lookup.rescan'. RescanInterval can
> be
> > > >> >> >>>>> 'lookup.rescan.interval', etc.
> > > >> >> >>>>>
> > > >> >> >>>>> Best regards,
> > > >> >> >>>>> Alexander
> > > >> >> >>>>>
> > > >> >> >>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> > > >> smiralexan@gmail.com
> > > >> >> >>> :
> > > >> >> >>>>>>
> > > >> >> >>>>>> Hi Qingsheng and Jark,
> > > >> >> >>>>>>
> > > >> >> >>>>>> 1. Builders vs 'of'
> > > >> >> >>>>>> I understand that builders are used when we have multiple
> > > >> >> >> parameters.
> > > >> >> >>>>>> I suggested them because we could add parameters later. To
> > > >> prevent
> > > >> >> >>>>>> Builder for ScanRuntimeProvider from looking redundant I
> can
> > > >> >> suggest
> > > >> >> >>>>>> one more config now - "rescanStartTime".
> > > >> >> >>>>>> It's a time in UTC (LocalTime class) when the first reload
> > of
> > > >> cache
> > > >> >> >>>>>> starts. This parameter can be thought of as 'initialDelay'
> > > (diff
> > > >> >> >>>>>> between current time and rescanStartTime) in method
> > > >> >> >>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
> can
> > be
> > > >> very
> > > >> >> >>>>>> useful when the dimension table is updated by some other
> > > >> scheduled
> > > >> >> >> job
> > > >> >> >>>>>> at a certain time. Or when the user simply wants a second
> > scan
> > > >> >> >> (first
> > > >> >> >>>>>> cache reload) be delayed. This option can be used even
> > without
> > > >> >> >>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
> one
> > > >> day.
> > > >> >> >>>>>> If you are fine with this option, I would be very glad if
> > you
> > > >> would
> > > >> >> >>>>>> give me access to edit FLIP page, so I could add it myself
> > > >> >> >>>>>>
> > > >> >> >>>>>> 2. Common table options
> > > >> >> >>>>>> I also think that FactoryUtil would be overloaded by all
> > cache
> > > >> >> >>>>>> options. But maybe unify all suggested options, not only
> for
> > > >> >> default
> > > >> >> >>>>>> cache? I.e. class 'LookupOptions', that unifies default
> > cache
> > > >> >> >> options,
> > > >> >> >>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> > > >> >> >>>>>>
> > > >> >> >>>>>> 3. Retries
> > > >> >> >>>>>> I'm fine with suggestion close to
> RetryUtils#tryTimes(times,
> > > >> call)
> > > >> >> >>>>>>
> > > >> >> >>>>>> [1]
> > > >> >> >>>>
> > > >> >> >>>
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> > > >> >> >>>>>>
> > > >> >> >>>>>> Best regards,
> > > >> >> >>>>>> Alexander
> > > >> >> >>>>>>
> > > >> >> >>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> > renqschn@gmail.com
> > > >:
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> Hi Jark and Alexander,
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> Thanks for your comments! I’m also OK to introduce common
> > > table
> > > >> >> >>>> options. I prefer to introduce a new
> DefaultLookupCacheOptions
> > > >> class
> > > >> >> >> for
> > > >> >> >>>> holding these option definitions because putting all options
> > > into
> > > >> >> >>>> FactoryUtil would make it a bit ”crowded” and not well
> > > >> categorized.
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> FLIP has been updated according to suggestions above:
> > > >> >> >>>>>>> 1. Use static “of” method for constructing
> > > >> RescanRuntimeProvider
> > > >> >> >>>> considering both arguments are required.
> > > >> >> >>>>>>> 2. Introduce new table options matching
> > > >> DefaultLookupCacheFactory
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> Best,
> > > >> >> >>>>>>> Qingsheng
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> imjark@gmail.com>
> > > >> wrote:
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Hi Alex,
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> 1) retry logic
> > > >> >> >>>>>>>> I think we can extract some common retry logic into
> > > utilities,
> > > >> >> >> e.g.
> > > >> >> >>>> RetryUtils#tryTimes(times, call).
> > > >> >> >>>>>>>> This seems independent of this FLIP and can be reused by
> > > >> >> >> DataStream
> > > >> >> >>>> users.
> > > >> >> >>>>>>>> Maybe we can open an issue to discuss this and where to
> > put
> > > >> it.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> 2) cache ConfigOptions
> > > >> >> >>>>>>>> I'm fine with defining cache config options in the
> > > framework.
> > > >> >> >>>>>>>> A candidate place to put is FactoryUtil which also
> > includes
> > > >> >> >>>> "sink.parallelism", "format" options.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Best,
> > > >> >> >>>>>>>> Jark
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> > > >> >> >>> smiralexan@gmail.com>
> > > >> >> >>>> wrote:
> > > >> >> >>>>>>>>>
> > > >> >> >>>>>>>>> Hi Qingsheng,
> > > >> >> >>>>>>>>>
> > > >> >> >>>>>>>>> Thank you for considering my comments.
> > > >> >> >>>>>>>>>
> > > >> >> >>>>>>>>>> there might be custom logic before making retry, such
> as
> > > >> >> >>>> re-establish the connection
> > > >> >> >>>>>>>>>
> > > >> >> >>>>>>>>> Yes, I understand that. I meant that such logic can be
> > > >> placed in
> > > >> >> >> a
> > > >> >> >>>>>>>>> separate function, that can be implemented by
> connectors.
> > > >> Just
> > > >> >> >>> moving
> > > >> >> >>>>>>>>> the retry logic would make connector's LookupFunction
> > more
> > > >> >> >> concise
> > > >> >> >>> +
> > > >> >> >>>>>>>>> avoid duplicate code. However, it's a minor change. The
> > > >> decision
> > > >> >> >> is
> > > >> >> >>>> up
> > > >> >> >>>>>>>>> to you.
> > > >> >> >>>>>>>>>
> > > >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> > > >> developers
> > > >> >> >> to
> > > >> >> >>>> define their own options as we do now per connector.
> > > >> >> >>>>>>>>>
> > > >> >> >>>>>>>>> What is the reason for that? One of the main goals of
> > this
> > > >> FLIP
> > > >> >> >> was
> > > >> >> >>>> to
> > > >> >> >>>>>>>>> unify the configs, wasn't it? I understand that current
> > > cache
> > > >> >> >>> design
> > > >> >> >>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> > still
> > > >> we
> > > >> >> >> can
> > > >> >> >>>> put
> > > >> >> >>>>>>>>> these options into the framework, so connectors can
> reuse
> > > >> them
> > > >> >> >> and
> > > >> >> >>>>>>>>> avoid code duplication, and, what is more significant,
> > > avoid
> > > >> >> >>> possible
> > > >> >> >>>>>>>>> different options naming. This moment can be pointed
> out
> > in
> > > >> >> >>>>>>>>> documentation for connector developers.
> > > >> >> >>>>>>>>>
> > > >> >> >>>>>>>>> Best regards,
> > > >> >> >>>>>>>>> Alexander
> > > >> >> >>>>>>>>>
> > > >> >> >>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> > > >> renqschn@gmail.com>:
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>> Hi Alexander,
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>> Thanks for the review and glad to see we are on the
> same
> > > >> page!
> > > >> >> I
> > > >> >> >>>> think you forgot to cc the dev mailing list so I’m also
> > quoting
> > > >> your
> > > >> >> >>> reply
> > > >> >> >>>> under this email.
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>> In my opinion the retry logic should be implemented in
> > > >> lookup()
> > > >> >> >>>> instead of in LookupFunction#eval(). Retrying is only
> > meaningful
> > > >> >> under
> > > >> >> >>> some
> > > >> >> >>>> specific retriable failures, and there might be custom logic
> > > >> before
> > > >> >> >>> making
> > > >> >> >>>> retry, such as re-establish the connection
> > > >> (JdbcRowDataLookupFunction
> > > >> >> >> is
> > > >> >> >>> an
> > > >> >> >>>> example), so it's more handy to leave it to the connector.
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>>> I don't see DDL options, that were in previous
> version
> > of
> > > >> >> FLIP.
> > > >> >> >>> Do
> > > >> >> >>>> you have any special plans for them?
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> > > >> developers
> > > >> >> >> to
> > > >> >> >>>> define their own options as we do now per connector.
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>> The rest of comments sound great and I’ll update the
> > FLIP.
> > > >> Hope
> > > >> >> >> we
> > > >> >> >>>> can finalize our proposal soon!
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>> Best,
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>> Qingsheng
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> > > >> >> >>> smiralexan@gmail.com>
> > > >> >> >>>> wrote:
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> Hi Qingsheng and devs!
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> I like the overall design of updated FLIP, however I
> > have
> > > >> >> >> several
> > > >> >> >>>>>>>>>>> suggestions and questions.
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> > > >> TableFunction
> > > >> >> >> is a
> > > >> >> >>>> good
> > > >> >> >>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> > class.
> > > >> >> 'eval'
> > > >> >> >>>> method
> > > >> >> >>>>>>>>>>> of new LookupFunction is great for this purpose. The
> > same
> > > >> is
> > > >> >> >> for
> > > >> >> >>>>>>>>>>> 'async' case.
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> 2) There might be other configs in future, such as
> > > >> >> >>>> 'cacheMissingKey'
> > > >> >> >>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> > > >> >> >>>> ScanRuntimeProvider.
> > > >> >> >>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
> and
> > > >> >> >>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> > > 'build'
> > > >> >> >>> method
> > > >> >> >>>>>>>>>>> instead of many 'of' methods in future)?
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> 3) What are the plans for existing
> > TableFunctionProvider
> > > >> and
> > > >> >> >>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> > > >> deprecated.
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> 4) Am I right that the current design does not assume
> > > >> usage of
> > > >> >> >>>>>>>>>>> user-provided LookupCache in re-scanning? In this
> case,
> > > it
> > > >> is
> > > >> >> >> not
> > > >> >> >>>> very
> > > >> >> >>>>>>>>>>> clear why do we need methods such as 'invalidate' or
> > > >> 'putAll'
> > > >> >> >> in
> > > >> >> >>>>>>>>>>> LookupCache.
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> 5) I don't see DDL options, that were in previous
> > version
> > > >> of
> > > >> >> >>> FLIP.
> > > >> >> >>>> Do
> > > >> >> >>>>>>>>>>> you have any special plans for them?
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> If you don't mind, I would be glad to be able to make
> > > small
> > > >> >> >>>>>>>>>>> adjustments to the FLIP document too. I think it's
> > worth
> > > >> >> >>> mentioning
> > > >> >> >>>>>>>>>>> about what exactly optimizations are planning in the
> > > >> future.
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>> Smirnov Alexander
> > > >> >> >>>>>>>>>>>
> > > >> >> >>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> > > >> renqschn@gmail.com
> > > >> >> >>> :
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> Hi Alexander and devs,
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> Thank you very much for the in-depth discussion! As
> > Jark
> > > >> >> >>>> mentioned we were inspired by Alexander's idea and made a
> > > >> refactor on
> > > >> >> >> our
> > > >> >> >>>> design. FLIP-221 [1] has been updated to reflect our design
> > now
> > > >> and
> > > >> >> we
> > > >> >> >>> are
> > > >> >> >>>> happy to hear more suggestions from you!
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> Compared to the previous design:
> > > >> >> >>>>>>>>>>>> 1. The lookup cache serves at table runtime level
> and
> > is
> > > >> >> >>>> integrated as a component of LookupJoinRunner as discussed
> > > >> >> previously.
> > > >> >> >>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect
> > the
> > > >> new
> > > >> >> >>>> design.
> > > >> >> >>>>>>>>>>>> 3. We separate the all-caching case individually and
> > > >> >> >> introduce a
> > > >> >> >>>> new RescanRuntimeProvider to reuse the ability of scanning.
> We
> > > are
> > > >> >> >>> planning
> > > >> >> >>>> to support SourceFunction / InputFormat for now considering
> > the
> > > >> >> >>> complexity
> > > >> >> >>>> of FLIP-27 Source API.
> > > >> >> >>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> > make
> > > >> the
> > > >> >> >>>> semantic of lookup more straightforward for developers.
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> For replying to Alexander:
> > > >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat
> is
> > > >> >> >> deprecated
> > > >> >> >>>> or not. Am I right that it will be so in the future, but
> > > currently
> > > >> >> it's
> > > >> >> >>> not?
> > > >> >> >>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for
> > > now.
> > > >> I
> > > >> >> >>> think
> > > >> >> >>>> it will be deprecated in the future but we don't have a
> clear
> > > plan
> > > >> >> for
> > > >> >> >>> that.
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> > looking
> > > >> >> >> forward
> > > >> >> >>>> to cooperating with you after we finalize the design and
> > > >> interfaces!
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> [1]
> > > >> >> >>>>
> > > >> >> >>>
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> Qingsheng
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> > > >> >> >>>> smiralexan@gmail.com> wrote:
> > > >> >> >>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> > > >> >> >>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>> Glad to see that we came to a consensus on almost
> all
> > > >> >> points!
> > > >> >> >>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat
> is
> > > >> >> >> deprecated
> > > >> >> >>>> or
> > > >> >> >>>>>>>>>>>>> not. Am I right that it will be so in the future,
> but
> > > >> >> >> currently
> > > >> >> >>>> it's
> > > >> >> >>>>>>>>>>>>> not? Actually I also think that for the first
> version
> > > >> it's
> > > >> >> OK
> > > >> >> >>> to
> > > >> >> >>>> use
> > > >> >> >>>>>>>>>>>>> InputFormat in ALL cache realization, because
> > > supporting
> > > >> >> >> rescan
> > > >> >> >>>>>>>>>>>>> ability seems like a very distant prospect. But for
> > > this
> > > >> >> >>>> decision we
> > > >> >> >>>>>>>>>>>>> need a consensus among all discussion participants.
> > > >> >> >>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>> In general, I don't have something to argue with
> your
> > > >> >> >>>> statements. All
> > > >> >> >>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
> would
> > be
> > > >> nice
> > > >> >> >> to
> > > >> >> >>>> work
> > > >> >> >>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot
> > of
> > > >> work
> > > >> >> >> on
> > > >> >> >>>> lookup
> > > >> >> >>>>>>>>>>>>> join caching with realization very close to the one
> > we
> > > >> are
> > > >> >> >>>> discussing,
> > > >> >> >>>>>>>>>>>>> and want to share the results of this work. Anyway
> > > >> looking
> > > >> >> >>>> forward for
> > > >> >> >>>>>>>>>>>>> the FLIP update!
> > > >> >> >>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>> Smirnov Alexander
> > > >> >> >>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> > imjark@gmail.com
> > > >:
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> Hi Alex,
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> Thanks for summarizing your points.
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> > > >> discussed
> > > >> >> >> it
> > > >> >> >>>> several times
> > > >> >> >>>>>>>>>>>>>> and we have totally refactored the design.
> > > >> >> >>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
> many
> > of
> > > >> your
> > > >> >> >>>> points!
> > > >> >> >>>>>>>>>>>>>> Qingsheng is still working on updating the design
> > docs
> > > >> and
> > > >> >> >>>> maybe can be
> > > >> >> >>>>>>>>>>>>>> available in the next few days.
> > > >> >> >>>>>>>>>>>>>> I will share some conclusions from our
> discussions:
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> 1) we have refactored the design towards to "cache
> > in
> > > >> >> >>>> framework" way.
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> customize
> > > and
> > > >> a
> > > >> >> >>>> default
> > > >> >> >>>>>>>>>>>>>> implementation with builder for users to easy-use.
> > > >> >> >>>>>>>>>>>>>> This can both make it possible to both have
> > > flexibility
> > > >> and
> > > >> >> >>>> conciseness.
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> > lookup
> > > >> >> >> cache,
> > > >> >> >>>> esp reducing
> > > >> >> >>>>>>>>>>>>>> IO.
> > > >> >> >>>>>>>>>>>>>> Filter pushdown should be the final state and the
> > > >> unified
> > > >> >> >> way
> > > >> >> >>>> to both
> > > >> >> >>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> > > >> >> >>>>>>>>>>>>>> so I think we should make effort in this
> direction.
> > If
> > > >> we
> > > >> >> >> need
> > > >> >> >>>> to support
> > > >> >> >>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> > > >> >> >>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide
> > to
> > > >> >> >>> implement
> > > >> >> >>>> the cache
> > > >> >> >>>>>>>>>>>>>> in the framework, we have the chance to support
> > > >> >> >>>>>>>>>>>>>> filter on cache anytime. This is an optimization
> and
> > > it
> > > >> >> >>> doesn't
> > > >> >> >>>> affect the
> > > >> >> >>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
> > > >> >> >>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
> your
> > > >> >> >> proposal.
> > > >> >> >>>>>>>>>>>>>> In the first version, we will only support
> > > InputFormat,
> > > >> >> >>>> SourceFunction for
> > > >> >> >>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> > > >> >> >>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source
> > > >> operator
> > > >> >> >>>> instead of
> > > >> >> >>>>>>>>>>>>>> calling it embedded in the join operator.
> > > >> >> >>>>>>>>>>>>>> However, this needs another FLIP to support the
> > > re-scan
> > > >> >> >>> ability
> > > >> >> >>>> for FLIP-27
> > > >> >> >>>>>>>>>>>>>> Source, and this can be a large work.
> > > >> >> >>>>>>>>>>>>>> In order to not block this issue, we can put the
> > > effort
> > > >> of
> > > >> >> >>>> FLIP-27 source
> > > >> >> >>>>>>>>>>>>>> integration into future work and integrate
> > > >> >> >>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> I think it's fine to use
> InputFormat&SourceFunction,
> > > as
> > > >> >> they
> > > >> >> >>>> are not
> > > >> >> >>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> another
> > > >> >> function
> > > >> >> >>>>>>>>>>>>>> similar to them which is meaningless. We need to
> > plan
> > > >> >> >> FLIP-27
> > > >> >> >>>> source
> > > >> >> >>>>>>>>>>>>>> integration ASAP before InputFormat &
> SourceFunction
> > > are
> > > >> >> >>>> deprecated.
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> Best,
> > > >> >> >>>>>>>>>>>>>> Jark
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> > > >> >> >>>> smiralexan@gmail.com>
> > > >> >> >>>>>>>>>>>>>> wrote:
> > > >> >> >>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>> Hi Martijn!
> > > >> >> >>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>> Got it. Therefore, the realization with
> InputFormat
> > > is
> > > >> not
> > > >> >> >>>> considered.
> > > >> >> >>>>>>>>>>>>>>> Thanks for clearing that up!
> > > >> >> >>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>> Smirnov Alexander
> > > >> >> >>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> > > >> >> >>>> martijn@ververica.com>:
> > > >> >> >>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>> Hi,
> > > >> >> >>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>> With regards to:
> > > >> >> >>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> But if there are plans to refactor all
> connectors
> > > to
> > > >> >> >>> FLIP-27
> > > >> >> >>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
> The
> > > old
> > > >> >> >>>> interfaces will be
> > > >> >> >>>>>>>>>>>>>>>> deprecated and connectors will either be
> > refactored
> > > to
> > > >> >> use
> > > >> >> >>>> the new ones
> > > >> >> >>>>>>>>>>>>>>> or
> > > >> >> >>>>>>>>>>>>>>>> dropped.
> > > >> >> >>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>> The caching should work for connectors that are
> > > using
> > > >> >> >>> FLIP-27
> > > >> >> >>>> interfaces,
> > > >> >> >>>>>>>>>>>>>>>> we should not introduce new features for old
> > > >> interfaces.
> > > >> >> >>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>> Martijn
> > > >> >> >>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов
> <
> > > >> >> >>>> smiralexan@gmail.com>
> > > >> >> >>>>>>>>>>>>>>>> wrote:
> > > >> >> >>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> Hi Jark!
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
> make
> > > >> some
> > > >> >> >>>> comments and
> > > >> >> >>>>>>>>>>>>>>>>> clarify my points.
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
> we
> > > can
> > > >> >> >>> achieve
> > > >> >> >>>> both
> > > >> >> >>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
> > > >> >> >>>> flink-table-common,
> > > >> >> >>>>>>>>>>>>>>>>> but have implementations of it in
> > > >> flink-table-runtime.
> > > >> >> >>>> Therefore if a
> > > >> >> >>>>>>>>>>>>>>>>> connector developer wants to use existing cache
> > > >> >> >> strategies
> > > >> >> >>>> and their
> > > >> >> >>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig
> to
> > > the
> > > >> >> >>>> planner, but if
> > > >> >> >>>>>>>>>>>>>>>>> he wants to have its own cache implementation
> in
> > > his
> > > >> >> >>>> TableFunction, it
> > > >> >> >>>>>>>>>>>>>>>>> will be possible for him to use the existing
> > > >> interface
> > > >> >> >> for
> > > >> >> >>>> this
> > > >> >> >>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
> the
> > > >> >> >>>> documentation). In
> > > >> >> >>>>>>>>>>>>>>>>> this way all configs and metrics will be
> unified.
> > > >> WDYT?
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> cache,
> > we
> > > >> will
> > > >> >> >>>> have 90% of
> > > >> >> >>>>>>>>>>>>>>>>> lookup requests that can never be cached
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> optimization
> > in
> > > >> case
> > > >> >> >> of
> > > >> >> >>>> LRU cache.
> > > >> >> >>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> > Collection<RowData>>.
> > > >> Here
> > > >> >> >> we
> > > >> >> >>>> always
> > > >> >> >>>>>>>>>>>>>>>>> store the response of the dimension table in
> > cache,
> > > >> even
> > > >> >> >>>> after
> > > >> >> >>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
> rows
> > > >> after
> > > >> >> >>>> applying
> > > >> >> >>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> > > >> >> >>> TableFunction,
> > > >> >> >>>> we store
> > > >> >> >>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> > cache
> > > >> line
> > > >> >> >>> will
> > > >> >> >>>> be
> > > >> >> >>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> > > bytes).
> > > >> >> >> I.e.
> > > >> >> >>>> we don't
> > > >> >> >>>>>>>>>>>>>>>>> completely filter keys, by which result was
> > pruned,
> > > >> but
> > > >> >> >>>> significantly
> > > >> >> >>>>>>>>>>>>>>>>> reduce required memory to store this result. If
> > the
> > > >> user
> > > >> >> >>>> knows about
> > > >> >> >>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> > > option
> > > >> >> >> before
> > > >> >> >>>> the start
> > > >> >> >>>>>>>>>>>>>>>>> of the job. But actually I came up with the
> idea
> > > >> that we
> > > >> >> >>> can
> > > >> >> >>>> do this
> > > >> >> >>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and
> > > >> 'weigher'
> > > >> >> >>>> methods of
> > > >> >> >>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> > > >> collection
> > > >> >> >> of
> > > >> >> >>>> rows
> > > >> >> >>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> > automatically
> > > >> fit
> > > >> >> >>> much
> > > >> >> >>>> more
> > > >> >> >>>>>>>>>>>>>>>>> records than before.
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> > > filters
> > > >> and
> > > >> >> >>>> projects
> > > >> >> >>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > > >> >> >>>> SupportsProjectionPushDown.
> > > >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > > interfaces,
> > > >> >> >> don't
> > > >> >> >>>> mean it's
> > > >> >> >>>>>>>>>>>>>>> hard
> > > >> >> >>>>>>>>>>>>>>>>> to implement.
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> > > implement
> > > >> >> >> filter
> > > >> >> >>>> pushdown.
> > > >> >> >>>>>>>>>>>>>>>>> But I think the fact that currently there is no
> > > >> database
> > > >> >> >>>> connector
> > > >> >> >>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> > > feature
> > > >> >> >> won't
> > > >> >> >>>> be
> > > >> >> >>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> > talk
> > > >> about
> > > >> >> >>>> other
> > > >> >> >>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases
> > > might
> > > >> >> not
> > > >> >> >>>> support all
> > > >> >> >>>>>>>>>>>>>>>>> Flink filters (or not support filters at all).
> I
> > > >> think
> > > >> >> >>> users
> > > >> >> >>>> are
> > > >> >> >>>>>>>>>>>>>>>>> interested in supporting cache filters
> > optimization
> > > >> >> >>>> independently of
> > > >> >> >>>>>>>>>>>>>>>>> supporting other features and solving more
> > complex
> > > >> >> >> problems
> > > >> >> >>>> (or
> > > >> >> >>>>>>>>>>>>>>>>> unsolvable at all).
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually
> in
> > > our
> > > >> >> >>>> internal version
> > > >> >> >>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and
> > > >> >> reloading
> > > >> >> >>>> data from
> > > >> >> >>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a
> > way
> > > to
> > > >> >> >> unify
> > > >> >> >>>> the logic
> > > >> >> >>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> > > >> >> SourceFunction,
> > > >> >> >>>> Source,...)
> > > >> >> >>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
> result
> > I
> > > >> >> >> settled
> > > >> >> >>>> on using
> > > >> >> >>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
> in
> > > all
> > > >> >> >> lookup
> > > >> >> >>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans
> > to
> > > >> >> >>> deprecate
> > > >> >> >>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> > usage
> > > of
> > > >> >> >>>> FLIP-27 source
> > > >> >> >>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> > > source
> > > >> was
> > > >> >> >>>> designed to
> > > >> >> >>>>>>>>>>>>>>>>> work in distributed environment
> (SplitEnumerator
> > on
> > > >> >> >>>> JobManager and
> > > >> >> >>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> > operator
> > > >> >> >> (lookup
> > > >> >> >>>> join
> > > >> >> >>>>>>>>>>>>>>>>> operator in our case). There is even no direct
> > way
> > > to
> > > >> >> >> pass
> > > >> >> >>>> splits from
> > > >> >> >>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
> works
> > > >> >> through
> > > >> >> >>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> > > >> >> >>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> > > >> >> >> AddSplitEvents).
> > > >> >> >>>> Usage of
> > > >> >> >>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
> clearer
> > > and
> > > >> >> >>>> easier. But if
> > > >> >> >>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> > > >> FLIP-27, I
> > > >> >> >>>> have the
> > > >> >> >>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
> lookup
> > > join
> > > >> >> ALL
> > > >> >> >>>> cache in
> > > >> >> >>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of
> > > batch
> > > >> >> >>> source?
> > > >> >> >>>> The point
> > > >> >> >>>>>>>>>>>>>>>>> is that the only difference between lookup join
> > ALL
> > > >> >> cache
> > > >> >> >>>> and simple
> > > >> >> >>>>>>>>>>>>>>>>> join with batch source is that in the first
> case
> > > >> >> scanning
> > > >> >> >>> is
> > > >> >> >>>> performed
> > > >> >> >>>>>>>>>>>>>>>>> multiple times, in between which state (cache)
> is
> > > >> >> cleared
> > > >> >> >>>> (correct me
> > > >> >> >>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> > > >> functionality of
> > > >> >> >>>> simple join
> > > >> >> >>>>>>>>>>>>>>>>> to support state reloading + extend the
> > > >> functionality of
> > > >> >> >>>> scanning
> > > >> >> >>>>>>>>>>>>>>>>> batch source multiple times (this one should be
> > > easy
> > > >> >> with
> > > >> >> >>>> new FLIP-27
> > > >> >> >>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading -
> we
> > > >> will
> > > >> >> >> need
> > > >> >> >>>> to change
> > > >> >> >>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> > again
> > > >> after
> > > >> >> >>>> some TTL).
> > > >> >> >>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> long-term
> > > >> goal
> > > >> >> >> and
> > > >> >> >>>> will make
> > > >> >> >>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
> said.
> > > >> Maybe
> > > >> >> >> we
> > > >> >> >>>> can limit
> > > >> >> >>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> > (InputFormats).
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> > > >> >> >>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> > flexible
> > > >> >> >>>> interfaces for
> > > >> >> >>>>>>>>>>>>>>>>> caching in lookup join.
> > > >> >> >>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both
> > in
> > > >> LRU
> > > >> >> >> and
> > > >> >> >>>> ALL caches.
> > > >> >> >>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> > > >> supported
> > > >> >> >> in
> > > >> >> >>>> Flink
> > > >> >> >>>>>>>>>>>>>>>>> connectors, some of the connectors might not
> have
> > > the
> > > >> >> >>>> opportunity to
> > > >> >> >>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently
> > > filter
> > > >> >> >>>> pushdown works
> > > >> >> >>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> filters
> > +
> > > >> >> >>>> projections
> > > >> >> >>>>>>>>>>>>>>>>> optimization should be independent from other
> > > >> features.
> > > >> >> >>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
> that
> > > >> >> involves
> > > >> >> >>>> multiple
> > > >> >> >>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
> from
> > > >> >> >>>> InputFormat in favor
> > > >> >> >>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> realization
> > > >> really
> > > >> >> >>>> complex and
> > > >> >> >>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
> extend
> > > the
> > > >> >> >>>> functionality of
> > > >> >> >>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> > case
> > > of
> > > >> >> >>> lookup
> > > >> >> >>>> join ALL
> > > >> >> >>>>>>>>>>>>>>>>> cache?
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> [1]
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>
> > > >> >> >>>>
> > > >> >> >>>
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> > > imjark@gmail.com
> > > >> >:
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
> want
> > to
> > > >> >> share
> > > >> >> >>> my
> > > >> >> >>>> ideas:
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> > connectors
> > > >> base
> > > >> >> >>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
> ways
> > > >> should
> > > >> >> >>>> work (e.g.,
> > > >> >> >>>>>>>>>>>>>>> cache
> > > >> >> >>>>>>>>>>>>>>>>>> pruning, compatibility).
> > > >> >> >>>>>>>>>>>>>>>>>> The framework way can provide more concise
> > > >> interfaces.
> > > >> >> >>>>>>>>>>>>>>>>>> The connector base way can define more
> flexible
> > > >> cache
> > > >> >> >>>>>>>>>>>>>>>>>> strategies/implementations.
> > > >> >> >>>>>>>>>>>>>>>>>> We are still investigating a way to see if we
> > can
> > > >> have
> > > >> >> >>> both
> > > >> >> >>>>>>>>>>>>>>> advantages.
> > > >> >> >>>>>>>>>>>>>>>>>> We should reach a consensus that the way
> should
> > > be a
> > > >> >> >> final
> > > >> >> >>>> state,
> > > >> >> >>>>>>>>>>>>>>> and we
> > > >> >> >>>>>>>>>>>>>>>>>> are on the path to it.
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> > > >> >> >>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
> into
> > > >> cache
> > > >> >> >> can
> > > >> >> >>>> benefit a
> > > >> >> >>>>>>>>>>>>>>> lot
> > > >> >> >>>>>>>>>>>>>>>>> for
> > > >> >> >>>>>>>>>>>>>>>>>> ALL cache.
> > > >> >> >>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> > > Connectors
> > > >> use
> > > >> >> >>>> cache to
> > > >> >> >>>>>>>>>>>>>>> reduce
> > > >> >> >>>>>>>>>>>>>>>>> IO
> > > >> >> >>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> > > >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> cache,
> > we
> > > >> will
> > > >> >> >>>> have 90% of
> > > >> >> >>>>>>>>>>>>>>>>> lookup
> > > >> >> >>>>>>>>>>>>>>>>>> requests that can never be cached
> > > >> >> >>>>>>>>>>>>>>>>>> and hit directly to the databases. That means
> > the
> > > >> cache
> > > >> >> >> is
> > > >> >> >>>>>>>>>>>>>>> meaningless in
> > > >> >> >>>>>>>>>>>>>>>>>> this case.
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to
> do
> > > >> >> filters
> > > >> >> >>>> and projects
> > > >> >> >>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > > >> >> >>>>>>>>>>>>>>> SupportsProjectionPushDown.
> > > >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > > interfaces,
> > > >> >> >> don't
> > > >> >> >>>> mean it's
> > > >> >> >>>>>>>>>>>>>>> hard
> > > >> >> >>>>>>>>>>>>>>>>> to
> > > >> >> >>>>>>>>>>>>>>>>>> implement.
> > > >> >> >>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces
> to
> > > >> reduce
> > > >> >> >> IO
> > > >> >> >>>> and the
> > > >> >> >>>>>>>>>>>>>>> cache
> > > >> >> >>>>>>>>>>>>>>>>>> size.
> > > >> >> >>>>>>>>>>>>>>>>>> That should be a final state that the scan
> > source
> > > >> and
> > > >> >> >>>> lookup source
> > > >> >> >>>>>>>>>>>>>>> share
> > > >> >> >>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> > > >> >> >>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> > pushdown
> > > >> logic
> > > >> >> >> in
> > > >> >> >>>> caches,
> > > >> >> >>>>>>>>>>>>>>> which
> > > >> >> >>>>>>>>>>>>>>>>>> will complex the lookup join design.
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> > > >> >> >>>>>>>>>>>>>>>>>> All cache might be the most challenging part
> of
> > > this
> > > >> >> >> FLIP.
> > > >> >> >>>> We have
> > > >> >> >>>>>>>>>>>>>>> never
> > > >> >> >>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> > > >> >> >>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
> "eval"
> > > >> method
> > > >> >> >> of
> > > >> >> >>>>>>>>>>>>>>> TableFunction.
> > > >> >> >>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> > > >> >> >>>>>>>>>>>>>>>>>> Ideally, connector implementation should share
> > the
> > > >> >> logic
> > > >> >> >>> of
> > > >> >> >>>> reload
> > > >> >> >>>>>>>>>>>>>>> and
> > > >> >> >>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> > > >> >> >>>> InputFormat/SourceFunction/FLIP-27
> > > >> >> >>>>>>>>>>>>>>>>> Source.
> > > >> >> >>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> > > deprecated,
> > > >> and
> > > >> >> >>> the
> > > >> >> >>>> FLIP-27
> > > >> >> >>>>>>>>>>>>>>>>> source
> > > >> >> >>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> > > >> >> >>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> > > >> LookupJoin,
> > > >> >> >>> this
> > > >> >> >>>> may make
> > > >> >> >>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> > > >> >> >>>>>>>>>>>>>>>>>> We are still investigating how to abstract the
> > ALL
> > > >> >> cache
> > > >> >> >>>> logic and
> > > >> >> >>>>>>>>>>>>>>> reuse
> > > >> >> >>>>>>>>>>>>>>>>>> the existing source interfaces.
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>> Best,
> > > >> >> >>>>>>>>>>>>>>>>>> Jark
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> > > >> >> >>>> ro.v.boyko@gmail.com>
> > > >> >> >>>>>>>>>>>>>>> wrote:
> > > >> >> >>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
> lies
> > > out
> > > >> of
> > > >> >> >> the
> > > >> >> >>>> scope of
> > > >> >> >>>>>>>>>>>>>>> this
> > > >> >> >>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be
> > > done
> > > >> for
> > > >> >> >>> all
> > > >> >> >>>>>>>>>>>>>>>>> ScanTableSource
> > > >> >> >>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> > > >> >> >>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> > > >> >> >>>>>>>>>>>>>>> martijnvisser@apache.org>
> > > >> >> >>>>>>>>>>>>>>>>>>> wrote:
> > > >> >> >>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>> Hi everyone,
> > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> > correctly
> > > >> >> >>> mentioned
> > > >> >> >>>> that
> > > >> >> >>>>>>>>>>>>>>> filter
> > > >> >> >>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> > > >> >> >> jdbc/hive/hbase."
> > > >> >> >>>> -> Would
> > > >> >> >>>>>>>>>>>>>>> an
> > > >> >> >>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> implement
> > > >> these
> > > >> >> >>> filter
> > > >> >> >>>>>>>>>>>>>>> pushdowns?
> > > >> >> >>>>>>>>>>>>>>>>> I
> > > >> >> >>>>>>>>>>>>>>>>>>>> can
> > > >> >> >>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to
> > > doing
> > > >> >> >> that,
> > > >> >> >>>> outside
> > > >> >> >>>>>>>>>>>>>>> of
> > > >> >> >>>>>>>>>>>>>>>>> lookup
> > > >> >> >>>>>>>>>>>>>>>>>>>> caching and metrics.
> > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>> Martijn Visser
> > > >> >> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > > >> >> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> > > >> >> >>>> ro.v.boyko@gmail.com>
> > > >> >> >>>>>>>>>>>>>>>>> wrote:
> > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>> Hi everyone!
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> > improvement!
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation
> > > >> would be
> > > >> >> >> a
> > > >> >> >>>> nice
> > > >> >> >>>>>>>>>>>>>>>>> opportunity
> > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > >> >> >>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> SYSTEM_TIME
> > > AS
> > > >> OF
> > > >> >> >>>> proc_time"
> > > >> >> >>>>>>>>>>>>>>>>> semantics
> > > >> >> >>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> > > >> implemented.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
> say
> > > >> that:
> > > >> >> >>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
> to
> > > cut
> > > >> off
> > > >> >> >>> the
> > > >> >> >>>> cache
> > > >> >> >>>>>>>>>>>>>>> size
> > > >> >> >>>>>>>>>>>>>>>>> by
> > > >> >> >>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the
> > most
> > > >> >> handy
> > > >> >> >>>> way to do
> > > >> >> >>>>>>>>>>>>>>> it
> > > >> >> >>>>>>>>>>>>>>>>> is
> > > >> >> >>>>>>>>>>>>>>>>>>>> apply
> > > >> >> >>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit
> > > >> harder to
> > > >> >> >>>> pass it
> > > >> >> >>>>>>>>>>>>>>>>> through the
> > > >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> > Alexander
> > > >> >> >>> correctly
> > > >> >> >>>>>>>>>>>>>>> mentioned
> > > >> >> >>>>>>>>>>>>>>>>> that
> > > >> >> >>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
> for
> > > >> >> >>>> jdbc/hive/hbase.
> > > >> >> >>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
> > > >> >> >> parameters
> > > >> >> >>>> for
> > > >> >> >>>>>>>>>>>>>>> different
> > > >> >> >>>>>>>>>>>>>>>>>>>> tables
> > > >> >> >>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
> set
> > it
> > > >> >> >> through
> > > >> >> >>>> DDL
> > > >> >> >>>>>>>>>>>>>>> rather
> > > >> >> >>>>>>>>>>>>>>>>> than
> > > >> >> >>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
> options
> > > for
> > > >> >> all
> > > >> >> >>>> lookup
> > > >> >> >>>>>>>>>>>>>>> tables.
> > > >> >> >>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> > > really
> > > >> >> >>>> deprives us of
> > > >> >> >>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> > implement
> > > >> >> their
> > > >> >> >>> own
> > > >> >> >>>>>>>>>>>>>>> cache).
> > > >> >> >>>>>>>>>>>>>>>>> But
> > > >> >> >>>>>>>>>>>>>>>>>>>> most
> > > >> >> >>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
> more
> > > >> >> >> different
> > > >> >> >>>> cache
> > > >> >> >>>>>>>>>>>>>>>>> strategies
> > > >> >> >>>>>>>>>>>>>>>>>>>> and
> > > >> >> >>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> > schema
> > > >> >> >> proposed
> > > >> >> >>>> by
> > > >> >> >>>>>>>>>>>>>>>>> Alexander.
> > > >> >> >>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not
> > > right
> > > >> and
> > > >> >> >>> all
> > > >> >> >>>> these
> > > >> >> >>>>>>>>>>>>>>>>>>>> facilities
> > > >> >> >>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> > > architecture?
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > > >> >> >>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> Visser <
> > > >> >> >>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> > > >> >> >>>>>>>>>>>>>>>>>>>>> wrote:
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> > wanted
> > > to
> > > >> >> >>>> express that
> > > >> >> >>>>>>>>>>>>>>> I
> > > >> >> >>>>>>>>>>>>>>>>> really
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this
> > > topic
> > > >> >> >> and I
> > > >> >> >>>> hope
> > > >> >> >>>>>>>>>>>>>>> that
> > > >> >> >>>>>>>>>>>>>>>>>>>> others
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> Martijn
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> > > Смирнов <
> > > >> >> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> > However, I
> > > >> have
> > > >> >> >>>> questions
> > > >> >> >>>>>>>>>>>>>>>>> about
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
> get
> > > >> >> >>>> something?).
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of
> > "FOR
> > > >> >> >>>> SYSTEM_TIME
> > > >> >> >>>>>>>>>>>>>>> AS OF
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> proc_time”
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> > > SYSTEM_TIME
> > > >> AS
> > > >> >> >> OF
> > > >> >> >>>>>>>>>>>>>>> proc_time"
> > > >> >> >>>>>>>>>>>>>>>>> is
> > > >> >> >>>>>>>>>>>>>>>>>>>> not
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
> you
> > > >> said,
> > > >> >> >>> users
> > > >> >> >>>> go
> > > >> >> >>>>>>>>>>>>>>> on it
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance
> > (no
> > > >> one
> > > >> >> >>>> proposed
> > > >> >> >>>>>>>>>>>>>>> to
> > > >> >> >>>>>>>>>>>>>>>>> enable
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do
> > you
> > > >> mean
> > > >> >> >>>> other
> > > >> >> >>>>>>>>>>>>>>>>> developers
> > > >> >> >>>>>>>>>>>>>>>>>>>> of
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> > > explicitly
> > > >> >> >>> specify
> > > >> >> >>>>>>>>>>>>>>> whether
> > > >> >> >>>>>>>>>>>>>>>>> their
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the
> > > list
> > > >> of
> > > >> >> >>>> supported
> > > >> >> >>>>>>>>>>>>>>>>>>>> options),
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> > want
> > > >> to.
> > > >> >> So
> > > >> >> >>>> what
> > > >> >> >>>>>>>>>>>>>>>>> exactly is
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
> caching
> > > in
> > > >> >> >>> modules
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> > flink-table-common
> > > >> from
> > > >> >> >>> the
> > > >> >> >>>>>>>>>>>>>>>>> considered
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> > > >> >> >>>> breaking/non-breaking
> > > >> >> >>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> > > proc_time"?
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> > > >> options in
> > > >> >> >>> DDL
> > > >> >> >>>> to
> > > >> >> >>>>>>>>>>>>>>>>> control
> > > >> >> >>>>>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
> never
> > > >> >> happened
> > > >> >> >>>>>>>>>>>>>>> previously
> > > >> >> >>>>>>>>>>>>>>>>> and
> > > >> >> >>>>>>>>>>>>>>>>>>>>> should
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> be cautious
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> > > semantics
> > > >> of
> > > >> >> >> DDL
> > > >> >> >>>>>>>>>>>>>>> options
> > > >> >> >>>>>>>>>>>>>>>>> and
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
> it
> > > >> about
> > > >> >> >>>> limiting
> > > >> >> >>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>> scope
> > > >> >> >>>>>>>>>>>>>>>>>>>> of
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> > > business
> > > >> >> >> logic
> > > >> >> >>>> rather
> > > >> >> >>>>>>>>>>>>>>> than
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic
> in
> > > the
> > > >> >> >>>> framework? I
> > > >> >> >>>>>>>>>>>>>>>>> mean
> > > >> >> >>>>>>>>>>>>>>>>>>>> that
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> > option
> > > >> with
> > > >> >> >>>> lookup
> > > >> >> >>>>>>>>>>>>>>> cache
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the
> > > wrong
> > > >> >> >>>> decision,
> > > >> >> >>>>>>>>>>>>>>>>> because it
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
> logic
> > > (not
> > > >> >> >> just
> > > >> >> >>>>>>>>>>>>>>> performance
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> > > functions
> > > >> of
> > > >> >> >> ONE
> > > >> >> >>>> table
> > > >> >> >>>>>>>>>>>>>>>>> (there
> > > >> >> >>>>>>>>>>>>>>>>>>>> can
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> caches).
> > > >> Does it
> > > >> >> >>>> really
> > > >> >> >>>>>>>>>>>>>>>>> matter for
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
> logic
> > is
> > > >> >> >>> located,
> > > >> >> >>>>>>>>>>>>>>> which is
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> > > >> 'sink.parallelism',
> > > >> >> >>>> which in
> > > >> >> >>>>>>>>>>>>>>>>> some way
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework"
> > and
> > > I
> > > >> >> >> don't
> > > >> >> >>>> see any
> > > >> >> >>>>>>>>>>>>>>>>> problem
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> here.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> > > all-caching
> > > >> >> >>>> scenario
> > > >> >> >>>>>>>>>>>>>>> and
> > > >> >> >>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>> design
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> would become more complex
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> > discussion,
> > > >> but
> > > >> >> >>>> actually
> > > >> >> >>>>>>>>>>>>>>> in our
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> > quite
> > > >> >> >> easily
> > > >> >> >>> -
> > > >> >> >>>> we
> > > >> >> >>>>>>>>>>>>>>> reused
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
> for
> > a
> > > >> new
> > > >> >> >>> API).
> > > >> >> >>>> The
> > > >> >> >>>>>>>>>>>>>>>>> point is
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
> > > >> >> >> InputFormat
> > > >> >> >>>> for
> > > >> >> >>>>>>>>>>>>>>>>> scanning
> > > >> >> >>>>>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even
> > Hive
> > > >> - it
> > > >> >> >>> uses
> > > >> >> >>>>>>>>>>>>>>> class
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a
> > > >> wrapper
> > > >> >> >>> around
> > > >> >> >>>>>>>>>>>>>>>>> InputFormat.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> > ability
> > > >> to
> > > >> >> >>> reload
> > > >> >> >>>>>>>>>>>>>>> cache
> > > >> >> >>>>>>>>>>>>>>>>> data
> > > >> >> >>>>>>>>>>>>>>>>>>>> in
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> > number
> > > >> of
> > > >> >> >>>>>>>>>>>>>>> InputSplits,
> > > >> >> >>>>>>>>>>>>>>>>> but
> > > >> >> >>>>>>>>>>>>>>>>>>>> has
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload
> > > time
> > > >> >> >>>> significantly
> > > >> >> >>>>>>>>>>>>>>>>> reduces
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> > blocking). I
> > > >> know
> > > >> >> >>> that
> > > >> >> >>>>>>>>>>>>>>> usually
> > > >> >> >>>>>>>>>>>>>>>>> we
> > > >> >> >>>>>>>>>>>>>>>>>>>> try
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> > code,
> > > >> but
> > > >> >> >>> maybe
> > > >> >> >>>> this
> > > >> >> >>>>>>>>>>>>>>> one
> > > >> >> >>>>>>>>>>>>>>>>> can
> > > >> >> >>>>>>>>>>>>>>>>>>>> be
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
> an
> > > >> ideal
> > > >> >> >>>> solution,
> > > >> >> >>>>>>>>>>>>>>> maybe
> > > >> >> >>>>>>>>>>>>>>>>>>>> there
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> are better ones.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
> might
> > > >> >> >> introduce
> > > >> >> >>>>>>>>>>>>>>>>> compatibility
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> issues
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> > > developer
> > > >> of
> > > >> >> >> the
> > > >> >> >>>>>>>>>>>>>>> connector
> > > >> >> >>>>>>>>>>>>>>>>>>>> won't
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
> new
> > > >> cache
> > > >> >> >>>> options
> > > >> >> >>>>>>>>>>>>>>>>>>>> incorrectly
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options
> > > into
> > > >> 2
> > > >> >> >>>> different
> > > >> >> >>>>>>>>>>>>>>> code
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will
> > > need
> > > >> to
> > > >> >> >> do
> > > >> >> >>>> is to
> > > >> >> >>>>>>>>>>>>>>>>> redirect
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> > > >> LookupConfig
> > > >> >> (+
> > > >> >> >>>> maybe
> > > >> >> >>>>>>>>>>>>>>> add an
> > > >> >> >>>>>>>>>>>>>>>>>>>> alias
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> > naming),
> > > >> >> >>> everything
> > > >> >> >>>>>>>>>>>>>>> will be
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> > won't
> > > >> do
> > > >> >> >>>>>>>>>>>>>>> refactoring at
> > > >> >> >>>>>>>>>>>>>>>>> all,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector
> > > >> because
> > > >> >> >> of
> > > >> >> >>>>>>>>>>>>>>> backward
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants
> to
> > > use
> > > >> >> his
> > > >> >> >>> own
> > > >> >> >>>>>>>>>>>>>>> cache
> > > >> >> >>>>>>>>>>>>>>>>> logic,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> > > configs
> > > >> >> into
> > > >> >> >>> the
> > > >> >> >>>>>>>>>>>>>>>>> framework,
> > > >> >> >>>>>>>>>>>>>>>>>>>> and
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with
> > > >> already
> > > >> >> >>>> existing
> > > >> >> >>>>>>>>>>>>>>>>> configs
> > > >> >> >>>>>>>>>>>>>>>>>>>> and
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a
> > > rare
> > > >> >> >> case).
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed
> > all
> > > >> the
> > > >> >> >> way
> > > >> >> >>>> down
> > > >> >> >>>>>>>>>>>>>>> to
> > > >> >> >>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>> table
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> > source
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is
> > that
> > > >> the
> > > >> >> >>> ONLY
> > > >> >> >>>>>>>>>>>>>>> connector
> > > >> >> >>>>>>>>>>>>>>>>>>>> that
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> > > >> FileSystemTableSource
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> > > currently).
> > > >> >> Also
> > > >> >> >>>> for some
> > > >> >> >>>>>>>>>>>>>>>>>>>> databases
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> > > complex
> > > >> >> >>> filters
> > > >> >> >>>>>>>>>>>>>>> that we
> > > >> >> >>>>>>>>>>>>>>>>> have
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> in Flink.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the
> > > cache
> > > >> >> >> seems
> > > >> >> >>>> not
> > > >> >> >>>>>>>>>>>>>>>>> quite
> > > >> >> >>>>>>>>>>>>>>>>>>>>> useful
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large
> > > >> amount of
> > > >> >> >>> data
> > > >> >> >>>>>>>>>>>>>>> from the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> > > suppose
> > > >> in
> > > >> >> >>>> dimension
> > > >> >> >>>>>>>>>>>>>>>>> table
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> 'users'
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20
> to
> > > 40,
> > > >> >> and
> > > >> >> >>>> input
> > > >> >> >>>>>>>>>>>>>>> stream
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
> by
> > > age
> > > >> of
> > > >> >> >>>> users. If
> > > >> >> >>>>>>>>>>>>>>> we
> > > >> >> >>>>>>>>>>>>>>>>> have
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> > This
> > > >> means
> > > >> >> >>> the
> > > >> >> >>>> user
> > > >> >> >>>>>>>>>>>>>>> can
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
> almost
> > 2
> > > >> >> times.
> > > >> >> >>> It
> > > >> >> >>>> will
> > > >> >> >>>>>>>>>>>>>>>>> gain a
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> huge
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> > > optimization
> > > >> >> >> starts
> > > >> >> >>>> to
> > > >> >> >>>>>>>>>>>>>>> really
> > > >> >> >>>>>>>>>>>>>>>>>>>> shine
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> > filters
> > > >> and
> > > >> >> >>>> projections
> > > >> >> >>>>>>>>>>>>>>>>> can't
> > > >> >> >>>>>>>>>>>>>>>>>>>> fit
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
> opens
> > up
> > > >> >> >>>> additional
> > > >> >> >>>>>>>>>>>>>>>>>>>> possibilities
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not
> > > quite
> > > >> >> >>>> useful'.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> > > >> regarding
> > > >> >> >> this
> > > >> >> >>>> topic!
> > > >> >> >>>>>>>>>>>>>>>>> Because
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> > points,
> > > >> and I
> > > >> >> >>>> think
> > > >> >> >>>>>>>>>>>>>>> with
> > > >> >> >>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>> help
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
> come
> > > to a
> > > >> >> >>>> consensus.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
> Ren
> > <
> > > >> >> >>>>>>>>>>>>>>> renqschn@gmail.com
> > > >> >> >>>>>>>>>>>>>>>>>> :
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for
> my
> > > >> late
> > > >> >> >>>> response!
> > > >> >> >>>>>>>>>>>>>>> We
> > > >> >> >>>>>>>>>>>>>>>>> had
> > > >> >> >>>>>>>>>>>>>>>>>>>> an
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
> and
> > > >> Leonard
> > > >> >> >>> and
> > > >> >> >>>> I’d
> > > >> >> >>>>>>>>>>>>>>> like
> > > >> >> >>>>>>>>>>>>>>>>> to
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> > implementing
> > > >> the
> > > >> >> >>> cache
> > > >> >> >>>>>>>>>>>>>>> logic in
> > > >> >> >>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>> table
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> > > >> user-provided
> > > >> >> >>>> table
> > > >> >> >>>>>>>>>>>>>>>>> function,
> > > >> >> >>>>>>>>>>>>>>>>>>>> we
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> extending
> > > >> >> >>>> TableFunction
> > > >> >> >>>>>>>>>>>>>>> with
> > > >> >> >>>>>>>>>>>>>>>>> these
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> concerns:
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic
> of
> > > >> "FOR
> > > >> >> >>>>>>>>>>>>>>> SYSTEM_TIME
> > > >> >> >>>>>>>>>>>>>>>>> AS OF
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> > reflect
> > > >> the
> > > >> >> >>>> content
> > > >> >> >>>>>>>>>>>>>>> of the
> > > >> >> >>>>>>>>>>>>>>>>>>>> lookup
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users
> > > >> choose
> > > >> >> to
> > > >> >> >>>> enable
> > > >> >> >>>>>>>>>>>>>>>>> caching
> > > >> >> >>>>>>>>>>>>>>>>>>>> on
> > > >> >> >>>>>>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
> that
> > > >> this
> > > >> >> >>>> breakage is
> > > >> >> >>>>>>>>>>>>>>>>>>>> acceptable
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> in
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
> prefer
> > > not
> > > >> to
> > > >> >> >>>> provide
> > > >> >> >>>>>>>>>>>>>>>>> caching on
> > > >> >> >>>>>>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
> in
> > > the
> > > >> >> >>>> framework
> > > >> >> >>>>>>>>>>>>>>>>> (whether
> > > >> >> >>>>>>>>>>>>>>>>>>>> in a
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> TableFunction),
> > we
> > > >> have
> > > >> >> >> to
> > > >> >> >>>>>>>>>>>>>>> confront a
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> situation
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
> control
> > > the
> > > >> >> >>>> behavior of
> > > >> >> >>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> framework,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> > > should
> > > >> be
> > > >> >> >>>> cautious.
> > > >> >> >>>>>>>>>>>>>>>>> Under
> > > >> >> >>>>>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> > framework
> > > >> >> should
> > > >> >> >>>> only be
> > > >> >> >>>>>>>>>>>>>>>>>>>> specified
> > > >> >> >>>>>>>>>>>>>>>>>>>>> by
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
> it’s
> > > >> hard
> > > >> >> to
> > > >> >> >>>> apply
> > > >> >> >>>>>>>>>>>>>>> these
> > > >> >> >>>>>>>>>>>>>>>>>>>> general
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source
> > > loads
> > > >> and
> > > >> >> >>>> refresh
> > > >> >> >>>>>>>>>>>>>>> all
> > > >> >> >>>>>>>>>>>>>>>>>>>> records
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> > high
> > > >> >> lookup
> > > >> >> >>>>>>>>>>>>>>> performance
> > > >> >> >>>>>>>>>>>>>>>>>>>> (like
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> Hive
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
> widely
> > > >> used
> > > >> >> by
> > > >> >> >>> our
> > > >> >> >>>>>>>>>>>>>>> internal
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
> the
> > > >> user’s
> > > >> >> >>>>>>>>>>>>>>> TableFunction
> > > >> >> >>>>>>>>>>>>>>>>>>>> works
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> fine
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> > > >> introduce a
> > > >> >> >>> new
> > > >> >> >>>>>>>>>>>>>>>>> interface for
> > > >> >> >>>>>>>>>>>>>>>>>>>>> this
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would
> > > >> become
> > > >> >> >> more
> > > >> >> >>>>>>>>>>>>>>> complex.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework
> > > might
> > > >> >> >>>> introduce
> > > >> >> >>>>>>>>>>>>>>>>>>>> compatibility
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> > there
> > > >> might
> > > >> >> >>>> exist two
> > > >> >> >>>>>>>>>>>>>>>>> caches
> > > >> >> >>>>>>>>>>>>>>>>>>>>> with
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
> > > >> >> >> incorrectly
> > > >> >> >>>>>>>>>>>>>>> configures
> > > >> >> >>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>> table
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> > implemented
> > > >> by
> > > >> >> >> the
> > > >> >> >>>> lookup
> > > >> >> >>>>>>>>>>>>>>>>> source).
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> > > >> Alexander, I
> > > >> >> >>>> think
> > > >> >> >>>>>>>>>>>>>>>>> filters
> > > >> >> >>>>>>>>>>>>>>>>>>>> and
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way
> > down
> > > >> to
> > > >> >> >> the
> > > >> >> >>>> table
> > > >> >> >>>>>>>>>>>>>>>>> function,
> > > >> >> >>>>>>>>>>>>>>>>>>>>> like
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of
> > the
> > > >> >> >> runner
> > > >> >> >>>> with
> > > >> >> >>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>> cache.
> > > >> >> >>>>>>>>>>>>>>>>>>>>> The
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> > network
> > > >> I/O
> > > >> >> >> and
> > > >> >> >>>>>>>>>>>>>>> pressure
> > > >> >> >>>>>>>>>>>>>>>>> on the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
> > > >> >> >>> optimizations
> > > >> >> >>>> to
> > > >> >> >>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>> cache
> > > >> >> >>>>>>>>>>>>>>>>>>>>> seems
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> > > reflect
> > > >> our
> > > >> >> >>>> ideas.
> > > >> >> >>>>>>>>>>>>>>> We
> > > >> >> >>>>>>>>>>>>>>>>>>>> prefer to
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
> of
> > > >> >> >>>> TableFunction,
> > > >> >> >>>>>>>>>>>>>>> and we
> > > >> >> >>>>>>>>>>>>>>>>>>>> could
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> > > >> (CachingTableFunction,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers
> > and
> > > >> >> >> regulate
> > > >> >> >>>>>>>>>>>>>>> metrics
> > > >> >> >>>>>>>>>>>>>>>>> of the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> cache.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>
> > > >> >> >>>>
> > > >> >> >>>
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> [2]
> > > >> >> >>> https://github.com/PatrickRen/flink/tree/FLIP-221
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> Александр
> > > >> Смирнов
> > > >> >> >> <
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> > > solution
> > > >> as
> > > >> >> >> the
> > > >> >> >>>>>>>>>>>>>>> first
> > > >> >> >>>>>>>>>>>>>>>>> step:
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually
> > > >> exclusive
> > > >> >> >>>>>>>>>>>>>>> (originally
> > > >> >> >>>>>>>>>>>>>>>>>>>>> proposed
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> > > conceptually
> > > >> >> they
> > > >> >> >>>> follow
> > > >> >> >>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>> same
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> > > >> different.
> > > >> >> >> If
> > > >> >> >>> we
> > > >> >> >>>>>>>>>>>>>>> will
> > > >> >> >>>>>>>>>>>>>>>>> go one
> > > >> >> >>>>>>>>>>>>>>>>>>>>> way,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
> will
> > > mean
> > > >> >> >>>> deleting
> > > >> >> >>>>>>>>>>>>>>>>> existing
> > > >> >> >>>>>>>>>>>>>>>>>>>> code
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> > > >> connectors.
> > > >> >> >> So
> > > >> >> >>> I
> > > >> >> >>>>>>>>>>>>>>> think we
> > > >> >> >>>>>>>>>>>>>>>>>>>> should
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> > about
> > > >> that
> > > >> >> >> and
> > > >> >> >>>> then
> > > >> >> >>>>>>>>>>>>>>> work
> > > >> >> >>>>>>>>>>>>>>>>>>>>> together
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> > tasks
> > > >> for
> > > >> >> >>>> different
> > > >> >> >>>>>>>>>>>>>>>>> parts
> > > >> >> >>>>>>>>>>>>>>>>>>>> of
> > > >> >> >>>>>>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> unification
> > /
> > > >> >> >>>> introducing
> > > >> >> >>>>>>>>>>>>>>>>> proposed
> > > >> >> >>>>>>>>>>>>>>>>>>>> set
> > > >> >> >>>>>>>>>>>>>>>>>>>>> of
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> > Qingsheng?
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> > > requests
> > > >> >> >> after
> > > >> >> >>>>>>>>>>>>>>> filter
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
> fields
> > > of
> > > >> the
> > > >> >> >>>> lookup
> > > >> >> >>>>>>>>>>>>>>>>> table, we
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
> after
> > > >> that we
> > > >> >> >>> can
> > > >> >> >>>>>>>>>>>>>>> filter
> > > >> >> >>>>>>>>>>>>>>>>>>>>> responses,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> > filter
> > > >> >> >>>> pushdown. So
> > > >> >> >>>>>>>>>>>>>>> if
> > > >> >> >>>>>>>>>>>>>>>>>>>>> filtering
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> > much
> > > >> less
> > > >> >> >>> rows
> > > >> >> >>>> in
> > > >> >> >>>>>>>>>>>>>>>>> cache.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > > architecture
> > > >> is
> > > >> >> >> not
> > > >> >> >>>>>>>>>>>>>>> shared.
> > > >> >> >>>>>>>>>>>>>>>>> I
> > > >> >> >>>>>>>>>>>>>>>>>>>> don't
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> > kinds
> > > >> of
> > > >> >> >>>>>>>>>>>>>>> conversations
> > > >> >> >>>>>>>>>>>>>>>>> :)
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> confluence,
> > > so
> > > >> I
> > > >> >> >>> made a
> > > >> >> >>>>>>>>>>>>>>> Jira
> > > >> >> >>>>>>>>>>>>>>>>> issue,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in
> > > more
> > > >> >> >>> details
> > > >> >> >>>> -
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >> https://issues.apache.org/jira/browse/FLINK-27411.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
> Heise
> > <
> > > >> >> >>>>>>>>>>>>>>> arvid@apache.org>:
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> > inconsistency
> > > >> was
> > > >> >> >> not
> > > >> >> >>>>>>>>>>>>>>>>> satisfying
> > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> me.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> > could
> > > >> also
> > > >> >> >>> live
> > > >> >> >>>>>>>>>>>>>>> with
> > > >> >> >>>>>>>>>>>>>>>>> an
> > > >> >> >>>>>>>>>>>>>>>>>>>>> easier
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of
> > > >> making
> > > >> >> >>>> caching
> > > >> >> >>>>>>>>>>>>>>> an
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> implementation
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> > devise a
> > > >> >> >> caching
> > > >> >> >>>>>>>>>>>>>>> layer
> > > >> >> >>>>>>>>>>>>>>>>>>>> around X.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> So
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> CachingTableFunction
> > > >> that
> > > >> >> >>>>>>>>>>>>>>> delegates to
> > > >> >> >>>>>>>>>>>>>>>>> X in
> > > >> >> >>>>>>>>>>>>>>>>>>>>> case
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> > Lifting
> > > >> it
> > > >> >> >> into
> > > >> >> >>>> the
> > > >> >> >>>>>>>>>>>>>>>>> operator
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> model
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> as
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> > > >> probably
> > > >> >> >>>>>>>>>>>>>>> unnecessary
> > > >> >> >>>>>>>>>>>>>>>>> in
> > > >> >> >>>>>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> first step
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
> will
> > > only
> > > >> >> >>> receive
> > > >> >> >>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>> requests
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> after
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
> more
> > > >> >> >>> interesting
> > > >> >> >>>> to
> > > >> >> >>>>>>>>>>>>>>> save
> > > >> >> >>>>>>>>>>>>>>>>>>>>> memory).
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> > changes
> > > of
> > > >> >> >> this
> > > >> >> >>>> FLIP
> > > >> >> >>>>>>>>>>>>>>>>> would be
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> limited to
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> > > interfaces.
> > > >> >> >>>> Everything
> > > >> >> >>>>>>>>>>>>>>> else
> > > >> >> >>>>>>>>>>>>>>>>>>>>> remains
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> an
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That
> > > means
> > > >> we
> > > >> >> >> can
> > > >> >> >>>>>>>>>>>>>>> easily
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> incorporate
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander
> > > >> pointed
> > > >> >> >> out
> > > >> >> >>>>>>>>>>>>>>> later.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > > architecture
> > > >> is
> > > >> >> >> not
> > > >> >> >>>>>>>>>>>>>>> shared.
> > > >> >> >>>>>>>>>>>>>>>>> I
> > > >> >> >>>>>>>>>>>>>>>>>>>> don't
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> > Александр
> > > >> >> >> Смирнов
> > > >> >> >>> <
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
> I'm
> > > >> not a
> > > >> >> >>>>>>>>>>>>>>> committer
> > > >> >> >>>>>>>>>>>>>>>>> yet,
> > > >> >> >>>>>>>>>>>>>>>>>>>> but
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> I'd
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> > FLIP
> > > >> >> really
> > > >> >> >>>>>>>>>>>>>>>>> interested
> > > >> >> >>>>>>>>>>>>>>>>>>>> me.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> > > >> feature in
> > > >> >> >> my
> > > >> >> >>>>>>>>>>>>>>>>> company’s
> > > >> >> >>>>>>>>>>>>>>>>>>>>> Flink
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our
> > > >> thoughts
> > > >> >> >> on
> > > >> >> >>>>>>>>>>>>>>> this and
> > > >> >> >>>>>>>>>>>>>>>>>>>> make
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> code
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative
> > > than
> > > >> >> >>>>>>>>>>>>>>> introducing an
> > > >> >> >>>>>>>>>>>>>>>>>>>>> abstract
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> > > >> >> (CachingTableFunction).
> > > >> >> >>> As
> > > >> >> >>>>>>>>>>>>>>> you
> > > >> >> >>>>>>>>>>>>>>>>> know,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> > > >> flink-table-common
> > > >> >> >>>>>>>>>>>>>>> module,
> > > >> >> >>>>>>>>>>>>>>>>> which
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> provides
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables –
> > > it’s
> > > >> >> very
> > > >> >> >>>>>>>>>>>>>>>>> convenient
> > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> importing
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> > > >> CachingTableFunction
> > > >> >> >>>> contains
> > > >> >> >>>>>>>>>>>>>>>>> logic
> > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
> > > >> >> >> everything
> > > >> >> >>>>>>>>>>>>>>>>> connected
> > > >> >> >>>>>>>>>>>>>>>>>>>> with
> > > >> >> >>>>>>>>>>>>>>>>>>>>> it
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module,
> > > >> probably
> > > >> >> >> in
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> > > depend
> > > >> on
> > > >> >> >>>> another
> > > >> >> >>>>>>>>>>>>>>>>> module,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> which
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
> which
> > > >> doesn’t
> > > >> >> >>>> sound
> > > >> >> >>>>>>>>>>>>>>>>> good.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> > > >> >> ‘getLookupConfig’
> > > >> >> >>> to
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> > > >> connectors
> > > >> >> to
> > > >> >> >>>> only
> > > >> >> >>>>>>>>>>>>>>> pass
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> > therefore
> > > >> they
> > > >> >> >>> won’t
> > > >> >> >>>>>>>>>>>>>>>>> depend on
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> > > planner
> > > >> >> >> will
> > > >> >> >>>>>>>>>>>>>>>>> construct a
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> lookup
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> > runtime
> > > >> logic
> > > >> >> >>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> in
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> > Architecture
> > > >> >> looks
> > > >> >> >>>> like
> > > >> >> >>>>>>>>>>>>>>> in
> > > >> >> >>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>> pinned
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> > > actually
> > > >> >> >> yours
> > > >> >> >>>>>>>>>>>>>>>>>>>> CacheConfig).
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that
> > will
> > > >> be
> > > >> >> >>>>>>>>>>>>>>> responsible
> > > >> >> >>>>>>>>>>>>>>>>> for
> > > >> >> >>>>>>>>>>>>>>>>>>>>> this
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> –
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> > > >> inheritors.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> > > >> >> >>>>>>>>>>>>>>> flink-table-runtime
> > > >> >> >>>>>>>>>>>>>>>>> -
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> > AsyncLookupJoinRunner,
> > > >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> > > >> >> >> LookupJoinCachingRunner,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful
> > > >> advantage
> > > >> >> >> of
> > > >> >> >>>>>>>>>>>>>>> such a
> > > >> >> >>>>>>>>>>>>>>>>>>>>> solution.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> If
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
> level,
> > > we
> > > >> can
> > > >> >> >>>> apply
> > > >> >> >>>>>>>>>>>>>>> some
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> > > >> LookupJoinRunnerWithCalc
> > > >> >> >> was
> > > >> >> >>>>>>>>>>>>>>> named
> > > >> >> >>>>>>>>>>>>>>>>> like
> > > >> >> >>>>>>>>>>>>>>>>>>>>> this
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function,
> > > which
> > > >> >> >>> actually
> > > >> >> >>>>>>>>>>>>>>>>> mostly
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> consists
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> > lookup
> > > >> table
> > > >> >> >> B
> > > >> >> >>>>>>>>>>>>>>>>> condition
> > > >> >> >>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> ON
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> > WHERE
> > > >> >> >>> B.salary >
> > > >> >> >>>>>>>>>>>>>>> 1000’
> > > >> >> >>>>>>>>>>>>>>>>>>>>> ‘calc’
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age =
> > > >> B.age +
> > > >> >> >> 10
> > > >> >> >>>> and
> > > >> >> >>>>>>>>>>>>>>>>>>>> B.salary >
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> 1000.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> > storing
> > > >> >> >> records
> > > >> >> >>> in
> > > >> >> >>>>>>>>>>>>>>>>> cache,
> > > >> >> >>>>>>>>>>>>>>>>>>>> size
> > > >> >> >>>>>>>>>>>>>>>>>>>>> of
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced:
> > > >> filters =
> > > >> >> >>>> avoid
> > > >> >> >>>>>>>>>>>>>>>>> storing
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> useless
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
> reduce
> > > >> >> records’
> > > >> >> >>>>>>>>>>>>>>> size. So
> > > >> >> >>>>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> initial
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be
> > > >> >> increased
> > > >> >> >>> by
> > > >> >> >>>>>>>>>>>>>>> the
> > > >> >> >>>>>>>>>>>>>>>>> user.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren
> > > wrote:
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> > > >> discussion
> > > >> >> >>> about
> > > >> >> >>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> which
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> > table
> > > >> >> cache
> > > >> >> >>> and
> > > >> >> >>>>>>>>>>>>>>> its
> > > >> >> >>>>>>>>>>>>>>>>>>>> standard
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> > > should
> > > >> >> >>>> implement
> > > >> >> >>>>>>>>>>>>>>>>> their
> > > >> >> >>>>>>>>>>>>>>>>>>>> own
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> cache to
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
> isn’t a
> > > >> >> >> standard
> > > >> >> >>> of
> > > >> >> >>>>>>>>>>>>>>>>> metrics
> > > >> >> >>>>>>>>>>>>>>>>>>>> for
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> users and
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with
> > > lookup
> > > >> >> >>> joins,
> > > >> >> >>>>>>>>>>>>>>> which
> > > >> >> >>>>>>>>>>>>>>>>> is a
> > > >> >> >>>>>>>>>>>>>>>>>>>>>> quite
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> common
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> > > >> including
> > > >> >> >>>> cache,
> > > >> >> >>>>>>>>>>>>>>>>>>>> metrics,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> wrapper
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
> table
> > > >> >> options.
> > > >> >> >>>>>>>>>>>>>>> Please
> > > >> >> >>>>>>>>>>>>>>>>> take a
> > > >> >> >>>>>>>>>>>>>>>>>>>>> look
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> at the
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
> Any
> > > >> >> >>> suggestions
> > > >> >> >>>>>>>>>>>>>>> and
> > > >> >> >>>>>>>>>>>>>>>>>>>> comments
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>> would be
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>
> > > >> >> >>>>
> > > >> >> >>>
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> --
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>>> --
> > > >> >> >>>>>>>>>>>>>>>>>>> Best regards,
> > > >> >> >>>>>>>>>>>>>>>>>>> Roman Boyko
> > > >> >> >>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > > >> >> >>>>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> --
> > > >> >> >>>>>>>>>>>> Best Regards,
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> Qingsheng Ren
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> Real-time Computing Team
> > > >> >> >>>>>>>>>>>> Alibaba Cloud
> > > >> >> >>>>>>>>>>>>
> > > >> >> >>>>>>>>>>>> Email: renqschn@gmail.com
> > > >> >> >>>>>>>>>>
> > > >> >> >>>>
> > > >> >> >>>>
> > > >> >> >>>
> > > >> >> >>
> > > >> >>
> > > >> >>
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jark Wu <im...@gmail.com>.

Hi Jing Ge,

What do you mean about the "impact on the block cache used by HBase"?
In my understanding, the connector cache and HBase cache are totally two
things.
The connector cache is a local/client cache, and the HBase cache is a
server cache.

> does it make sense to have a no-cache solution as one of the
default solutions so that customers will have no effort for the migration
if they want to stick with Hbase cache

The implementation migration should be transparent to users. Take the HBase
connector as
an example,  it already supports lookup cache but is disabled by default.
After migration, the
connector still disables cache by default (i.e. no-cache solution). No
migration effort for users.

HBase cache and connector cache are two different things. HBase cache can't
simply replace
connector cache. Because one of the most important usages for connector
cache is reducing
 the I/O request/response and improving the throughput, which can achieve
by just using a server cache.

Best,
Jark




On Fri, 27 May 2022 at 22:42, Jing Ge <ji...@ververica.com> wrote:

> Thanks all for the valuable discussion. The new feature looks very
> interesting.
>
> According to the FLIP description: "*Currently we have JDBC, Hive and HBase
> connector implemented lookup table source. All existing implementations
> will be migrated to the current design and the migration will be
> transparent to end users*." I was only wondering if we should pay attention
> to HBase and similar DBs. Since, commonly, the lookup data will be huge
> while using HBase, partial caching will be used in this case, if I am not
> mistaken, which might have an impact on the block cache used by HBase, e.g.
> LruBlockCache.
> Another question is that, since HBase provides a sophisticated cache
> solution, does it make sense to have a no-cache solution as one of the
> default solutions so that customers will have no effort for the migration
> if they want to stick with Hbase cache?
>
> Best regards,
> Jing
>
> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <ji...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I think the problem now is below:
> > 1. AllCache and PartialCache interface on the non-uniform, one needs to
> > provide LookupProvider, the other needs to provide CacheBuilder.
> > 2. AllCache definition is not flexible, for example, PartialCache can use
> > any custom storage, while the AllCache can not, AllCache can also be
> > considered to store memory or disk, also need a flexible strategy.
> > 3. AllCache can not customize ReloadStrategy, currently only
> > ScheduledReloadStrategy.
> >
> > In order to solve the above problems, the following are my ideas.
> >
> > ## Top level cache interfaces:
> >
> > ```
> >
> > public interface CacheLookupProvider extends
> > LookupTableSource.LookupRuntimeProvider {
> >
> >     CacheBuilder createCacheBuilder();
> > }
> >
> >
> > public interface CacheBuilder {
> >     Cache create();
> > }
> >
> >
> > public interface Cache {
> >
> >     /**
> >      * Returns the value associated with key in this cache, or null if
> > there is no cached value for
> >      * key.
> >      */
> >     @Nullable
> >     Collection<RowData> getIfPresent(RowData key);
> >
> >     /** Returns the number of key-value mappings in the cache. */
> >     long size();
> > }
> >
> > ```
> >
> > ## Partial cache
> >
> > ```
> >
> > public interface PartialCacheLookupFunction extends CacheLookupProvider {
> >
> >     @Override
> >     PartialCacheBuilder createCacheBuilder();
> >
> > /** Creates an {@link LookupFunction} instance. */
> > LookupFunction createLookupFunction();
> > }
> >
> >
> > public interface PartialCacheBuilder extends CacheBuilder {
> >
> >     PartialCache create();
> > }
> >
> >
> > public interface PartialCache extends Cache {
> >
> >     /**
> >      * Associates the specified value rows with the specified key row
> > in the cache. If the cache
> >      * previously contained value associated with the key, the old
> > value is replaced by the
> >      * specified value.
> >      *
> >      * @return the previous value rows associated with key, or null if
> > there was no mapping for key.
> >      * @param key - key row with which the specified value is to be
> > associated
> >      * @param value – value rows to be associated with the specified key
> >      */
> >     Collection<RowData> put(RowData key, Collection<RowData> value);
> >
> >     /** Discards any cached value for the specified key. */
> >     void invalidate(RowData key);
> > }
> >
> > ```
> >
> > ## All cache
> > ```
> >
> > public interface AllCacheLookupProvider extends CacheLookupProvider {
> >
> >     void registerReloadStrategy(ScheduledExecutorService
> > executorService, Reloader reloader);
> >
> >     ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> >
> >     @Override
> >     AllCacheBuilder createCacheBuilder();
> > }
> >
> >
> > public interface AllCacheBuilder extends CacheBuilder {
> >
> >     AllCache create();
> > }
> >
> >
> > public interface AllCache extends Cache {
> >
> >     void putAll(Iterator<Map<RowData, RowData>> allEntries);
> >
> >     void clearAll();
> > }
> >
> >
> > public interface Reloader {
> >
> >     void reload();
> > }
> >
> > ```
> >
> > Best,
> > Jingsong
> >
> > On Fri, May 27, 2022 at 11:10 AM Jingsong Li <ji...@gmail.com>
> > wrote:
> >
> > > Thanks Qingsheng and all for your discussion.
> > >
> > > Very sorry to jump in so late.
> > >
> > > Maybe I missed something?
> > > My first impression when I saw the cache interface was, why don't we
> > > provide an interface similar to guava cache [1], on top of guava cache,
> > > caffeine also makes extensions for asynchronous calls.[2]
> > > There is also the bulk load in caffeine too.
> > >
> > > I am also more confused why first from LookupCacheFactory.Builder and
> > then
> > > to Factory to create Cache.
> > >
> > > [1] https://github.com/google/guava
> > > [2] https://github.com/ben-manes/caffeine/wiki/Population
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
> > >
> > >> After looking at the new introduced ReloadTime and Becket's comment,
> > >> I agree with Becket we should have a pluggable reloading strategy.
> > >> We can provide some common implementations, e.g., periodic reloading,
> > and
> > >> daily reloading.
> > >> But there definitely be some connector- or business-specific reloading
> > >> strategies, e.g.
> > >> notify by a zookeeper watcher, reload once a new Hive partition is
> > >> complete.
> > >>
> > >> Best,
> > >> Jark
> > >>
> > >> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com>
> wrote:
> > >>
> > >> > Hi Qingsheng,
> > >> >
> > >> > Thanks for updating the FLIP. A few comments / questions below:
> > >> >
> > >> > 1. Is there a reason that we have both "XXXFactory" and
> "XXXProvider".
> > >> > What is the difference between them? If they are the same, can we
> just
> > >> use
> > >> > XXXFactory everywhere?
> > >> >
> > >> > 2. Regarding the FullCachingLookupProvider, should the reloading
> > policy
> > >> > also be pluggable? Periodical reloading could be sometimes be tricky
> > in
> > >> > practice. For example, if user uses 24 hours as the cache refresh
> > >> interval
> > >> > and some nightly batch job delayed, the cache update may still see
> the
> > >> > stale data.
> > >> >
> > >> > 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> should
> > be
> > >> > removed.
> > >> >
> > >> > 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a
> > >> little
> > >> > confusing to me. If Optional<LookupCacheFactory> getCacheFactory()
> > >> returns
> > >> > a non-empty factory, doesn't that already indicates the framework to
> > >> cache
> > >> > the missing keys? Also, why is this method returning an
> > >> Optional<Boolean>
> > >> > instead of boolean?
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jiangjie (Becket) Qin
> > >> >
> > >> >
> > >> >
> > >> > On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <re...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> Hi Lincoln and Jark,
> > >> >>
> > >> >> Thanks for the comments! If the community reaches a consensus that
> we
> > >> use
> > >> >> SQL hint instead of table options to decide whether to use sync or
> > >> async
> > >> >> mode, it’s indeed not necessary to introduce the “lookup.async”
> > option.
> > >> >>
> > >> >> I think it’s a good idea to let the decision of async made on query
> > >> >> level, which could make better optimization with more infomation
> > >> gathered
> > >> >> by planner. Is there any FLIP describing the issue in FLINK-27625?
> I
> > >> >> thought FLIP-234 is only proposing adding SQL hint for retry on
> > missing
> > >> >> instead of the entire async mode to be controlled by hint.
> > >> >>
> > >> >> Best regards,
> > >> >>
> > >> >> Qingsheng
> > >> >>
> > >> >> > On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com>
> > >> wrote:
> > >> >> >
> > >> >> > Hi Jark,
> > >> >> >
> > >> >> > Thanks for your reply!
> > >> >> >
> > >> >> > Currently 'lookup.async' just lies in HBase connector, I have no
> > idea
> > >> >> > whether or when to remove it (we can discuss it in another issue
> > for
> > >> the
> > >> >> > HBase connector after FLINK-27625 is done), just not add it into
> a
> > >> >> common
> > >> >> > option now.
> > >> >> >
> > >> >> > Best,
> > >> >> > Lincoln Lee
> > >> >> >
> > >> >> >
> > >> >> > Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> > >> >> >
> > >> >> >> Hi Lincoln,
> > >> >> >>
> > >> >> >> I have taken a look at FLIP-234, and I agree with you that the
> > >> >> connectors
> > >> >> >> can
> > >> >> >> provide both async and sync runtime providers simultaneously
> > instead
> > >> >> of one
> > >> >> >> of them.
> > >> >> >> At that point, "lookup.async" looks redundant. If this option is
> > >> >> planned to
> > >> >> >> be removed
> > >> >> >> in the long term, I think it makes sense not to introduce it in
> > this
> > >> >> FLIP.
> > >> >> >>
> > >> >> >> Best,
> > >> >> >> Jark
> > >> >> >>
> > >> >> >> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> lincoln.86xy@gmail.com
> > >
> > >> >> wrote:
> > >> >> >>
> > >> >> >>> Hi Qingsheng,
> > >> >> >>>
> > >> >> >>> Sorry for jumping into the discussion so late. It's a good idea
> > >> that
> > >> >> we
> > >> >> >> can
> > >> >> >>> have a common table option. I have a minor comments on
> > >> 'lookup.async'
> > >> >> >> that
> > >> >> >>> not make it a common option:
> > >> >> >>>
> > >> >> >>> The table layer abstracts both sync and async lookup
> > capabilities,
> > >> >> >>> connectors implementers can choose one or both, in the case of
> > >> >> >> implementing
> > >> >> >>> only one capability(status of the most of existing builtin
> > >> connectors)
> > >> >> >>> 'lookup.async' will not be used.  And when a connector has both
> > >> >> >>> capabilities, I think this choice is more suitable for making
> > >> >> decisions
> > >> >> >> at
> > >> >> >>> the query level, for example, table planner can choose the
> > physical
> > >> >> >>> implementation of async lookup or sync lookup based on its cost
> > >> >> model, or
> > >> >> >>> users can give query hint based on their own better
> > >> understanding.  If
> > >> >> >>> there is another common table option 'lookup.async', it may
> > confuse
> > >> >> the
> > >> >> >>> users in the long run.
> > >> >> >>>
> > >> >> >>> So, I prefer to leave the 'lookup.async' option in private
> place
> > >> (for
> > >> >> the
> > >> >> >>> current hbase connector) and not turn it into a common option.
> > >> >> >>>
> > >> >> >>> WDYT?
> > >> >> >>>
> > >> >> >>> Best,
> > >> >> >>> Lincoln Lee
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> > >> >> >>>
> > >> >> >>>> Hi Alexander,
> > >> >> >>>>
> > >> >> >>>> Thanks for the review! We recently updated the FLIP and you
> can
> > >> find
> > >> >> >>> those
> > >> >> >>>> changes from my latest email. Since some terminologies has
> > >> changed so
> > >> >> >>> I’ll
> > >> >> >>>> use the new concept for replying your comments.
> > >> >> >>>>
> > >> >> >>>> 1. Builder vs ‘of’
> > >> >> >>>> I’m OK to use builder pattern if we have additional optional
> > >> >> parameters
> > >> >> >>>> for full caching mode (“rescan” previously). The
> > >> schedule-with-delay
> > >> >> >> idea
> > >> >> >>>> looks reasonable to me, but I think we need to redesign the
> > >> builder
> > >> >> API
> > >> >> >>> of
> > >> >> >>>> full caching to make it more descriptive for developers. Would
> > you
> > >> >> mind
> > >> >> >>>> sharing your ideas about the API? For accessing the FLIP
> > workspace
> > >> >> you
> > >> >> >>> can
> > >> >> >>>> just provide your account ID and ping any PMC member including
> > >> Jark.
> > >> >> >>>>
> > >> >> >>>> 2. Common table options
> > >> >> >>>> We have some discussions these days and propose to introduce 8
> > >> common
> > >> >> >>>> table options about caching. It has been updated on the FLIP.
> > >> >> >>>>
> > >> >> >>>> 3. Retries
> > >> >> >>>> I think we are on the same page :-)
> > >> >> >>>>
> > >> >> >>>> For your additional concerns:
> > >> >> >>>> 1) The table option has been updated.
> > >> >> >>>> 2) We got “lookup.cache” back for configuring whether to use
> > >> partial
> > >> >> or
> > >> >> >>>> full caching mode.
> > >> >> >>>>
> > >> >> >>>> Best regards,
> > >> >> >>>>
> > >> >> >>>> Qingsheng
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> > >> smiralexan@gmail.com>
> > >> >> >>>> wrote:
> > >> >> >>>>>
> > >> >> >>>>> Also I have a few additions:
> > >> >> >>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> > >> >> >>>>> 'lookup.cache.max-rows'? I think it will be more clear that
> we
> > >> talk
> > >> >> >>>>> not about bytes, but about the number of rows. Plus it fits
> > more,
> > >> >> >>>>> considering my optimization with filters.
> > >> >> >>>>> 2) How will users enable rescanning? Are we going to separate
> > >> >> caching
> > >> >> >>>>> and rescanning from the options point of view? Like initially
> > we
> > >> had
> > >> >> >>>>> one option 'lookup.cache' with values LRU / ALL. I think now
> we
> > >> can
> > >> >> >>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
> > >> >> >>>>> 'lookup.rescan.interval', etc.
> > >> >> >>>>>
> > >> >> >>>>> Best regards,
> > >> >> >>>>> Alexander
> > >> >> >>>>>
> > >> >> >>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> > >> smiralexan@gmail.com
> > >> >> >>> :
> > >> >> >>>>>>
> > >> >> >>>>>> Hi Qingsheng and Jark,
> > >> >> >>>>>>
> > >> >> >>>>>> 1. Builders vs 'of'
> > >> >> >>>>>> I understand that builders are used when we have multiple
> > >> >> >> parameters.
> > >> >> >>>>>> I suggested them because we could add parameters later. To
> > >> prevent
> > >> >> >>>>>> Builder for ScanRuntimeProvider from looking redundant I can
> > >> >> suggest
> > >> >> >>>>>> one more config now - "rescanStartTime".
> > >> >> >>>>>> It's a time in UTC (LocalTime class) when the first reload
> of
> > >> cache
> > >> >> >>>>>> starts. This parameter can be thought of as 'initialDelay'
> > (diff
> > >> >> >>>>>> between current time and rescanStartTime) in method
> > >> >> >>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can
> be
> > >> very
> > >> >> >>>>>> useful when the dimension table is updated by some other
> > >> scheduled
> > >> >> >> job
> > >> >> >>>>>> at a certain time. Or when the user simply wants a second
> scan
> > >> >> >> (first
> > >> >> >>>>>> cache reload) be delayed. This option can be used even
> without
> > >> >> >>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one
> > >> day.
> > >> >> >>>>>> If you are fine with this option, I would be very glad if
> you
> > >> would
> > >> >> >>>>>> give me access to edit FLIP page, so I could add it myself
> > >> >> >>>>>>
> > >> >> >>>>>> 2. Common table options
> > >> >> >>>>>> I also think that FactoryUtil would be overloaded by all
> cache
> > >> >> >>>>>> options. But maybe unify all suggested options, not only for
> > >> >> default
> > >> >> >>>>>> cache? I.e. class 'LookupOptions', that unifies default
> cache
> > >> >> >> options,
> > >> >> >>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> > >> >> >>>>>>
> > >> >> >>>>>> 3. Retries
> > >> >> >>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times,
> > >> call)
> > >> >> >>>>>>
> > >> >> >>>>>> [1]
> > >> >> >>>>
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >>
> >
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> > >> >> >>>>>>
> > >> >> >>>>>> Best regards,
> > >> >> >>>>>> Alexander
> > >> >> >>>>>>
> > >> >> >>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> renqschn@gmail.com
> > >:
> > >> >> >>>>>>>
> > >> >> >>>>>>> Hi Jark and Alexander,
> > >> >> >>>>>>>
> > >> >> >>>>>>> Thanks for your comments! I’m also OK to introduce common
> > table
> > >> >> >>>> options. I prefer to introduce a new DefaultLookupCacheOptions
> > >> class
> > >> >> >> for
> > >> >> >>>> holding these option definitions because putting all options
> > into
> > >> >> >>>> FactoryUtil would make it a bit ”crowded” and not well
> > >> categorized.
> > >> >> >>>>>>>
> > >> >> >>>>>>> FLIP has been updated according to suggestions above:
> > >> >> >>>>>>> 1. Use static “of” method for constructing
> > >> RescanRuntimeProvider
> > >> >> >>>> considering both arguments are required.
> > >> >> >>>>>>> 2. Introduce new table options matching
> > >> DefaultLookupCacheFactory
> > >> >> >>>>>>>
> > >> >> >>>>>>> Best,
> > >> >> >>>>>>> Qingsheng
> > >> >> >>>>>>>
> > >> >> >>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com>
> > >> wrote:
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Hi Alex,
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> 1) retry logic
> > >> >> >>>>>>>> I think we can extract some common retry logic into
> > utilities,
> > >> >> >> e.g.
> > >> >> >>>> RetryUtils#tryTimes(times, call).
> > >> >> >>>>>>>> This seems independent of this FLIP and can be reused by
> > >> >> >> DataStream
> > >> >> >>>> users.
> > >> >> >>>>>>>> Maybe we can open an issue to discuss this and where to
> put
> > >> it.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> 2) cache ConfigOptions
> > >> >> >>>>>>>> I'm fine with defining cache config options in the
> > framework.
> > >> >> >>>>>>>> A candidate place to put is FactoryUtil which also
> includes
> > >> >> >>>> "sink.parallelism", "format" options.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Best,
> > >> >> >>>>>>>> Jark
> > >> >> >>>>>>>>
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> > >> >> >>> smiralexan@gmail.com>
> > >> >> >>>> wrote:
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> Hi Qingsheng,
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> Thank you for considering my comments.
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>>> there might be custom logic before making retry, such as
> > >> >> >>>> re-establish the connection
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> Yes, I understand that. I meant that such logic can be
> > >> placed in
> > >> >> >> a
> > >> >> >>>>>>>>> separate function, that can be implemented by connectors.
> > >> Just
> > >> >> >>> moving
> > >> >> >>>>>>>>> the retry logic would make connector's LookupFunction
> more
> > >> >> >> concise
> > >> >> >>> +
> > >> >> >>>>>>>>> avoid duplicate code. However, it's a minor change. The
> > >> decision
> > >> >> >> is
> > >> >> >>>> up
> > >> >> >>>>>>>>> to you.
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> > >> developers
> > >> >> >> to
> > >> >> >>>> define their own options as we do now per connector.
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> What is the reason for that? One of the main goals of
> this
> > >> FLIP
> > >> >> >> was
> > >> >> >>>> to
> > >> >> >>>>>>>>> unify the configs, wasn't it? I understand that current
> > cache
> > >> >> >>> design
> > >> >> >>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> still
> > >> we
> > >> >> >> can
> > >> >> >>>> put
> > >> >> >>>>>>>>> these options into the framework, so connectors can reuse
> > >> them
> > >> >> >> and
> > >> >> >>>>>>>>> avoid code duplication, and, what is more significant,
> > avoid
> > >> >> >>> possible
> > >> >> >>>>>>>>> different options naming. This moment can be pointed out
> in
> > >> >> >>>>>>>>> documentation for connector developers.
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> Best regards,
> > >> >> >>>>>>>>> Alexander
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> > >> renqschn@gmail.com>:
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> Hi Alexander,
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> Thanks for the review and glad to see we are on the same
> > >> page!
> > >> >> I
> > >> >> >>>> think you forgot to cc the dev mailing list so I’m also
> quoting
> > >> your
> > >> >> >>> reply
> > >> >> >>>> under this email.
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> In my opinion the retry logic should be implemented in
> > >> lookup()
> > >> >> >>>> instead of in LookupFunction#eval(). Retrying is only
> meaningful
> > >> >> under
> > >> >> >>> some
> > >> >> >>>> specific retriable failures, and there might be custom logic
> > >> before
> > >> >> >>> making
> > >> >> >>>> retry, such as re-establish the connection
> > >> (JdbcRowDataLookupFunction
> > >> >> >> is
> > >> >> >>> an
> > >> >> >>>> example), so it's more handy to leave it to the connector.
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>>> I don't see DDL options, that were in previous version
> of
> > >> >> FLIP.
> > >> >> >>> Do
> > >> >> >>>> you have any special plans for them?
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> > >> developers
> > >> >> >> to
> > >> >> >>>> define their own options as we do now per connector.
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> The rest of comments sound great and I’ll update the
> FLIP.
> > >> Hope
> > >> >> >> we
> > >> >> >>>> can finalize our proposal soon!
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> Best,
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> Qingsheng
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> > >> >> >>> smiralexan@gmail.com>
> > >> >> >>>> wrote:
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> Hi Qingsheng and devs!
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> I like the overall design of updated FLIP, however I
> have
> > >> >> >> several
> > >> >> >>>>>>>>>>> suggestions and questions.
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> > >> TableFunction
> > >> >> >> is a
> > >> >> >>>> good
> > >> >> >>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> class.
> > >> >> 'eval'
> > >> >> >>>> method
> > >> >> >>>>>>>>>>> of new LookupFunction is great for this purpose. The
> same
> > >> is
> > >> >> >> for
> > >> >> >>>>>>>>>>> 'async' case.
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> 2) There might be other configs in future, such as
> > >> >> >>>> 'cacheMissingKey'
> > >> >> >>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> > >> >> >>>> ScanRuntimeProvider.
> > >> >> >>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
> > >> >> >>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> > 'build'
> > >> >> >>> method
> > >> >> >>>>>>>>>>> instead of many 'of' methods in future)?
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> 3) What are the plans for existing
> TableFunctionProvider
> > >> and
> > >> >> >>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> > >> deprecated.
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> 4) Am I right that the current design does not assume
> > >> usage of
> > >> >> >>>>>>>>>>> user-provided LookupCache in re-scanning? In this case,
> > it
> > >> is
> > >> >> >> not
> > >> >> >>>> very
> > >> >> >>>>>>>>>>> clear why do we need methods such as 'invalidate' or
> > >> 'putAll'
> > >> >> >> in
> > >> >> >>>>>>>>>>> LookupCache.
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> 5) I don't see DDL options, that were in previous
> version
> > >> of
> > >> >> >>> FLIP.
> > >> >> >>>> Do
> > >> >> >>>>>>>>>>> you have any special plans for them?
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> If you don't mind, I would be glad to be able to make
> > small
> > >> >> >>>>>>>>>>> adjustments to the FLIP document too. I think it's
> worth
> > >> >> >>> mentioning
> > >> >> >>>>>>>>>>> about what exactly optimizations are planning in the
> > >> future.
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>> Smirnov Alexander
> > >> >> >>>>>>>>>>>
> > >> >> >>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> > >> renqschn@gmail.com
> > >> >> >>> :
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> Hi Alexander and devs,
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> Thank you very much for the in-depth discussion! As
> Jark
> > >> >> >>>> mentioned we were inspired by Alexander's idea and made a
> > >> refactor on
> > >> >> >> our
> > >> >> >>>> design. FLIP-221 [1] has been updated to reflect our design
> now
> > >> and
> > >> >> we
> > >> >> >>> are
> > >> >> >>>> happy to hear more suggestions from you!
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> Compared to the previous design:
> > >> >> >>>>>>>>>>>> 1. The lookup cache serves at table runtime level and
> is
> > >> >> >>>> integrated as a component of LookupJoinRunner as discussed
> > >> >> previously.
> > >> >> >>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect
> the
> > >> new
> > >> >> >>>> design.
> > >> >> >>>>>>>>>>>> 3. We separate the all-caching case individually and
> > >> >> >> introduce a
> > >> >> >>>> new RescanRuntimeProvider to reuse the ability of scanning. We
> > are
> > >> >> >>> planning
> > >> >> >>>> to support SourceFunction / InputFormat for now considering
> the
> > >> >> >>> complexity
> > >> >> >>>> of FLIP-27 Source API.
> > >> >> >>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> make
> > >> the
> > >> >> >>>> semantic of lookup more straightforward for developers.
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> For replying to Alexander:
> > >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> > >> >> >> deprecated
> > >> >> >>>> or not. Am I right that it will be so in the future, but
> > currently
> > >> >> it's
> > >> >> >>> not?
> > >> >> >>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for
> > now.
> > >> I
> > >> >> >>> think
> > >> >> >>>> it will be deprecated in the future but we don't have a clear
> > plan
> > >> >> for
> > >> >> >>> that.
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> looking
> > >> >> >> forward
> > >> >> >>>> to cooperating with you after we finalize the design and
> > >> interfaces!
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> [1]
> > >> >> >>>>
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> Qingsheng
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> > >> >> >>>> smiralexan@gmail.com> wrote:
> > >> >> >>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> > >> >> >>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>> Glad to see that we came to a consensus on almost all
> > >> >> points!
> > >> >> >>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> > >> >> >> deprecated
> > >> >> >>>> or
> > >> >> >>>>>>>>>>>>> not. Am I right that it will be so in the future, but
> > >> >> >> currently
> > >> >> >>>> it's
> > >> >> >>>>>>>>>>>>> not? Actually I also think that for the first version
> > >> it's
> > >> >> OK
> > >> >> >>> to
> > >> >> >>>> use
> > >> >> >>>>>>>>>>>>> InputFormat in ALL cache realization, because
> > supporting
> > >> >> >> rescan
> > >> >> >>>>>>>>>>>>> ability seems like a very distant prospect. But for
> > this
> > >> >> >>>> decision we
> > >> >> >>>>>>>>>>>>> need a consensus among all discussion participants.
> > >> >> >>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>> In general, I don't have something to argue with your
> > >> >> >>>> statements. All
> > >> >> >>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would
> be
> > >> nice
> > >> >> >> to
> > >> >> >>>> work
> > >> >> >>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot
> of
> > >> work
> > >> >> >> on
> > >> >> >>>> lookup
> > >> >> >>>>>>>>>>>>> join caching with realization very close to the one
> we
> > >> are
> > >> >> >>>> discussing,
> > >> >> >>>>>>>>>>>>> and want to share the results of this work. Anyway
> > >> looking
> > >> >> >>>> forward for
> > >> >> >>>>>>>>>>>>> the FLIP update!
> > >> >> >>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>> Smirnov Alexander
> > >> >> >>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> imjark@gmail.com
> > >:
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> Hi Alex,
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> Thanks for summarizing your points.
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> > >> discussed
> > >> >> >> it
> > >> >> >>>> several times
> > >> >> >>>>>>>>>>>>>> and we have totally refactored the design.
> > >> >> >>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many
> of
> > >> your
> > >> >> >>>> points!
> > >> >> >>>>>>>>>>>>>> Qingsheng is still working on updating the design
> docs
> > >> and
> > >> >> >>>> maybe can be
> > >> >> >>>>>>>>>>>>>> available in the next few days.
> > >> >> >>>>>>>>>>>>>> I will share some conclusions from our discussions:
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> 1) we have refactored the design towards to "cache
> in
> > >> >> >>>> framework" way.
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize
> > and
> > >> a
> > >> >> >>>> default
> > >> >> >>>>>>>>>>>>>> implementation with builder for users to easy-use.
> > >> >> >>>>>>>>>>>>>> This can both make it possible to both have
> > flexibility
> > >> and
> > >> >> >>>> conciseness.
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> lookup
> > >> >> >> cache,
> > >> >> >>>> esp reducing
> > >> >> >>>>>>>>>>>>>> IO.
> > >> >> >>>>>>>>>>>>>> Filter pushdown should be the final state and the
> > >> unified
> > >> >> >> way
> > >> >> >>>> to both
> > >> >> >>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> > >> >> >>>>>>>>>>>>>> so I think we should make effort in this direction.
> If
> > >> we
> > >> >> >> need
> > >> >> >>>> to support
> > >> >> >>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> > >> >> >>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide
> to
> > >> >> >>> implement
> > >> >> >>>> the cache
> > >> >> >>>>>>>>>>>>>> in the framework, we have the chance to support
> > >> >> >>>>>>>>>>>>>> filter on cache anytime. This is an optimization and
> > it
> > >> >> >>> doesn't
> > >> >> >>>> affect the
> > >> >> >>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
> > >> >> >>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
> > >> >> >> proposal.
> > >> >> >>>>>>>>>>>>>> In the first version, we will only support
> > InputFormat,
> > >> >> >>>> SourceFunction for
> > >> >> >>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> > >> >> >>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source
> > >> operator
> > >> >> >>>> instead of
> > >> >> >>>>>>>>>>>>>> calling it embedded in the join operator.
> > >> >> >>>>>>>>>>>>>> However, this needs another FLIP to support the
> > re-scan
> > >> >> >>> ability
> > >> >> >>>> for FLIP-27
> > >> >> >>>>>>>>>>>>>> Source, and this can be a large work.
> > >> >> >>>>>>>>>>>>>> In order to not block this issue, we can put the
> > effort
> > >> of
> > >> >> >>>> FLIP-27 source
> > >> >> >>>>>>>>>>>>>> integration into future work and integrate
> > >> >> >>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction,
> > as
> > >> >> they
> > >> >> >>>> are not
> > >> >> >>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another
> > >> >> function
> > >> >> >>>>>>>>>>>>>> similar to them which is meaningless. We need to
> plan
> > >> >> >> FLIP-27
> > >> >> >>>> source
> > >> >> >>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction
> > are
> > >> >> >>>> deprecated.
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> Best,
> > >> >> >>>>>>>>>>>>>> Jark
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> > >> >> >>>> smiralexan@gmail.com>
> > >> >> >>>>>>>>>>>>>> wrote:
> > >> >> >>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>> Hi Martijn!
> > >> >> >>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat
> > is
> > >> not
> > >> >> >>>> considered.
> > >> >> >>>>>>>>>>>>>>> Thanks for clearing that up!
> > >> >> >>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>> Smirnov Alexander
> > >> >> >>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> > >> >> >>>> martijn@ververica.com>:
> > >> >> >>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>> Hi,
> > >> >> >>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>> With regards to:
> > >> >> >>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors
> > to
> > >> >> >>> FLIP-27
> > >> >> >>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The
> > old
> > >> >> >>>> interfaces will be
> > >> >> >>>>>>>>>>>>>>>> deprecated and connectors will either be
> refactored
> > to
> > >> >> use
> > >> >> >>>> the new ones
> > >> >> >>>>>>>>>>>>>>> or
> > >> >> >>>>>>>>>>>>>>>> dropped.
> > >> >> >>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>> The caching should work for connectors that are
> > using
> > >> >> >>> FLIP-27
> > >> >> >>>> interfaces,
> > >> >> >>>>>>>>>>>>>>>> we should not introduce new features for old
> > >> interfaces.
> > >> >> >>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>> Martijn
> > >> >> >>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> > >> >> >>>> smiralexan@gmail.com>
> > >> >> >>>>>>>>>>>>>>>> wrote:
> > >> >> >>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> Hi Jark!
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make
> > >> some
> > >> >> >>>> comments and
> > >> >> >>>>>>>>>>>>>>>>> clarify my points.
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we
> > can
> > >> >> >>> achieve
> > >> >> >>>> both
> > >> >> >>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
> > >> >> >>>> flink-table-common,
> > >> >> >>>>>>>>>>>>>>>>> but have implementations of it in
> > >> flink-table-runtime.
> > >> >> >>>> Therefore if a
> > >> >> >>>>>>>>>>>>>>>>> connector developer wants to use existing cache
> > >> >> >> strategies
> > >> >> >>>> and their
> > >> >> >>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to
> > the
> > >> >> >>>> planner, but if
> > >> >> >>>>>>>>>>>>>>>>> he wants to have its own cache implementation in
> > his
> > >> >> >>>> TableFunction, it
> > >> >> >>>>>>>>>>>>>>>>> will be possible for him to use the existing
> > >> interface
> > >> >> >> for
> > >> >> >>>> this
> > >> >> >>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
> > >> >> >>>> documentation). In
> > >> >> >>>>>>>>>>>>>>>>> this way all configs and metrics will be unified.
> > >> WDYT?
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache,
> we
> > >> will
> > >> >> >>>> have 90% of
> > >> >> >>>>>>>>>>>>>>>>> lookup requests that can never be cached
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization
> in
> > >> case
> > >> >> >> of
> > >> >> >>>> LRU cache.
> > >> >> >>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> Collection<RowData>>.
> > >> Here
> > >> >> >> we
> > >> >> >>>> always
> > >> >> >>>>>>>>>>>>>>>>> store the response of the dimension table in
> cache,
> > >> even
> > >> >> >>>> after
> > >> >> >>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows
> > >> after
> > >> >> >>>> applying
> > >> >> >>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> > >> >> >>> TableFunction,
> > >> >> >>>> we store
> > >> >> >>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> cache
> > >> line
> > >> >> >>> will
> > >> >> >>>> be
> > >> >> >>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> > bytes).
> > >> >> >> I.e.
> > >> >> >>>> we don't
> > >> >> >>>>>>>>>>>>>>>>> completely filter keys, by which result was
> pruned,
> > >> but
> > >> >> >>>> significantly
> > >> >> >>>>>>>>>>>>>>>>> reduce required memory to store this result. If
> the
> > >> user
> > >> >> >>>> knows about
> > >> >> >>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> > option
> > >> >> >> before
> > >> >> >>>> the start
> > >> >> >>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea
> > >> that we
> > >> >> >>> can
> > >> >> >>>> do this
> > >> >> >>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and
> > >> 'weigher'
> > >> >> >>>> methods of
> > >> >> >>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> > >> collection
> > >> >> >> of
> > >> >> >>>> rows
> > >> >> >>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> automatically
> > >> fit
> > >> >> >>> much
> > >> >> >>>> more
> > >> >> >>>>>>>>>>>>>>>>> records than before.
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> > filters
> > >> and
> > >> >> >>>> projects
> > >> >> >>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > >> >> >>>> SupportsProjectionPushDown.
> > >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > interfaces,
> > >> >> >> don't
> > >> >> >>>> mean it's
> > >> >> >>>>>>>>>>>>>>> hard
> > >> >> >>>>>>>>>>>>>>>>> to implement.
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> > implement
> > >> >> >> filter
> > >> >> >>>> pushdown.
> > >> >> >>>>>>>>>>>>>>>>> But I think the fact that currently there is no
> > >> database
> > >> >> >>>> connector
> > >> >> >>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> > feature
> > >> >> >> won't
> > >> >> >>>> be
> > >> >> >>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> talk
> > >> about
> > >> >> >>>> other
> > >> >> >>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases
> > might
> > >> >> not
> > >> >> >>>> support all
> > >> >> >>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I
> > >> think
> > >> >> >>> users
> > >> >> >>>> are
> > >> >> >>>>>>>>>>>>>>>>> interested in supporting cache filters
> optimization
> > >> >> >>>> independently of
> > >> >> >>>>>>>>>>>>>>>>> supporting other features and solving more
> complex
> > >> >> >> problems
> > >> >> >>>> (or
> > >> >> >>>>>>>>>>>>>>>>> unsolvable at all).
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in
> > our
> > >> >> >>>> internal version
> > >> >> >>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and
> > >> >> reloading
> > >> >> >>>> data from
> > >> >> >>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a
> way
> > to
> > >> >> >> unify
> > >> >> >>>> the logic
> > >> >> >>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> > >> >> SourceFunction,
> > >> >> >>>> Source,...)
> > >> >> >>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result
> I
> > >> >> >> settled
> > >> >> >>>> on using
> > >> >> >>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in
> > all
> > >> >> >> lookup
> > >> >> >>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans
> to
> > >> >> >>> deprecate
> > >> >> >>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> usage
> > of
> > >> >> >>>> FLIP-27 source
> > >> >> >>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> > source
> > >> was
> > >> >> >>>> designed to
> > >> >> >>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator
> on
> > >> >> >>>> JobManager and
> > >> >> >>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> operator
> > >> >> >> (lookup
> > >> >> >>>> join
> > >> >> >>>>>>>>>>>>>>>>> operator in our case). There is even no direct
> way
> > to
> > >> >> >> pass
> > >> >> >>>> splits from
> > >> >> >>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works
> > >> >> through
> > >> >> >>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> > >> >> >>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> > >> >> >> AddSplitEvents).
> > >> >> >>>> Usage of
> > >> >> >>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer
> > and
> > >> >> >>>> easier. But if
> > >> >> >>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> > >> FLIP-27, I
> > >> >> >>>> have the
> > >> >> >>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup
> > join
> > >> >> ALL
> > >> >> >>>> cache in
> > >> >> >>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of
> > batch
> > >> >> >>> source?
> > >> >> >>>> The point
> > >> >> >>>>>>>>>>>>>>>>> is that the only difference between lookup join
> ALL
> > >> >> cache
> > >> >> >>>> and simple
> > >> >> >>>>>>>>>>>>>>>>> join with batch source is that in the first case
> > >> >> scanning
> > >> >> >>> is
> > >> >> >>>> performed
> > >> >> >>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is
> > >> >> cleared
> > >> >> >>>> (correct me
> > >> >> >>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> > >> functionality of
> > >> >> >>>> simple join
> > >> >> >>>>>>>>>>>>>>>>> to support state reloading + extend the
> > >> functionality of
> > >> >> >>>> scanning
> > >> >> >>>>>>>>>>>>>>>>> batch source multiple times (this one should be
> > easy
> > >> >> with
> > >> >> >>>> new FLIP-27
> > >> >> >>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we
> > >> will
> > >> >> >> need
> > >> >> >>>> to change
> > >> >> >>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> again
> > >> after
> > >> >> >>>> some TTL).
> > >> >> >>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term
> > >> goal
> > >> >> >> and
> > >> >> >>>> will make
> > >> >> >>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said.
> > >> Maybe
> > >> >> >> we
> > >> >> >>>> can limit
> > >> >> >>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> (InputFormats).
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> > >> >> >>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> flexible
> > >> >> >>>> interfaces for
> > >> >> >>>>>>>>>>>>>>>>> caching in lookup join.
> > >> >> >>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both
> in
> > >> LRU
> > >> >> >> and
> > >> >> >>>> ALL caches.
> > >> >> >>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> > >> supported
> > >> >> >> in
> > >> >> >>>> Flink
> > >> >> >>>>>>>>>>>>>>>>> connectors, some of the connectors might not have
> > the
> > >> >> >>>> opportunity to
> > >> >> >>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently
> > filter
> > >> >> >>>> pushdown works
> > >> >> >>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters
> +
> > >> >> >>>> projections
> > >> >> >>>>>>>>>>>>>>>>> optimization should be independent from other
> > >> features.
> > >> >> >>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that
> > >> >> involves
> > >> >> >>>> multiple
> > >> >> >>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
> > >> >> >>>> InputFormat in favor
> > >> >> >>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization
> > >> really
> > >> >> >>>> complex and
> > >> >> >>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend
> > the
> > >> >> >>>> functionality of
> > >> >> >>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> case
> > of
> > >> >> >>> lookup
> > >> >> >>>> join ALL
> > >> >> >>>>>>>>>>>>>>>>> cache?
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>>>> Smirnov Alexander
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> [1]
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>
> > >> >> >>>>
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >>
> >
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> > imjark@gmail.com
> > >> >:
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want
> to
> > >> >> share
> > >> >> >>> my
> > >> >> >>>> ideas:
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> connectors
> > >> base
> > >> >> >>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways
> > >> should
> > >> >> >>>> work (e.g.,
> > >> >> >>>>>>>>>>>>>>> cache
> > >> >> >>>>>>>>>>>>>>>>>> pruning, compatibility).
> > >> >> >>>>>>>>>>>>>>>>>> The framework way can provide more concise
> > >> interfaces.
> > >> >> >>>>>>>>>>>>>>>>>> The connector base way can define more flexible
> > >> cache
> > >> >> >>>>>>>>>>>>>>>>>> strategies/implementations.
> > >> >> >>>>>>>>>>>>>>>>>> We are still investigating a way to see if we
> can
> > >> have
> > >> >> >>> both
> > >> >> >>>>>>>>>>>>>>> advantages.
> > >> >> >>>>>>>>>>>>>>>>>> We should reach a consensus that the way should
> > be a
> > >> >> >> final
> > >> >> >>>> state,
> > >> >> >>>>>>>>>>>>>>> and we
> > >> >> >>>>>>>>>>>>>>>>>> are on the path to it.
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> > >> >> >>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into
> > >> cache
> > >> >> >> can
> > >> >> >>>> benefit a
> > >> >> >>>>>>>>>>>>>>> lot
> > >> >> >>>>>>>>>>>>>>>>> for
> > >> >> >>>>>>>>>>>>>>>>>> ALL cache.
> > >> >> >>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> > Connectors
> > >> use
> > >> >> >>>> cache to
> > >> >> >>>>>>>>>>>>>>> reduce
> > >> >> >>>>>>>>>>>>>>>>> IO
> > >> >> >>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> > >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache,
> we
> > >> will
> > >> >> >>>> have 90% of
> > >> >> >>>>>>>>>>>>>>>>> lookup
> > >> >> >>>>>>>>>>>>>>>>>> requests that can never be cached
> > >> >> >>>>>>>>>>>>>>>>>> and hit directly to the databases. That means
> the
> > >> cache
> > >> >> >> is
> > >> >> >>>>>>>>>>>>>>> meaningless in
> > >> >> >>>>>>>>>>>>>>>>>> this case.
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do
> > >> >> filters
> > >> >> >>>> and projects
> > >> >> >>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > >> >> >>>>>>>>>>>>>>> SupportsProjectionPushDown.
> > >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > interfaces,
> > >> >> >> don't
> > >> >> >>>> mean it's
> > >> >> >>>>>>>>>>>>>>> hard
> > >> >> >>>>>>>>>>>>>>>>> to
> > >> >> >>>>>>>>>>>>>>>>>> implement.
> > >> >> >>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to
> > >> reduce
> > >> >> >> IO
> > >> >> >>>> and the
> > >> >> >>>>>>>>>>>>>>> cache
> > >> >> >>>>>>>>>>>>>>>>>> size.
> > >> >> >>>>>>>>>>>>>>>>>> That should be a final state that the scan
> source
> > >> and
> > >> >> >>>> lookup source
> > >> >> >>>>>>>>>>>>>>> share
> > >> >> >>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> > >> >> >>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> pushdown
> > >> logic
> > >> >> >> in
> > >> >> >>>> caches,
> > >> >> >>>>>>>>>>>>>>> which
> > >> >> >>>>>>>>>>>>>>>>>> will complex the lookup join design.
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> > >> >> >>>>>>>>>>>>>>>>>> All cache might be the most challenging part of
> > this
> > >> >> >> FLIP.
> > >> >> >>>> We have
> > >> >> >>>>>>>>>>>>>>> never
> > >> >> >>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> > >> >> >>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval"
> > >> method
> > >> >> >> of
> > >> >> >>>>>>>>>>>>>>> TableFunction.
> > >> >> >>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> > >> >> >>>>>>>>>>>>>>>>>> Ideally, connector implementation should share
> the
> > >> >> logic
> > >> >> >>> of
> > >> >> >>>> reload
> > >> >> >>>>>>>>>>>>>>> and
> > >> >> >>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> > >> >> >>>> InputFormat/SourceFunction/FLIP-27
> > >> >> >>>>>>>>>>>>>>>>> Source.
> > >> >> >>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> > deprecated,
> > >> and
> > >> >> >>> the
> > >> >> >>>> FLIP-27
> > >> >> >>>>>>>>>>>>>>>>> source
> > >> >> >>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> > >> >> >>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> > >> LookupJoin,
> > >> >> >>> this
> > >> >> >>>> may make
> > >> >> >>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> > >> >> >>>>>>>>>>>>>>>>>> We are still investigating how to abstract the
> ALL
> > >> >> cache
> > >> >> >>>> logic and
> > >> >> >>>>>>>>>>>>>>> reuse
> > >> >> >>>>>>>>>>>>>>>>>> the existing source interfaces.
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>> Best,
> > >> >> >>>>>>>>>>>>>>>>>> Jark
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> > >> >> >>>> ro.v.boyko@gmail.com>
> > >> >> >>>>>>>>>>>>>>> wrote:
> > >> >> >>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies
> > out
> > >> of
> > >> >> >> the
> > >> >> >>>> scope of
> > >> >> >>>>>>>>>>>>>>> this
> > >> >> >>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be
> > done
> > >> for
> > >> >> >>> all
> > >> >> >>>>>>>>>>>>>>>>> ScanTableSource
> > >> >> >>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> > >> >> >>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> > >> >> >>>>>>>>>>>>>>> martijnvisser@apache.org>
> > >> >> >>>>>>>>>>>>>>>>>>> wrote:
> > >> >> >>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>> Hi everyone,
> > >> >> >>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> correctly
> > >> >> >>> mentioned
> > >> >> >>>> that
> > >> >> >>>>>>>>>>>>>>> filter
> > >> >> >>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> > >> >> >> jdbc/hive/hbase."
> > >> >> >>>> -> Would
> > >> >> >>>>>>>>>>>>>>> an
> > >> >> >>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement
> > >> these
> > >> >> >>> filter
> > >> >> >>>>>>>>>>>>>>> pushdowns?
> > >> >> >>>>>>>>>>>>>>>>> I
> > >> >> >>>>>>>>>>>>>>>>>>>> can
> > >> >> >>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to
> > doing
> > >> >> >> that,
> > >> >> >>>> outside
> > >> >> >>>>>>>>>>>>>>> of
> > >> >> >>>>>>>>>>>>>>>>> lookup
> > >> >> >>>>>>>>>>>>>>>>>>>> caching and metrics.
> > >> >> >>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>> Martijn Visser
> > >> >> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > >> >> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > >> >> >>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> > >> >> >>>> ro.v.boyko@gmail.com>
> > >> >> >>>>>>>>>>>>>>>>> wrote:
> > >> >> >>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>> Hi everyone!
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> improvement!
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation
> > >> would be
> > >> >> >> a
> > >> >> >>>> nice
> > >> >> >>>>>>>>>>>>>>>>> opportunity
> > >> >> >>>>>>>>>>>>>>>>>>>> for
> > >> >> >>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME
> > AS
> > >> OF
> > >> >> >>>> proc_time"
> > >> >> >>>>>>>>>>>>>>>>> semantics
> > >> >> >>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> > >> implemented.
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say
> > >> that:
> > >> >> >>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to
> > cut
> > >> off
> > >> >> >>> the
> > >> >> >>>> cache
> > >> >> >>>>>>>>>>>>>>> size
> > >> >> >>>>>>>>>>>>>>>>> by
> > >> >> >>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the
> most
> > >> >> handy
> > >> >> >>>> way to do
> > >> >> >>>>>>>>>>>>>>> it
> > >> >> >>>>>>>>>>>>>>>>> is
> > >> >> >>>>>>>>>>>>>>>>>>>> apply
> > >> >> >>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit
> > >> harder to
> > >> >> >>>> pass it
> > >> >> >>>>>>>>>>>>>>>>> through the
> > >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> Alexander
> > >> >> >>> correctly
> > >> >> >>>>>>>>>>>>>>> mentioned
> > >> >> >>>>>>>>>>>>>>>>> that
> > >> >> >>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
> > >> >> >>>> jdbc/hive/hbase.
> > >> >> >>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
> > >> >> >> parameters
> > >> >> >>>> for
> > >> >> >>>>>>>>>>>>>>> different
> > >> >> >>>>>>>>>>>>>>>>>>>> tables
> > >> >> >>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set
> it
> > >> >> >> through
> > >> >> >>>> DDL
> > >> >> >>>>>>>>>>>>>>> rather
> > >> >> >>>>>>>>>>>>>>>>> than
> > >> >> >>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options
> > for
> > >> >> all
> > >> >> >>>> lookup
> > >> >> >>>>>>>>>>>>>>> tables.
> > >> >> >>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> > really
> > >> >> >>>> deprives us of
> > >> >> >>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> implement
> > >> >> their
> > >> >> >>> own
> > >> >> >>>>>>>>>>>>>>> cache).
> > >> >> >>>>>>>>>>>>>>>>> But
> > >> >> >>>>>>>>>>>>>>>>>>>> most
> > >> >> >>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
> > >> >> >> different
> > >> >> >>>> cache
> > >> >> >>>>>>>>>>>>>>>>> strategies
> > >> >> >>>>>>>>>>>>>>>>>>>> and
> > >> >> >>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> schema
> > >> >> >> proposed
> > >> >> >>>> by
> > >> >> >>>>>>>>>>>>>>>>> Alexander.
> > >> >> >>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not
> > right
> > >> and
> > >> >> >>> all
> > >> >> >>>> these
> > >> >> >>>>>>>>>>>>>>>>>>>> facilities
> > >> >> >>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> > architecture?
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > >> >> >>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> > >> >> >>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> > >> >> >>>>>>>>>>>>>>>>>>>>> wrote:
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> wanted
> > to
> > >> >> >>>> express that
> > >> >> >>>>>>>>>>>>>>> I
> > >> >> >>>>>>>>>>>>>>>>> really
> > >> >> >>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this
> > topic
> > >> >> >> and I
> > >> >> >>>> hope
> > >> >> >>>>>>>>>>>>>>> that
> > >> >> >>>>>>>>>>>>>>>>>>>> others
> > >> >> >>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>> Martijn
> > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> > Смирнов <
> > >> >> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >> >> >>>>>>>>>>>>>>>>>>>>>> wrote:
> > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> However, I
> > >> have
> > >> >> >>>> questions
> > >> >> >>>>>>>>>>>>>>>>> about
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
> > >> >> >>>> something?).
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of
> "FOR
> > >> >> >>>> SYSTEM_TIME
> > >> >> >>>>>>>>>>>>>>> AS OF
> > >> >> >>>>>>>>>>>>>>>>>>>>>> proc_time”
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> > SYSTEM_TIME
> > >> AS
> > >> >> >> OF
> > >> >> >>>>>>>>>>>>>>> proc_time"
> > >> >> >>>>>>>>>>>>>>>>> is
> > >> >> >>>>>>>>>>>>>>>>>>>> not
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you
> > >> said,
> > >> >> >>> users
> > >> >> >>>> go
> > >> >> >>>>>>>>>>>>>>> on it
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance
> (no
> > >> one
> > >> >> >>>> proposed
> > >> >> >>>>>>>>>>>>>>> to
> > >> >> >>>>>>>>>>>>>>>>> enable
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do
> you
> > >> mean
> > >> >> >>>> other
> > >> >> >>>>>>>>>>>>>>>>> developers
> > >> >> >>>>>>>>>>>>>>>>>>>> of
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> > explicitly
> > >> >> >>> specify
> > >> >> >>>>>>>>>>>>>>> whether
> > >> >> >>>>>>>>>>>>>>>>> their
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the
> > list
> > >> of
> > >> >> >>>> supported
> > >> >> >>>>>>>>>>>>>>>>>>>> options),
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> want
> > >> to.
> > >> >> So
> > >> >> >>>> what
> > >> >> >>>>>>>>>>>>>>>>> exactly is
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching
> > in
> > >> >> >>> modules
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> flink-table-common
> > >> from
> > >> >> >>> the
> > >> >> >>>>>>>>>>>>>>>>> considered
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> > >> >> >>>> breaking/non-breaking
> > >> >> >>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> > proc_time"?
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> > >> options in
> > >> >> >>> DDL
> > >> >> >>>> to
> > >> >> >>>>>>>>>>>>>>>>> control
> > >> >> >>>>>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never
> > >> >> happened
> > >> >> >>>>>>>>>>>>>>> previously
> > >> >> >>>>>>>>>>>>>>>>> and
> > >> >> >>>>>>>>>>>>>>>>>>>>> should
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> be cautious
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> > semantics
> > >> of
> > >> >> >> DDL
> > >> >> >>>>>>>>>>>>>>> options
> > >> >> >>>>>>>>>>>>>>>>> and
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it
> > >> about
> > >> >> >>>> limiting
> > >> >> >>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>> scope
> > >> >> >>>>>>>>>>>>>>>>>>>> of
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> > business
> > >> >> >> logic
> > >> >> >>>> rather
> > >> >> >>>>>>>>>>>>>>> than
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in
> > the
> > >> >> >>>> framework? I
> > >> >> >>>>>>>>>>>>>>>>> mean
> > >> >> >>>>>>>>>>>>>>>>>>>> that
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> option
> > >> with
> > >> >> >>>> lookup
> > >> >> >>>>>>>>>>>>>>> cache
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the
> > wrong
> > >> >> >>>> decision,
> > >> >> >>>>>>>>>>>>>>>>> because it
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic
> > (not
> > >> >> >> just
> > >> >> >>>>>>>>>>>>>>> performance
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> > functions
> > >> of
> > >> >> >> ONE
> > >> >> >>>> table
> > >> >> >>>>>>>>>>>>>>>>> (there
> > >> >> >>>>>>>>>>>>>>>>>>>> can
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches).
> > >> Does it
> > >> >> >>>> really
> > >> >> >>>>>>>>>>>>>>>>> matter for
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic
> is
> > >> >> >>> located,
> > >> >> >>>>>>>>>>>>>>> which is
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> > >> 'sink.parallelism',
> > >> >> >>>> which in
> > >> >> >>>>>>>>>>>>>>>>> some way
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework"
> and
> > I
> > >> >> >> don't
> > >> >> >>>> see any
> > >> >> >>>>>>>>>>>>>>>>> problem
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> here.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> > all-caching
> > >> >> >>>> scenario
> > >> >> >>>>>>>>>>>>>>> and
> > >> >> >>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>> design
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> would become more complex
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> discussion,
> > >> but
> > >> >> >>>> actually
> > >> >> >>>>>>>>>>>>>>> in our
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> quite
> > >> >> >> easily
> > >> >> >>> -
> > >> >> >>>> we
> > >> >> >>>>>>>>>>>>>>> reused
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for
> a
> > >> new
> > >> >> >>> API).
> > >> >> >>>> The
> > >> >> >>>>>>>>>>>>>>>>> point is
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
> > >> >> >> InputFormat
> > >> >> >>>> for
> > >> >> >>>>>>>>>>>>>>>>> scanning
> > >> >> >>>>>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even
> Hive
> > >> - it
> > >> >> >>> uses
> > >> >> >>>>>>>>>>>>>>> class
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a
> > >> wrapper
> > >> >> >>> around
> > >> >> >>>>>>>>>>>>>>>>> InputFormat.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> ability
> > >> to
> > >> >> >>> reload
> > >> >> >>>>>>>>>>>>>>> cache
> > >> >> >>>>>>>>>>>>>>>>> data
> > >> >> >>>>>>>>>>>>>>>>>>>> in
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> number
> > >> of
> > >> >> >>>>>>>>>>>>>>> InputSplits,
> > >> >> >>>>>>>>>>>>>>>>> but
> > >> >> >>>>>>>>>>>>>>>>>>>> has
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload
> > time
> > >> >> >>>> significantly
> > >> >> >>>>>>>>>>>>>>>>> reduces
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> blocking). I
> > >> know
> > >> >> >>> that
> > >> >> >>>>>>>>>>>>>>> usually
> > >> >> >>>>>>>>>>>>>>>>> we
> > >> >> >>>>>>>>>>>>>>>>>>>> try
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> code,
> > >> but
> > >> >> >>> maybe
> > >> >> >>>> this
> > >> >> >>>>>>>>>>>>>>> one
> > >> >> >>>>>>>>>>>>>>>>> can
> > >> >> >>>>>>>>>>>>>>>>>>>> be
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an
> > >> ideal
> > >> >> >>>> solution,
> > >> >> >>>>>>>>>>>>>>> maybe
> > >> >> >>>>>>>>>>>>>>>>>>>> there
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> are better ones.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
> > >> >> >> introduce
> > >> >> >>>>>>>>>>>>>>>>> compatibility
> > >> >> >>>>>>>>>>>>>>>>>>>>>> issues
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> > developer
> > >> of
> > >> >> >> the
> > >> >> >>>>>>>>>>>>>>> connector
> > >> >> >>>>>>>>>>>>>>>>>>>> won't
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new
> > >> cache
> > >> >> >>>> options
> > >> >> >>>>>>>>>>>>>>>>>>>> incorrectly
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options
> > into
> > >> 2
> > >> >> >>>> different
> > >> >> >>>>>>>>>>>>>>> code
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will
> > need
> > >> to
> > >> >> >> do
> > >> >> >>>> is to
> > >> >> >>>>>>>>>>>>>>>>> redirect
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> > >> LookupConfig
> > >> >> (+
> > >> >> >>>> maybe
> > >> >> >>>>>>>>>>>>>>> add an
> > >> >> >>>>>>>>>>>>>>>>>>>> alias
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> naming),
> > >> >> >>> everything
> > >> >> >>>>>>>>>>>>>>> will be
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> won't
> > >> do
> > >> >> >>>>>>>>>>>>>>> refactoring at
> > >> >> >>>>>>>>>>>>>>>>> all,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector
> > >> because
> > >> >> >> of
> > >> >> >>>>>>>>>>>>>>> backward
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to
> > use
> > >> >> his
> > >> >> >>> own
> > >> >> >>>>>>>>>>>>>>> cache
> > >> >> >>>>>>>>>>>>>>>>> logic,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> > configs
> > >> >> into
> > >> >> >>> the
> > >> >> >>>>>>>>>>>>>>>>> framework,
> > >> >> >>>>>>>>>>>>>>>>>>>> and
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with
> > >> already
> > >> >> >>>> existing
> > >> >> >>>>>>>>>>>>>>>>> configs
> > >> >> >>>>>>>>>>>>>>>>>>>> and
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a
> > rare
> > >> >> >> case).
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed
> all
> > >> the
> > >> >> >> way
> > >> >> >>>> down
> > >> >> >>>>>>>>>>>>>>> to
> > >> >> >>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>> table
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> source
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is
> that
> > >> the
> > >> >> >>> ONLY
> > >> >> >>>>>>>>>>>>>>> connector
> > >> >> >>>>>>>>>>>>>>>>>>>> that
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> > >> FileSystemTableSource
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> > currently).
> > >> >> Also
> > >> >> >>>> for some
> > >> >> >>>>>>>>>>>>>>>>>>>> databases
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> > complex
> > >> >> >>> filters
> > >> >> >>>>>>>>>>>>>>> that we
> > >> >> >>>>>>>>>>>>>>>>> have
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> in Flink.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the
> > cache
> > >> >> >> seems
> > >> >> >>>> not
> > >> >> >>>>>>>>>>>>>>>>> quite
> > >> >> >>>>>>>>>>>>>>>>>>>>> useful
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large
> > >> amount of
> > >> >> >>> data
> > >> >> >>>>>>>>>>>>>>> from the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> > suppose
> > >> in
> > >> >> >>>> dimension
> > >> >> >>>>>>>>>>>>>>>>> table
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> 'users'
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to
> > 40,
> > >> >> and
> > >> >> >>>> input
> > >> >> >>>>>>>>>>>>>>> stream
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by
> > age
> > >> of
> > >> >> >>>> users. If
> > >> >> >>>>>>>>>>>>>>> we
> > >> >> >>>>>>>>>>>>>>>>> have
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> This
> > >> means
> > >> >> >>> the
> > >> >> >>>> user
> > >> >> >>>>>>>>>>>>>>> can
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost
> 2
> > >> >> times.
> > >> >> >>> It
> > >> >> >>>> will
> > >> >> >>>>>>>>>>>>>>>>> gain a
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> huge
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> > optimization
> > >> >> >> starts
> > >> >> >>>> to
> > >> >> >>>>>>>>>>>>>>> really
> > >> >> >>>>>>>>>>>>>>>>>>>> shine
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> filters
> > >> and
> > >> >> >>>> projections
> > >> >> >>>>>>>>>>>>>>>>> can't
> > >> >> >>>>>>>>>>>>>>>>>>>> fit
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens
> up
> > >> >> >>>> additional
> > >> >> >>>>>>>>>>>>>>>>>>>> possibilities
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not
> > quite
> > >> >> >>>> useful'.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> > >> regarding
> > >> >> >> this
> > >> >> >>>> topic!
> > >> >> >>>>>>>>>>>>>>>>> Because
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> points,
> > >> and I
> > >> >> >>>> think
> > >> >> >>>>>>>>>>>>>>> with
> > >> >> >>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>> help
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come
> > to a
> > >> >> >>>> consensus.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren
> <
> > >> >> >>>>>>>>>>>>>>> renqschn@gmail.com
> > >> >> >>>>>>>>>>>>>>>>>> :
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my
> > >> late
> > >> >> >>>> response!
> > >> >> >>>>>>>>>>>>>>> We
> > >> >> >>>>>>>>>>>>>>>>> had
> > >> >> >>>>>>>>>>>>>>>>>>>> an
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and
> > >> Leonard
> > >> >> >>> and
> > >> >> >>>> I’d
> > >> >> >>>>>>>>>>>>>>> like
> > >> >> >>>>>>>>>>>>>>>>> to
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> implementing
> > >> the
> > >> >> >>> cache
> > >> >> >>>>>>>>>>>>>>> logic in
> > >> >> >>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>> table
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> > >> user-provided
> > >> >> >>>> table
> > >> >> >>>>>>>>>>>>>>>>> function,
> > >> >> >>>>>>>>>>>>>>>>>>>> we
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
> > >> >> >>>> TableFunction
> > >> >> >>>>>>>>>>>>>>> with
> > >> >> >>>>>>>>>>>>>>>>> these
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> concerns:
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of
> > >> "FOR
> > >> >> >>>>>>>>>>>>>>> SYSTEM_TIME
> > >> >> >>>>>>>>>>>>>>>>> AS OF
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> reflect
> > >> the
> > >> >> >>>> content
> > >> >> >>>>>>>>>>>>>>> of the
> > >> >> >>>>>>>>>>>>>>>>>>>> lookup
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users
> > >> choose
> > >> >> to
> > >> >> >>>> enable
> > >> >> >>>>>>>>>>>>>>>>> caching
> > >> >> >>>>>>>>>>>>>>>>>>>> on
> > >> >> >>>>>>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that
> > >> this
> > >> >> >>>> breakage is
> > >> >> >>>>>>>>>>>>>>>>>>>> acceptable
> > >> >> >>>>>>>>>>>>>>>>>>>>>> in
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer
> > not
> > >> to
> > >> >> >>>> provide
> > >> >> >>>>>>>>>>>>>>>>> caching on
> > >> >> >>>>>>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in
> > the
> > >> >> >>>> framework
> > >> >> >>>>>>>>>>>>>>>>> (whether
> > >> >> >>>>>>>>>>>>>>>>>>>> in a
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction),
> we
> > >> have
> > >> >> >> to
> > >> >> >>>>>>>>>>>>>>> confront a
> > >> >> >>>>>>>>>>>>>>>>>>>>>> situation
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control
> > the
> > >> >> >>>> behavior of
> > >> >> >>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>> framework,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> > should
> > >> be
> > >> >> >>>> cautious.
> > >> >> >>>>>>>>>>>>>>>>> Under
> > >> >> >>>>>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> framework
> > >> >> should
> > >> >> >>>> only be
> > >> >> >>>>>>>>>>>>>>>>>>>> specified
> > >> >> >>>>>>>>>>>>>>>>>>>>> by
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s
> > >> hard
> > >> >> to
> > >> >> >>>> apply
> > >> >> >>>>>>>>>>>>>>> these
> > >> >> >>>>>>>>>>>>>>>>>>>> general
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source
> > loads
> > >> and
> > >> >> >>>> refresh
> > >> >> >>>>>>>>>>>>>>> all
> > >> >> >>>>>>>>>>>>>>>>>>>> records
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> high
> > >> >> lookup
> > >> >> >>>>>>>>>>>>>>> performance
> > >> >> >>>>>>>>>>>>>>>>>>>> (like
> > >> >> >>>>>>>>>>>>>>>>>>>>>> Hive
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely
> > >> used
> > >> >> by
> > >> >> >>> our
> > >> >> >>>>>>>>>>>>>>> internal
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the
> > >> user’s
> > >> >> >>>>>>>>>>>>>>> TableFunction
> > >> >> >>>>>>>>>>>>>>>>>>>> works
> > >> >> >>>>>>>>>>>>>>>>>>>>>> fine
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> > >> introduce a
> > >> >> >>> new
> > >> >> >>>>>>>>>>>>>>>>> interface for
> > >> >> >>>>>>>>>>>>>>>>>>>>> this
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would
> > >> become
> > >> >> >> more
> > >> >> >>>>>>>>>>>>>>> complex.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework
> > might
> > >> >> >>>> introduce
> > >> >> >>>>>>>>>>>>>>>>>>>> compatibility
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> there
> > >> might
> > >> >> >>>> exist two
> > >> >> >>>>>>>>>>>>>>>>> caches
> > >> >> >>>>>>>>>>>>>>>>>>>>> with
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
> > >> >> >> incorrectly
> > >> >> >>>>>>>>>>>>>>> configures
> > >> >> >>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>> table
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> implemented
> > >> by
> > >> >> >> the
> > >> >> >>>> lookup
> > >> >> >>>>>>>>>>>>>>>>> source).
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> > >> Alexander, I
> > >> >> >>>> think
> > >> >> >>>>>>>>>>>>>>>>> filters
> > >> >> >>>>>>>>>>>>>>>>>>>> and
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way
> down
> > >> to
> > >> >> >> the
> > >> >> >>>> table
> > >> >> >>>>>>>>>>>>>>>>> function,
> > >> >> >>>>>>>>>>>>>>>>>>>>> like
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of
> the
> > >> >> >> runner
> > >> >> >>>> with
> > >> >> >>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>> cache.
> > >> >> >>>>>>>>>>>>>>>>>>>>> The
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> network
> > >> I/O
> > >> >> >> and
> > >> >> >>>>>>>>>>>>>>> pressure
> > >> >> >>>>>>>>>>>>>>>>> on the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
> > >> >> >>> optimizations
> > >> >> >>>> to
> > >> >> >>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>> cache
> > >> >> >>>>>>>>>>>>>>>>>>>>> seems
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> > reflect
> > >> our
> > >> >> >>>> ideas.
> > >> >> >>>>>>>>>>>>>>> We
> > >> >> >>>>>>>>>>>>>>>>>>>> prefer to
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
> > >> >> >>>> TableFunction,
> > >> >> >>>>>>>>>>>>>>> and we
> > >> >> >>>>>>>>>>>>>>>>>>>> could
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> > >> (CachingTableFunction,
> > >> >> >>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers
> and
> > >> >> >> regulate
> > >> >> >>>>>>>>>>>>>>> metrics
> > >> >> >>>>>>>>>>>>>>>>> of the
> > >> >> >>>>>>>>>>>>>>>>>>>>>> cache.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>
> > >> >> >>>>
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> [2]
> > >> >> >>> https://github.com/PatrickRen/flink/tree/FLIP-221
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр
> > >> Смирнов
> > >> >> >> <
> > >> >> >>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> > solution
> > >> as
> > >> >> >> the
> > >> >> >>>>>>>>>>>>>>> first
> > >> >> >>>>>>>>>>>>>>>>> step:
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually
> > >> exclusive
> > >> >> >>>>>>>>>>>>>>> (originally
> > >> >> >>>>>>>>>>>>>>>>>>>>> proposed
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> > conceptually
> > >> >> they
> > >> >> >>>> follow
> > >> >> >>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>> same
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> > >> different.
> > >> >> >> If
> > >> >> >>> we
> > >> >> >>>>>>>>>>>>>>> will
> > >> >> >>>>>>>>>>>>>>>>> go one
> > >> >> >>>>>>>>>>>>>>>>>>>>> way,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will
> > mean
> > >> >> >>>> deleting
> > >> >> >>>>>>>>>>>>>>>>> existing
> > >> >> >>>>>>>>>>>>>>>>>>>> code
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> > >> connectors.
> > >> >> >> So
> > >> >> >>> I
> > >> >> >>>>>>>>>>>>>>> think we
> > >> >> >>>>>>>>>>>>>>>>>>>> should
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> about
> > >> that
> > >> >> >> and
> > >> >> >>>> then
> > >> >> >>>>>>>>>>>>>>> work
> > >> >> >>>>>>>>>>>>>>>>>>>>> together
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> tasks
> > >> for
> > >> >> >>>> different
> > >> >> >>>>>>>>>>>>>>>>> parts
> > >> >> >>>>>>>>>>>>>>>>>>>> of
> > >> >> >>>>>>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification
> /
> > >> >> >>>> introducing
> > >> >> >>>>>>>>>>>>>>>>> proposed
> > >> >> >>>>>>>>>>>>>>>>>>>> set
> > >> >> >>>>>>>>>>>>>>>>>>>>> of
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> Qingsheng?
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> > requests
> > >> >> >> after
> > >> >> >>>>>>>>>>>>>>> filter
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields
> > of
> > >> the
> > >> >> >>>> lookup
> > >> >> >>>>>>>>>>>>>>>>> table, we
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after
> > >> that we
> > >> >> >>> can
> > >> >> >>>>>>>>>>>>>>> filter
> > >> >> >>>>>>>>>>>>>>>>>>>>> responses,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> filter
> > >> >> >>>> pushdown. So
> > >> >> >>>>>>>>>>>>>>> if
> > >> >> >>>>>>>>>>>>>>>>>>>>> filtering
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> much
> > >> less
> > >> >> >>> rows
> > >> >> >>>> in
> > >> >> >>>>>>>>>>>>>>>>> cache.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > architecture
> > >> is
> > >> >> >> not
> > >> >> >>>>>>>>>>>>>>> shared.
> > >> >> >>>>>>>>>>>>>>>>> I
> > >> >> >>>>>>>>>>>>>>>>>>>> don't
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> kinds
> > >> of
> > >> >> >>>>>>>>>>>>>>> conversations
> > >> >> >>>>>>>>>>>>>>>>> :)
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence,
> > so
> > >> I
> > >> >> >>> made a
> > >> >> >>>>>>>>>>>>>>> Jira
> > >> >> >>>>>>>>>>>>>>>>> issue,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in
> > more
> > >> >> >>> details
> > >> >> >>>> -
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >> https://issues.apache.org/jira/browse/FLINK-27411.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise
> <
> > >> >> >>>>>>>>>>>>>>> arvid@apache.org>:
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> inconsistency
> > >> was
> > >> >> >> not
> > >> >> >>>>>>>>>>>>>>>>> satisfying
> > >> >> >>>>>>>>>>>>>>>>>>>> for
> > >> >> >>>>>>>>>>>>>>>>>>>>>> me.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> could
> > >> also
> > >> >> >>> live
> > >> >> >>>>>>>>>>>>>>> with
> > >> >> >>>>>>>>>>>>>>>>> an
> > >> >> >>>>>>>>>>>>>>>>>>>>> easier
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of
> > >> making
> > >> >> >>>> caching
> > >> >> >>>>>>>>>>>>>>> an
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> implementation
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> devise a
> > >> >> >> caching
> > >> >> >>>>>>>>>>>>>>> layer
> > >> >> >>>>>>>>>>>>>>>>>>>> around X.
> > >> >> >>>>>>>>>>>>>>>>>>>>>> So
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction
> > >> that
> > >> >> >>>>>>>>>>>>>>> delegates to
> > >> >> >>>>>>>>>>>>>>>>> X in
> > >> >> >>>>>>>>>>>>>>>>>>>>> case
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> Lifting
> > >> it
> > >> >> >> into
> > >> >> >>>> the
> > >> >> >>>>>>>>>>>>>>>>> operator
> > >> >> >>>>>>>>>>>>>>>>>>>>>> model
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> as
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> > >> probably
> > >> >> >>>>>>>>>>>>>>> unnecessary
> > >> >> >>>>>>>>>>>>>>>>> in
> > >> >> >>>>>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> first step
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will
> > only
> > >> >> >>> receive
> > >> >> >>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>> requests
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> after
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
> > >> >> >>> interesting
> > >> >> >>>> to
> > >> >> >>>>>>>>>>>>>>> save
> > >> >> >>>>>>>>>>>>>>>>>>>>> memory).
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> changes
> > of
> > >> >> >> this
> > >> >> >>>> FLIP
> > >> >> >>>>>>>>>>>>>>>>> would be
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> limited to
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> > interfaces.
> > >> >> >>>> Everything
> > >> >> >>>>>>>>>>>>>>> else
> > >> >> >>>>>>>>>>>>>>>>>>>>> remains
> > >> >> >>>>>>>>>>>>>>>>>>>>>> an
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That
> > means
> > >> we
> > >> >> >> can
> > >> >> >>>>>>>>>>>>>>> easily
> > >> >> >>>>>>>>>>>>>>>>>>>>>> incorporate
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander
> > >> pointed
> > >> >> >> out
> > >> >> >>>>>>>>>>>>>>> later.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > architecture
> > >> is
> > >> >> >> not
> > >> >> >>>>>>>>>>>>>>> shared.
> > >> >> >>>>>>>>>>>>>>>>> I
> > >> >> >>>>>>>>>>>>>>>>>>>> don't
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> Александр
> > >> >> >> Смирнов
> > >> >> >>> <
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm
> > >> not a
> > >> >> >>>>>>>>>>>>>>> committer
> > >> >> >>>>>>>>>>>>>>>>> yet,
> > >> >> >>>>>>>>>>>>>>>>>>>> but
> > >> >> >>>>>>>>>>>>>>>>>>>>>> I'd
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> FLIP
> > >> >> really
> > >> >> >>>>>>>>>>>>>>>>> interested
> > >> >> >>>>>>>>>>>>>>>>>>>> me.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> > >> feature in
> > >> >> >> my
> > >> >> >>>>>>>>>>>>>>>>> company’s
> > >> >> >>>>>>>>>>>>>>>>>>>>> Flink
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our
> > >> thoughts
> > >> >> >> on
> > >> >> >>>>>>>>>>>>>>> this and
> > >> >> >>>>>>>>>>>>>>>>>>>> make
> > >> >> >>>>>>>>>>>>>>>>>>>>>> code
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative
> > than
> > >> >> >>>>>>>>>>>>>>> introducing an
> > >> >> >>>>>>>>>>>>>>>>>>>>> abstract
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> > >> >> (CachingTableFunction).
> > >> >> >>> As
> > >> >> >>>>>>>>>>>>>>> you
> > >> >> >>>>>>>>>>>>>>>>> know,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> > >> flink-table-common
> > >> >> >>>>>>>>>>>>>>> module,
> > >> >> >>>>>>>>>>>>>>>>> which
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> provides
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables –
> > it’s
> > >> >> very
> > >> >> >>>>>>>>>>>>>>>>> convenient
> > >> >> >>>>>>>>>>>>>>>>>>>> for
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> importing
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> > >> CachingTableFunction
> > >> >> >>>> contains
> > >> >> >>>>>>>>>>>>>>>>> logic
> > >> >> >>>>>>>>>>>>>>>>>>>> for
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
> > >> >> >> everything
> > >> >> >>>>>>>>>>>>>>>>> connected
> > >> >> >>>>>>>>>>>>>>>>>>>> with
> > >> >> >>>>>>>>>>>>>>>>>>>>> it
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module,
> > >> probably
> > >> >> >> in
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> > depend
> > >> on
> > >> >> >>>> another
> > >> >> >>>>>>>>>>>>>>>>> module,
> > >> >> >>>>>>>>>>>>>>>>>>>>>> which
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which
> > >> doesn’t
> > >> >> >>>> sound
> > >> >> >>>>>>>>>>>>>>>>> good.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> > >> >> ‘getLookupConfig’
> > >> >> >>> to
> > >> >> >>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> > >> connectors
> > >> >> to
> > >> >> >>>> only
> > >> >> >>>>>>>>>>>>>>> pass
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> therefore
> > >> they
> > >> >> >>> won’t
> > >> >> >>>>>>>>>>>>>>>>> depend on
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> > planner
> > >> >> >> will
> > >> >> >>>>>>>>>>>>>>>>> construct a
> > >> >> >>>>>>>>>>>>>>>>>>>>>> lookup
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> runtime
> > >> logic
> > >> >> >>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> > >> >> >>>>>>>>>>>>>>>>>>>>>> in
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> Architecture
> > >> >> looks
> > >> >> >>>> like
> > >> >> >>>>>>>>>>>>>>> in
> > >> >> >>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>> pinned
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> > actually
> > >> >> >> yours
> > >> >> >>>>>>>>>>>>>>>>>>>> CacheConfig).
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that
> will
> > >> be
> > >> >> >>>>>>>>>>>>>>> responsible
> > >> >> >>>>>>>>>>>>>>>>> for
> > >> >> >>>>>>>>>>>>>>>>>>>>> this
> > >> >> >>>>>>>>>>>>>>>>>>>>>> –
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> > >> inheritors.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> > >> >> >>>>>>>>>>>>>>> flink-table-runtime
> > >> >> >>>>>>>>>>>>>>>>> -
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> AsyncLookupJoinRunner,
> > >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> > >> >> >> LookupJoinCachingRunner,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful
> > >> advantage
> > >> >> >> of
> > >> >> >>>>>>>>>>>>>>> such a
> > >> >> >>>>>>>>>>>>>>>>>>>>> solution.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> If
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level,
> > we
> > >> can
> > >> >> >>>> apply
> > >> >> >>>>>>>>>>>>>>> some
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> > >> LookupJoinRunnerWithCalc
> > >> >> >> was
> > >> >> >>>>>>>>>>>>>>> named
> > >> >> >>>>>>>>>>>>>>>>> like
> > >> >> >>>>>>>>>>>>>>>>>>>>> this
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function,
> > which
> > >> >> >>> actually
> > >> >> >>>>>>>>>>>>>>>>> mostly
> > >> >> >>>>>>>>>>>>>>>>>>>>>> consists
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> lookup
> > >> table
> > >> >> >> B
> > >> >> >>>>>>>>>>>>>>>>> condition
> > >> >> >>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> ON
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> WHERE
> > >> >> >>> B.salary >
> > >> >> >>>>>>>>>>>>>>> 1000’
> > >> >> >>>>>>>>>>>>>>>>>>>>> ‘calc’
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age =
> > >> B.age +
> > >> >> >> 10
> > >> >> >>>> and
> > >> >> >>>>>>>>>>>>>>>>>>>> B.salary >
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> 1000.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> storing
> > >> >> >> records
> > >> >> >>> in
> > >> >> >>>>>>>>>>>>>>>>> cache,
> > >> >> >>>>>>>>>>>>>>>>>>>> size
> > >> >> >>>>>>>>>>>>>>>>>>>>> of
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced:
> > >> filters =
> > >> >> >>>> avoid
> > >> >> >>>>>>>>>>>>>>>>> storing
> > >> >> >>>>>>>>>>>>>>>>>>>>>> useless
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce
> > >> >> records’
> > >> >> >>>>>>>>>>>>>>> size. So
> > >> >> >>>>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>>>>>>> initial
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be
> > >> >> increased
> > >> >> >>> by
> > >> >> >>>>>>>>>>>>>>> the
> > >> >> >>>>>>>>>>>>>>>>> user.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren
> > wrote:
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> > >> discussion
> > >> >> >>> about
> > >> >> >>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> which
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> table
> > >> >> cache
> > >> >> >>> and
> > >> >> >>>>>>>>>>>>>>> its
> > >> >> >>>>>>>>>>>>>>>>>>>> standard
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> > should
> > >> >> >>>> implement
> > >> >> >>>>>>>>>>>>>>>>> their
> > >> >> >>>>>>>>>>>>>>>>>>>> own
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> cache to
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
> > >> >> >> standard
> > >> >> >>> of
> > >> >> >>>>>>>>>>>>>>>>> metrics
> > >> >> >>>>>>>>>>>>>>>>>>>> for
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> users and
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with
> > lookup
> > >> >> >>> joins,
> > >> >> >>>>>>>>>>>>>>> which
> > >> >> >>>>>>>>>>>>>>>>> is a
> > >> >> >>>>>>>>>>>>>>>>>>>>>> quite
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> common
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> > >> including
> > >> >> >>>> cache,
> > >> >> >>>>>>>>>>>>>>>>>>>> metrics,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> wrapper
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table
> > >> >> options.
> > >> >> >>>>>>>>>>>>>>> Please
> > >> >> >>>>>>>>>>>>>>>>> take a
> > >> >> >>>>>>>>>>>>>>>>>>>>> look
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> at the
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
> > >> >> >>> suggestions
> > >> >> >>>>>>>>>>>>>>> and
> > >> >> >>>>>>>>>>>>>>>>>>>> comments
> > >> >> >>>>>>>>>>>>>>>>>>>>>>> would be
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>
> > >> >> >>>>
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> --
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > >> >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>>> --
> > >> >> >>>>>>>>>>>>>>>>>>> Best regards,
> > >> >> >>>>>>>>>>>>>>>>>>> Roman Boyko
> > >> >> >>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >> >> >>>>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> --
> > >> >> >>>>>>>>>>>> Best Regards,
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> Qingsheng Ren
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> Real-time Computing Team
> > >> >> >>>>>>>>>>>> Alibaba Cloud
> > >> >> >>>>>>>>>>>>
> > >> >> >>>>>>>>>>>> Email: renqschn@gmail.com
> > >> >> >>>>>>>>>>
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >> >>
> > >>
> > >
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jing Ge <ji...@ververica.com>.

Thanks all for the valuable discussion. The new feature looks very
interesting.

According to the FLIP description: "*Currently we have JDBC, Hive and HBase
connector implemented lookup table source. All existing implementations
will be migrated to the current design and the migration will be
transparent to end users*." I was only wondering if we should pay attention
to HBase and similar DBs. Since, commonly, the lookup data will be huge
while using HBase, partial caching will be used in this case, if I am not
mistaken, which might have an impact on the block cache used by HBase, e.g.
LruBlockCache.
Another question is that, since HBase provides a sophisticated cache
solution, does it make sense to have a no-cache solution as one of the
default solutions so that customers will have no effort for the migration
if they want to stick with Hbase cache?

Best regards,
Jing

On Fri, May 27, 2022 at 11:19 AM Jingsong Li <ji...@gmail.com> wrote:

> Hi all,
>
> I think the problem now is below:
> 1. AllCache and PartialCache interface on the non-uniform, one needs to
> provide LookupProvider, the other needs to provide CacheBuilder.
> 2. AllCache definition is not flexible, for example, PartialCache can use
> any custom storage, while the AllCache can not, AllCache can also be
> considered to store memory or disk, also need a flexible strategy.
> 3. AllCache can not customize ReloadStrategy, currently only
> ScheduledReloadStrategy.
>
> In order to solve the above problems, the following are my ideas.
>
> ## Top level cache interfaces:
>
> ```
>
> public interface CacheLookupProvider extends
> LookupTableSource.LookupRuntimeProvider {
>
>     CacheBuilder createCacheBuilder();
> }
>
>
> public interface CacheBuilder {
>     Cache create();
> }
>
>
> public interface Cache {
>
>     /**
>      * Returns the value associated with key in this cache, or null if
> there is no cached value for
>      * key.
>      */
>     @Nullable
>     Collection<RowData> getIfPresent(RowData key);
>
>     /** Returns the number of key-value mappings in the cache. */
>     long size();
> }
>
> ```
>
> ## Partial cache
>
> ```
>
> public interface PartialCacheLookupFunction extends CacheLookupProvider {
>
>     @Override
>     PartialCacheBuilder createCacheBuilder();
>
> /** Creates an {@link LookupFunction} instance. */
> LookupFunction createLookupFunction();
> }
>
>
> public interface PartialCacheBuilder extends CacheBuilder {
>
>     PartialCache create();
> }
>
>
> public interface PartialCache extends Cache {
>
>     /**
>      * Associates the specified value rows with the specified key row
> in the cache. If the cache
>      * previously contained value associated with the key, the old
> value is replaced by the
>      * specified value.
>      *
>      * @return the previous value rows associated with key, or null if
> there was no mapping for key.
>      * @param key - key row with which the specified value is to be
> associated
>      * @param value – value rows to be associated with the specified key
>      */
>     Collection<RowData> put(RowData key, Collection<RowData> value);
>
>     /** Discards any cached value for the specified key. */
>     void invalidate(RowData key);
> }
>
> ```
>
> ## All cache
> ```
>
> public interface AllCacheLookupProvider extends CacheLookupProvider {
>
>     void registerReloadStrategy(ScheduledExecutorService
> executorService, Reloader reloader);
>
>     ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
>
>     @Override
>     AllCacheBuilder createCacheBuilder();
> }
>
>
> public interface AllCacheBuilder extends CacheBuilder {
>
>     AllCache create();
> }
>
>
> public interface AllCache extends Cache {
>
>     void putAll(Iterator<Map<RowData, RowData>> allEntries);
>
>     void clearAll();
> }
>
>
> public interface Reloader {
>
>     void reload();
> }
>
> ```
>
> Best,
> Jingsong
>
> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <ji...@gmail.com>
> wrote:
>
> > Thanks Qingsheng and all for your discussion.
> >
> > Very sorry to jump in so late.
> >
> > Maybe I missed something?
> > My first impression when I saw the cache interface was, why don't we
> > provide an interface similar to guava cache [1], on top of guava cache,
> > caffeine also makes extensions for asynchronous calls.[2]
> > There is also the bulk load in caffeine too.
> >
> > I am also more confused why first from LookupCacheFactory.Builder and
> then
> > to Factory to create Cache.
> >
> > [1] https://github.com/google/guava
> > [2] https://github.com/ben-manes/caffeine/wiki/Population
> >
> > Best,
> > Jingsong
> >
> > On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
> >
> >> After looking at the new introduced ReloadTime and Becket's comment,
> >> I agree with Becket we should have a pluggable reloading strategy.
> >> We can provide some common implementations, e.g., periodic reloading,
> and
> >> daily reloading.
> >> But there definitely be some connector- or business-specific reloading
> >> strategies, e.g.
> >> notify by a zookeeper watcher, reload once a new Hive partition is
> >> complete.
> >>
> >> Best,
> >> Jark
> >>
> >> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com> wrote:
> >>
> >> > Hi Qingsheng,
> >> >
> >> > Thanks for updating the FLIP. A few comments / questions below:
> >> >
> >> > 1. Is there a reason that we have both "XXXFactory" and "XXXProvider".
> >> > What is the difference between them? If they are the same, can we just
> >> use
> >> > XXXFactory everywhere?
> >> >
> >> > 2. Regarding the FullCachingLookupProvider, should the reloading
> policy
> >> > also be pluggable? Periodical reloading could be sometimes be tricky
> in
> >> > practice. For example, if user uses 24 hours as the cache refresh
> >> interval
> >> > and some nightly batch job delayed, the cache update may still see the
> >> > stale data.
> >> >
> >> > 3. In DefaultLookupCacheFactory, it looks like InitialCapacity should
> be
> >> > removed.
> >> >
> >> > 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a
> >> little
> >> > confusing to me. If Optional<LookupCacheFactory> getCacheFactory()
> >> returns
> >> > a non-empty factory, doesn't that already indicates the framework to
> >> cache
> >> > the missing keys? Also, why is this method returning an
> >> Optional<Boolean>
> >> > instead of boolean?
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> >
> >> >
> >> > On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <re...@gmail.com>
> >> wrote:
> >> >
> >> >> Hi Lincoln and Jark,
> >> >>
> >> >> Thanks for the comments! If the community reaches a consensus that we
> >> use
> >> >> SQL hint instead of table options to decide whether to use sync or
> >> async
> >> >> mode, it’s indeed not necessary to introduce the “lookup.async”
> option.
> >> >>
> >> >> I think it’s a good idea to let the decision of async made on query
> >> >> level, which could make better optimization with more infomation
> >> gathered
> >> >> by planner. Is there any FLIP describing the issue in FLINK-27625? I
> >> >> thought FLIP-234 is only proposing adding SQL hint for retry on
> missing
> >> >> instead of the entire async mode to be controlled by hint.
> >> >>
> >> >> Best regards,
> >> >>
> >> >> Qingsheng
> >> >>
> >> >> > On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com>
> >> wrote:
> >> >> >
> >> >> > Hi Jark,
> >> >> >
> >> >> > Thanks for your reply!
> >> >> >
> >> >> > Currently 'lookup.async' just lies in HBase connector, I have no
> idea
> >> >> > whether or when to remove it (we can discuss it in another issue
> for
> >> the
> >> >> > HBase connector after FLINK-27625 is done), just not add it into a
> >> >> common
> >> >> > option now.
> >> >> >
> >> >> > Best,
> >> >> > Lincoln Lee
> >> >> >
> >> >> >
> >> >> > Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> >> >> >
> >> >> >> Hi Lincoln,
> >> >> >>
> >> >> >> I have taken a look at FLIP-234, and I agree with you that the
> >> >> connectors
> >> >> >> can
> >> >> >> provide both async and sync runtime providers simultaneously
> instead
> >> >> of one
> >> >> >> of them.
> >> >> >> At that point, "lookup.async" looks redundant. If this option is
> >> >> planned to
> >> >> >> be removed
> >> >> >> in the long term, I think it makes sense not to introduce it in
> this
> >> >> FLIP.
> >> >> >>
> >> >> >> Best,
> >> >> >> Jark
> >> >> >>
> >> >> >> On Tue, 24 May 2022 at 11:08, Lincoln Lee <lincoln.86xy@gmail.com
> >
> >> >> wrote:
> >> >> >>
> >> >> >>> Hi Qingsheng,
> >> >> >>>
> >> >> >>> Sorry for jumping into the discussion so late. It's a good idea
> >> that
> >> >> we
> >> >> >> can
> >> >> >>> have a common table option. I have a minor comments on
> >> 'lookup.async'
> >> >> >> that
> >> >> >>> not make it a common option:
> >> >> >>>
> >> >> >>> The table layer abstracts both sync and async lookup
> capabilities,
> >> >> >>> connectors implementers can choose one or both, in the case of
> >> >> >> implementing
> >> >> >>> only one capability(status of the most of existing builtin
> >> connectors)
> >> >> >>> 'lookup.async' will not be used.  And when a connector has both
> >> >> >>> capabilities, I think this choice is more suitable for making
> >> >> decisions
> >> >> >> at
> >> >> >>> the query level, for example, table planner can choose the
> physical
> >> >> >>> implementation of async lookup or sync lookup based on its cost
> >> >> model, or
> >> >> >>> users can give query hint based on their own better
> >> understanding.  If
> >> >> >>> there is another common table option 'lookup.async', it may
> confuse
> >> >> the
> >> >> >>> users in the long run.
> >> >> >>>
> >> >> >>> So, I prefer to leave the 'lookup.async' option in private place
> >> (for
> >> >> the
> >> >> >>> current hbase connector) and not turn it into a common option.
> >> >> >>>
> >> >> >>> WDYT?
> >> >> >>>
> >> >> >>> Best,
> >> >> >>> Lincoln Lee
> >> >> >>>
> >> >> >>>
> >> >> >>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> >> >> >>>
> >> >> >>>> Hi Alexander,
> >> >> >>>>
> >> >> >>>> Thanks for the review! We recently updated the FLIP and you can
> >> find
> >> >> >>> those
> >> >> >>>> changes from my latest email. Since some terminologies has
> >> changed so
> >> >> >>> I’ll
> >> >> >>>> use the new concept for replying your comments.
> >> >> >>>>
> >> >> >>>> 1. Builder vs ‘of’
> >> >> >>>> I’m OK to use builder pattern if we have additional optional
> >> >> parameters
> >> >> >>>> for full caching mode (“rescan” previously). The
> >> schedule-with-delay
> >> >> >> idea
> >> >> >>>> looks reasonable to me, but I think we need to redesign the
> >> builder
> >> >> API
> >> >> >>> of
> >> >> >>>> full caching to make it more descriptive for developers. Would
> you
> >> >> mind
> >> >> >>>> sharing your ideas about the API? For accessing the FLIP
> workspace
> >> >> you
> >> >> >>> can
> >> >> >>>> just provide your account ID and ping any PMC member including
> >> Jark.
> >> >> >>>>
> >> >> >>>> 2. Common table options
> >> >> >>>> We have some discussions these days and propose to introduce 8
> >> common
> >> >> >>>> table options about caching. It has been updated on the FLIP.
> >> >> >>>>
> >> >> >>>> 3. Retries
> >> >> >>>> I think we are on the same page :-)
> >> >> >>>>
> >> >> >>>> For your additional concerns:
> >> >> >>>> 1) The table option has been updated.
> >> >> >>>> 2) We got “lookup.cache” back for configuring whether to use
> >> partial
> >> >> or
> >> >> >>>> full caching mode.
> >> >> >>>>
> >> >> >>>> Best regards,
> >> >> >>>>
> >> >> >>>> Qingsheng
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> >> smiralexan@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>>>
> >> >> >>>>> Also I have a few additions:
> >> >> >>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> >> >> >>>>> 'lookup.cache.max-rows'? I think it will be more clear that we
> >> talk
> >> >> >>>>> not about bytes, but about the number of rows. Plus it fits
> more,
> >> >> >>>>> considering my optimization with filters.
> >> >> >>>>> 2) How will users enable rescanning? Are we going to separate
> >> >> caching
> >> >> >>>>> and rescanning from the options point of view? Like initially
> we
> >> had
> >> >> >>>>> one option 'lookup.cache' with values LRU / ALL. I think now we
> >> can
> >> >> >>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
> >> >> >>>>> 'lookup.rescan.interval', etc.
> >> >> >>>>>
> >> >> >>>>> Best regards,
> >> >> >>>>> Alexander
> >> >> >>>>>
> >> >> >>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> >> smiralexan@gmail.com
> >> >> >>> :
> >> >> >>>>>>
> >> >> >>>>>> Hi Qingsheng and Jark,
> >> >> >>>>>>
> >> >> >>>>>> 1. Builders vs 'of'
> >> >> >>>>>> I understand that builders are used when we have multiple
> >> >> >> parameters.
> >> >> >>>>>> I suggested them because we could add parameters later. To
> >> prevent
> >> >> >>>>>> Builder for ScanRuntimeProvider from looking redundant I can
> >> >> suggest
> >> >> >>>>>> one more config now - "rescanStartTime".
> >> >> >>>>>> It's a time in UTC (LocalTime class) when the first reload of
> >> cache
> >> >> >>>>>> starts. This parameter can be thought of as 'initialDelay'
> (diff
> >> >> >>>>>> between current time and rescanStartTime) in method
> >> >> >>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be
> >> very
> >> >> >>>>>> useful when the dimension table is updated by some other
> >> scheduled
> >> >> >> job
> >> >> >>>>>> at a certain time. Or when the user simply wants a second scan
> >> >> >> (first
> >> >> >>>>>> cache reload) be delayed. This option can be used even without
> >> >> >>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one
> >> day.
> >> >> >>>>>> If you are fine with this option, I would be very glad if you
> >> would
> >> >> >>>>>> give me access to edit FLIP page, so I could add it myself
> >> >> >>>>>>
> >> >> >>>>>> 2. Common table options
> >> >> >>>>>> I also think that FactoryUtil would be overloaded by all cache
> >> >> >>>>>> options. But maybe unify all suggested options, not only for
> >> >> default
> >> >> >>>>>> cache? I.e. class 'LookupOptions', that unifies default cache
> >> >> >> options,
> >> >> >>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> >> >> >>>>>>
> >> >> >>>>>> 3. Retries
> >> >> >>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times,
> >> call)
> >> >> >>>>>>
> >> >> >>>>>> [1]
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >>
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >> >> >>>>>>
> >> >> >>>>>> Best regards,
> >> >> >>>>>> Alexander
> >> >> >>>>>>
> >> >> >>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <renqschn@gmail.com
> >:
> >> >> >>>>>>>
> >> >> >>>>>>> Hi Jark and Alexander,
> >> >> >>>>>>>
> >> >> >>>>>>> Thanks for your comments! I’m also OK to introduce common
> table
> >> >> >>>> options. I prefer to introduce a new DefaultLookupCacheOptions
> >> class
> >> >> >> for
> >> >> >>>> holding these option definitions because putting all options
> into
> >> >> >>>> FactoryUtil would make it a bit ”crowded” and not well
> >> categorized.
> >> >> >>>>>>>
> >> >> >>>>>>> FLIP has been updated according to suggestions above:
> >> >> >>>>>>> 1. Use static “of” method for constructing
> >> RescanRuntimeProvider
> >> >> >>>> considering both arguments are required.
> >> >> >>>>>>> 2. Introduce new table options matching
> >> DefaultLookupCacheFactory
> >> >> >>>>>>>
> >> >> >>>>>>> Best,
> >> >> >>>>>>> Qingsheng
> >> >> >>>>>>>
> >> >> >>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com>
> >> wrote:
> >> >> >>>>>>>>
> >> >> >>>>>>>> Hi Alex,
> >> >> >>>>>>>>
> >> >> >>>>>>>> 1) retry logic
> >> >> >>>>>>>> I think we can extract some common retry logic into
> utilities,
> >> >> >> e.g.
> >> >> >>>> RetryUtils#tryTimes(times, call).
> >> >> >>>>>>>> This seems independent of this FLIP and can be reused by
> >> >> >> DataStream
> >> >> >>>> users.
> >> >> >>>>>>>> Maybe we can open an issue to discuss this and where to put
> >> it.
> >> >> >>>>>>>>
> >> >> >>>>>>>> 2) cache ConfigOptions
> >> >> >>>>>>>> I'm fine with defining cache config options in the
> framework.
> >> >> >>>>>>>> A candidate place to put is FactoryUtil which also includes
> >> >> >>>> "sink.parallelism", "format" options.
> >> >> >>>>>>>>
> >> >> >>>>>>>> Best,
> >> >> >>>>>>>> Jark
> >> >> >>>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> >> >> >>> smiralexan@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> Hi Qingsheng,
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> Thank you for considering my comments.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>> there might be custom logic before making retry, such as
> >> >> >>>> re-establish the connection
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> Yes, I understand that. I meant that such logic can be
> >> placed in
> >> >> >> a
> >> >> >>>>>>>>> separate function, that can be implemented by connectors.
> >> Just
> >> >> >>> moving
> >> >> >>>>>>>>> the retry logic would make connector's LookupFunction more
> >> >> >> concise
> >> >> >>> +
> >> >> >>>>>>>>> avoid duplicate code. However, it's a minor change. The
> >> decision
> >> >> >> is
> >> >> >>>> up
> >> >> >>>>>>>>> to you.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> >> developers
> >> >> >> to
> >> >> >>>> define their own options as we do now per connector.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> What is the reason for that? One of the main goals of this
> >> FLIP
> >> >> >> was
> >> >> >>>> to
> >> >> >>>>>>>>> unify the configs, wasn't it? I understand that current
> cache
> >> >> >>> design
> >> >> >>>>>>>>> doesn't depend on ConfigOptions, like was before. But still
> >> we
> >> >> >> can
> >> >> >>>> put
> >> >> >>>>>>>>> these options into the framework, so connectors can reuse
> >> them
> >> >> >> and
> >> >> >>>>>>>>> avoid code duplication, and, what is more significant,
> avoid
> >> >> >>> possible
> >> >> >>>>>>>>> different options naming. This moment can be pointed out in
> >> >> >>>>>>>>> documentation for connector developers.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> Best regards,
> >> >> >>>>>>>>> Alexander
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> >> renqschn@gmail.com>:
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> Hi Alexander,
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> Thanks for the review and glad to see we are on the same
> >> page!
> >> >> I
> >> >> >>>> think you forgot to cc the dev mailing list so I’m also quoting
> >> your
> >> >> >>> reply
> >> >> >>>> under this email.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> In my opinion the retry logic should be implemented in
> >> lookup()
> >> >> >>>> instead of in LookupFunction#eval(). Retrying is only meaningful
> >> >> under
> >> >> >>> some
> >> >> >>>> specific retriable failures, and there might be custom logic
> >> before
> >> >> >>> making
> >> >> >>>> retry, such as re-establish the connection
> >> (JdbcRowDataLookupFunction
> >> >> >> is
> >> >> >>> an
> >> >> >>>> example), so it's more handy to leave it to the connector.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>> I don't see DDL options, that were in previous version of
> >> >> FLIP.
> >> >> >>> Do
> >> >> >>>> you have any special plans for them?
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> >> developers
> >> >> >> to
> >> >> >>>> define their own options as we do now per connector.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> The rest of comments sound great and I’ll update the FLIP.
> >> Hope
> >> >> >> we
> >> >> >>>> can finalize our proposal soon!
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> Best,
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> Qingsheng
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> >> >> >>> smiralexan@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> Hi Qingsheng and devs!
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> I like the overall design of updated FLIP, however I have
> >> >> >> several
> >> >> >>>>>>>>>>> suggestions and questions.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> >> TableFunction
> >> >> >> is a
> >> >> >>>> good
> >> >> >>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this class.
> >> >> 'eval'
> >> >> >>>> method
> >> >> >>>>>>>>>>> of new LookupFunction is great for this purpose. The same
> >> is
> >> >> >> for
> >> >> >>>>>>>>>>> 'async' case.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 2) There might be other configs in future, such as
> >> >> >>>> 'cacheMissingKey'
> >> >> >>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> >> >> >>>> ScanRuntimeProvider.
> >> >> >>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
> >> >> >>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> 'build'
> >> >> >>> method
> >> >> >>>>>>>>>>> instead of many 'of' methods in future)?
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 3) What are the plans for existing TableFunctionProvider
> >> and
> >> >> >>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> >> deprecated.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 4) Am I right that the current design does not assume
> >> usage of
> >> >> >>>>>>>>>>> user-provided LookupCache in re-scanning? In this case,
> it
> >> is
> >> >> >> not
> >> >> >>>> very
> >> >> >>>>>>>>>>> clear why do we need methods such as 'invalidate' or
> >> 'putAll'
> >> >> >> in
> >> >> >>>>>>>>>>> LookupCache.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 5) I don't see DDL options, that were in previous version
> >> of
> >> >> >>> FLIP.
> >> >> >>>> Do
> >> >> >>>>>>>>>>> you have any special plans for them?
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> If you don't mind, I would be glad to be able to make
> small
> >> >> >>>>>>>>>>> adjustments to the FLIP document too. I think it's worth
> >> >> >>> mentioning
> >> >> >>>>>>>>>>> about what exactly optimizations are planning in the
> >> future.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> >> renqschn@gmail.com
> >> >> >>> :
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Hi Alexander and devs,
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Thank you very much for the in-depth discussion! As Jark
> >> >> >>>> mentioned we were inspired by Alexander's idea and made a
> >> refactor on
> >> >> >> our
> >> >> >>>> design. FLIP-221 [1] has been updated to reflect our design now
> >> and
> >> >> we
> >> >> >>> are
> >> >> >>>> happy to hear more suggestions from you!
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Compared to the previous design:
> >> >> >>>>>>>>>>>> 1. The lookup cache serves at table runtime level and is
> >> >> >>>> integrated as a component of LookupJoinRunner as discussed
> >> >> previously.
> >> >> >>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the
> >> new
> >> >> >>>> design.
> >> >> >>>>>>>>>>>> 3. We separate the all-caching case individually and
> >> >> >> introduce a
> >> >> >>>> new RescanRuntimeProvider to reuse the ability of scanning. We
> are
> >> >> >>> planning
> >> >> >>>> to support SourceFunction / InputFormat for now considering the
> >> >> >>> complexity
> >> >> >>>> of FLIP-27 Source API.
> >> >> >>>>>>>>>>>> 4. A new interface LookupFunction is introduced to make
> >> the
> >> >> >>>> semantic of lookup more straightforward for developers.
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> For replying to Alexander:
> >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> >> >> >> deprecated
> >> >> >>>> or not. Am I right that it will be so in the future, but
> currently
> >> >> it's
> >> >> >>> not?
> >> >> >>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for
> now.
> >> I
> >> >> >>> think
> >> >> >>>> it will be deprecated in the future but we don't have a clear
> plan
> >> >> for
> >> >> >>> that.
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Thanks again for the discussion on this FLIP and looking
> >> >> >> forward
> >> >> >>>> to cooperating with you after we finalize the design and
> >> interfaces!
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> [1]
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Qingsheng
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> >> >> >>>> smiralexan@gmail.com> wrote:
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> Glad to see that we came to a consensus on almost all
> >> >> points!
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> >> >> >> deprecated
> >> >> >>>> or
> >> >> >>>>>>>>>>>>> not. Am I right that it will be so in the future, but
> >> >> >> currently
> >> >> >>>> it's
> >> >> >>>>>>>>>>>>> not? Actually I also think that for the first version
> >> it's
> >> >> OK
> >> >> >>> to
> >> >> >>>> use
> >> >> >>>>>>>>>>>>> InputFormat in ALL cache realization, because
> supporting
> >> >> >> rescan
> >> >> >>>>>>>>>>>>> ability seems like a very distant prospect. But for
> this
> >> >> >>>> decision we
> >> >> >>>>>>>>>>>>> need a consensus among all discussion participants.
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> In general, I don't have something to argue with your
> >> >> >>>> statements. All
> >> >> >>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would be
> >> nice
> >> >> >> to
> >> >> >>>> work
> >> >> >>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot of
> >> work
> >> >> >> on
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>> join caching with realization very close to the one we
> >> are
> >> >> >>>> discussing,
> >> >> >>>>>>>>>>>>> and want to share the results of this work. Anyway
> >> looking
> >> >> >>>> forward for
> >> >> >>>>>>>>>>>>> the FLIP update!
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <imjark@gmail.com
> >:
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> Hi Alex,
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> Thanks for summarizing your points.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> >> discussed
> >> >> >> it
> >> >> >>>> several times
> >> >> >>>>>>>>>>>>>> and we have totally refactored the design.
> >> >> >>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many of
> >> your
> >> >> >>>> points!
> >> >> >>>>>>>>>>>>>> Qingsheng is still working on updating the design docs
> >> and
> >> >> >>>> maybe can be
> >> >> >>>>>>>>>>>>>> available in the next few days.
> >> >> >>>>>>>>>>>>>> I will share some conclusions from our discussions:
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> 1) we have refactored the design towards to "cache in
> >> >> >>>> framework" way.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize
> and
> >> a
> >> >> >>>> default
> >> >> >>>>>>>>>>>>>> implementation with builder for users to easy-use.
> >> >> >>>>>>>>>>>>>> This can both make it possible to both have
> flexibility
> >> and
> >> >> >>>> conciseness.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
> >> >> >> cache,
> >> >> >>>> esp reducing
> >> >> >>>>>>>>>>>>>> IO.
> >> >> >>>>>>>>>>>>>> Filter pushdown should be the final state and the
> >> unified
> >> >> >> way
> >> >> >>>> to both
> >> >> >>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> >> >> >>>>>>>>>>>>>> so I think we should make effort in this direction. If
> >> we
> >> >> >> need
> >> >> >>>> to support
> >> >> >>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> >> >> >>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide to
> >> >> >>> implement
> >> >> >>>> the cache
> >> >> >>>>>>>>>>>>>> in the framework, we have the chance to support
> >> >> >>>>>>>>>>>>>> filter on cache anytime. This is an optimization and
> it
> >> >> >>> doesn't
> >> >> >>>> affect the
> >> >> >>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
> >> >> >>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
> >> >> >> proposal.
> >> >> >>>>>>>>>>>>>> In the first version, we will only support
> InputFormat,
> >> >> >>>> SourceFunction for
> >> >> >>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> >> >> >>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source
> >> operator
> >> >> >>>> instead of
> >> >> >>>>>>>>>>>>>> calling it embedded in the join operator.
> >> >> >>>>>>>>>>>>>> However, this needs another FLIP to support the
> re-scan
> >> >> >>> ability
> >> >> >>>> for FLIP-27
> >> >> >>>>>>>>>>>>>> Source, and this can be a large work.
> >> >> >>>>>>>>>>>>>> In order to not block this issue, we can put the
> effort
> >> of
> >> >> >>>> FLIP-27 source
> >> >> >>>>>>>>>>>>>> integration into future work and integrate
> >> >> >>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction,
> as
> >> >> they
> >> >> >>>> are not
> >> >> >>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another
> >> >> function
> >> >> >>>>>>>>>>>>>> similar to them which is meaningless. We need to plan
> >> >> >> FLIP-27
> >> >> >>>> source
> >> >> >>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction
> are
> >> >> >>>> deprecated.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> Best,
> >> >> >>>>>>>>>>>>>> Jark
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> >> >> >>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>> Hi Martijn!
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat
> is
> >> not
> >> >> >>>> considered.
> >> >> >>>>>>>>>>>>>>> Thanks for clearing that up!
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> >> >> >>>> martijn@ververica.com>:
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> Hi,
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> With regards to:
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors
> to
> >> >> >>> FLIP-27
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The
> old
> >> >> >>>> interfaces will be
> >> >> >>>>>>>>>>>>>>>> deprecated and connectors will either be refactored
> to
> >> >> use
> >> >> >>>> the new ones
> >> >> >>>>>>>>>>>>>>> or
> >> >> >>>>>>>>>>>>>>>> dropped.
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> The caching should work for connectors that are
> using
> >> >> >>> FLIP-27
> >> >> >>>> interfaces,
> >> >> >>>>>>>>>>>>>>>> we should not introduce new features for old
> >> interfaces.
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> Martijn
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> >> >> >>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> Hi Jark!
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make
> >> some
> >> >> >>>> comments and
> >> >> >>>>>>>>>>>>>>>>> clarify my points.
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we
> can
> >> >> >>> achieve
> >> >> >>>> both
> >> >> >>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
> >> >> >>>> flink-table-common,
> >> >> >>>>>>>>>>>>>>>>> but have implementations of it in
> >> flink-table-runtime.
> >> >> >>>> Therefore if a
> >> >> >>>>>>>>>>>>>>>>> connector developer wants to use existing cache
> >> >> >> strategies
> >> >> >>>> and their
> >> >> >>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to
> the
> >> >> >>>> planner, but if
> >> >> >>>>>>>>>>>>>>>>> he wants to have its own cache implementation in
> his
> >> >> >>>> TableFunction, it
> >> >> >>>>>>>>>>>>>>>>> will be possible for him to use the existing
> >> interface
> >> >> >> for
> >> >> >>>> this
> >> >> >>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
> >> >> >>>> documentation). In
> >> >> >>>>>>>>>>>>>>>>> this way all configs and metrics will be unified.
> >> WDYT?
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
> >> will
> >> >> >>>> have 90% of
> >> >> >>>>>>>>>>>>>>>>> lookup requests that can never be cached
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in
> >> case
> >> >> >> of
> >> >> >>>> LRU cache.
> >> >> >>>>>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>.
> >> Here
> >> >> >> we
> >> >> >>>> always
> >> >> >>>>>>>>>>>>>>>>> store the response of the dimension table in cache,
> >> even
> >> >> >>>> after
> >> >> >>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows
> >> after
> >> >> >>>> applying
> >> >> >>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> >> >> >>> TableFunction,
> >> >> >>>> we store
> >> >> >>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache
> >> line
> >> >> >>> will
> >> >> >>>> be
> >> >> >>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> bytes).
> >> >> >> I.e.
> >> >> >>>> we don't
> >> >> >>>>>>>>>>>>>>>>> completely filter keys, by which result was pruned,
> >> but
> >> >> >>>> significantly
> >> >> >>>>>>>>>>>>>>>>> reduce required memory to store this result. If the
> >> user
> >> >> >>>> knows about
> >> >> >>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> option
> >> >> >> before
> >> >> >>>> the start
> >> >> >>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea
> >> that we
> >> >> >>> can
> >> >> >>>> do this
> >> >> >>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and
> >> 'weigher'
> >> >> >>>> methods of
> >> >> >>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> >> collection
> >> >> >> of
> >> >> >>>> rows
> >> >> >>>>>>>>>>>>>>>>> (value of cache). Therefore cache can automatically
> >> fit
> >> >> >>> much
> >> >> >>>> more
> >> >> >>>>>>>>>>>>>>>>> records than before.
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> filters
> >> and
> >> >> >>>> projects
> >> >> >>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >> >> >>>> SupportsProjectionPushDown.
> >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> interfaces,
> >> >> >> don't
> >> >> >>>> mean it's
> >> >> >>>>>>>>>>>>>>> hard
> >> >> >>>>>>>>>>>>>>>>> to implement.
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> implement
> >> >> >> filter
> >> >> >>>> pushdown.
> >> >> >>>>>>>>>>>>>>>>> But I think the fact that currently there is no
> >> database
> >> >> >>>> connector
> >> >> >>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> feature
> >> >> >> won't
> >> >> >>>> be
> >> >> >>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk
> >> about
> >> >> >>>> other
> >> >> >>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases
> might
> >> >> not
> >> >> >>>> support all
> >> >> >>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I
> >> think
> >> >> >>> users
> >> >> >>>> are
> >> >> >>>>>>>>>>>>>>>>> interested in supporting cache filters optimization
> >> >> >>>> independently of
> >> >> >>>>>>>>>>>>>>>>> supporting other features and solving more complex
> >> >> >> problems
> >> >> >>>> (or
> >> >> >>>>>>>>>>>>>>>>> unsolvable at all).
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in
> our
> >> >> >>>> internal version
> >> >> >>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and
> >> >> reloading
> >> >> >>>> data from
> >> >> >>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way
> to
> >> >> >> unify
> >> >> >>>> the logic
> >> >> >>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> >> >> SourceFunction,
> >> >> >>>> Source,...)
> >> >> >>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
> >> >> >> settled
> >> >> >>>> on using
> >> >> >>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in
> all
> >> >> >> lookup
> >> >> >>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans to
> >> >> >>> deprecate
> >> >> >>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage
> of
> >> >> >>>> FLIP-27 source
> >> >> >>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> source
> >> was
> >> >> >>>> designed to
> >> >> >>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
> >> >> >>>> JobManager and
> >> >> >>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
> >> >> >> (lookup
> >> >> >>>> join
> >> >> >>>>>>>>>>>>>>>>> operator in our case). There is even no direct way
> to
> >> >> >> pass
> >> >> >>>> splits from
> >> >> >>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works
> >> >> through
> >> >> >>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >> >> >>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> >> >> >> AddSplitEvents).
> >> >> >>>> Usage of
> >> >> >>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer
> and
> >> >> >>>> easier. But if
> >> >> >>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> >> FLIP-27, I
> >> >> >>>> have the
> >> >> >>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup
> join
> >> >> ALL
> >> >> >>>> cache in
> >> >> >>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of
> batch
> >> >> >>> source?
> >> >> >>>> The point
> >> >> >>>>>>>>>>>>>>>>> is that the only difference between lookup join ALL
> >> >> cache
> >> >> >>>> and simple
> >> >> >>>>>>>>>>>>>>>>> join with batch source is that in the first case
> >> >> scanning
> >> >> >>> is
> >> >> >>>> performed
> >> >> >>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is
> >> >> cleared
> >> >> >>>> (correct me
> >> >> >>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> >> functionality of
> >> >> >>>> simple join
> >> >> >>>>>>>>>>>>>>>>> to support state reloading + extend the
> >> functionality of
> >> >> >>>> scanning
> >> >> >>>>>>>>>>>>>>>>> batch source multiple times (this one should be
> easy
> >> >> with
> >> >> >>>> new FLIP-27
> >> >> >>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we
> >> will
> >> >> >> need
> >> >> >>>> to change
> >> >> >>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again
> >> after
> >> >> >>>> some TTL).
> >> >> >>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term
> >> goal
> >> >> >> and
> >> >> >>>> will make
> >> >> >>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said.
> >> Maybe
> >> >> >> we
> >> >> >>>> can limit
> >> >> >>>>>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> >> >> >>>>>>>>>>>>>>>>> 1) There is a way to make both concise and flexible
> >> >> >>>> interfaces for
> >> >> >>>>>>>>>>>>>>>>> caching in lookup join.
> >> >> >>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both in
> >> LRU
> >> >> >> and
> >> >> >>>> ALL caches.
> >> >> >>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> >> supported
> >> >> >> in
> >> >> >>>> Flink
> >> >> >>>>>>>>>>>>>>>>> connectors, some of the connectors might not have
> the
> >> >> >>>> opportunity to
> >> >> >>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently
> filter
> >> >> >>>> pushdown works
> >> >> >>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
> >> >> >>>> projections
> >> >> >>>>>>>>>>>>>>>>> optimization should be independent from other
> >> features.
> >> >> >>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that
> >> >> involves
> >> >> >>>> multiple
> >> >> >>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
> >> >> >>>> InputFormat in favor
> >> >> >>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization
> >> really
> >> >> >>>> complex and
> >> >> >>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend
> the
> >> >> >>>> functionality of
> >> >> >>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in case
> of
> >> >> >>> lookup
> >> >> >>>> join ALL
> >> >> >>>>>>>>>>>>>>>>> cache?
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> [1]
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >>
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> imjark@gmail.com
> >> >:
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want to
> >> >> share
> >> >> >>> my
> >> >> >>>> ideas:
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors
> >> base
> >> >> >>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways
> >> should
> >> >> >>>> work (e.g.,
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>>> pruning, compatibility).
> >> >> >>>>>>>>>>>>>>>>>> The framework way can provide more concise
> >> interfaces.
> >> >> >>>>>>>>>>>>>>>>>> The connector base way can define more flexible
> >> cache
> >> >> >>>>>>>>>>>>>>>>>> strategies/implementations.
> >> >> >>>>>>>>>>>>>>>>>> We are still investigating a way to see if we can
> >> have
> >> >> >>> both
> >> >> >>>>>>>>>>>>>>> advantages.
> >> >> >>>>>>>>>>>>>>>>>> We should reach a consensus that the way should
> be a
> >> >> >> final
> >> >> >>>> state,
> >> >> >>>>>>>>>>>>>>> and we
> >> >> >>>>>>>>>>>>>>>>>> are on the path to it.
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> >> >> >>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into
> >> cache
> >> >> >> can
> >> >> >>>> benefit a
> >> >> >>>>>>>>>>>>>>> lot
> >> >> >>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>> ALL cache.
> >> >> >>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> Connectors
> >> use
> >> >> >>>> cache to
> >> >> >>>>>>>>>>>>>>> reduce
> >> >> >>>>>>>>>>>>>>>>> IO
> >> >> >>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
> >> will
> >> >> >>>> have 90% of
> >> >> >>>>>>>>>>>>>>>>> lookup
> >> >> >>>>>>>>>>>>>>>>>> requests that can never be cached
> >> >> >>>>>>>>>>>>>>>>>> and hit directly to the databases. That means the
> >> cache
> >> >> >> is
> >> >> >>>>>>>>>>>>>>> meaningless in
> >> >> >>>>>>>>>>>>>>>>>> this case.
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do
> >> >> filters
> >> >> >>>> and projects
> >> >> >>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >> >> >>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> interfaces,
> >> >> >> don't
> >> >> >>>> mean it's
> >> >> >>>>>>>>>>>>>>> hard
> >> >> >>>>>>>>>>>>>>>>> to
> >> >> >>>>>>>>>>>>>>>>>> implement.
> >> >> >>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to
> >> reduce
> >> >> >> IO
> >> >> >>>> and the
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>>> size.
> >> >> >>>>>>>>>>>>>>>>>> That should be a final state that the scan source
> >> and
> >> >> >>>> lookup source
> >> >> >>>>>>>>>>>>>>> share
> >> >> >>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> >> >> >>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown
> >> logic
> >> >> >> in
> >> >> >>>> caches,
> >> >> >>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>>> will complex the lookup join design.
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> >> >> >>>>>>>>>>>>>>>>>> All cache might be the most challenging part of
> this
> >> >> >> FLIP.
> >> >> >>>> We have
> >> >> >>>>>>>>>>>>>>> never
> >> >> >>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> >> >> >>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval"
> >> method
> >> >> >> of
> >> >> >>>>>>>>>>>>>>> TableFunction.
> >> >> >>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >> >> >>>>>>>>>>>>>>>>>> Ideally, connector implementation should share the
> >> >> logic
> >> >> >>> of
> >> >> >>>> reload
> >> >> >>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> >> >> >>>> InputFormat/SourceFunction/FLIP-27
> >> >> >>>>>>>>>>>>>>>>> Source.
> >> >> >>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> deprecated,
> >> and
> >> >> >>> the
> >> >> >>>> FLIP-27
> >> >> >>>>>>>>>>>>>>>>> source
> >> >> >>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >> >> >>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> >> LookupJoin,
> >> >> >>> this
> >> >> >>>> may make
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> >> >> >>>>>>>>>>>>>>>>>> We are still investigating how to abstract the ALL
> >> >> cache
> >> >> >>>> logic and
> >> >> >>>>>>>>>>>>>>> reuse
> >> >> >>>>>>>>>>>>>>>>>> the existing source interfaces.
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> Best,
> >> >> >>>>>>>>>>>>>>>>>> Jark
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> >> >> >>>> ro.v.boyko@gmail.com>
> >> >> >>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies
> out
> >> of
> >> >> >> the
> >> >> >>>> scope of
> >> >> >>>>>>>>>>>>>>> this
> >> >> >>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be
> done
> >> for
> >> >> >>> all
> >> >> >>>>>>>>>>>>>>>>> ScanTableSource
> >> >> >>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> >> >> >>>>>>>>>>>>>>> martijnvisser@apache.org>
> >> >> >>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> Hi everyone,
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
> >> >> >>> mentioned
> >> >> >>>> that
> >> >> >>>>>>>>>>>>>>> filter
> >> >> >>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> >> >> >> jdbc/hive/hbase."
> >> >> >>>> -> Would
> >> >> >>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement
> >> these
> >> >> >>> filter
> >> >> >>>>>>>>>>>>>>> pushdowns?
> >> >> >>>>>>>>>>>>>>>>> I
> >> >> >>>>>>>>>>>>>>>>>>>> can
> >> >> >>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to
> doing
> >> >> >> that,
> >> >> >>>> outside
> >> >> >>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>> lookup
> >> >> >>>>>>>>>>>>>>>>>>>> caching and metrics.
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> Martijn Visser
> >> >> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >> >> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> >> >> >>>> ro.v.boyko@gmail.com>
> >> >> >>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> Hi everyone!
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation
> >> would be
> >> >> >> a
> >> >> >>>> nice
> >> >> >>>>>>>>>>>>>>>>> opportunity
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME
> AS
> >> OF
> >> >> >>>> proc_time"
> >> >> >>>>>>>>>>>>>>>>> semantics
> >> >> >>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> >> implemented.
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say
> >> that:
> >> >> >>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to
> cut
> >> off
> >> >> >>> the
> >> >> >>>> cache
> >> >> >>>>>>>>>>>>>>> size
> >> >> >>>>>>>>>>>>>>>>> by
> >> >> >>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most
> >> >> handy
> >> >> >>>> way to do
> >> >> >>>>>>>>>>>>>>> it
> >> >> >>>>>>>>>>>>>>>>> is
> >> >> >>>>>>>>>>>>>>>>>>>> apply
> >> >> >>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit
> >> harder to
> >> >> >>>> pass it
> >> >> >>>>>>>>>>>>>>>>> through the
> >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
> >> >> >>> correctly
> >> >> >>>>>>>>>>>>>>> mentioned
> >> >> >>>>>>>>>>>>>>>>> that
> >> >> >>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
> >> >> >>>> jdbc/hive/hbase.
> >> >> >>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
> >> >> >> parameters
> >> >> >>>> for
> >> >> >>>>>>>>>>>>>>> different
> >> >> >>>>>>>>>>>>>>>>>>>> tables
> >> >> >>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
> >> >> >> through
> >> >> >>>> DDL
> >> >> >>>>>>>>>>>>>>> rather
> >> >> >>>>>>>>>>>>>>>>> than
> >> >> >>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options
> for
> >> >> all
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>>>> tables.
> >> >> >>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> really
> >> >> >>>> deprives us of
> >> >> >>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement
> >> >> their
> >> >> >>> own
> >> >> >>>>>>>>>>>>>>> cache).
> >> >> >>>>>>>>>>>>>>>>> But
> >> >> >>>>>>>>>>>>>>>>>>>> most
> >> >> >>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
> >> >> >> different
> >> >> >>>> cache
> >> >> >>>>>>>>>>>>>>>>> strategies
> >> >> >>>>>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> All these points are much closer to the schema
> >> >> >> proposed
> >> >> >>>> by
> >> >> >>>>>>>>>>>>>>>>> Alexander.
> >> >> >>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not
> right
> >> and
> >> >> >>> all
> >> >> >>>> these
> >> >> >>>>>>>>>>>>>>>>>>>> facilities
> >> >> >>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> architecture?
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >> >> >>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> >> >> >>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >> >> >>>>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted
> to
> >> >> >>>> express that
> >> >> >>>>>>>>>>>>>>> I
> >> >> >>>>>>>>>>>>>>>>> really
> >> >> >>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this
> topic
> >> >> >> and I
> >> >> >>>> hope
> >> >> >>>>>>>>>>>>>>> that
> >> >> >>>>>>>>>>>>>>>>>>>> others
> >> >> >>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> Martijn
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> Смирнов <
> >> >> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I
> >> have
> >> >> >>>> questions
> >> >> >>>>>>>>>>>>>>>>> about
> >> >> >>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
> >> >> >>>> something?).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
> >> >> >>>> SYSTEM_TIME
> >> >> >>>>>>>>>>>>>>> AS OF
> >> >> >>>>>>>>>>>>>>>>>>>>>> proc_time”
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> SYSTEM_TIME
> >> AS
> >> >> >> OF
> >> >> >>>>>>>>>>>>>>> proc_time"
> >> >> >>>>>>>>>>>>>>>>> is
> >> >> >>>>>>>>>>>>>>>>>>>> not
> >> >> >>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you
> >> said,
> >> >> >>> users
> >> >> >>>> go
> >> >> >>>>>>>>>>>>>>> on it
> >> >> >>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no
> >> one
> >> >> >>>> proposed
> >> >> >>>>>>>>>>>>>>> to
> >> >> >>>>>>>>>>>>>>>>> enable
> >> >> >>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you
> >> mean
> >> >> >>>> other
> >> >> >>>>>>>>>>>>>>>>> developers
> >> >> >>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> explicitly
> >> >> >>> specify
> >> >> >>>>>>>>>>>>>>> whether
> >> >> >>>>>>>>>>>>>>>>> their
> >> >> >>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the
> list
> >> of
> >> >> >>>> supported
> >> >> >>>>>>>>>>>>>>>>>>>> options),
> >> >> >>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want
> >> to.
> >> >> So
> >> >> >>>> what
> >> >> >>>>>>>>>>>>>>>>> exactly is
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching
> in
> >> >> >>> modules
> >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common
> >> from
> >> >> >>> the
> >> >> >>>>>>>>>>>>>>>>> considered
> >> >> >>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> >> >> >>>> breaking/non-breaking
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> proc_time"?
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> >> options in
> >> >> >>> DDL
> >> >> >>>> to
> >> >> >>>>>>>>>>>>>>>>> control
> >> >> >>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never
> >> >> happened
> >> >> >>>>>>>>>>>>>>> previously
> >> >> >>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>> should
> >> >> >>>>>>>>>>>>>>>>>>>>>>> be cautious
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> semantics
> >> of
> >> >> >> DDL
> >> >> >>>>>>>>>>>>>>> options
> >> >> >>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it
> >> about
> >> >> >>>> limiting
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> scope
> >> >> >>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> business
> >> >> >> logic
> >> >> >>>> rather
> >> >> >>>>>>>>>>>>>>> than
> >> >> >>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in
> the
> >> >> >>>> framework? I
> >> >> >>>>>>>>>>>>>>>>> mean
> >> >> >>>>>>>>>>>>>>>>>>>> that
> >> >> >>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option
> >> with
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the
> wrong
> >> >> >>>> decision,
> >> >> >>>>>>>>>>>>>>>>> because it
> >> >> >>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic
> (not
> >> >> >> just
> >> >> >>>>>>>>>>>>>>> performance
> >> >> >>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> functions
> >> of
> >> >> >> ONE
> >> >> >>>> table
> >> >> >>>>>>>>>>>>>>>>> (there
> >> >> >>>>>>>>>>>>>>>>>>>> can
> >> >> >>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches).
> >> Does it
> >> >> >>>> really
> >> >> >>>>>>>>>>>>>>>>> matter for
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
> >> >> >>> located,
> >> >> >>>>>>>>>>>>>>> which is
> >> >> >>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> >> 'sink.parallelism',
> >> >> >>>> which in
> >> >> >>>>>>>>>>>>>>>>> some way
> >> >> >>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and
> I
> >> >> >> don't
> >> >> >>>> see any
> >> >> >>>>>>>>>>>>>>>>> problem
> >> >> >>>>>>>>>>>>>>>>>>>>>>> here.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> all-caching
> >> >> >>>> scenario
> >> >> >>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> design
> >> >> >>>>>>>>>>>>>>>>>>>>>>> would become more complex
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion,
> >> but
> >> >> >>>> actually
> >> >> >>>>>>>>>>>>>>> in our
> >> >> >>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
> >> >> >> easily
> >> >> >>> -
> >> >> >>>> we
> >> >> >>>>>>>>>>>>>>> reused
> >> >> >>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a
> >> new
> >> >> >>> API).
> >> >> >>>> The
> >> >> >>>>>>>>>>>>>>>>> point is
> >> >> >>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
> >> >> >> InputFormat
> >> >> >>>> for
> >> >> >>>>>>>>>>>>>>>>> scanning
> >> >> >>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive
> >> - it
> >> >> >>> uses
> >> >> >>>>>>>>>>>>>>> class
> >> >> >>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a
> >> wrapper
> >> >> >>> around
> >> >> >>>>>>>>>>>>>>>>> InputFormat.
> >> >> >>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability
> >> to
> >> >> >>> reload
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>> data
> >> >> >>>>>>>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number
> >> of
> >> >> >>>>>>>>>>>>>>> InputSplits,
> >> >> >>>>>>>>>>>>>>>>> but
> >> >> >>>>>>>>>>>>>>>>>>>> has
> >> >> >>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload
> time
> >> >> >>>> significantly
> >> >> >>>>>>>>>>>>>>>>> reduces
> >> >> >>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I
> >> know
> >> >> >>> that
> >> >> >>>>>>>>>>>>>>> usually
> >> >> >>>>>>>>>>>>>>>>> we
> >> >> >>>>>>>>>>>>>>>>>>>> try
> >> >> >>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code,
> >> but
> >> >> >>> maybe
> >> >> >>>> this
> >> >> >>>>>>>>>>>>>>> one
> >> >> >>>>>>>>>>>>>>>>> can
> >> >> >>>>>>>>>>>>>>>>>>>> be
> >> >> >>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an
> >> ideal
> >> >> >>>> solution,
> >> >> >>>>>>>>>>>>>>> maybe
> >> >> >>>>>>>>>>>>>>>>>>>> there
> >> >> >>>>>>>>>>>>>>>>>>>>>>> are better ones.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
> >> >> >> introduce
> >> >> >>>>>>>>>>>>>>>>> compatibility
> >> >> >>>>>>>>>>>>>>>>>>>>>> issues
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> developer
> >> of
> >> >> >> the
> >> >> >>>>>>>>>>>>>>> connector
> >> >> >>>>>>>>>>>>>>>>>>>> won't
> >> >> >>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new
> >> cache
> >> >> >>>> options
> >> >> >>>>>>>>>>>>>>>>>>>> incorrectly
> >> >> >>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options
> into
> >> 2
> >> >> >>>> different
> >> >> >>>>>>>>>>>>>>> code
> >> >> >>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will
> need
> >> to
> >> >> >> do
> >> >> >>>> is to
> >> >> >>>>>>>>>>>>>>>>> redirect
> >> >> >>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> >> LookupConfig
> >> >> (+
> >> >> >>>> maybe
> >> >> >>>>>>>>>>>>>>> add an
> >> >> >>>>>>>>>>>>>>>>>>>> alias
> >> >> >>>>>>>>>>>>>>>>>>>>>>> for options, if there was different naming),
> >> >> >>> everything
> >> >> >>>>>>>>>>>>>>> will be
> >> >> >>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't
> >> do
> >> >> >>>>>>>>>>>>>>> refactoring at
> >> >> >>>>>>>>>>>>>>>>> all,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector
> >> because
> >> >> >> of
> >> >> >>>>>>>>>>>>>>> backward
> >> >> >>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to
> use
> >> >> his
> >> >> >>> own
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>> logic,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> configs
> >> >> into
> >> >> >>> the
> >> >> >>>>>>>>>>>>>>>>> framework,
> >> >> >>>>>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with
> >> already
> >> >> >>>> existing
> >> >> >>>>>>>>>>>>>>>>> configs
> >> >> >>>>>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a
> rare
> >> >> >> case).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all
> >> the
> >> >> >> way
> >> >> >>>> down
> >> >> >>>>>>>>>>>>>>> to
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> table
> >> >> >>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that
> >> the
> >> >> >>> ONLY
> >> >> >>>>>>>>>>>>>>> connector
> >> >> >>>>>>>>>>>>>>>>>>>> that
> >> >> >>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> >> FileSystemTableSource
> >> >> >>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> currently).
> >> >> Also
> >> >> >>>> for some
> >> >> >>>>>>>>>>>>>>>>>>>> databases
> >> >> >>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> complex
> >> >> >>> filters
> >> >> >>>>>>>>>>>>>>> that we
> >> >> >>>>>>>>>>>>>>>>> have
> >> >> >>>>>>>>>>>>>>>>>>>>>>> in Flink.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the
> cache
> >> >> >> seems
> >> >> >>>> not
> >> >> >>>>>>>>>>>>>>>>> quite
> >> >> >>>>>>>>>>>>>>>>>>>>> useful
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large
> >> amount of
> >> >> >>> data
> >> >> >>>>>>>>>>>>>>> from the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> suppose
> >> in
> >> >> >>>> dimension
> >> >> >>>>>>>>>>>>>>>>> table
> >> >> >>>>>>>>>>>>>>>>>>>>>>> 'users'
> >> >> >>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to
> 40,
> >> >> and
> >> >> >>>> input
> >> >> >>>>>>>>>>>>>>> stream
> >> >> >>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by
> age
> >> of
> >> >> >>>> users. If
> >> >> >>>>>>>>>>>>>>> we
> >> >> >>>>>>>>>>>>>>>>> have
> >> >> >>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >> >> >>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This
> >> means
> >> >> >>> the
> >> >> >>>> user
> >> >> >>>>>>>>>>>>>>> can
> >> >> >>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2
> >> >> times.
> >> >> >>> It
> >> >> >>>> will
> >> >> >>>>>>>>>>>>>>>>> gain a
> >> >> >>>>>>>>>>>>>>>>>>>>>>> huge
> >> >> >>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> optimization
> >> >> >> starts
> >> >> >>>> to
> >> >> >>>>>>>>>>>>>>> really
> >> >> >>>>>>>>>>>>>>>>>>>> shine
> >> >> >>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters
> >> and
> >> >> >>>> projections
> >> >> >>>>>>>>>>>>>>>>> can't
> >> >> >>>>>>>>>>>>>>>>>>>> fit
> >> >> >>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
> >> >> >>>> additional
> >> >> >>>>>>>>>>>>>>>>>>>> possibilities
> >> >> >>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not
> quite
> >> >> >>>> useful'.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> >> regarding
> >> >> >> this
> >> >> >>>> topic!
> >> >> >>>>>>>>>>>>>>>>> Because
> >> >> >>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points,
> >> and I
> >> >> >>>> think
> >> >> >>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>> help
> >> >> >>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come
> to a
> >> >> >>>> consensus.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> >> >> >>>>>>>>>>>>>>> renqschn@gmail.com
> >> >> >>>>>>>>>>>>>>>>>> :
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my
> >> late
> >> >> >>>> response!
> >> >> >>>>>>>>>>>>>>> We
> >> >> >>>>>>>>>>>>>>>>> had
> >> >> >>>>>>>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and
> >> Leonard
> >> >> >>> and
> >> >> >>>> I’d
> >> >> >>>>>>>>>>>>>>> like
> >> >> >>>>>>>>>>>>>>>>> to
> >> >> >>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing
> >> the
> >> >> >>> cache
> >> >> >>>>>>>>>>>>>>> logic in
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> table
> >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> >> user-provided
> >> >> >>>> table
> >> >> >>>>>>>>>>>>>>>>> function,
> >> >> >>>>>>>>>>>>>>>>>>>> we
> >> >> >>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
> >> >> >>>> TableFunction
> >> >> >>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>> these
> >> >> >>>>>>>>>>>>>>>>>>>>>>> concerns:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of
> >> "FOR
> >> >> >>>>>>>>>>>>>>> SYSTEM_TIME
> >> >> >>>>>>>>>>>>>>>>> AS OF
> >> >> >>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect
> >> the
> >> >> >>>> content
> >> >> >>>>>>>>>>>>>>> of the
> >> >> >>>>>>>>>>>>>>>>>>>> lookup
> >> >> >>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users
> >> choose
> >> >> to
> >> >> >>>> enable
> >> >> >>>>>>>>>>>>>>>>> caching
> >> >> >>>>>>>>>>>>>>>>>>>> on
> >> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that
> >> this
> >> >> >>>> breakage is
> >> >> >>>>>>>>>>>>>>>>>>>> acceptable
> >> >> >>>>>>>>>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer
> not
> >> to
> >> >> >>>> provide
> >> >> >>>>>>>>>>>>>>>>> caching on
> >> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in
> the
> >> >> >>>> framework
> >> >> >>>>>>>>>>>>>>>>> (whether
> >> >> >>>>>>>>>>>>>>>>>>>> in a
> >> >> >>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we
> >> have
> >> >> >> to
> >> >> >>>>>>>>>>>>>>> confront a
> >> >> >>>>>>>>>>>>>>>>>>>>>> situation
> >> >> >>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control
> the
> >> >> >>>> behavior of
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>> framework,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> should
> >> be
> >> >> >>>> cautious.
> >> >> >>>>>>>>>>>>>>>>> Under
> >> >> >>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the framework
> >> >> should
> >> >> >>>> only be
> >> >> >>>>>>>>>>>>>>>>>>>> specified
> >> >> >>>>>>>>>>>>>>>>>>>>> by
> >> >> >>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s
> >> hard
> >> >> to
> >> >> >>>> apply
> >> >> >>>>>>>>>>>>>>> these
> >> >> >>>>>>>>>>>>>>>>>>>> general
> >> >> >>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source
> loads
> >> and
> >> >> >>>> refresh
> >> >> >>>>>>>>>>>>>>> all
> >> >> >>>>>>>>>>>>>>>>>>>> records
> >> >> >>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high
> >> >> lookup
> >> >> >>>>>>>>>>>>>>> performance
> >> >> >>>>>>>>>>>>>>>>>>>> (like
> >> >> >>>>>>>>>>>>>>>>>>>>>> Hive
> >> >> >>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely
> >> used
> >> >> by
> >> >> >>> our
> >> >> >>>>>>>>>>>>>>> internal
> >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the
> >> user’s
> >> >> >>>>>>>>>>>>>>> TableFunction
> >> >> >>>>>>>>>>>>>>>>>>>> works
> >> >> >>>>>>>>>>>>>>>>>>>>>> fine
> >> >> >>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> >> introduce a
> >> >> >>> new
> >> >> >>>>>>>>>>>>>>>>> interface for
> >> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >> >>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would
> >> become
> >> >> >> more
> >> >> >>>>>>>>>>>>>>> complex.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework
> might
> >> >> >>>> introduce
> >> >> >>>>>>>>>>>>>>>>>>>> compatibility
> >> >> >>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there
> >> might
> >> >> >>>> exist two
> >> >> >>>>>>>>>>>>>>>>> caches
> >> >> >>>>>>>>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
> >> >> >> incorrectly
> >> >> >>>>>>>>>>>>>>> configures
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> table
> >> >> >>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented
> >> by
> >> >> >> the
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>>>>>> source).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> >> Alexander, I
> >> >> >>>> think
> >> >> >>>>>>>>>>>>>>>>> filters
> >> >> >>>>>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down
> >> to
> >> >> >> the
> >> >> >>>> table
> >> >> >>>>>>>>>>>>>>>>> function,
> >> >> >>>>>>>>>>>>>>>>>>>>> like
> >> >> >>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
> >> >> >> runner
> >> >> >>>> with
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> cache.
> >> >> >>>>>>>>>>>>>>>>>>>>> The
> >> >> >>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network
> >> I/O
> >> >> >> and
> >> >> >>>>>>>>>>>>>>> pressure
> >> >> >>>>>>>>>>>>>>>>> on the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
> >> >> >>> optimizations
> >> >> >>>> to
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>>>>>> seems
> >> >> >>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> reflect
> >> our
> >> >> >>>> ideas.
> >> >> >>>>>>>>>>>>>>> We
> >> >> >>>>>>>>>>>>>>>>>>>> prefer to
> >> >> >>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
> >> >> >>>> TableFunction,
> >> >> >>>>>>>>>>>>>>> and we
> >> >> >>>>>>>>>>>>>>>>>>>> could
> >> >> >>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> >> (CachingTableFunction,
> >> >> >>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
> >> >> >> regulate
> >> >> >>>>>>>>>>>>>>> metrics
> >> >> >>>>>>>>>>>>>>>>> of the
> >> >> >>>>>>>>>>>>>>>>>>>>>> cache.
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> [2]
> >> >> >>> https://github.com/PatrickRen/flink/tree/FLIP-221
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр
> >> Смирнов
> >> >> >> <
> >> >> >>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> solution
> >> as
> >> >> >> the
> >> >> >>>>>>>>>>>>>>> first
> >> >> >>>>>>>>>>>>>>>>> step:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually
> >> exclusive
> >> >> >>>>>>>>>>>>>>> (originally
> >> >> >>>>>>>>>>>>>>>>>>>>> proposed
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> conceptually
> >> >> they
> >> >> >>>> follow
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> same
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> >> different.
> >> >> >> If
> >> >> >>> we
> >> >> >>>>>>>>>>>>>>> will
> >> >> >>>>>>>>>>>>>>>>> go one
> >> >> >>>>>>>>>>>>>>>>>>>>> way,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will
> mean
> >> >> >>>> deleting
> >> >> >>>>>>>>>>>>>>>>> existing
> >> >> >>>>>>>>>>>>>>>>>>>> code
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> >> connectors.
> >> >> >> So
> >> >> >>> I
> >> >> >>>>>>>>>>>>>>> think we
> >> >> >>>>>>>>>>>>>>>>>>>> should
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about
> >> that
> >> >> >> and
> >> >> >>>> then
> >> >> >>>>>>>>>>>>>>> work
> >> >> >>>>>>>>>>>>>>>>>>>>> together
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks
> >> for
> >> >> >>>> different
> >> >> >>>>>>>>>>>>>>>>> parts
> >> >> >>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
> >> >> >>>> introducing
> >> >> >>>>>>>>>>>>>>>>> proposed
> >> >> >>>>>>>>>>>>>>>>>>>> set
> >> >> >>>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> requests
> >> >> >> after
> >> >> >>>>>>>>>>>>>>> filter
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields
> of
> >> the
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>>>>>> table, we
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after
> >> that we
> >> >> >>> can
> >> >> >>>>>>>>>>>>>>> filter
> >> >> >>>>>>>>>>>>>>>>>>>>> responses,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
> >> >> >>>> pushdown. So
> >> >> >>>>>>>>>>>>>>> if
> >> >> >>>>>>>>>>>>>>>>>>>>> filtering
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much
> >> less
> >> >> >>> rows
> >> >> >>>> in
> >> >> >>>>>>>>>>>>>>>>> cache.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> architecture
> >> is
> >> >> >> not
> >> >> >>>>>>>>>>>>>>> shared.
> >> >> >>>>>>>>>>>>>>>>> I
> >> >> >>>>>>>>>>>>>>>>>>>> don't
> >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds
> >> of
> >> >> >>>>>>>>>>>>>>> conversations
> >> >> >>>>>>>>>>>>>>>>> :)
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence,
> so
> >> I
> >> >> >>> made a
> >> >> >>>>>>>>>>>>>>> Jira
> >> >> >>>>>>>>>>>>>>>>> issue,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in
> more
> >> >> >>> details
> >> >> >>>> -
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >> https://issues.apache.org/jira/browse/FLINK-27411.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> >> >> >>>>>>>>>>>>>>> arvid@apache.org>:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency
> >> was
> >> >> >> not
> >> >> >>>>>>>>>>>>>>>>> satisfying
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>>> me.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could
> >> also
> >> >> >>> live
> >> >> >>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>>> easier
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of
> >> making
> >> >> >>>> caching
> >> >> >>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>>>>> implementation
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
> >> >> >> caching
> >> >> >>>>>>>>>>>>>>> layer
> >> >> >>>>>>>>>>>>>>>>>>>> around X.
> >> >> >>>>>>>>>>>>>>>>>>>>>> So
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction
> >> that
> >> >> >>>>>>>>>>>>>>> delegates to
> >> >> >>>>>>>>>>>>>>>>> X in
> >> >> >>>>>>>>>>>>>>>>>>>>> case
> >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting
> >> it
> >> >> >> into
> >> >> >>>> the
> >> >> >>>>>>>>>>>>>>>>> operator
> >> >> >>>>>>>>>>>>>>>>>>>>>> model
> >> >> >>>>>>>>>>>>>>>>>>>>>>> as
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> >> probably
> >> >> >>>>>>>>>>>>>>> unnecessary
> >> >> >>>>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> first step
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will
> only
> >> >> >>> receive
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>> requests
> >> >> >>>>>>>>>>>>>>>>>>>>>>> after
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
> >> >> >>> interesting
> >> >> >>>> to
> >> >> >>>>>>>>>>>>>>> save
> >> >> >>>>>>>>>>>>>>>>>>>>> memory).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes
> of
> >> >> >> this
> >> >> >>>> FLIP
> >> >> >>>>>>>>>>>>>>>>> would be
> >> >> >>>>>>>>>>>>>>>>>>>>>>> limited to
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> interfaces.
> >> >> >>>> Everything
> >> >> >>>>>>>>>>>>>>> else
> >> >> >>>>>>>>>>>>>>>>>>>>> remains
> >> >> >>>>>>>>>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That
> means
> >> we
> >> >> >> can
> >> >> >>>>>>>>>>>>>>> easily
> >> >> >>>>>>>>>>>>>>>>>>>>>> incorporate
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander
> >> pointed
> >> >> >> out
> >> >> >>>>>>>>>>>>>>> later.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> architecture
> >> is
> >> >> >> not
> >> >> >>>>>>>>>>>>>>> shared.
> >> >> >>>>>>>>>>>>>>>>> I
> >> >> >>>>>>>>>>>>>>>>>>>> don't
> >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
> >> >> >> Смирнов
> >> >> >>> <
> >> >> >>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm
> >> not a
> >> >> >>>>>>>>>>>>>>> committer
> >> >> >>>>>>>>>>>>>>>>> yet,
> >> >> >>>>>>>>>>>>>>>>>>>> but
> >> >> >>>>>>>>>>>>>>>>>>>>>> I'd
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP
> >> >> really
> >> >> >>>>>>>>>>>>>>>>> interested
> >> >> >>>>>>>>>>>>>>>>>>>> me.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> >> feature in
> >> >> >> my
> >> >> >>>>>>>>>>>>>>>>> company’s
> >> >> >>>>>>>>>>>>>>>>>>>>> Flink
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our
> >> thoughts
> >> >> >> on
> >> >> >>>>>>>>>>>>>>> this and
> >> >> >>>>>>>>>>>>>>>>>>>> make
> >> >> >>>>>>>>>>>>>>>>>>>>>> code
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative
> than
> >> >> >>>>>>>>>>>>>>> introducing an
> >> >> >>>>>>>>>>>>>>>>>>>>> abstract
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> >> >> (CachingTableFunction).
> >> >> >>> As
> >> >> >>>>>>>>>>>>>>> you
> >> >> >>>>>>>>>>>>>>>>> know,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> >> flink-table-common
> >> >> >>>>>>>>>>>>>>> module,
> >> >> >>>>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>>>>>>>> provides
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables –
> it’s
> >> >> very
> >> >> >>>>>>>>>>>>>>>>> convenient
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>>>> importing
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> >> CachingTableFunction
> >> >> >>>> contains
> >> >> >>>>>>>>>>>>>>>>> logic
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
> >> >> >> everything
> >> >> >>>>>>>>>>>>>>>>> connected
> >> >> >>>>>>>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>>>>>> it
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module,
> >> probably
> >> >> >> in
> >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> depend
> >> on
> >> >> >>>> another
> >> >> >>>>>>>>>>>>>>>>> module,
> >> >> >>>>>>>>>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which
> >> doesn’t
> >> >> >>>> sound
> >> >> >>>>>>>>>>>>>>>>> good.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> >> >> ‘getLookupConfig’
> >> >> >>> to
> >> >> >>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> >> connectors
> >> >> to
> >> >> >>>> only
> >> >> >>>>>>>>>>>>>>> pass
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore
> >> they
> >> >> >>> won’t
> >> >> >>>>>>>>>>>>>>>>> depend on
> >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> planner
> >> >> >> will
> >> >> >>>>>>>>>>>>>>>>> construct a
> >> >> >>>>>>>>>>>>>>>>>>>>>> lookup
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime
> >> logic
> >> >> >>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> >> >> >>>>>>>>>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture
> >> >> looks
> >> >> >>>> like
> >> >> >>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> pinned
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> actually
> >> >> >> yours
> >> >> >>>>>>>>>>>>>>>>>>>> CacheConfig).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will
> >> be
> >> >> >>>>>>>>>>>>>>> responsible
> >> >> >>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >> >>>>>>>>>>>>>>>>>>>>>> –
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> >> inheritors.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >> >> >>>>>>>>>>>>>>> flink-table-runtime
> >> >> >>>>>>>>>>>>>>>>> -
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> >> >> >> LookupJoinCachingRunner,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful
> >> advantage
> >> >> >> of
> >> >> >>>>>>>>>>>>>>> such a
> >> >> >>>>>>>>>>>>>>>>>>>>> solution.
> >> >> >>>>>>>>>>>>>>>>>>>>>>> If
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level,
> we
> >> can
> >> >> >>>> apply
> >> >> >>>>>>>>>>>>>>> some
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> >> LookupJoinRunnerWithCalc
> >> >> >> was
> >> >> >>>>>>>>>>>>>>> named
> >> >> >>>>>>>>>>>>>>>>> like
> >> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function,
> which
> >> >> >>> actually
> >> >> >>>>>>>>>>>>>>>>> mostly
> >> >> >>>>>>>>>>>>>>>>>>>>>> consists
> >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup
> >> table
> >> >> >> B
> >> >> >>>>>>>>>>>>>>>>> condition
> >> >> >>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> >> >> >>>>>>>>>>>>>>>>>>>>>>> ON
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
> >> >> >>> B.salary >
> >> >> >>>>>>>>>>>>>>> 1000’
> >> >> >>>>>>>>>>>>>>>>>>>>> ‘calc’
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age =
> >> B.age +
> >> >> >> 10
> >> >> >>>> and
> >> >> >>>>>>>>>>>>>>>>>>>> B.salary >
> >> >> >>>>>>>>>>>>>>>>>>>>>>> 1000.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
> >> >> >> records
> >> >> >>> in
> >> >> >>>>>>>>>>>>>>>>> cache,
> >> >> >>>>>>>>>>>>>>>>>>>> size
> >> >> >>>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced:
> >> filters =
> >> >> >>>> avoid
> >> >> >>>>>>>>>>>>>>>>> storing
> >> >> >>>>>>>>>>>>>>>>>>>>>> useless
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce
> >> >> records’
> >> >> >>>>>>>>>>>>>>> size. So
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>> initial
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be
> >> >> increased
> >> >> >>> by
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> user.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren
> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> >> discussion
> >> >> >>> about
> >> >> >>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> >> >> >>>>>>>>>>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table
> >> >> cache
> >> >> >>> and
> >> >> >>>>>>>>>>>>>>> its
> >> >> >>>>>>>>>>>>>>>>>>>> standard
> >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> should
> >> >> >>>> implement
> >> >> >>>>>>>>>>>>>>>>> their
> >> >> >>>>>>>>>>>>>>>>>>>> own
> >> >> >>>>>>>>>>>>>>>>>>>>>>> cache to
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
> >> >> >> standard
> >> >> >>> of
> >> >> >>>>>>>>>>>>>>>>> metrics
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>>>> users and
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with
> lookup
> >> >> >>> joins,
> >> >> >>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>> is a
> >> >> >>>>>>>>>>>>>>>>>>>>>> quite
> >> >> >>>>>>>>>>>>>>>>>>>>>>> common
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> >> including
> >> >> >>>> cache,
> >> >> >>>>>>>>>>>>>>>>>>>> metrics,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> wrapper
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table
> >> >> options.
> >> >> >>>>>>>>>>>>>>> Please
> >> >> >>>>>>>>>>>>>>>>> take a
> >> >> >>>>>>>>>>>>>>>>>>>>> look
> >> >> >>>>>>>>>>>>>>>>>>>>>>> at the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
> >> >> >>> suggestions
> >> >> >>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>> comments
> >> >> >>>>>>>>>>>>>>>>>>>>>>> would be
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> --
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>> --
> >> >> >>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>> Roman Boyko
> >> >> >>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> --
> >> >> >>>>>>>>>>>> Best Regards,
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Qingsheng Ren
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Real-time Computing Team
> >> >> >>>>>>>>>>>> Alibaba Cloud
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Email: renqschn@gmail.com
> >> >> >>>>>>>>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >> >>
> >>
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi Alexander, 

Thanks for the update! I made some updates on the FLIP and here’s some diffs compared to your version:

1. ReloadStrategy has been covered by the new interface FullCachingReloadTrigger. This trigger will be taken over and run by the LookupJoinRunner and provides enough flexibility instead of just time-based scheduling. 

2. We provide an implementation “PeriodicFullCachingReloadTrigger” as a public util and decide not to accept start time as argument. It looks weird that the full caching doesn’t load anything at the beginning of the job but load after some time. I think it’s quite easy to implement a trigger that reloads at some specific times (like 8:00 daily) and I write a DailyFullCachingReload trigger in my POC [1] (might be naive please forgive me if there are any flaws), but I prefer not to expose it as a public API because we can’t cover all cases and the periodic one should be enough.

3. About “PartialCache" I prefer to keep the original name “LookupCache”. This interface just defines a storage and doesn’t have any features about “partial”. It is the caching operation that “partially” put entries into the cache, so that’s why I renamed LookupFunctionProvider to PartialCachingLookupProvider. 

Looking forward to your reply!

[1] https://github.com/PatrickRen/flink/blob/FLIP-221-framework/flink-table/flink-table-common/src/main/java/org/apache/flink/table/connector/source/lookup/caching/full/DailyFullCachingReloadTrigger.java

Best,

Qingsheng


> On May 31, 2022, at 00:50, Alexander Smirnov <sm...@gmail.com> wrote:
> 
> Hi Jingsong and devs!
> 
> I agree that custom reloading would be very useful, so I changed
> recently proposed ReloadTime to customizable ReloadStrategy and its
> default realization FixedDelayReloadStrategy. I updated the FLIP, you
> can look at the new design [1]. From my point of view, the
> disadvantage of my solution is that the framework provides its runtime
> logic with reloading (packed in Runnable) into the connectors, but on
> the other hand it looks pretty concise. I decided to not provide
> ExecutorService into 'scheduleReload' because it will be redundant in
> cases when the connector already has an active connection to which the
> listener can be added. As an alternative I had an idea to make
> connectors provide a CompletableFuture to which the framework can
> apply callbacks (so they would be triggered once Future is completed),
> but it didn't look very clear. Would be glad to have feedback from
> you!
> 
> Jingsong, I'm fine with your other suggestions (about unification of
> Full / Partial caches and customizable Full cache). If Qingsheng and
> others agree with it, we can also change that in FLIP.
> 
> Best regards,
> Alexander
> 
> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221%3A+Abstraction+for+lookup+source+cache+and+metric
> 
> пт, 27 мая 2022 г. в 16:12, Jingsong Li <ji...@gmail.com>:
>> 
>> Hi all,
>> 
>> I think the problem now is below:
>> 1. AllCache and PartialCache interface on the non-uniform, one needs to
>> provide LookupProvider, the other needs to provide CacheBuilder.
>> 2. AllCache definition is not flexible, for example, PartialCache can use
>> any custom storage, while the AllCache can not, AllCache can also be
>> considered to store memory or disk, also need a flexible strategy.
>> 3. AllCache can not customize ReloadStrategy, currently only
>> ScheduledReloadStrategy.
>> 
>> In order to solve the above problems, the following are my ideas.
>> 
>> ## Top level cache interfaces:
>> 
>> ```
>> 
>> public interface CacheLookupProvider extends
>> LookupTableSource.LookupRuntimeProvider {
>> 
>>    CacheBuilder createCacheBuilder();
>> }
>> 
>> 
>> public interface CacheBuilder {
>>    Cache create();
>> }
>> 
>> 
>> public interface Cache {
>> 
>>    /**
>>     * Returns the value associated with key in this cache, or null if
>> there is no cached value for
>>     * key.
>>     */
>>    @Nullable
>>    Collection<RowData> getIfPresent(RowData key);
>> 
>>    /** Returns the number of key-value mappings in the cache. */
>>    long size();
>> }
>> 
>> ```
>> 
>> ## Partial cache
>> 
>> ```
>> 
>> public interface PartialCacheLookupFunction extends CacheLookupProvider {
>> 
>>    @Override
>>    PartialCacheBuilder createCacheBuilder();
>> 
>> /** Creates an {@link LookupFunction} instance. */
>> LookupFunction createLookupFunction();
>> }
>> 
>> 
>> public interface PartialCacheBuilder extends CacheBuilder {
>> 
>>    PartialCache create();
>> }
>> 
>> 
>> public interface PartialCache extends Cache {
>> 
>>    /**
>>     * Associates the specified value rows with the specified key row
>> in the cache. If the cache
>>     * previously contained value associated with the key, the old
>> value is replaced by the
>>     * specified value.
>>     *
>>     * @return the previous value rows associated with key, or null if
>> there was no mapping for key.
>>     * @param key - key row with which the specified value is to be associated
>>     * @param value – value rows to be associated with the specified key
>>     */
>>    Collection<RowData> put(RowData key, Collection<RowData> value);
>> 
>>    /** Discards any cached value for the specified key. */
>>    void invalidate(RowData key);
>> }
>> 
>> ```
>> 
>> ## All cache
>> ```
>> 
>> public interface AllCacheLookupProvider extends CacheLookupProvider {
>> 
>>    void registerReloadStrategy(ScheduledExecutorService
>> executorService, Reloader reloader);
>> 
>>    ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
>> 
>>    @Override
>>    AllCacheBuilder createCacheBuilder();
>> }
>> 
>> 
>> public interface AllCacheBuilder extends CacheBuilder {
>> 
>>    AllCache create();
>> }
>> 
>> 
>> public interface AllCache extends Cache {
>> 
>>    void putAll(Iterator<Map<RowData, RowData>> allEntries);
>> 
>>    void clearAll();
>> }
>> 
>> 
>> public interface Reloader {
>> 
>>    void reload();
>> }
>> 
>> ```
>> 
>> Best,
>> Jingsong
>> 
>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <ji...@gmail.com> wrote:
>> 
>>> Thanks Qingsheng and all for your discussion.
>>> 
>>> Very sorry to jump in so late.
>>> 
>>> Maybe I missed something?
>>> My first impression when I saw the cache interface was, why don't we
>>> provide an interface similar to guava cache [1], on top of guava cache,
>>> caffeine also makes extensions for asynchronous calls.[2]
>>> There is also the bulk load in caffeine too.
>>> 
>>> I am also more confused why first from LookupCacheFactory.Builder and then
>>> to Factory to create Cache.
>>> 
>>> [1] https://github.com/google/guava
>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
>>> 
>>> Best,
>>> Jingsong
>>> 
>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
>>> 
>>>> After looking at the new introduced ReloadTime and Becket's comment,
>>>> I agree with Becket we should have a pluggable reloading strategy.
>>>> We can provide some common implementations, e.g., periodic reloading, and
>>>> daily reloading.
>>>> But there definitely be some connector- or business-specific reloading
>>>> strategies, e.g.
>>>> notify by a zookeeper watcher, reload once a new Hive partition is
>>>> complete.
>>>> 
>>>> Best,
>>>> Jark
>>>> 
>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com> wrote:
>>>> 
>>>>> Hi Qingsheng,
>>>>> 
>>>>> Thanks for updating the FLIP. A few comments / questions below:
>>>>> 
>>>>> 1. Is there a reason that we have both "XXXFactory" and "XXXProvider".
>>>>> What is the difference between them? If they are the same, can we just
>>>> use
>>>>> XXXFactory everywhere?
>>>>> 
>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading policy
>>>>> also be pluggable? Periodical reloading could be sometimes be tricky in
>>>>> practice. For example, if user uses 24 hours as the cache refresh
>>>> interval
>>>>> and some nightly batch job delayed, the cache update may still see the
>>>>> stale data.
>>>>> 
>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity should be
>>>>> removed.
>>>>> 
>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a
>>>> little
>>>>> confusing to me. If Optional<LookupCacheFactory> getCacheFactory()
>>>> returns
>>>>> a non-empty factory, doesn't that already indicates the framework to
>>>> cache
>>>>> the missing keys? Also, why is this method returning an
>>>> Optional<Boolean>
>>>>> instead of boolean?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jiangjie (Becket) Qin
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <re...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi Lincoln and Jark,
>>>>>> 
>>>>>> Thanks for the comments! If the community reaches a consensus that we
>>>> use
>>>>>> SQL hint instead of table options to decide whether to use sync or
>>>> async
>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async” option.
>>>>>> 
>>>>>> I think it’s a good idea to let the decision of async made on query
>>>>>> level, which could make better optimization with more infomation
>>>> gathered
>>>>>> by planner. Is there any FLIP describing the issue in FLINK-27625? I
>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on missing
>>>>>> instead of the entire async mode to be controlled by hint.
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Qingsheng
>>>>>> 
>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>> Hi Jark,
>>>>>>> 
>>>>>>> Thanks for your reply!
>>>>>>> 
>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have no idea
>>>>>>> whether or when to remove it (we can discuss it in another issue for
>>>> the
>>>>>>> HBase connector after FLINK-27625 is done), just not add it into a
>>>>>> common
>>>>>>> option now.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Lincoln Lee
>>>>>>> 
>>>>>>> 
>>>>>>> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
>>>>>>> 
>>>>>>>> Hi Lincoln,
>>>>>>>> 
>>>>>>>> I have taken a look at FLIP-234, and I agree with you that the
>>>>>> connectors
>>>>>>>> can
>>>>>>>> provide both async and sync runtime providers simultaneously instead
>>>>>> of one
>>>>>>>> of them.
>>>>>>>> At that point, "lookup.async" looks redundant. If this option is
>>>>>> planned to
>>>>>>>> be removed
>>>>>>>> in the long term, I think it makes sense not to introduce it in this
>>>>>> FLIP.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>> 
>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Qingsheng,
>>>>>>>>> 
>>>>>>>>> Sorry for jumping into the discussion so late. It's a good idea
>>>> that
>>>>>> we
>>>>>>>> can
>>>>>>>>> have a common table option. I have a minor comments on
>>>> 'lookup.async'
>>>>>>>> that
>>>>>>>>> not make it a common option:
>>>>>>>>> 
>>>>>>>>> The table layer abstracts both sync and async lookup capabilities,
>>>>>>>>> connectors implementers can choose one or both, in the case of
>>>>>>>> implementing
>>>>>>>>> only one capability(status of the most of existing builtin
>>>> connectors)
>>>>>>>>> 'lookup.async' will not be used.  And when a connector has both
>>>>>>>>> capabilities, I think this choice is more suitable for making
>>>>>> decisions
>>>>>>>> at
>>>>>>>>> the query level, for example, table planner can choose the physical
>>>>>>>>> implementation of async lookup or sync lookup based on its cost
>>>>>> model, or
>>>>>>>>> users can give query hint based on their own better
>>>> understanding.  If
>>>>>>>>> there is another common table option 'lookup.async', it may confuse
>>>>>> the
>>>>>>>>> users in the long run.
>>>>>>>>> 
>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private place
>>>> (for
>>>>>> the
>>>>>>>>> current hbase connector) and not turn it into a common option.
>>>>>>>>> 
>>>>>>>>> WDYT?
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Lincoln Lee
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>>>>>>>>> 
>>>>>>>>>> Hi Alexander,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the review! We recently updated the FLIP and you can
>>>> find
>>>>>>>>> those
>>>>>>>>>> changes from my latest email. Since some terminologies has
>>>> changed so
>>>>>>>>> I’ll
>>>>>>>>>> use the new concept for replying your comments.
>>>>>>>>>> 
>>>>>>>>>> 1. Builder vs ‘of’
>>>>>>>>>> I’m OK to use builder pattern if we have additional optional
>>>>>> parameters
>>>>>>>>>> for full caching mode (“rescan” previously). The
>>>> schedule-with-delay
>>>>>>>> idea
>>>>>>>>>> looks reasonable to me, but I think we need to redesign the
>>>> builder
>>>>>> API
>>>>>>>>> of
>>>>>>>>>> full caching to make it more descriptive for developers. Would you
>>>>>> mind
>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP workspace
>>>>>> you
>>>>>>>>> can
>>>>>>>>>> just provide your account ID and ping any PMC member including
>>>> Jark.
>>>>>>>>>> 
>>>>>>>>>> 2. Common table options
>>>>>>>>>> We have some discussions these days and propose to introduce 8
>>>> common
>>>>>>>>>> table options about caching. It has been updated on the FLIP.
>>>>>>>>>> 
>>>>>>>>>> 3. Retries
>>>>>>>>>> I think we are on the same page :-)
>>>>>>>>>> 
>>>>>>>>>> For your additional concerns:
>>>>>>>>>> 1) The table option has been updated.
>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to use
>>>> partial
>>>>>> or
>>>>>>>>>> full caching mode.
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> 
>>>>>>>>>> Qingsheng
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
>>>> smiralexan@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Also I have a few additions:
>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear that we
>>>> talk
>>>>>>>>>>> not about bytes, but about the number of rows. Plus it fits more,
>>>>>>>>>>> considering my optimization with filters.
>>>>>>>>>>> 2) How will users enable rescanning? Are we going to separate
>>>>>> caching
>>>>>>>>>>> and rescanning from the options point of view? Like initially we
>>>> had
>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think now we
>>>> can
>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
>>>>>>>>>>> 'lookup.rescan.interval', etc.
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Alexander
>>>>>>>>>>> 
>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
>>>> smiralexan@gmail.com
>>>>>>>>> :
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Qingsheng and Jark,
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Builders vs 'of'
>>>>>>>>>>>> I understand that builders are used when we have multiple
>>>>>>>> parameters.
>>>>>>>>>>>> I suggested them because we could add parameters later. To
>>>> prevent
>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I can
>>>>>> suggest
>>>>>>>>>>>> one more config now - "rescanStartTime".
>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first reload of
>>>> cache
>>>>>>>>>>>> starts. This parameter can be thought of as 'initialDelay' (diff
>>>>>>>>>>>> between current time and rescanStartTime) in method
>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be
>>>> very
>>>>>>>>>>>> useful when the dimension table is updated by some other
>>>> scheduled
>>>>>>>> job
>>>>>>>>>>>> at a certain time. Or when the user simply wants a second scan
>>>>>>>> (first
>>>>>>>>>>>> cache reload) be delayed. This option can be used even without
>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one
>>>> day.
>>>>>>>>>>>> If you are fine with this option, I would be very glad if you
>>>> would
>>>>>>>>>>>> give me access to edit FLIP page, so I could add it myself
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. Common table options
>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all cache
>>>>>>>>>>>> options. But maybe unify all suggested options, not only for
>>>>>> default
>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default cache
>>>>>>>> options,
>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
>>>>>>>>>>>> 
>>>>>>>>>>>> 3. Retries
>>>>>>>>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times,
>>>> call)
>>>>>>>>>>>> 
>>>>>>>>>>>> [1]
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Alexander
>>>>>>>>>>>> 
>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Jark and Alexander,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce common table
>>>>>>>>>> options. I prefer to introduce a new DefaultLookupCacheOptions
>>>> class
>>>>>>>> for
>>>>>>>>>> holding these option definitions because putting all options into
>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
>>>> categorized.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
>>>>>>>>>>>>> 1. Use static “of” method for constructing
>>>> RescanRuntimeProvider
>>>>>>>>>> considering both arguments are required.
>>>>>>>>>>>>> 2. Introduce new table options matching
>>>> DefaultLookupCacheFactory
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com>
>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1) retry logic
>>>>>>>>>>>>>> I think we can extract some common retry logic into utilities,
>>>>>>>> e.g.
>>>>>>>>>> RetryUtils#tryTimes(times, call).
>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused by
>>>>>>>> DataStream
>>>>>>>>>> users.
>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where to put
>>>> it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2) cache ConfigOptions
>>>>>>>>>>>>>> I'm fine with defining cache config options in the framework.
>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also includes
>>>>>>>>>> "sink.parallelism", "format" options.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you for considering my comments.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> there might be custom logic before making retry, such as
>>>>>>>>>> re-establish the connection
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can be
>>>> placed in
>>>>>>>> a
>>>>>>>>>>>>>>> separate function, that can be implemented by connectors.
>>>> Just
>>>>>>>>> moving
>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction more
>>>>>>>> concise
>>>>>>>>> +
>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change. The
>>>> decision
>>>>>>>> is
>>>>>>>>>> up
>>>>>>>>>>>>>>> to you.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>> developers
>>>>>>>> to
>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of this
>>>> FLIP
>>>>>>>> was
>>>>>>>>>> to
>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that current cache
>>>>>>>>> design
>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But still
>>>> we
>>>>>>>> can
>>>>>>>>>> put
>>>>>>>>>>>>>>> these options into the framework, so connectors can reuse
>>>> them
>>>>>>>> and
>>>>>>>>>>>>>>> avoid code duplication, and, what is more significant, avoid
>>>>>>>>> possible
>>>>>>>>>>>>>>> different options naming. This moment can be pointed out in
>>>>>>>>>>>>>>> documentation for connector developers.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Alexander
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
>>>> renqschn@gmail.com>:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the same
>>>> page!
>>>>>> I
>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also quoting
>>>> your
>>>>>>>>> reply
>>>>>>>>>> under this email.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented in
>>>> lookup()
>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only meaningful
>>>>>> under
>>>>>>>>> some
>>>>>>>>>> specific retriable failures, and there might be custom logic
>>>> before
>>>>>>>>> making
>>>>>>>>>> retry, such as re-establish the connection
>>>> (JdbcRowDataLookupFunction
>>>>>>>> is
>>>>>>>>> an
>>>>>>>>>> example), so it's more handy to leave it to the connector.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous version of
>>>>>> FLIP.
>>>>>>>>> Do
>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
>>>> developers
>>>>>>>> to
>>>>>>>>>> define their own options as we do now per connector.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the FLIP.
>>>> Hope
>>>>>>>> we
>>>>>>>>>> can finalize our proposal soon!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however I have
>>>>>>>> several
>>>>>>>>>>>>>>>>> suggestions and questions.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
>>>> TableFunction
>>>>>>>> is a
>>>>>>>>>> good
>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this class.
>>>>>> 'eval'
>>>>>>>>>> method
>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose. The same
>>>> is
>>>>>>>> for
>>>>>>>>>>>>>>>>> 'async' case.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
>>>>>>>>>> 'cacheMissingKey'
>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
>>>>>>>>>> ScanRuntimeProvider.
>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
>>>>>>>>> method
>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 3) What are the plans for existing TableFunctionProvider
>>>> and
>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
>>>> deprecated.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not assume
>>>> usage of
>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this case, it
>>>> is
>>>>>>>> not
>>>>>>>>>> very
>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate' or
>>>> 'putAll'
>>>>>>>> in
>>>>>>>>>>>>>>>>> LookupCache.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous version
>>>> of
>>>>>>>>> FLIP.
>>>>>>>>>> Do
>>>>>>>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to make small
>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's worth
>>>>>>>>> mentioning
>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in the
>>>> future.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
>>>> renqschn@gmail.com
>>>>>>>>> :
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion! As Jark
>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
>>>> refactor on
>>>>>>>> our
>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our design now
>>>> and
>>>>>> we
>>>>>>>>> are
>>>>>>>>>> happy to hear more suggestions from you!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Compared to the previous design:
>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level and is
>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
>>>>>> previously.
>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the
>>>> new
>>>>>>>>>> design.
>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually and
>>>>>>>> introduce a
>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of scanning. We are
>>>>>>>>> planning
>>>>>>>>>> to support SourceFunction / InputFormat for now considering the
>>>>>>>>> complexity
>>>>>>>>>> of FLIP-27 Source API.
>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to make
>>>> the
>>>>>>>>>> semantic of lookup more straightforward for developers.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> For replying to Alexander:
>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>>>>>>>> deprecated
>>>>>>>>>> or not. Am I right that it will be so in the future, but currently
>>>>>> it's
>>>>>>>>> not?
>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for now.
>>>> I
>>>>>>>>> think
>>>>>>>>>> it will be deprecated in the future but we don't have a clear plan
>>>>>> for
>>>>>>>>> that.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and looking
>>>>>>>> forward
>>>>>>>>>> to cooperating with you after we finalize the design and
>>>> interfaces!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
>>>>>>>>>> smiralexan@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost all
>>>>>> points!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>>>>>>>> deprecated
>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future, but
>>>>>>>> currently
>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first version
>>>> it's
>>>>>> OK
>>>>>>>>> to
>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because supporting
>>>>>>>> rescan
>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But for this
>>>>>>>>>> decision we
>>>>>>>>>>>>>>>>>>> need a consensus among all discussion participants.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with your
>>>>>>>>>> statements. All
>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would be
>>>> nice
>>>>>>>> to
>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot of
>>>> work
>>>>>>>> on
>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>> join caching with realization very close to the one we
>>>> are
>>>>>>>>>> discussing,
>>>>>>>>>>>>>>>>>>> and want to share the results of this work. Anyway
>>>> looking
>>>>>>>>>> forward for
>>>>>>>>>>>>>>>>>>> the FLIP update!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
>>>> discussed
>>>>>>>> it
>>>>>>>>>> several times
>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many of
>>>> your
>>>>>>>>>> points!
>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the design docs
>>>> and
>>>>>>>>>> maybe can be
>>>>>>>>>>>>>>>>>>>> available in the next few days.
>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our discussions:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to "cache in
>>>>>>>>>> framework" way.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize and
>>>> a
>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>>> implementation with builder for users to easy-use.
>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have flexibility
>>>> and
>>>>>>>>>> conciseness.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
>>>>>>>> cache,
>>>>>>>>>> esp reducing
>>>>>>>>>>>>>>>>>>>> IO.
>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and the
>>>> unified
>>>>>>>> way
>>>>>>>>>> to both
>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this direction. If
>>>> we
>>>>>>>> need
>>>>>>>>>> to support
>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide to
>>>>>>>>> implement
>>>>>>>>>> the cache
>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization and it
>>>>>>>>> doesn't
>>>>>>>>>> affect the
>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>> In the first version, we will only support InputFormat,
>>>>>>>>>> SourceFunction for
>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source
>>>> operator
>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the re-scan
>>>>>>>>> ability
>>>>>>>>>> for FLIP-27
>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the effort
>>>> of
>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as
>>>>>> they
>>>>>>>>>> are not
>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another
>>>>>> function
>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to plan
>>>>>>>> FLIP-27
>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
>>>>>>>>>> deprecated.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat is
>>>> not
>>>>>>>>>> considered.
>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
>>>>>>>>>> martijn@ververica.com>:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> With regards to:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors to
>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
>>>>>>>>>> interfaces will be
>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be refactored to
>>>>>> use
>>>>>>>>>> the new ones
>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>> dropped.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that are using
>>>>>>>>> FLIP-27
>>>>>>>>>> interfaces,
>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make
>>>> some
>>>>>>>>>> comments and
>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we can
>>>>>>>>> achieve
>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
>>>>>>>>>> flink-table-common,
>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
>>>> flink-table-runtime.
>>>>>>>>>> Therefore if a
>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing cache
>>>>>>>> strategies
>>>>>>>>>> and their
>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
>>>>>>>>>> planner, but if
>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation in his
>>>>>>>>>> TableFunction, it
>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
>>>> interface
>>>>>>>> for
>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
>>>>>>>>>> documentation). In
>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be unified.
>>>> WDYT?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
>>>> will
>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in
>>>> case
>>>>>>>> of
>>>>>>>>>> LRU cache.
>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>.
>>>> Here
>>>>>>>> we
>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in cache,
>>>> even
>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows
>>>> after
>>>>>>>>>> applying
>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
>>>>>>>>> TableFunction,
>>>>>>>>>> we store
>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache
>>>> line
>>>>>>>>> will
>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in bytes).
>>>>>>>> I.e.
>>>>>>>>>> we don't
>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was pruned,
>>>> but
>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result. If the
>>>> user
>>>>>>>>>> knows about
>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option
>>>>>>>> before
>>>>>>>>>> the start
>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea
>>>> that we
>>>>>>>>> can
>>>>>>>>>> do this
>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and
>>>> 'weigher'
>>>>>>>>>> methods of
>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
>>>> collection
>>>>>>>> of
>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can automatically
>>>> fit
>>>>>>>>> much
>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>> records than before.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters
>>>> and
>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>>>>>>>> don't
>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>> to implement.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to implement
>>>>>>>> filter
>>>>>>>>>> pushdown.
>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is no
>>>> database
>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this feature
>>>>>>>> won't
>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk
>>>> about
>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases might
>>>>>> not
>>>>>>>>>> support all
>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I
>>>> think
>>>>>>>>> users
>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters optimization
>>>>>>>>>> independently of
>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more complex
>>>>>>>> problems
>>>>>>>>>> (or
>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
>>>>>>>>>> internal version
>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and
>>>>>> reloading
>>>>>>>>>> data from
>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to
>>>>>>>> unify
>>>>>>>>>> the logic
>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
>>>>>> SourceFunction,
>>>>>>>>>> Source,...)
>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
>>>>>>>> settled
>>>>>>>>>> on using
>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in all
>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans to
>>>>>>>>> deprecate
>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
>>>>>>>>>> FLIP-27 source
>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this source
>>>> was
>>>>>>>>>> designed to
>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
>>>>>>>>>> JobManager and
>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
>>>>>>>> (lookup
>>>>>>>>>> join
>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no direct way to
>>>>>>>> pass
>>>>>>>>>> splits from
>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works
>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
>>>>>>>> AddSplitEvents).
>>>>>>>>>> Usage of
>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
>>>>>>>>>> easier. But if
>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
>>>> FLIP-27, I
>>>>>>>>>> have the
>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join
>>>>>> ALL
>>>>>>>>>> cache in
>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of batch
>>>>>>>>> source?
>>>>>>>>>> The point
>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup join ALL
>>>>>> cache
>>>>>>>>>> and simple
>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first case
>>>>>> scanning
>>>>>>>>> is
>>>>>>>>>> performed
>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is
>>>>>> cleared
>>>>>>>>>> (correct me
>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
>>>> functionality of
>>>>>>>>>> simple join
>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
>>>> functionality of
>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should be easy
>>>>>> with
>>>>>>>>>> new FLIP-27
>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we
>>>> will
>>>>>>>> need
>>>>>>>>>> to change
>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again
>>>> after
>>>>>>>>>> some TTL).
>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term
>>>> goal
>>>>>>>> and
>>>>>>>>>> will make
>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said.
>>>> Maybe
>>>>>>>> we
>>>>>>>>>> can limit
>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and flexible
>>>>>>>>>> interfaces for
>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both in
>>>> LRU
>>>>>>>> and
>>>>>>>>>> ALL caches.
>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
>>>> supported
>>>>>>>> in
>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not have the
>>>>>>>>>> opportunity to
>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently filter
>>>>>>>>>> pushdown works
>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
>>>> features.
>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that
>>>>>> involves
>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
>>>>>>>>>> InputFormat in favor
>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization
>>>> really
>>>>>>>>>> complex and
>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
>>>>>>>>>> functionality of
>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
>>>>>>>>> lookup
>>>>>>>>>> join ALL
>>>>>>>>>>>>>>>>>>>>>>> cache?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <imjark@gmail.com
>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want to
>>>>>> share
>>>>>>>>> my
>>>>>>>>>> ideas:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors
>>>> base
>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways
>>>> should
>>>>>>>>>> work (e.g.,
>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more flexible
>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if we can
>>>> have
>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>> advantages.
>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way should be a
>>>>>>>> final
>>>>>>>>>> state,
>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into
>>>> cache
>>>>>>>> can
>>>>>>>>>> benefit a
>>>>>>>>>>>>>>>>>>>>> lot
>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors
>>>> use
>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>> reduce
>>>>>>>>>>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
>>>> will
>>>>>>>>>> have 90% of
>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That means the
>>>> cache
>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> meaningless in
>>>>>>>>>>>>>>>>>>>>>>>> this case.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do
>>>>>> filters
>>>>>>>>>> and projects
>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>>>>>>>> don't
>>>>>>>>>> mean it's
>>>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> implement.
>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to
>>>> reduce
>>>>>>>> IO
>>>>>>>>>> and the
>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan source
>>>> and
>>>>>>>>>> lookup source
>>>>>>>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown
>>>> logic
>>>>>>>> in
>>>>>>>>>> caches,
>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part of this
>>>>>>>> FLIP.
>>>>>>>>>> We have
>>>>>>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval"
>>>> method
>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> TableFunction.
>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should share the
>>>>>> logic
>>>>>>>>> of
>>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
>>>>>>>>>>>>>>>>>>>>>>> Source.
>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated,
>>>> and
>>>>>>>>> the
>>>>>>>>>> FLIP-27
>>>>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
>>>> LookupJoin,
>>>>>>>>> this
>>>>>>>>>> may make
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract the ALL
>>>>>> cache
>>>>>>>>>> logic and
>>>>>>>>>>>>>>>>>>>>> reuse
>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies out
>>>> of
>>>>>>>> the
>>>>>>>>>> scope of
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be done
>>>> for
>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
>>>>>>>>> mentioned
>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
>>>>>>>> jdbc/hive/hbase."
>>>>>>>>>> -> Would
>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement
>>>> these
>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>> pushdowns?
>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to doing
>>>>>>>> that,
>>>>>>>>>> outside
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>>>>>>>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation
>>>> would be
>>>>>>>> a
>>>>>>>>>> nice
>>>>>>>>>>>>>>>>>>>>>>> opportunity
>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS
>>>> OF
>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>> semantics
>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
>>>> implemented.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say
>>>> that:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut
>>>> off
>>>>>>>>> the
>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most
>>>>>> handy
>>>>>>>>>> way to do
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit
>>>> harder to
>>>>>>>>>> pass it
>>>>>>>>>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
>>>>>>>>> correctly
>>>>>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
>>>>>>>>>> jdbc/hive/hbase.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
>>>>>>>> parameters
>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
>>>>>>>> through
>>>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for
>>>>>> all
>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
>>>>>>>>>> deprives us of
>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement
>>>>>> their
>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>> cache).
>>>>>>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
>>>>>>>> different
>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the schema
>>>>>>>> proposed
>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>> Alexander.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right
>>>> and
>>>>>>>>> all
>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>>>>>>>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
>>>>>>>>>> express that
>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic
>>>>>>>> and I
>>>>>>>>>> hope
>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I
>>>> have
>>>>>>>>>> questions
>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
>>>>>>>>>> something?).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME
>>>> AS
>>>>>>>> OF
>>>>>>>>>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you
>>>> said,
>>>>>>>>> users
>>>>>>>>>> go
>>>>>>>>>>>>>>>>>>>>> on it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no
>>>> one
>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you
>>>> mean
>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list
>>>> of
>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>>>>>>>>>> options),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want
>>>> to.
>>>>>> So
>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>>>>> exactly is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching in
>>>>>>>>> modules
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common
>>>> from
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
>>>>>>>>>> breaking/non-breaking
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
>>>> options in
>>>>>>>>> DDL
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never
>>>>>> happened
>>>>>>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics
>>>> of
>>>>>>>> DDL
>>>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it
>>>> about
>>>>>>>>>> limiting
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user business
>>>>>>>> logic
>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
>>>>>>>>>> framework? I
>>>>>>>>>>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option
>>>> with
>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
>>>>>>>>>> decision,
>>>>>>>>>>>>>>>>>>>>>>> because it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not
>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several functions
>>>> of
>>>>>>>> ONE
>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>> (there
>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches).
>>>> Does it
>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>> matter for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
>>>>>>>>> located,
>>>>>>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
>>>> 'sink.parallelism',
>>>>>>>>>> which in
>>>>>>>>>>>>>>>>>>>>>>> some way
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I
>>>>>>>> don't
>>>>>>>>>> see any
>>>>>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
>>>>>>>>>> scenario
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion,
>>>> but
>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>> in our
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
>>>>>>>> easily
>>>>>>>>> -
>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> reused
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a
>>>> new
>>>>>>>>> API).
>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>> point is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
>>>>>>>> InputFormat
>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive
>>>> - it
>>>>>>>>> uses
>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a
>>>> wrapper
>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability
>>>> to
>>>>>>>>> reload
>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number
>>>> of
>>>>>>>>>>>>>>>>>>>>> InputSplits,
>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>> reduces
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I
>>>> know
>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code,
>>>> but
>>>>>>>>> maybe
>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an
>>>> ideal
>>>>>>>>>> solution,
>>>>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer
>>>> of
>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new
>>>> cache
>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into
>>>> 2
>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need
>>>> to
>>>>>>>> do
>>>>>>>>>> is to
>>>>>>>>>>>>>>>>>>>>>>> redirect
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
>>>> LookupConfig
>>>>>> (+
>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>> add an
>>>>>>>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different naming),
>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't
>>>> do
>>>>>>>>>>>>>>>>>>>>> refactoring at
>>>>>>>>>>>>>>>>>>>>>>> all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector
>>>> because
>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use
>>>>>> his
>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs
>>>>>> into
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with
>>>> already
>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>> configs
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare
>>>>>>>> case).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all
>>>> the
>>>>>>>> way
>>>>>>>>>> down
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that
>>>> the
>>>>>>>>> ONLY
>>>>>>>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
>>>> FileSystemTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it currently).
>>>>>> Also
>>>>>>>>>> for some
>>>>>>>>>>>>>>>>>>>>>>>>>> databases
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache
>>>>>>>> seems
>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large
>>>> amount of
>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose
>>>> in
>>>>>>>>>> dimension
>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40,
>>>>>> and
>>>>>>>>>> input
>>>>>>>>>>>>>>>>>>>>> stream
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age
>>>> of
>>>>>>>>>> users. If
>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This
>>>> means
>>>>>>>>> the
>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2
>>>>>> times.
>>>>>>>>> It
>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>> gain a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization
>>>>>>>> starts
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>> shine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters
>>>> and
>>>>>>>>>> projections
>>>>>>>>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
>>>>>>>>>> useful'.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
>>>> regarding
>>>>>>>> this
>>>>>>>>>> topic!
>>>>>>>>>>>>>>>>>>>>>>> Because
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points,
>>>> and I
>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
>>>>>>>>>> consensus.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>>>>>>>>>>>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my
>>>> late
>>>>>>>>>> response!
>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and
>>>> Leonard
>>>>>>>>> and
>>>>>>>>>> I’d
>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing
>>>> the
>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>> logic in
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
>>>> user-provided
>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of
>>>> "FOR
>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect
>>>> the
>>>>>>>>>> content
>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users
>>>> choose
>>>>>> to
>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that
>>>> this
>>>>>>>>>> breakage is
>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not
>>>> to
>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>>>>> caching on
>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
>>>>>>>>>> framework
>>>>>>>>>>>>>>>>>>>>>>> (whether
>>>>>>>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we
>>>> have
>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> confront a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
>>>>>>>>>> behavior of
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and should
>>>> be
>>>>>>>>>> cautious.
>>>>>>>>>>>>>>>>>>>>>>> Under
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the framework
>>>>>> should
>>>>>>>>>> only be
>>>>>>>>>>>>>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s
>>>> hard
>>>>>> to
>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>>>> general
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads
>>>> and
>>>>>>>>>> refresh
>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high
>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely
>>>> used
>>>>>> by
>>>>>>>>> our
>>>>>>>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the
>>>> user’s
>>>>>>>>>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>>>>>>>> works
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
>>>> introduce a
>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>>>> interface for
>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would
>>>> become
>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there
>>>> might
>>>>>>>>>> exist two
>>>>>>>>>>>>>>>>>>>>>>> caches
>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>> configures
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented
>>>> by
>>>>>>>> the
>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>> source).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
>>>> Alexander, I
>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down
>>>> to
>>>>>>>> the
>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
>>>>>>>> runner
>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network
>>>> I/O
>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> pressure
>>>>>>>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
>>>>>>>>> optimizations
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect
>>>> our
>>>>>>>>>> ideas.
>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
>>>>>>>>>> TableFunction,
>>>>>>>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
>>>> (CachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
>>>>>>>> regulate
>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр
>>>> Смирнов
>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution
>>>> as
>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>> step:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually
>>>> exclusive
>>>>>>>>>>>>>>>>>>>>> (originally
>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually
>>>>>> they
>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
>>>> different.
>>>>>>>> If
>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>> go one
>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
>>>>>>>>>> deleting
>>>>>>>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
>>>> connectors.
>>>>>>>> So
>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about
>>>> that
>>>>>>>> and
>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks
>>>> for
>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>> parts
>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests
>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of
>>>> the
>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>> table, we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after
>>>> that we
>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
>>>>>>>>>> pushdown. So
>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much
>>>> less
>>>>>>>>> rows
>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture
>>>> is
>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds
>>>> of
>>>>>>>>>>>>>>>>>>>>> conversations
>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so
>>>> I
>>>>>>>>> made a
>>>>>>>>>>>>>>>>>>>>> Jira
>>>>>>>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
>>>>>>>>> details
>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>>>>>>>>>>>>>>>>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency
>>>> was
>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> satisfying
>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could
>>>> also
>>>>>>>>> live
>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of
>>>> making
>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>> layer
>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction
>>>> that
>>>>>>>>>>>>>>>>>>>>> delegates to
>>>>>>>>>>>>>>>>>>>>>>> X in
>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting
>>>> it
>>>>>>>> into
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
>>>> probably
>>>>>>>>>>>>>>>>>>>>> unnecessary
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
>>>>>>>>> receive
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
>>>>>>>>> interesting
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> save
>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of
>>>>>>>> this
>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
>>>>>>>>>> Everything
>>>>>>>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means
>>>> we
>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander
>>>> pointed
>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture
>>>> is
>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
>>>>>>>> Смирнов
>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm
>>>> not a
>>>>>>>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>>>>>>>> yet,
>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP
>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>> interested
>>>>>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
>>>> feature in
>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>> company’s
>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our
>>>> thoughts
>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> this and
>>>>>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
>>>>>>>>>>>>>>>>>>>>> introducing an
>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
>>>>>> (CachingTableFunction).
>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>> know,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
>>>> flink-table-common
>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s
>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>> convenient
>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
>>>> CachingTableFunction
>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
>>>>>>>> everything
>>>>>>>>>>>>>>>>>>>>>>> connected
>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module,
>>>> probably
>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend
>>>> on
>>>>>>>>>> another
>>>>>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which
>>>> doesn’t
>>>>>>>>>> sound
>>>>>>>>>>>>>>>>>>>>>>> good.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
>>>>>> ‘getLookupConfig’
>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
>>>> connectors
>>>>>> to
>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore
>>>> they
>>>>>>>>> won’t
>>>>>>>>>>>>>>>>>>>>>>> depend on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner
>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>> construct a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime
>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture
>>>>>> looks
>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually
>>>>>>>> yours
>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will
>>>> be
>>>>>>>>>>>>>>>>>>>>> responsible
>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
>>>> inheritors.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
>>>>>>>> LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful
>>>> advantage
>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> such a
>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we
>>>> can
>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
>>>> LookupJoinRunnerWithCalc
>>>>>>>> was
>>>>>>>>>>>>>>>>>>>>> named
>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>>>> mostly
>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup
>>>> table
>>>>>>>> B
>>>>>>>>>>>>>>>>>>>>>>> condition
>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>> 1000’
>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age =
>>>> B.age +
>>>>>>>> 10
>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
>>>>>>>> records
>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced:
>>>> filters =
>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce
>>>>>> records’
>>>>>>>>>>>>>>>>>>>>> size. So
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be
>>>>>> increased
>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> user.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
>>>> discussion
>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table
>>>>>> cache
>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
>>>>>>>>>> implement
>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
>>>>>>>> standard
>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
>>>>>>>>> joins,
>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
>>>> including
>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table
>>>>>> options.
>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
>>>>>>>>> suggestions
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> comments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Alexander Smirnov <sm...@gmail.com>.

Hi Jingsong and devs!

I agree that custom reloading would be very useful, so I changed
recently proposed ReloadTime to customizable ReloadStrategy and its
default realization FixedDelayReloadStrategy. I updated the FLIP, you
can look at the new design [1]. From my point of view, the
disadvantage of my solution is that the framework provides its runtime
logic with reloading (packed in Runnable) into the connectors, but on
the other hand it looks pretty concise. I decided to not provide
ExecutorService into 'scheduleReload' because it will be redundant in
cases when the connector already has an active connection to which the
listener can be added. As an alternative I had an idea to make
connectors provide a CompletableFuture to which the framework can
apply callbacks (so they would be triggered once Future is completed),
but it didn't look very clear. Would be glad to have feedback from
you!

Jingsong, I'm fine with your other suggestions (about unification of
Full / Partial caches and customizable Full cache). If Qingsheng and
others agree with it, we can also change that in FLIP.

Best regards,
Alexander

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221%3A+Abstraction+for+lookup+source+cache+and+metric

пт, 27 мая 2022 г. в 16:12, Jingsong Li <ji...@gmail.com>:
>
> Hi all,
>
> I think the problem now is below:
> 1. AllCache and PartialCache interface on the non-uniform, one needs to
> provide LookupProvider, the other needs to provide CacheBuilder.
> 2. AllCache definition is not flexible, for example, PartialCache can use
> any custom storage, while the AllCache can not, AllCache can also be
> considered to store memory or disk, also need a flexible strategy.
> 3. AllCache can not customize ReloadStrategy, currently only
> ScheduledReloadStrategy.
>
> In order to solve the above problems, the following are my ideas.
>
> ## Top level cache interfaces:
>
> ```
>
> public interface CacheLookupProvider extends
> LookupTableSource.LookupRuntimeProvider {
>
>     CacheBuilder createCacheBuilder();
> }
>
>
> public interface CacheBuilder {
>     Cache create();
> }
>
>
> public interface Cache {
>
>     /**
>      * Returns the value associated with key in this cache, or null if
> there is no cached value for
>      * key.
>      */
>     @Nullable
>     Collection<RowData> getIfPresent(RowData key);
>
>     /** Returns the number of key-value mappings in the cache. */
>     long size();
> }
>
> ```
>
> ## Partial cache
>
> ```
>
> public interface PartialCacheLookupFunction extends CacheLookupProvider {
>
>     @Override
>     PartialCacheBuilder createCacheBuilder();
>
> /** Creates an {@link LookupFunction} instance. */
> LookupFunction createLookupFunction();
> }
>
>
> public interface PartialCacheBuilder extends CacheBuilder {
>
>     PartialCache create();
> }
>
>
> public interface PartialCache extends Cache {
>
>     /**
>      * Associates the specified value rows with the specified key row
> in the cache. If the cache
>      * previously contained value associated with the key, the old
> value is replaced by the
>      * specified value.
>      *
>      * @return the previous value rows associated with key, or null if
> there was no mapping for key.
>      * @param key - key row with which the specified value is to be associated
>      * @param value – value rows to be associated with the specified key
>      */
>     Collection<RowData> put(RowData key, Collection<RowData> value);
>
>     /** Discards any cached value for the specified key. */
>     void invalidate(RowData key);
> }
>
> ```
>
> ## All cache
> ```
>
> public interface AllCacheLookupProvider extends CacheLookupProvider {
>
>     void registerReloadStrategy(ScheduledExecutorService
> executorService, Reloader reloader);
>
>     ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
>
>     @Override
>     AllCacheBuilder createCacheBuilder();
> }
>
>
> public interface AllCacheBuilder extends CacheBuilder {
>
>     AllCache create();
> }
>
>
> public interface AllCache extends Cache {
>
>     void putAll(Iterator<Map<RowData, RowData>> allEntries);
>
>     void clearAll();
> }
>
>
> public interface Reloader {
>
>     void reload();
> }
>
> ```
>
> Best,
> Jingsong
>
> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <ji...@gmail.com> wrote:
>
> > Thanks Qingsheng and all for your discussion.
> >
> > Very sorry to jump in so late.
> >
> > Maybe I missed something?
> > My first impression when I saw the cache interface was, why don't we
> > provide an interface similar to guava cache [1], on top of guava cache,
> > caffeine also makes extensions for asynchronous calls.[2]
> > There is also the bulk load in caffeine too.
> >
> > I am also more confused why first from LookupCacheFactory.Builder and then
> > to Factory to create Cache.
> >
> > [1] https://github.com/google/guava
> > [2] https://github.com/ben-manes/caffeine/wiki/Population
> >
> > Best,
> > Jingsong
> >
> > On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
> >
> >> After looking at the new introduced ReloadTime and Becket's comment,
> >> I agree with Becket we should have a pluggable reloading strategy.
> >> We can provide some common implementations, e.g., periodic reloading, and
> >> daily reloading.
> >> But there definitely be some connector- or business-specific reloading
> >> strategies, e.g.
> >> notify by a zookeeper watcher, reload once a new Hive partition is
> >> complete.
> >>
> >> Best,
> >> Jark
> >>
> >> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com> wrote:
> >>
> >> > Hi Qingsheng,
> >> >
> >> > Thanks for updating the FLIP. A few comments / questions below:
> >> >
> >> > 1. Is there a reason that we have both "XXXFactory" and "XXXProvider".
> >> > What is the difference between them? If they are the same, can we just
> >> use
> >> > XXXFactory everywhere?
> >> >
> >> > 2. Regarding the FullCachingLookupProvider, should the reloading policy
> >> > also be pluggable? Periodical reloading could be sometimes be tricky in
> >> > practice. For example, if user uses 24 hours as the cache refresh
> >> interval
> >> > and some nightly batch job delayed, the cache update may still see the
> >> > stale data.
> >> >
> >> > 3. In DefaultLookupCacheFactory, it looks like InitialCapacity should be
> >> > removed.
> >> >
> >> > 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a
> >> little
> >> > confusing to me. If Optional<LookupCacheFactory> getCacheFactory()
> >> returns
> >> > a non-empty factory, doesn't that already indicates the framework to
> >> cache
> >> > the missing keys? Also, why is this method returning an
> >> Optional<Boolean>
> >> > instead of boolean?
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> >
> >> >
> >> > On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <re...@gmail.com>
> >> wrote:
> >> >
> >> >> Hi Lincoln and Jark,
> >> >>
> >> >> Thanks for the comments! If the community reaches a consensus that we
> >> use
> >> >> SQL hint instead of table options to decide whether to use sync or
> >> async
> >> >> mode, it’s indeed not necessary to introduce the “lookup.async” option.
> >> >>
> >> >> I think it’s a good idea to let the decision of async made on query
> >> >> level, which could make better optimization with more infomation
> >> gathered
> >> >> by planner. Is there any FLIP describing the issue in FLINK-27625? I
> >> >> thought FLIP-234 is only proposing adding SQL hint for retry on missing
> >> >> instead of the entire async mode to be controlled by hint.
> >> >>
> >> >> Best regards,
> >> >>
> >> >> Qingsheng
> >> >>
> >> >> > On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com>
> >> wrote:
> >> >> >
> >> >> > Hi Jark,
> >> >> >
> >> >> > Thanks for your reply!
> >> >> >
> >> >> > Currently 'lookup.async' just lies in HBase connector, I have no idea
> >> >> > whether or when to remove it (we can discuss it in another issue for
> >> the
> >> >> > HBase connector after FLINK-27625 is done), just not add it into a
> >> >> common
> >> >> > option now.
> >> >> >
> >> >> > Best,
> >> >> > Lincoln Lee
> >> >> >
> >> >> >
> >> >> > Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> >> >> >
> >> >> >> Hi Lincoln,
> >> >> >>
> >> >> >> I have taken a look at FLIP-234, and I agree with you that the
> >> >> connectors
> >> >> >> can
> >> >> >> provide both async and sync runtime providers simultaneously instead
> >> >> of one
> >> >> >> of them.
> >> >> >> At that point, "lookup.async" looks redundant. If this option is
> >> >> planned to
> >> >> >> be removed
> >> >> >> in the long term, I think it makes sense not to introduce it in this
> >> >> FLIP.
> >> >> >>
> >> >> >> Best,
> >> >> >> Jark
> >> >> >>
> >> >> >> On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com>
> >> >> wrote:
> >> >> >>
> >> >> >>> Hi Qingsheng,
> >> >> >>>
> >> >> >>> Sorry for jumping into the discussion so late. It's a good idea
> >> that
> >> >> we
> >> >> >> can
> >> >> >>> have a common table option. I have a minor comments on
> >> 'lookup.async'
> >> >> >> that
> >> >> >>> not make it a common option:
> >> >> >>>
> >> >> >>> The table layer abstracts both sync and async lookup capabilities,
> >> >> >>> connectors implementers can choose one or both, in the case of
> >> >> >> implementing
> >> >> >>> only one capability(status of the most of existing builtin
> >> connectors)
> >> >> >>> 'lookup.async' will not be used.  And when a connector has both
> >> >> >>> capabilities, I think this choice is more suitable for making
> >> >> decisions
> >> >> >> at
> >> >> >>> the query level, for example, table planner can choose the physical
> >> >> >>> implementation of async lookup or sync lookup based on its cost
> >> >> model, or
> >> >> >>> users can give query hint based on their own better
> >> understanding.  If
> >> >> >>> there is another common table option 'lookup.async', it may confuse
> >> >> the
> >> >> >>> users in the long run.
> >> >> >>>
> >> >> >>> So, I prefer to leave the 'lookup.async' option in private place
> >> (for
> >> >> the
> >> >> >>> current hbase connector) and not turn it into a common option.
> >> >> >>>
> >> >> >>> WDYT?
> >> >> >>>
> >> >> >>> Best,
> >> >> >>> Lincoln Lee
> >> >> >>>
> >> >> >>>
> >> >> >>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> >> >> >>>
> >> >> >>>> Hi Alexander,
> >> >> >>>>
> >> >> >>>> Thanks for the review! We recently updated the FLIP and you can
> >> find
> >> >> >>> those
> >> >> >>>> changes from my latest email. Since some terminologies has
> >> changed so
> >> >> >>> I’ll
> >> >> >>>> use the new concept for replying your comments.
> >> >> >>>>
> >> >> >>>> 1. Builder vs ‘of’
> >> >> >>>> I’m OK to use builder pattern if we have additional optional
> >> >> parameters
> >> >> >>>> for full caching mode (“rescan” previously). The
> >> schedule-with-delay
> >> >> >> idea
> >> >> >>>> looks reasonable to me, but I think we need to redesign the
> >> builder
> >> >> API
> >> >> >>> of
> >> >> >>>> full caching to make it more descriptive for developers. Would you
> >> >> mind
> >> >> >>>> sharing your ideas about the API? For accessing the FLIP workspace
> >> >> you
> >> >> >>> can
> >> >> >>>> just provide your account ID and ping any PMC member including
> >> Jark.
> >> >> >>>>
> >> >> >>>> 2. Common table options
> >> >> >>>> We have some discussions these days and propose to introduce 8
> >> common
> >> >> >>>> table options about caching. It has been updated on the FLIP.
> >> >> >>>>
> >> >> >>>> 3. Retries
> >> >> >>>> I think we are on the same page :-)
> >> >> >>>>
> >> >> >>>> For your additional concerns:
> >> >> >>>> 1) The table option has been updated.
> >> >> >>>> 2) We got “lookup.cache” back for configuring whether to use
> >> partial
> >> >> or
> >> >> >>>> full caching mode.
> >> >> >>>>
> >> >> >>>> Best regards,
> >> >> >>>>
> >> >> >>>> Qingsheng
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> >> smiralexan@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>>>
> >> >> >>>>> Also I have a few additions:
> >> >> >>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> >> >> >>>>> 'lookup.cache.max-rows'? I think it will be more clear that we
> >> talk
> >> >> >>>>> not about bytes, but about the number of rows. Plus it fits more,
> >> >> >>>>> considering my optimization with filters.
> >> >> >>>>> 2) How will users enable rescanning? Are we going to separate
> >> >> caching
> >> >> >>>>> and rescanning from the options point of view? Like initially we
> >> had
> >> >> >>>>> one option 'lookup.cache' with values LRU / ALL. I think now we
> >> can
> >> >> >>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
> >> >> >>>>> 'lookup.rescan.interval', etc.
> >> >> >>>>>
> >> >> >>>>> Best regards,
> >> >> >>>>> Alexander
> >> >> >>>>>
> >> >> >>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> >> smiralexan@gmail.com
> >> >> >>> :
> >> >> >>>>>>
> >> >> >>>>>> Hi Qingsheng and Jark,
> >> >> >>>>>>
> >> >> >>>>>> 1. Builders vs 'of'
> >> >> >>>>>> I understand that builders are used when we have multiple
> >> >> >> parameters.
> >> >> >>>>>> I suggested them because we could add parameters later. To
> >> prevent
> >> >> >>>>>> Builder for ScanRuntimeProvider from looking redundant I can
> >> >> suggest
> >> >> >>>>>> one more config now - "rescanStartTime".
> >> >> >>>>>> It's a time in UTC (LocalTime class) when the first reload of
> >> cache
> >> >> >>>>>> starts. This parameter can be thought of as 'initialDelay' (diff
> >> >> >>>>>> between current time and rescanStartTime) in method
> >> >> >>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be
> >> very
> >> >> >>>>>> useful when the dimension table is updated by some other
> >> scheduled
> >> >> >> job
> >> >> >>>>>> at a certain time. Or when the user simply wants a second scan
> >> >> >> (first
> >> >> >>>>>> cache reload) be delayed. This option can be used even without
> >> >> >>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one
> >> day.
> >> >> >>>>>> If you are fine with this option, I would be very glad if you
> >> would
> >> >> >>>>>> give me access to edit FLIP page, so I could add it myself
> >> >> >>>>>>
> >> >> >>>>>> 2. Common table options
> >> >> >>>>>> I also think that FactoryUtil would be overloaded by all cache
> >> >> >>>>>> options. But maybe unify all suggested options, not only for
> >> >> default
> >> >> >>>>>> cache? I.e. class 'LookupOptions', that unifies default cache
> >> >> >> options,
> >> >> >>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> >> >> >>>>>>
> >> >> >>>>>> 3. Retries
> >> >> >>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times,
> >> call)
> >> >> >>>>>>
> >> >> >>>>>> [1]
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >> >> >>>>>>
> >> >> >>>>>> Best regards,
> >> >> >>>>>> Alexander
> >> >> >>>>>>
> >> >> >>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
> >> >> >>>>>>>
> >> >> >>>>>>> Hi Jark and Alexander,
> >> >> >>>>>>>
> >> >> >>>>>>> Thanks for your comments! I’m also OK to introduce common table
> >> >> >>>> options. I prefer to introduce a new DefaultLookupCacheOptions
> >> class
> >> >> >> for
> >> >> >>>> holding these option definitions because putting all options into
> >> >> >>>> FactoryUtil would make it a bit ”crowded” and not well
> >> categorized.
> >> >> >>>>>>>
> >> >> >>>>>>> FLIP has been updated according to suggestions above:
> >> >> >>>>>>> 1. Use static “of” method for constructing
> >> RescanRuntimeProvider
> >> >> >>>> considering both arguments are required.
> >> >> >>>>>>> 2. Introduce new table options matching
> >> DefaultLookupCacheFactory
> >> >> >>>>>>>
> >> >> >>>>>>> Best,
> >> >> >>>>>>> Qingsheng
> >> >> >>>>>>>
> >> >> >>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com>
> >> wrote:
> >> >> >>>>>>>>
> >> >> >>>>>>>> Hi Alex,
> >> >> >>>>>>>>
> >> >> >>>>>>>> 1) retry logic
> >> >> >>>>>>>> I think we can extract some common retry logic into utilities,
> >> >> >> e.g.
> >> >> >>>> RetryUtils#tryTimes(times, call).
> >> >> >>>>>>>> This seems independent of this FLIP and can be reused by
> >> >> >> DataStream
> >> >> >>>> users.
> >> >> >>>>>>>> Maybe we can open an issue to discuss this and where to put
> >> it.
> >> >> >>>>>>>>
> >> >> >>>>>>>> 2) cache ConfigOptions
> >> >> >>>>>>>> I'm fine with defining cache config options in the framework.
> >> >> >>>>>>>> A candidate place to put is FactoryUtil which also includes
> >> >> >>>> "sink.parallelism", "format" options.
> >> >> >>>>>>>>
> >> >> >>>>>>>> Best,
> >> >> >>>>>>>> Jark
> >> >> >>>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> >> >> >>> smiralexan@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> Hi Qingsheng,
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> Thank you for considering my comments.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>> there might be custom logic before making retry, such as
> >> >> >>>> re-establish the connection
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> Yes, I understand that. I meant that such logic can be
> >> placed in
> >> >> >> a
> >> >> >>>>>>>>> separate function, that can be implemented by connectors.
> >> Just
> >> >> >>> moving
> >> >> >>>>>>>>> the retry logic would make connector's LookupFunction more
> >> >> >> concise
> >> >> >>> +
> >> >> >>>>>>>>> avoid duplicate code. However, it's a minor change. The
> >> decision
> >> >> >> is
> >> >> >>>> up
> >> >> >>>>>>>>> to you.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> >> developers
> >> >> >> to
> >> >> >>>> define their own options as we do now per connector.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> What is the reason for that? One of the main goals of this
> >> FLIP
> >> >> >> was
> >> >> >>>> to
> >> >> >>>>>>>>> unify the configs, wasn't it? I understand that current cache
> >> >> >>> design
> >> >> >>>>>>>>> doesn't depend on ConfigOptions, like was before. But still
> >> we
> >> >> >> can
> >> >> >>>> put
> >> >> >>>>>>>>> these options into the framework, so connectors can reuse
> >> them
> >> >> >> and
> >> >> >>>>>>>>> avoid code duplication, and, what is more significant, avoid
> >> >> >>> possible
> >> >> >>>>>>>>> different options naming. This moment can be pointed out in
> >> >> >>>>>>>>> documentation for connector developers.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> Best regards,
> >> >> >>>>>>>>> Alexander
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> >> renqschn@gmail.com>:
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> Hi Alexander,
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> Thanks for the review and glad to see we are on the same
> >> page!
> >> >> I
> >> >> >>>> think you forgot to cc the dev mailing list so I’m also quoting
> >> your
> >> >> >>> reply
> >> >> >>>> under this email.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> In my opinion the retry logic should be implemented in
> >> lookup()
> >> >> >>>> instead of in LookupFunction#eval(). Retrying is only meaningful
> >> >> under
> >> >> >>> some
> >> >> >>>> specific retriable failures, and there might be custom logic
> >> before
> >> >> >>> making
> >> >> >>>> retry, such as re-establish the connection
> >> (JdbcRowDataLookupFunction
> >> >> >> is
> >> >> >>> an
> >> >> >>>> example), so it's more handy to leave it to the connector.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>> I don't see DDL options, that were in previous version of
> >> >> FLIP.
> >> >> >>> Do
> >> >> >>>> you have any special plans for them?
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> We decide not to provide common DDL options and let
> >> developers
> >> >> >> to
> >> >> >>>> define their own options as we do now per connector.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> The rest of comments sound great and I’ll update the FLIP.
> >> Hope
> >> >> >> we
> >> >> >>>> can finalize our proposal soon!
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> Best,
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> Qingsheng
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> >> >> >>> smiralexan@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> Hi Qingsheng and devs!
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> I like the overall design of updated FLIP, however I have
> >> >> >> several
> >> >> >>>>>>>>>>> suggestions and questions.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> >> TableFunction
> >> >> >> is a
> >> >> >>>> good
> >> >> >>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this class.
> >> >> 'eval'
> >> >> >>>> method
> >> >> >>>>>>>>>>> of new LookupFunction is great for this purpose. The same
> >> is
> >> >> >> for
> >> >> >>>>>>>>>>> 'async' case.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 2) There might be other configs in future, such as
> >> >> >>>> 'cacheMissingKey'
> >> >> >>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> >> >> >>>> ScanRuntimeProvider.
> >> >> >>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
> >> >> >>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
> >> >> >>> method
> >> >> >>>>>>>>>>> instead of many 'of' methods in future)?
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 3) What are the plans for existing TableFunctionProvider
> >> and
> >> >> >>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> >> deprecated.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 4) Am I right that the current design does not assume
> >> usage of
> >> >> >>>>>>>>>>> user-provided LookupCache in re-scanning? In this case, it
> >> is
> >> >> >> not
> >> >> >>>> very
> >> >> >>>>>>>>>>> clear why do we need methods such as 'invalidate' or
> >> 'putAll'
> >> >> >> in
> >> >> >>>>>>>>>>> LookupCache.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> 5) I don't see DDL options, that were in previous version
> >> of
> >> >> >>> FLIP.
> >> >> >>>> Do
> >> >> >>>>>>>>>>> you have any special plans for them?
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> If you don't mind, I would be glad to be able to make small
> >> >> >>>>>>>>>>> adjustments to the FLIP document too. I think it's worth
> >> >> >>> mentioning
> >> >> >>>>>>>>>>> about what exactly optimizations are planning in the
> >> future.
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>
> >> >> >>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> >> renqschn@gmail.com
> >> >> >>> :
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Hi Alexander and devs,
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Thank you very much for the in-depth discussion! As Jark
> >> >> >>>> mentioned we were inspired by Alexander's idea and made a
> >> refactor on
> >> >> >> our
> >> >> >>>> design. FLIP-221 [1] has been updated to reflect our design now
> >> and
> >> >> we
> >> >> >>> are
> >> >> >>>> happy to hear more suggestions from you!
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Compared to the previous design:
> >> >> >>>>>>>>>>>> 1. The lookup cache serves at table runtime level and is
> >> >> >>>> integrated as a component of LookupJoinRunner as discussed
> >> >> previously.
> >> >> >>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the
> >> new
> >> >> >>>> design.
> >> >> >>>>>>>>>>>> 3. We separate the all-caching case individually and
> >> >> >> introduce a
> >> >> >>>> new RescanRuntimeProvider to reuse the ability of scanning. We are
> >> >> >>> planning
> >> >> >>>> to support SourceFunction / InputFormat for now considering the
> >> >> >>> complexity
> >> >> >>>> of FLIP-27 Source API.
> >> >> >>>>>>>>>>>> 4. A new interface LookupFunction is introduced to make
> >> the
> >> >> >>>> semantic of lookup more straightforward for developers.
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> For replying to Alexander:
> >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> >> >> >> deprecated
> >> >> >>>> or not. Am I right that it will be so in the future, but currently
> >> >> it's
> >> >> >>> not?
> >> >> >>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for now.
> >> I
> >> >> >>> think
> >> >> >>>> it will be deprecated in the future but we don't have a clear plan
> >> >> for
> >> >> >>> that.
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Thanks again for the discussion on this FLIP and looking
> >> >> >> forward
> >> >> >>>> to cooperating with you after we finalize the design and
> >> interfaces!
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> [1]
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Qingsheng
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> >> >> >>>> smiralexan@gmail.com> wrote:
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> Glad to see that we came to a consensus on almost all
> >> >> points!
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> >> >> >> deprecated
> >> >> >>>> or
> >> >> >>>>>>>>>>>>> not. Am I right that it will be so in the future, but
> >> >> >> currently
> >> >> >>>> it's
> >> >> >>>>>>>>>>>>> not? Actually I also think that for the first version
> >> it's
> >> >> OK
> >> >> >>> to
> >> >> >>>> use
> >> >> >>>>>>>>>>>>> InputFormat in ALL cache realization, because supporting
> >> >> >> rescan
> >> >> >>>>>>>>>>>>> ability seems like a very distant prospect. But for this
> >> >> >>>> decision we
> >> >> >>>>>>>>>>>>> need a consensus among all discussion participants.
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> In general, I don't have something to argue with your
> >> >> >>>> statements. All
> >> >> >>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would be
> >> nice
> >> >> >> to
> >> >> >>>> work
> >> >> >>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot of
> >> work
> >> >> >> on
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>> join caching with realization very close to the one we
> >> are
> >> >> >>>> discussing,
> >> >> >>>>>>>>>>>>> and want to share the results of this work. Anyway
> >> looking
> >> >> >>>> forward for
> >> >> >>>>>>>>>>>>> the FLIP update!
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> Hi Alex,
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> Thanks for summarizing your points.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> >> discussed
> >> >> >> it
> >> >> >>>> several times
> >> >> >>>>>>>>>>>>>> and we have totally refactored the design.
> >> >> >>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many of
> >> your
> >> >> >>>> points!
> >> >> >>>>>>>>>>>>>> Qingsheng is still working on updating the design docs
> >> and
> >> >> >>>> maybe can be
> >> >> >>>>>>>>>>>>>> available in the next few days.
> >> >> >>>>>>>>>>>>>> I will share some conclusions from our discussions:
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> 1) we have refactored the design towards to "cache in
> >> >> >>>> framework" way.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize and
> >> a
> >> >> >>>> default
> >> >> >>>>>>>>>>>>>> implementation with builder for users to easy-use.
> >> >> >>>>>>>>>>>>>> This can both make it possible to both have flexibility
> >> and
> >> >> >>>> conciseness.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
> >> >> >> cache,
> >> >> >>>> esp reducing
> >> >> >>>>>>>>>>>>>> IO.
> >> >> >>>>>>>>>>>>>> Filter pushdown should be the final state and the
> >> unified
> >> >> >> way
> >> >> >>>> to both
> >> >> >>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> >> >> >>>>>>>>>>>>>> so I think we should make effort in this direction. If
> >> we
> >> >> >> need
> >> >> >>>> to support
> >> >> >>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> >> >> >>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide to
> >> >> >>> implement
> >> >> >>>> the cache
> >> >> >>>>>>>>>>>>>> in the framework, we have the chance to support
> >> >> >>>>>>>>>>>>>> filter on cache anytime. This is an optimization and it
> >> >> >>> doesn't
> >> >> >>>> affect the
> >> >> >>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
> >> >> >>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
> >> >> >> proposal.
> >> >> >>>>>>>>>>>>>> In the first version, we will only support InputFormat,
> >> >> >>>> SourceFunction for
> >> >> >>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> >> >> >>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source
> >> operator
> >> >> >>>> instead of
> >> >> >>>>>>>>>>>>>> calling it embedded in the join operator.
> >> >> >>>>>>>>>>>>>> However, this needs another FLIP to support the re-scan
> >> >> >>> ability
> >> >> >>>> for FLIP-27
> >> >> >>>>>>>>>>>>>> Source, and this can be a large work.
> >> >> >>>>>>>>>>>>>> In order to not block this issue, we can put the effort
> >> of
> >> >> >>>> FLIP-27 source
> >> >> >>>>>>>>>>>>>> integration into future work and integrate
> >> >> >>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as
> >> >> they
> >> >> >>>> are not
> >> >> >>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another
> >> >> function
> >> >> >>>>>>>>>>>>>> similar to them which is meaningless. We need to plan
> >> >> >> FLIP-27
> >> >> >>>> source
> >> >> >>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
> >> >> >>>> deprecated.
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> Best,
> >> >> >>>>>>>>>>>>>> Jark
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> >> >> >>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>> Hi Martijn!
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat is
> >> not
> >> >> >>>> considered.
> >> >> >>>>>>>>>>>>>>> Thanks for clearing that up!
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> >> >> >>>> martijn@ververica.com>:
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> Hi,
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> With regards to:
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors to
> >> >> >>> FLIP-27
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
> >> >> >>>> interfaces will be
> >> >> >>>>>>>>>>>>>>>> deprecated and connectors will either be refactored to
> >> >> use
> >> >> >>>> the new ones
> >> >> >>>>>>>>>>>>>>> or
> >> >> >>>>>>>>>>>>>>>> dropped.
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> The caching should work for connectors that are using
> >> >> >>> FLIP-27
> >> >> >>>> interfaces,
> >> >> >>>>>>>>>>>>>>>> we should not introduce new features for old
> >> interfaces.
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> Martijn
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> >> >> >>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> Hi Jark!
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make
> >> some
> >> >> >>>> comments and
> >> >> >>>>>>>>>>>>>>>>> clarify my points.
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we can
> >> >> >>> achieve
> >> >> >>>> both
> >> >> >>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
> >> >> >>>> flink-table-common,
> >> >> >>>>>>>>>>>>>>>>> but have implementations of it in
> >> flink-table-runtime.
> >> >> >>>> Therefore if a
> >> >> >>>>>>>>>>>>>>>>> connector developer wants to use existing cache
> >> >> >> strategies
> >> >> >>>> and their
> >> >> >>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
> >> >> >>>> planner, but if
> >> >> >>>>>>>>>>>>>>>>> he wants to have its own cache implementation in his
> >> >> >>>> TableFunction, it
> >> >> >>>>>>>>>>>>>>>>> will be possible for him to use the existing
> >> interface
> >> >> >> for
> >> >> >>>> this
> >> >> >>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
> >> >> >>>> documentation). In
> >> >> >>>>>>>>>>>>>>>>> this way all configs and metrics will be unified.
> >> WDYT?
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
> >> will
> >> >> >>>> have 90% of
> >> >> >>>>>>>>>>>>>>>>> lookup requests that can never be cached
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in
> >> case
> >> >> >> of
> >> >> >>>> LRU cache.
> >> >> >>>>>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>.
> >> Here
> >> >> >> we
> >> >> >>>> always
> >> >> >>>>>>>>>>>>>>>>> store the response of the dimension table in cache,
> >> even
> >> >> >>>> after
> >> >> >>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows
> >> after
> >> >> >>>> applying
> >> >> >>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> >> >> >>> TableFunction,
> >> >> >>>> we store
> >> >> >>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache
> >> line
> >> >> >>> will
> >> >> >>>> be
> >> >> >>>>>>>>>>>>>>>>> filled, but will require much less memory (in bytes).
> >> >> >> I.e.
> >> >> >>>> we don't
> >> >> >>>>>>>>>>>>>>>>> completely filter keys, by which result was pruned,
> >> but
> >> >> >>>> significantly
> >> >> >>>>>>>>>>>>>>>>> reduce required memory to store this result. If the
> >> user
> >> >> >>>> knows about
> >> >> >>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option
> >> >> >> before
> >> >> >>>> the start
> >> >> >>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea
> >> that we
> >> >> >>> can
> >> >> >>>> do this
> >> >> >>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and
> >> 'weigher'
> >> >> >>>> methods of
> >> >> >>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> >> collection
> >> >> >> of
> >> >> >>>> rows
> >> >> >>>>>>>>>>>>>>>>> (value of cache). Therefore cache can automatically
> >> fit
> >> >> >>> much
> >> >> >>>> more
> >> >> >>>>>>>>>>>>>>>>> records than before.
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters
> >> and
> >> >> >>>> projects
> >> >> >>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >> >> >>>> SupportsProjectionPushDown.
> >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
> >> >> >> don't
> >> >> >>>> mean it's
> >> >> >>>>>>>>>>>>>>> hard
> >> >> >>>>>>>>>>>>>>>>> to implement.
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> It's debatable how difficult it will be to implement
> >> >> >> filter
> >> >> >>>> pushdown.
> >> >> >>>>>>>>>>>>>>>>> But I think the fact that currently there is no
> >> database
> >> >> >>>> connector
> >> >> >>>>>>>>>>>>>>>>> with filter pushdown at least means that this feature
> >> >> >> won't
> >> >> >>>> be
> >> >> >>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk
> >> about
> >> >> >>>> other
> >> >> >>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases might
> >> >> not
> >> >> >>>> support all
> >> >> >>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I
> >> think
> >> >> >>> users
> >> >> >>>> are
> >> >> >>>>>>>>>>>>>>>>> interested in supporting cache filters optimization
> >> >> >>>> independently of
> >> >> >>>>>>>>>>>>>>>>> supporting other features and solving more complex
> >> >> >> problems
> >> >> >>>> (or
> >> >> >>>>>>>>>>>>>>>>> unsolvable at all).
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
> >> >> >>>> internal version
> >> >> >>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and
> >> >> reloading
> >> >> >>>> data from
> >> >> >>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to
> >> >> >> unify
> >> >> >>>> the logic
> >> >> >>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> >> >> SourceFunction,
> >> >> >>>> Source,...)
> >> >> >>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
> >> >> >> settled
> >> >> >>>> on using
> >> >> >>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in all
> >> >> >> lookup
> >> >> >>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans to
> >> >> >>> deprecate
> >> >> >>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
> >> >> >>>> FLIP-27 source
> >> >> >>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this source
> >> was
> >> >> >>>> designed to
> >> >> >>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
> >> >> >>>> JobManager and
> >> >> >>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
> >> >> >> (lookup
> >> >> >>>> join
> >> >> >>>>>>>>>>>>>>>>> operator in our case). There is even no direct way to
> >> >> >> pass
> >> >> >>>> splits from
> >> >> >>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works
> >> >> through
> >> >> >>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >> >> >>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> >> >> >> AddSplitEvents).
> >> >> >>>> Usage of
> >> >> >>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
> >> >> >>>> easier. But if
> >> >> >>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> >> FLIP-27, I
> >> >> >>>> have the
> >> >> >>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join
> >> >> ALL
> >> >> >>>> cache in
> >> >> >>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of batch
> >> >> >>> source?
> >> >> >>>> The point
> >> >> >>>>>>>>>>>>>>>>> is that the only difference between lookup join ALL
> >> >> cache
> >> >> >>>> and simple
> >> >> >>>>>>>>>>>>>>>>> join with batch source is that in the first case
> >> >> scanning
> >> >> >>> is
> >> >> >>>> performed
> >> >> >>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is
> >> >> cleared
> >> >> >>>> (correct me
> >> >> >>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> >> functionality of
> >> >> >>>> simple join
> >> >> >>>>>>>>>>>>>>>>> to support state reloading + extend the
> >> functionality of
> >> >> >>>> scanning
> >> >> >>>>>>>>>>>>>>>>> batch source multiple times (this one should be easy
> >> >> with
> >> >> >>>> new FLIP-27
> >> >> >>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we
> >> will
> >> >> >> need
> >> >> >>>> to change
> >> >> >>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again
> >> after
> >> >> >>>> some TTL).
> >> >> >>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term
> >> goal
> >> >> >> and
> >> >> >>>> will make
> >> >> >>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said.
> >> Maybe
> >> >> >> we
> >> >> >>>> can limit
> >> >> >>>>>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> >> >> >>>>>>>>>>>>>>>>> 1) There is a way to make both concise and flexible
> >> >> >>>> interfaces for
> >> >> >>>>>>>>>>>>>>>>> caching in lookup join.
> >> >> >>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both in
> >> LRU
> >> >> >> and
> >> >> >>>> ALL caches.
> >> >> >>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> >> supported
> >> >> >> in
> >> >> >>>> Flink
> >> >> >>>>>>>>>>>>>>>>> connectors, some of the connectors might not have the
> >> >> >>>> opportunity to
> >> >> >>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently filter
> >> >> >>>> pushdown works
> >> >> >>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
> >> >> >>>> projections
> >> >> >>>>>>>>>>>>>>>>> optimization should be independent from other
> >> features.
> >> >> >>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that
> >> >> involves
> >> >> >>>> multiple
> >> >> >>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
> >> >> >>>> InputFormat in favor
> >> >> >>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization
> >> really
> >> >> >>>> complex and
> >> >> >>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
> >> >> >>>> functionality of
> >> >> >>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
> >> >> >>> lookup
> >> >> >>>> join ALL
> >> >> >>>>>>>>>>>>>>>>> cache?
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> [1]
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <imjark@gmail.com
> >> >:
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want to
> >> >> share
> >> >> >>> my
> >> >> >>>> ideas:
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors
> >> base
> >> >> >>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways
> >> should
> >> >> >>>> work (e.g.,
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>>> pruning, compatibility).
> >> >> >>>>>>>>>>>>>>>>>> The framework way can provide more concise
> >> interfaces.
> >> >> >>>>>>>>>>>>>>>>>> The connector base way can define more flexible
> >> cache
> >> >> >>>>>>>>>>>>>>>>>> strategies/implementations.
> >> >> >>>>>>>>>>>>>>>>>> We are still investigating a way to see if we can
> >> have
> >> >> >>> both
> >> >> >>>>>>>>>>>>>>> advantages.
> >> >> >>>>>>>>>>>>>>>>>> We should reach a consensus that the way should be a
> >> >> >> final
> >> >> >>>> state,
> >> >> >>>>>>>>>>>>>>> and we
> >> >> >>>>>>>>>>>>>>>>>> are on the path to it.
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> >> >> >>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into
> >> cache
> >> >> >> can
> >> >> >>>> benefit a
> >> >> >>>>>>>>>>>>>>> lot
> >> >> >>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>> ALL cache.
> >> >> >>>>>>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors
> >> use
> >> >> >>>> cache to
> >> >> >>>>>>>>>>>>>>> reduce
> >> >> >>>>>>>>>>>>>>>>> IO
> >> >> >>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> >> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
> >> will
> >> >> >>>> have 90% of
> >> >> >>>>>>>>>>>>>>>>> lookup
> >> >> >>>>>>>>>>>>>>>>>> requests that can never be cached
> >> >> >>>>>>>>>>>>>>>>>> and hit directly to the databases. That means the
> >> cache
> >> >> >> is
> >> >> >>>>>>>>>>>>>>> meaningless in
> >> >> >>>>>>>>>>>>>>>>>> this case.
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do
> >> >> filters
> >> >> >>>> and projects
> >> >> >>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >> >> >>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
> >> >> >> don't
> >> >> >>>> mean it's
> >> >> >>>>>>>>>>>>>>> hard
> >> >> >>>>>>>>>>>>>>>>> to
> >> >> >>>>>>>>>>>>>>>>>> implement.
> >> >> >>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to
> >> reduce
> >> >> >> IO
> >> >> >>>> and the
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>>> size.
> >> >> >>>>>>>>>>>>>>>>>> That should be a final state that the scan source
> >> and
> >> >> >>>> lookup source
> >> >> >>>>>>>>>>>>>>> share
> >> >> >>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> >> >> >>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown
> >> logic
> >> >> >> in
> >> >> >>>> caches,
> >> >> >>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>>> will complex the lookup join design.
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> >> >> >>>>>>>>>>>>>>>>>> All cache might be the most challenging part of this
> >> >> >> FLIP.
> >> >> >>>> We have
> >> >> >>>>>>>>>>>>>>> never
> >> >> >>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> >> >> >>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval"
> >> method
> >> >> >> of
> >> >> >>>>>>>>>>>>>>> TableFunction.
> >> >> >>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >> >> >>>>>>>>>>>>>>>>>> Ideally, connector implementation should share the
> >> >> logic
> >> >> >>> of
> >> >> >>>> reload
> >> >> >>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> >> >> >>>> InputFormat/SourceFunction/FLIP-27
> >> >> >>>>>>>>>>>>>>>>> Source.
> >> >> >>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated,
> >> and
> >> >> >>> the
> >> >> >>>> FLIP-27
> >> >> >>>>>>>>>>>>>>>>> source
> >> >> >>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >> >> >>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> >> LookupJoin,
> >> >> >>> this
> >> >> >>>> may make
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> >> >> >>>>>>>>>>>>>>>>>> We are still investigating how to abstract the ALL
> >> >> cache
> >> >> >>>> logic and
> >> >> >>>>>>>>>>>>>>> reuse
> >> >> >>>>>>>>>>>>>>>>>> the existing source interfaces.
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> Best,
> >> >> >>>>>>>>>>>>>>>>>> Jark
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> >> >> >>>> ro.v.boyko@gmail.com>
> >> >> >>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies out
> >> of
> >> >> >> the
> >> >> >>>> scope of
> >> >> >>>>>>>>>>>>>>> this
> >> >> >>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be done
> >> for
> >> >> >>> all
> >> >> >>>>>>>>>>>>>>>>> ScanTableSource
> >> >> >>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> >> >> >>>>>>>>>>>>>>> martijnvisser@apache.org>
> >> >> >>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> Hi everyone,
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
> >> >> >>> mentioned
> >> >> >>>> that
> >> >> >>>>>>>>>>>>>>> filter
> >> >> >>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> >> >> >> jdbc/hive/hbase."
> >> >> >>>> -> Would
> >> >> >>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement
> >> these
> >> >> >>> filter
> >> >> >>>>>>>>>>>>>>> pushdowns?
> >> >> >>>>>>>>>>>>>>>>> I
> >> >> >>>>>>>>>>>>>>>>>>>> can
> >> >> >>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to doing
> >> >> >> that,
> >> >> >>>> outside
> >> >> >>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>> lookup
> >> >> >>>>>>>>>>>>>>>>>>>> caching and metrics.
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> Martijn Visser
> >> >> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >> >> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> >> >> >>>> ro.v.boyko@gmail.com>
> >> >> >>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> Hi everyone!
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation
> >> would be
> >> >> >> a
> >> >> >>>> nice
> >> >> >>>>>>>>>>>>>>>>> opportunity
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS
> >> OF
> >> >> >>>> proc_time"
> >> >> >>>>>>>>>>>>>>>>> semantics
> >> >> >>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> >> implemented.
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say
> >> that:
> >> >> >>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut
> >> off
> >> >> >>> the
> >> >> >>>> cache
> >> >> >>>>>>>>>>>>>>> size
> >> >> >>>>>>>>>>>>>>>>> by
> >> >> >>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most
> >> >> handy
> >> >> >>>> way to do
> >> >> >>>>>>>>>>>>>>> it
> >> >> >>>>>>>>>>>>>>>>> is
> >> >> >>>>>>>>>>>>>>>>>>>> apply
> >> >> >>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit
> >> harder to
> >> >> >>>> pass it
> >> >> >>>>>>>>>>>>>>>>> through the
> >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
> >> >> >>> correctly
> >> >> >>>>>>>>>>>>>>> mentioned
> >> >> >>>>>>>>>>>>>>>>> that
> >> >> >>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
> >> >> >>>> jdbc/hive/hbase.
> >> >> >>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
> >> >> >> parameters
> >> >> >>>> for
> >> >> >>>>>>>>>>>>>>> different
> >> >> >>>>>>>>>>>>>>>>>>>> tables
> >> >> >>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
> >> >> >> through
> >> >> >>>> DDL
> >> >> >>>>>>>>>>>>>>> rather
> >> >> >>>>>>>>>>>>>>>>> than
> >> >> >>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for
> >> >> all
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>>>> tables.
> >> >> >>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
> >> >> >>>> deprives us of
> >> >> >>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement
> >> >> their
> >> >> >>> own
> >> >> >>>>>>>>>>>>>>> cache).
> >> >> >>>>>>>>>>>>>>>>> But
> >> >> >>>>>>>>>>>>>>>>>>>> most
> >> >> >>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
> >> >> >> different
> >> >> >>>> cache
> >> >> >>>>>>>>>>>>>>>>> strategies
> >> >> >>>>>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> All these points are much closer to the schema
> >> >> >> proposed
> >> >> >>>> by
> >> >> >>>>>>>>>>>>>>>>> Alexander.
> >> >> >>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right
> >> and
> >> >> >>> all
> >> >> >>>> these
> >> >> >>>>>>>>>>>>>>>>>>>> facilities
> >> >> >>>>>>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >> >> >>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> >> >> >>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >> >> >>>>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
> >> >> >>>> express that
> >> >> >>>>>>>>>>>>>>> I
> >> >> >>>>>>>>>>>>>>>>> really
> >> >> >>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic
> >> >> >> and I
> >> >> >>>> hope
> >> >> >>>>>>>>>>>>>>> that
> >> >> >>>>>>>>>>>>>>>>>>>> others
> >> >> >>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> Martijn
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> >> >> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I
> >> have
> >> >> >>>> questions
> >> >> >>>>>>>>>>>>>>>>> about
> >> >> >>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
> >> >> >>>> something?).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
> >> >> >>>> SYSTEM_TIME
> >> >> >>>>>>>>>>>>>>> AS OF
> >> >> >>>>>>>>>>>>>>>>>>>>>> proc_time”
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME
> >> AS
> >> >> >> OF
> >> >> >>>>>>>>>>>>>>> proc_time"
> >> >> >>>>>>>>>>>>>>>>> is
> >> >> >>>>>>>>>>>>>>>>>>>> not
> >> >> >>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you
> >> said,
> >> >> >>> users
> >> >> >>>> go
> >> >> >>>>>>>>>>>>>>> on it
> >> >> >>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no
> >> one
> >> >> >>>> proposed
> >> >> >>>>>>>>>>>>>>> to
> >> >> >>>>>>>>>>>>>>>>> enable
> >> >> >>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you
> >> mean
> >> >> >>>> other
> >> >> >>>>>>>>>>>>>>>>> developers
> >> >> >>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
> >> >> >>> specify
> >> >> >>>>>>>>>>>>>>> whether
> >> >> >>>>>>>>>>>>>>>>> their
> >> >> >>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list
> >> of
> >> >> >>>> supported
> >> >> >>>>>>>>>>>>>>>>>>>> options),
> >> >> >>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want
> >> to.
> >> >> So
> >> >> >>>> what
> >> >> >>>>>>>>>>>>>>>>> exactly is
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching in
> >> >> >>> modules
> >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common
> >> from
> >> >> >>> the
> >> >> >>>>>>>>>>>>>>>>> considered
> >> >> >>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> >> >> >>>> breaking/non-breaking
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> >> options in
> >> >> >>> DDL
> >> >> >>>> to
> >> >> >>>>>>>>>>>>>>>>> control
> >> >> >>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never
> >> >> happened
> >> >> >>>>>>>>>>>>>>> previously
> >> >> >>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>> should
> >> >> >>>>>>>>>>>>>>>>>>>>>>> be cautious
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics
> >> of
> >> >> >> DDL
> >> >> >>>>>>>>>>>>>>> options
> >> >> >>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it
> >> about
> >> >> >>>> limiting
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> scope
> >> >> >>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user business
> >> >> >> logic
> >> >> >>>> rather
> >> >> >>>>>>>>>>>>>>> than
> >> >> >>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
> >> >> >>>> framework? I
> >> >> >>>>>>>>>>>>>>>>> mean
> >> >> >>>>>>>>>>>>>>>>>>>> that
> >> >> >>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option
> >> with
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
> >> >> >>>> decision,
> >> >> >>>>>>>>>>>>>>>>> because it
> >> >> >>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not
> >> >> >> just
> >> >> >>>>>>>>>>>>>>> performance
> >> >> >>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several functions
> >> of
> >> >> >> ONE
> >> >> >>>> table
> >> >> >>>>>>>>>>>>>>>>> (there
> >> >> >>>>>>>>>>>>>>>>>>>> can
> >> >> >>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches).
> >> Does it
> >> >> >>>> really
> >> >> >>>>>>>>>>>>>>>>> matter for
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
> >> >> >>> located,
> >> >> >>>>>>>>>>>>>>> which is
> >> >> >>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> >> 'sink.parallelism',
> >> >> >>>> which in
> >> >> >>>>>>>>>>>>>>>>> some way
> >> >> >>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I
> >> >> >> don't
> >> >> >>>> see any
> >> >> >>>>>>>>>>>>>>>>> problem
> >> >> >>>>>>>>>>>>>>>>>>>>>>> here.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
> >> >> >>>> scenario
> >> >> >>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> design
> >> >> >>>>>>>>>>>>>>>>>>>>>>> would become more complex
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion,
> >> but
> >> >> >>>> actually
> >> >> >>>>>>>>>>>>>>> in our
> >> >> >>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
> >> >> >> easily
> >> >> >>> -
> >> >> >>>> we
> >> >> >>>>>>>>>>>>>>> reused
> >> >> >>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a
> >> new
> >> >> >>> API).
> >> >> >>>> The
> >> >> >>>>>>>>>>>>>>>>> point is
> >> >> >>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
> >> >> >> InputFormat
> >> >> >>>> for
> >> >> >>>>>>>>>>>>>>>>> scanning
> >> >> >>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive
> >> - it
> >> >> >>> uses
> >> >> >>>>>>>>>>>>>>> class
> >> >> >>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a
> >> wrapper
> >> >> >>> around
> >> >> >>>>>>>>>>>>>>>>> InputFormat.
> >> >> >>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability
> >> to
> >> >> >>> reload
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>> data
> >> >> >>>>>>>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number
> >> of
> >> >> >>>>>>>>>>>>>>> InputSplits,
> >> >> >>>>>>>>>>>>>>>>> but
> >> >> >>>>>>>>>>>>>>>>>>>> has
> >> >> >>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
> >> >> >>>> significantly
> >> >> >>>>>>>>>>>>>>>>> reduces
> >> >> >>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I
> >> know
> >> >> >>> that
> >> >> >>>>>>>>>>>>>>> usually
> >> >> >>>>>>>>>>>>>>>>> we
> >> >> >>>>>>>>>>>>>>>>>>>> try
> >> >> >>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code,
> >> but
> >> >> >>> maybe
> >> >> >>>> this
> >> >> >>>>>>>>>>>>>>> one
> >> >> >>>>>>>>>>>>>>>>> can
> >> >> >>>>>>>>>>>>>>>>>>>> be
> >> >> >>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an
> >> ideal
> >> >> >>>> solution,
> >> >> >>>>>>>>>>>>>>> maybe
> >> >> >>>>>>>>>>>>>>>>>>>> there
> >> >> >>>>>>>>>>>>>>>>>>>>>>> are better ones.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
> >> >> >> introduce
> >> >> >>>>>>>>>>>>>>>>> compatibility
> >> >> >>>>>>>>>>>>>>>>>>>>>> issues
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer
> >> of
> >> >> >> the
> >> >> >>>>>>>>>>>>>>> connector
> >> >> >>>>>>>>>>>>>>>>>>>> won't
> >> >> >>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new
> >> cache
> >> >> >>>> options
> >> >> >>>>>>>>>>>>>>>>>>>> incorrectly
> >> >> >>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into
> >> 2
> >> >> >>>> different
> >> >> >>>>>>>>>>>>>>> code
> >> >> >>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need
> >> to
> >> >> >> do
> >> >> >>>> is to
> >> >> >>>>>>>>>>>>>>>>> redirect
> >> >> >>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> >> LookupConfig
> >> >> (+
> >> >> >>>> maybe
> >> >> >>>>>>>>>>>>>>> add an
> >> >> >>>>>>>>>>>>>>>>>>>> alias
> >> >> >>>>>>>>>>>>>>>>>>>>>>> for options, if there was different naming),
> >> >> >>> everything
> >> >> >>>>>>>>>>>>>>> will be
> >> >> >>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't
> >> do
> >> >> >>>>>>>>>>>>>>> refactoring at
> >> >> >>>>>>>>>>>>>>>>> all,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector
> >> because
> >> >> >> of
> >> >> >>>>>>>>>>>>>>> backward
> >> >> >>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use
> >> >> his
> >> >> >>> own
> >> >> >>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>> logic,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs
> >> >> into
> >> >> >>> the
> >> >> >>>>>>>>>>>>>>>>> framework,
> >> >> >>>>>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with
> >> already
> >> >> >>>> existing
> >> >> >>>>>>>>>>>>>>>>> configs
> >> >> >>>>>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare
> >> >> >> case).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all
> >> the
> >> >> >> way
> >> >> >>>> down
> >> >> >>>>>>>>>>>>>>> to
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> table
> >> >> >>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that
> >> the
> >> >> >>> ONLY
> >> >> >>>>>>>>>>>>>>> connector
> >> >> >>>>>>>>>>>>>>>>>>>> that
> >> >> >>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> >> FileSystemTableSource
> >> >> >>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it currently).
> >> >> Also
> >> >> >>>> for some
> >> >> >>>>>>>>>>>>>>>>>>>> databases
> >> >> >>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
> >> >> >>> filters
> >> >> >>>>>>>>>>>>>>> that we
> >> >> >>>>>>>>>>>>>>>>> have
> >> >> >>>>>>>>>>>>>>>>>>>>>>> in Flink.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache
> >> >> >> seems
> >> >> >>>> not
> >> >> >>>>>>>>>>>>>>>>> quite
> >> >> >>>>>>>>>>>>>>>>>>>>> useful
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large
> >> amount of
> >> >> >>> data
> >> >> >>>>>>>>>>>>>>> from the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose
> >> in
> >> >> >>>> dimension
> >> >> >>>>>>>>>>>>>>>>> table
> >> >> >>>>>>>>>>>>>>>>>>>>>>> 'users'
> >> >> >>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40,
> >> >> and
> >> >> >>>> input
> >> >> >>>>>>>>>>>>>>> stream
> >> >> >>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age
> >> of
> >> >> >>>> users. If
> >> >> >>>>>>>>>>>>>>> we
> >> >> >>>>>>>>>>>>>>>>> have
> >> >> >>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >> >> >>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This
> >> means
> >> >> >>> the
> >> >> >>>> user
> >> >> >>>>>>>>>>>>>>> can
> >> >> >>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2
> >> >> times.
> >> >> >>> It
> >> >> >>>> will
> >> >> >>>>>>>>>>>>>>>>> gain a
> >> >> >>>>>>>>>>>>>>>>>>>>>>> huge
> >> >> >>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization
> >> >> >> starts
> >> >> >>>> to
> >> >> >>>>>>>>>>>>>>> really
> >> >> >>>>>>>>>>>>>>>>>>>> shine
> >> >> >>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters
> >> and
> >> >> >>>> projections
> >> >> >>>>>>>>>>>>>>>>> can't
> >> >> >>>>>>>>>>>>>>>>>>>> fit
> >> >> >>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
> >> >> >>>> additional
> >> >> >>>>>>>>>>>>>>>>>>>> possibilities
> >> >> >>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
> >> >> >>>> useful'.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> >> regarding
> >> >> >> this
> >> >> >>>> topic!
> >> >> >>>>>>>>>>>>>>>>> Because
> >> >> >>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points,
> >> and I
> >> >> >>>> think
> >> >> >>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>> help
> >> >> >>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
> >> >> >>>> consensus.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> >> >> >>>>>>>>>>>>>>> renqschn@gmail.com
> >> >> >>>>>>>>>>>>>>>>>> :
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my
> >> late
> >> >> >>>> response!
> >> >> >>>>>>>>>>>>>>> We
> >> >> >>>>>>>>>>>>>>>>> had
> >> >> >>>>>>>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and
> >> Leonard
> >> >> >>> and
> >> >> >>>> I’d
> >> >> >>>>>>>>>>>>>>> like
> >> >> >>>>>>>>>>>>>>>>> to
> >> >> >>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing
> >> the
> >> >> >>> cache
> >> >> >>>>>>>>>>>>>>> logic in
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> table
> >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> >> user-provided
> >> >> >>>> table
> >> >> >>>>>>>>>>>>>>>>> function,
> >> >> >>>>>>>>>>>>>>>>>>>> we
> >> >> >>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
> >> >> >>>> TableFunction
> >> >> >>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>> these
> >> >> >>>>>>>>>>>>>>>>>>>>>>> concerns:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of
> >> "FOR
> >> >> >>>>>>>>>>>>>>> SYSTEM_TIME
> >> >> >>>>>>>>>>>>>>>>> AS OF
> >> >> >>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect
> >> the
> >> >> >>>> content
> >> >> >>>>>>>>>>>>>>> of the
> >> >> >>>>>>>>>>>>>>>>>>>> lookup
> >> >> >>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users
> >> choose
> >> >> to
> >> >> >>>> enable
> >> >> >>>>>>>>>>>>>>>>> caching
> >> >> >>>>>>>>>>>>>>>>>>>> on
> >> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that
> >> this
> >> >> >>>> breakage is
> >> >> >>>>>>>>>>>>>>>>>>>> acceptable
> >> >> >>>>>>>>>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not
> >> to
> >> >> >>>> provide
> >> >> >>>>>>>>>>>>>>>>> caching on
> >> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
> >> >> >>>> framework
> >> >> >>>>>>>>>>>>>>>>> (whether
> >> >> >>>>>>>>>>>>>>>>>>>> in a
> >> >> >>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we
> >> have
> >> >> >> to
> >> >> >>>>>>>>>>>>>>> confront a
> >> >> >>>>>>>>>>>>>>>>>>>>>> situation
> >> >> >>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
> >> >> >>>> behavior of
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>> framework,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and should
> >> be
> >> >> >>>> cautious.
> >> >> >>>>>>>>>>>>>>>>> Under
> >> >> >>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the framework
> >> >> should
> >> >> >>>> only be
> >> >> >>>>>>>>>>>>>>>>>>>> specified
> >> >> >>>>>>>>>>>>>>>>>>>>> by
> >> >> >>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s
> >> hard
> >> >> to
> >> >> >>>> apply
> >> >> >>>>>>>>>>>>>>> these
> >> >> >>>>>>>>>>>>>>>>>>>> general
> >> >> >>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads
> >> and
> >> >> >>>> refresh
> >> >> >>>>>>>>>>>>>>> all
> >> >> >>>>>>>>>>>>>>>>>>>> records
> >> >> >>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high
> >> >> lookup
> >> >> >>>>>>>>>>>>>>> performance
> >> >> >>>>>>>>>>>>>>>>>>>> (like
> >> >> >>>>>>>>>>>>>>>>>>>>>> Hive
> >> >> >>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely
> >> used
> >> >> by
> >> >> >>> our
> >> >> >>>>>>>>>>>>>>> internal
> >> >> >>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the
> >> user’s
> >> >> >>>>>>>>>>>>>>> TableFunction
> >> >> >>>>>>>>>>>>>>>>>>>> works
> >> >> >>>>>>>>>>>>>>>>>>>>>> fine
> >> >> >>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> >> introduce a
> >> >> >>> new
> >> >> >>>>>>>>>>>>>>>>> interface for
> >> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >> >>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would
> >> become
> >> >> >> more
> >> >> >>>>>>>>>>>>>>> complex.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
> >> >> >>>> introduce
> >> >> >>>>>>>>>>>>>>>>>>>> compatibility
> >> >> >>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there
> >> might
> >> >> >>>> exist two
> >> >> >>>>>>>>>>>>>>>>> caches
> >> >> >>>>>>>>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
> >> >> >> incorrectly
> >> >> >>>>>>>>>>>>>>> configures
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> table
> >> >> >>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented
> >> by
> >> >> >> the
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>>>>>> source).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> >> Alexander, I
> >> >> >>>> think
> >> >> >>>>>>>>>>>>>>>>> filters
> >> >> >>>>>>>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down
> >> to
> >> >> >> the
> >> >> >>>> table
> >> >> >>>>>>>>>>>>>>>>> function,
> >> >> >>>>>>>>>>>>>>>>>>>>> like
> >> >> >>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
> >> >> >> runner
> >> >> >>>> with
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> cache.
> >> >> >>>>>>>>>>>>>>>>>>>>> The
> >> >> >>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network
> >> I/O
> >> >> >> and
> >> >> >>>>>>>>>>>>>>> pressure
> >> >> >>>>>>>>>>>>>>>>> on the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
> >> >> >>> optimizations
> >> >> >>>> to
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> cache
> >> >> >>>>>>>>>>>>>>>>>>>>> seems
> >> >> >>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect
> >> our
> >> >> >>>> ideas.
> >> >> >>>>>>>>>>>>>>> We
> >> >> >>>>>>>>>>>>>>>>>>>> prefer to
> >> >> >>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
> >> >> >>>> TableFunction,
> >> >> >>>>>>>>>>>>>>> and we
> >> >> >>>>>>>>>>>>>>>>>>>> could
> >> >> >>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> >> (CachingTableFunction,
> >> >> >>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
> >> >> >> regulate
> >> >> >>>>>>>>>>>>>>> metrics
> >> >> >>>>>>>>>>>>>>>>> of the
> >> >> >>>>>>>>>>>>>>>>>>>>>> cache.
> >> >> >>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> [2]
> >> >> >>> https://github.com/PatrickRen/flink/tree/FLIP-221
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр
> >> Смирнов
> >> >> >> <
> >> >> >>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution
> >> as
> >> >> >> the
> >> >> >>>>>>>>>>>>>>> first
> >> >> >>>>>>>>>>>>>>>>> step:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually
> >> exclusive
> >> >> >>>>>>>>>>>>>>> (originally
> >> >> >>>>>>>>>>>>>>>>>>>>> proposed
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually
> >> >> they
> >> >> >>>> follow
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> same
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> >> different.
> >> >> >> If
> >> >> >>> we
> >> >> >>>>>>>>>>>>>>> will
> >> >> >>>>>>>>>>>>>>>>> go one
> >> >> >>>>>>>>>>>>>>>>>>>>> way,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
> >> >> >>>> deleting
> >> >> >>>>>>>>>>>>>>>>> existing
> >> >> >>>>>>>>>>>>>>>>>>>> code
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> >> connectors.
> >> >> >> So
> >> >> >>> I
> >> >> >>>>>>>>>>>>>>> think we
> >> >> >>>>>>>>>>>>>>>>>>>> should
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about
> >> that
> >> >> >> and
> >> >> >>>> then
> >> >> >>>>>>>>>>>>>>> work
> >> >> >>>>>>>>>>>>>>>>>>>>> together
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks
> >> for
> >> >> >>>> different
> >> >> >>>>>>>>>>>>>>>>> parts
> >> >> >>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
> >> >> >>>> introducing
> >> >> >>>>>>>>>>>>>>>>> proposed
> >> >> >>>>>>>>>>>>>>>>>>>> set
> >> >> >>>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests
> >> >> >> after
> >> >> >>>>>>>>>>>>>>> filter
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of
> >> the
> >> >> >>>> lookup
> >> >> >>>>>>>>>>>>>>>>> table, we
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after
> >> that we
> >> >> >>> can
> >> >> >>>>>>>>>>>>>>> filter
> >> >> >>>>>>>>>>>>>>>>>>>>> responses,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
> >> >> >>>> pushdown. So
> >> >> >>>>>>>>>>>>>>> if
> >> >> >>>>>>>>>>>>>>>>>>>>> filtering
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much
> >> less
> >> >> >>> rows
> >> >> >>>> in
> >> >> >>>>>>>>>>>>>>>>> cache.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture
> >> is
> >> >> >> not
> >> >> >>>>>>>>>>>>>>> shared.
> >> >> >>>>>>>>>>>>>>>>> I
> >> >> >>>>>>>>>>>>>>>>>>>> don't
> >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds
> >> of
> >> >> >>>>>>>>>>>>>>> conversations
> >> >> >>>>>>>>>>>>>>>>> :)
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so
> >> I
> >> >> >>> made a
> >> >> >>>>>>>>>>>>>>> Jira
> >> >> >>>>>>>>>>>>>>>>> issue,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
> >> >> >>> details
> >> >> >>>> -
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >> https://issues.apache.org/jira/browse/FLINK-27411.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> >> >> >>>>>>>>>>>>>>> arvid@apache.org>:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency
> >> was
> >> >> >> not
> >> >> >>>>>>>>>>>>>>>>> satisfying
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>>> me.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could
> >> also
> >> >> >>> live
> >> >> >>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>>> easier
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of
> >> making
> >> >> >>>> caching
> >> >> >>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>>>>> implementation
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
> >> >> >> caching
> >> >> >>>>>>>>>>>>>>> layer
> >> >> >>>>>>>>>>>>>>>>>>>> around X.
> >> >> >>>>>>>>>>>>>>>>>>>>>> So
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction
> >> that
> >> >> >>>>>>>>>>>>>>> delegates to
> >> >> >>>>>>>>>>>>>>>>> X in
> >> >> >>>>>>>>>>>>>>>>>>>>> case
> >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting
> >> it
> >> >> >> into
> >> >> >>>> the
> >> >> >>>>>>>>>>>>>>>>> operator
> >> >> >>>>>>>>>>>>>>>>>>>>>> model
> >> >> >>>>>>>>>>>>>>>>>>>>>>> as
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> >> probably
> >> >> >>>>>>>>>>>>>>> unnecessary
> >> >> >>>>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>> first step
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
> >> >> >>> receive
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>> requests
> >> >> >>>>>>>>>>>>>>>>>>>>>>> after
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
> >> >> >>> interesting
> >> >> >>>> to
> >> >> >>>>>>>>>>>>>>> save
> >> >> >>>>>>>>>>>>>>>>>>>>> memory).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of
> >> >> >> this
> >> >> >>>> FLIP
> >> >> >>>>>>>>>>>>>>>>> would be
> >> >> >>>>>>>>>>>>>>>>>>>>>>> limited to
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
> >> >> >>>> Everything
> >> >> >>>>>>>>>>>>>>> else
> >> >> >>>>>>>>>>>>>>>>>>>>> remains
> >> >> >>>>>>>>>>>>>>>>>>>>>> an
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means
> >> we
> >> >> >> can
> >> >> >>>>>>>>>>>>>>> easily
> >> >> >>>>>>>>>>>>>>>>>>>>>> incorporate
> >> >> >>>>>>>>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander
> >> pointed
> >> >> >> out
> >> >> >>>>>>>>>>>>>>> later.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture
> >> is
> >> >> >> not
> >> >> >>>>>>>>>>>>>>> shared.
> >> >> >>>>>>>>>>>>>>>>> I
> >> >> >>>>>>>>>>>>>>>>>>>> don't
> >> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
> >> >> >> Смирнов
> >> >> >>> <
> >> >> >>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm
> >> not a
> >> >> >>>>>>>>>>>>>>> committer
> >> >> >>>>>>>>>>>>>>>>> yet,
> >> >> >>>>>>>>>>>>>>>>>>>> but
> >> >> >>>>>>>>>>>>>>>>>>>>>> I'd
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP
> >> >> really
> >> >> >>>>>>>>>>>>>>>>> interested
> >> >> >>>>>>>>>>>>>>>>>>>> me.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> >> feature in
> >> >> >> my
> >> >> >>>>>>>>>>>>>>>>> company’s
> >> >> >>>>>>>>>>>>>>>>>>>>> Flink
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our
> >> thoughts
> >> >> >> on
> >> >> >>>>>>>>>>>>>>> this and
> >> >> >>>>>>>>>>>>>>>>>>>> make
> >> >> >>>>>>>>>>>>>>>>>>>>>> code
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
> >> >> >>>>>>>>>>>>>>> introducing an
> >> >> >>>>>>>>>>>>>>>>>>>>> abstract
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> >> >> (CachingTableFunction).
> >> >> >>> As
> >> >> >>>>>>>>>>>>>>> you
> >> >> >>>>>>>>>>>>>>>>> know,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> >> flink-table-common
> >> >> >>>>>>>>>>>>>>> module,
> >> >> >>>>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>>>>>>>> provides
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s
> >> >> very
> >> >> >>>>>>>>>>>>>>>>> convenient
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>>>> importing
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> >> CachingTableFunction
> >> >> >>>> contains
> >> >> >>>>>>>>>>>>>>>>> logic
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
> >> >> >> everything
> >> >> >>>>>>>>>>>>>>>>> connected
> >> >> >>>>>>>>>>>>>>>>>>>> with
> >> >> >>>>>>>>>>>>>>>>>>>>> it
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module,
> >> probably
> >> >> >> in
> >> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend
> >> on
> >> >> >>>> another
> >> >> >>>>>>>>>>>>>>>>> module,
> >> >> >>>>>>>>>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which
> >> doesn’t
> >> >> >>>> sound
> >> >> >>>>>>>>>>>>>>>>> good.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> >> >> ‘getLookupConfig’
> >> >> >>> to
> >> >> >>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> >> connectors
> >> >> to
> >> >> >>>> only
> >> >> >>>>>>>>>>>>>>> pass
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore
> >> they
> >> >> >>> won’t
> >> >> >>>>>>>>>>>>>>>>> depend on
> >> >> >>>>>>>>>>>>>>>>>>>>>>> runtime
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner
> >> >> >> will
> >> >> >>>>>>>>>>>>>>>>> construct a
> >> >> >>>>>>>>>>>>>>>>>>>>>> lookup
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime
> >> logic
> >> >> >>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> >> >> >>>>>>>>>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture
> >> >> looks
> >> >> >>>> like
> >> >> >>>>>>>>>>>>>>> in
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>> pinned
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually
> >> >> >> yours
> >> >> >>>>>>>>>>>>>>>>>>>> CacheConfig).
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will
> >> be
> >> >> >>>>>>>>>>>>>>> responsible
> >> >> >>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >> >>>>>>>>>>>>>>>>>>>>>> –
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> >> inheritors.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >> >> >>>>>>>>>>>>>>> flink-table-runtime
> >> >> >>>>>>>>>>>>>>>>> -
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> >> >> >>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> >> >> >> LookupJoinCachingRunner,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful
> >> advantage
> >> >> >> of
> >> >> >>>>>>>>>>>>>>> such a
> >> >> >>>>>>>>>>>>>>>>>>>>> solution.
> >> >> >>>>>>>>>>>>>>>>>>>>>>> If
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we
> >> can
> >> >> >>>> apply
> >> >> >>>>>>>>>>>>>>> some
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> >> LookupJoinRunnerWithCalc
> >> >> >> was
> >> >> >>>>>>>>>>>>>>> named
> >> >> >>>>>>>>>>>>>>>>> like
> >> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
> >> >> >>> actually
> >> >> >>>>>>>>>>>>>>>>> mostly
> >> >> >>>>>>>>>>>>>>>>>>>>>> consists
> >> >> >>>>>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup
> >> table
> >> >> >> B
> >> >> >>>>>>>>>>>>>>>>> condition
> >> >> >>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> >> >> >>>>>>>>>>>>>>>>>>>>>>> ON
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
> >> >> >>> B.salary >
> >> >> >>>>>>>>>>>>>>> 1000’
> >> >> >>>>>>>>>>>>>>>>>>>>> ‘calc’
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age =
> >> B.age +
> >> >> >> 10
> >> >> >>>> and
> >> >> >>>>>>>>>>>>>>>>>>>> B.salary >
> >> >> >>>>>>>>>>>>>>>>>>>>>>> 1000.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
> >> >> >> records
> >> >> >>> in
> >> >> >>>>>>>>>>>>>>>>> cache,
> >> >> >>>>>>>>>>>>>>>>>>>> size
> >> >> >>>>>>>>>>>>>>>>>>>>> of
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced:
> >> filters =
> >> >> >>>> avoid
> >> >> >>>>>>>>>>>>>>>>> storing
> >> >> >>>>>>>>>>>>>>>>>>>>>> useless
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce
> >> >> records’
> >> >> >>>>>>>>>>>>>>> size. So
> >> >> >>>>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>>>>>>> initial
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be
> >> >> increased
> >> >> >>> by
> >> >> >>>>>>>>>>>>>>> the
> >> >> >>>>>>>>>>>>>>>>> user.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> >> discussion
> >> >> >>> about
> >> >> >>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> >> >> >>>>>>>>>>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table
> >> >> cache
> >> >> >>> and
> >> >> >>>>>>>>>>>>>>> its
> >> >> >>>>>>>>>>>>>>>>>>>> standard
> >> >> >>>>>>>>>>>>>>>>>>>>>>> metrics.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
> >> >> >>>> implement
> >> >> >>>>>>>>>>>>>>>>> their
> >> >> >>>>>>>>>>>>>>>>>>>> own
> >> >> >>>>>>>>>>>>>>>>>>>>>>> cache to
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
> >> >> >> standard
> >> >> >>> of
> >> >> >>>>>>>>>>>>>>>>> metrics
> >> >> >>>>>>>>>>>>>>>>>>>> for
> >> >> >>>>>>>>>>>>>>>>>>>>>>> users and
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
> >> >> >>> joins,
> >> >> >>>>>>>>>>>>>>> which
> >> >> >>>>>>>>>>>>>>>>> is a
> >> >> >>>>>>>>>>>>>>>>>>>>>> quite
> >> >> >>>>>>>>>>>>>>>>>>>>>>> common
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> >> including
> >> >> >>>> cache,
> >> >> >>>>>>>>>>>>>>>>>>>> metrics,
> >> >> >>>>>>>>>>>>>>>>>>>>>>> wrapper
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table
> >> >> options.
> >> >> >>>>>>>>>>>>>>> Please
> >> >> >>>>>>>>>>>>>>>>> take a
> >> >> >>>>>>>>>>>>>>>>>>>>> look
> >> >> >>>>>>>>>>>>>>>>>>>>>>> at the
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
> >> >> >>> suggestions
> >> >> >>>>>>>>>>>>>>> and
> >> >> >>>>>>>>>>>>>>>>>>>> comments
> >> >> >>>>>>>>>>>>>>>>>>>>>>> would be
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> --
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>>> --
> >> >> >>>>>>>>>>>>>>>>>>> Best regards,
> >> >> >>>>>>>>>>>>>>>>>>> Roman Boyko
> >> >> >>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >> >> >>>>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>>
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> --
> >> >> >>>>>>>>>>>> Best Regards,
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Qingsheng Ren
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Real-time Computing Team
> >> >> >>>>>>>>>>>> Alibaba Cloud
> >> >> >>>>>>>>>>>>
> >> >> >>>>>>>>>>>> Email: renqschn@gmail.com
> >> >> >>>>>>>>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >> >>
> >>
> >

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jingsong Li <ji...@gmail.com>.

Hi all,

I think the problem now is below:
1. AllCache and PartialCache interface on the non-uniform, one needs to
provide LookupProvider, the other needs to provide CacheBuilder.
2. AllCache definition is not flexible, for example, PartialCache can use
any custom storage, while the AllCache can not, AllCache can also be
considered to store memory or disk, also need a flexible strategy.
3. AllCache can not customize ReloadStrategy, currently only
ScheduledReloadStrategy.

In order to solve the above problems, the following are my ideas.

## Top level cache interfaces:

```

public interface CacheLookupProvider extends
LookupTableSource.LookupRuntimeProvider {

    CacheBuilder createCacheBuilder();
}


public interface CacheBuilder {
    Cache create();
}


public interface Cache {

    /**
     * Returns the value associated with key in this cache, or null if
there is no cached value for
     * key.
     */
    @Nullable
    Collection<RowData> getIfPresent(RowData key);

    /** Returns the number of key-value mappings in the cache. */
    long size();
}

```

## Partial cache

```

public interface PartialCacheLookupFunction extends CacheLookupProvider {

    @Override
    PartialCacheBuilder createCacheBuilder();

/** Creates an {@link LookupFunction} instance. */
LookupFunction createLookupFunction();
}


public interface PartialCacheBuilder extends CacheBuilder {

    PartialCache create();
}


public interface PartialCache extends Cache {

    /**
     * Associates the specified value rows with the specified key row
in the cache. If the cache
     * previously contained value associated with the key, the old
value is replaced by the
     * specified value.
     *
     * @return the previous value rows associated with key, or null if
there was no mapping for key.
     * @param key - key row with which the specified value is to be associated
     * @param value – value rows to be associated with the specified key
     */
    Collection<RowData> put(RowData key, Collection<RowData> value);

    /** Discards any cached value for the specified key. */
    void invalidate(RowData key);
}

```

## All cache
```

public interface AllCacheLookupProvider extends CacheLookupProvider {

    void registerReloadStrategy(ScheduledExecutorService
executorService, Reloader reloader);

    ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();

    @Override
    AllCacheBuilder createCacheBuilder();
}


public interface AllCacheBuilder extends CacheBuilder {

    AllCache create();
}


public interface AllCache extends Cache {

    void putAll(Iterator<Map<RowData, RowData>> allEntries);

    void clearAll();
}


public interface Reloader {

    void reload();
}

```

Best,
Jingsong

On Fri, May 27, 2022 at 11:10 AM Jingsong Li <ji...@gmail.com> wrote:

> Thanks Qingsheng and all for your discussion.
>
> Very sorry to jump in so late.
>
> Maybe I missed something?
> My first impression when I saw the cache interface was, why don't we
> provide an interface similar to guava cache [1], on top of guava cache,
> caffeine also makes extensions for asynchronous calls.[2]
> There is also the bulk load in caffeine too.
>
> I am also more confused why first from LookupCacheFactory.Builder and then
> to Factory to create Cache.
>
> [1] https://github.com/google/guava
> [2] https://github.com/ben-manes/caffeine/wiki/Population
>
> Best,
> Jingsong
>
> On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:
>
>> After looking at the new introduced ReloadTime and Becket's comment,
>> I agree with Becket we should have a pluggable reloading strategy.
>> We can provide some common implementations, e.g., periodic reloading, and
>> daily reloading.
>> But there definitely be some connector- or business-specific reloading
>> strategies, e.g.
>> notify by a zookeeper watcher, reload once a new Hive partition is
>> complete.
>>
>> Best,
>> Jark
>>
>> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com> wrote:
>>
>> > Hi Qingsheng,
>> >
>> > Thanks for updating the FLIP. A few comments / questions below:
>> >
>> > 1. Is there a reason that we have both "XXXFactory" and "XXXProvider".
>> > What is the difference between them? If they are the same, can we just
>> use
>> > XXXFactory everywhere?
>> >
>> > 2. Regarding the FullCachingLookupProvider, should the reloading policy
>> > also be pluggable? Periodical reloading could be sometimes be tricky in
>> > practice. For example, if user uses 24 hours as the cache refresh
>> interval
>> > and some nightly batch job delayed, the cache update may still see the
>> > stale data.
>> >
>> > 3. In DefaultLookupCacheFactory, it looks like InitialCapacity should be
>> > removed.
>> >
>> > 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a
>> little
>> > confusing to me. If Optional<LookupCacheFactory> getCacheFactory()
>> returns
>> > a non-empty factory, doesn't that already indicates the framework to
>> cache
>> > the missing keys? Also, why is this method returning an
>> Optional<Boolean>
>> > instead of boolean?
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> >
>> >
>> > On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <re...@gmail.com>
>> wrote:
>> >
>> >> Hi Lincoln and Jark,
>> >>
>> >> Thanks for the comments! If the community reaches a consensus that we
>> use
>> >> SQL hint instead of table options to decide whether to use sync or
>> async
>> >> mode, it’s indeed not necessary to introduce the “lookup.async” option.
>> >>
>> >> I think it’s a good idea to let the decision of async made on query
>> >> level, which could make better optimization with more infomation
>> gathered
>> >> by planner. Is there any FLIP describing the issue in FLINK-27625? I
>> >> thought FLIP-234 is only proposing adding SQL hint for retry on missing
>> >> instead of the entire async mode to be controlled by hint.
>> >>
>> >> Best regards,
>> >>
>> >> Qingsheng
>> >>
>> >> > On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com>
>> wrote:
>> >> >
>> >> > Hi Jark,
>> >> >
>> >> > Thanks for your reply!
>> >> >
>> >> > Currently 'lookup.async' just lies in HBase connector, I have no idea
>> >> > whether or when to remove it (we can discuss it in another issue for
>> the
>> >> > HBase connector after FLINK-27625 is done), just not add it into a
>> >> common
>> >> > option now.
>> >> >
>> >> > Best,
>> >> > Lincoln Lee
>> >> >
>> >> >
>> >> > Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
>> >> >
>> >> >> Hi Lincoln,
>> >> >>
>> >> >> I have taken a look at FLIP-234, and I agree with you that the
>> >> connectors
>> >> >> can
>> >> >> provide both async and sync runtime providers simultaneously instead
>> >> of one
>> >> >> of them.
>> >> >> At that point, "lookup.async" looks redundant. If this option is
>> >> planned to
>> >> >> be removed
>> >> >> in the long term, I think it makes sense not to introduce it in this
>> >> FLIP.
>> >> >>
>> >> >> Best,
>> >> >> Jark
>> >> >>
>> >> >> On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >>> Hi Qingsheng,
>> >> >>>
>> >> >>> Sorry for jumping into the discussion so late. It's a good idea
>> that
>> >> we
>> >> >> can
>> >> >>> have a common table option. I have a minor comments on
>> 'lookup.async'
>> >> >> that
>> >> >>> not make it a common option:
>> >> >>>
>> >> >>> The table layer abstracts both sync and async lookup capabilities,
>> >> >>> connectors implementers can choose one or both, in the case of
>> >> >> implementing
>> >> >>> only one capability(status of the most of existing builtin
>> connectors)
>> >> >>> 'lookup.async' will not be used.  And when a connector has both
>> >> >>> capabilities, I think this choice is more suitable for making
>> >> decisions
>> >> >> at
>> >> >>> the query level, for example, table planner can choose the physical
>> >> >>> implementation of async lookup or sync lookup based on its cost
>> >> model, or
>> >> >>> users can give query hint based on their own better
>> understanding.  If
>> >> >>> there is another common table option 'lookup.async', it may confuse
>> >> the
>> >> >>> users in the long run.
>> >> >>>
>> >> >>> So, I prefer to leave the 'lookup.async' option in private place
>> (for
>> >> the
>> >> >>> current hbase connector) and not turn it into a common option.
>> >> >>>
>> >> >>> WDYT?
>> >> >>>
>> >> >>> Best,
>> >> >>> Lincoln Lee
>> >> >>>
>> >> >>>
>> >> >>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>> >> >>>
>> >> >>>> Hi Alexander,
>> >> >>>>
>> >> >>>> Thanks for the review! We recently updated the FLIP and you can
>> find
>> >> >>> those
>> >> >>>> changes from my latest email. Since some terminologies has
>> changed so
>> >> >>> I’ll
>> >> >>>> use the new concept for replying your comments.
>> >> >>>>
>> >> >>>> 1. Builder vs ‘of’
>> >> >>>> I’m OK to use builder pattern if we have additional optional
>> >> parameters
>> >> >>>> for full caching mode (“rescan” previously). The
>> schedule-with-delay
>> >> >> idea
>> >> >>>> looks reasonable to me, but I think we need to redesign the
>> builder
>> >> API
>> >> >>> of
>> >> >>>> full caching to make it more descriptive for developers. Would you
>> >> mind
>> >> >>>> sharing your ideas about the API? For accessing the FLIP workspace
>> >> you
>> >> >>> can
>> >> >>>> just provide your account ID and ping any PMC member including
>> Jark.
>> >> >>>>
>> >> >>>> 2. Common table options
>> >> >>>> We have some discussions these days and propose to introduce 8
>> common
>> >> >>>> table options about caching. It has been updated on the FLIP.
>> >> >>>>
>> >> >>>> 3. Retries
>> >> >>>> I think we are on the same page :-)
>> >> >>>>
>> >> >>>> For your additional concerns:
>> >> >>>> 1) The table option has been updated.
>> >> >>>> 2) We got “lookup.cache” back for configuring whether to use
>> partial
>> >> or
>> >> >>>> full caching mode.
>> >> >>>>
>> >> >>>> Best regards,
>> >> >>>>
>> >> >>>> Qingsheng
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>> On May 19, 2022, at 17:25, Александр Смирнов <
>> smiralexan@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> Also I have a few additions:
>> >> >>>>> 1) maybe rename 'lookup.cache.maximum-size' to
>> >> >>>>> 'lookup.cache.max-rows'? I think it will be more clear that we
>> talk
>> >> >>>>> not about bytes, but about the number of rows. Plus it fits more,
>> >> >>>>> considering my optimization with filters.
>> >> >>>>> 2) How will users enable rescanning? Are we going to separate
>> >> caching
>> >> >>>>> and rescanning from the options point of view? Like initially we
>> had
>> >> >>>>> one option 'lookup.cache' with values LRU / ALL. I think now we
>> can
>> >> >>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
>> >> >>>>> 'lookup.rescan.interval', etc.
>> >> >>>>>
>> >> >>>>> Best regards,
>> >> >>>>> Alexander
>> >> >>>>>
>> >> >>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
>> smiralexan@gmail.com
>> >> >>> :
>> >> >>>>>>
>> >> >>>>>> Hi Qingsheng and Jark,
>> >> >>>>>>
>> >> >>>>>> 1. Builders vs 'of'
>> >> >>>>>> I understand that builders are used when we have multiple
>> >> >> parameters.
>> >> >>>>>> I suggested them because we could add parameters later. To
>> prevent
>> >> >>>>>> Builder for ScanRuntimeProvider from looking redundant I can
>> >> suggest
>> >> >>>>>> one more config now - "rescanStartTime".
>> >> >>>>>> It's a time in UTC (LocalTime class) when the first reload of
>> cache
>> >> >>>>>> starts. This parameter can be thought of as 'initialDelay' (diff
>> >> >>>>>> between current time and rescanStartTime) in method
>> >> >>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be
>> very
>> >> >>>>>> useful when the dimension table is updated by some other
>> scheduled
>> >> >> job
>> >> >>>>>> at a certain time. Or when the user simply wants a second scan
>> >> >> (first
>> >> >>>>>> cache reload) be delayed. This option can be used even without
>> >> >>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one
>> day.
>> >> >>>>>> If you are fine with this option, I would be very glad if you
>> would
>> >> >>>>>> give me access to edit FLIP page, so I could add it myself
>> >> >>>>>>
>> >> >>>>>> 2. Common table options
>> >> >>>>>> I also think that FactoryUtil would be overloaded by all cache
>> >> >>>>>> options. But maybe unify all suggested options, not only for
>> >> default
>> >> >>>>>> cache? I.e. class 'LookupOptions', that unifies default cache
>> >> >> options,
>> >> >>>>>> rescan options, 'async', 'maxRetries'. WDYT?
>> >> >>>>>>
>> >> >>>>>> 3. Retries
>> >> >>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times,
>> call)
>> >> >>>>>>
>> >> >>>>>> [1]
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>> >> >>>>>>
>> >> >>>>>> Best regards,
>> >> >>>>>> Alexander
>> >> >>>>>>
>> >> >>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
>> >> >>>>>>>
>> >> >>>>>>> Hi Jark and Alexander,
>> >> >>>>>>>
>> >> >>>>>>> Thanks for your comments! I’m also OK to introduce common table
>> >> >>>> options. I prefer to introduce a new DefaultLookupCacheOptions
>> class
>> >> >> for
>> >> >>>> holding these option definitions because putting all options into
>> >> >>>> FactoryUtil would make it a bit ”crowded” and not well
>> categorized.
>> >> >>>>>>>
>> >> >>>>>>> FLIP has been updated according to suggestions above:
>> >> >>>>>>> 1. Use static “of” method for constructing
>> RescanRuntimeProvider
>> >> >>>> considering both arguments are required.
>> >> >>>>>>> 2. Introduce new table options matching
>> DefaultLookupCacheFactory
>> >> >>>>>>>
>> >> >>>>>>> Best,
>> >> >>>>>>> Qingsheng
>> >> >>>>>>>
>> >> >>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com>
>> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> Hi Alex,
>> >> >>>>>>>>
>> >> >>>>>>>> 1) retry logic
>> >> >>>>>>>> I think we can extract some common retry logic into utilities,
>> >> >> e.g.
>> >> >>>> RetryUtils#tryTimes(times, call).
>> >> >>>>>>>> This seems independent of this FLIP and can be reused by
>> >> >> DataStream
>> >> >>>> users.
>> >> >>>>>>>> Maybe we can open an issue to discuss this and where to put
>> it.
>> >> >>>>>>>>
>> >> >>>>>>>> 2) cache ConfigOptions
>> >> >>>>>>>> I'm fine with defining cache config options in the framework.
>> >> >>>>>>>> A candidate place to put is FactoryUtil which also includes
>> >> >>>> "sink.parallelism", "format" options.
>> >> >>>>>>>>
>> >> >>>>>>>> Best,
>> >> >>>>>>>> Jark
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
>> >> >>> smiralexan@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>> Hi Qingsheng,
>> >> >>>>>>>>>
>> >> >>>>>>>>> Thank you for considering my comments.
>> >> >>>>>>>>>
>> >> >>>>>>>>>> there might be custom logic before making retry, such as
>> >> >>>> re-establish the connection
>> >> >>>>>>>>>
>> >> >>>>>>>>> Yes, I understand that. I meant that such logic can be
>> placed in
>> >> >> a
>> >> >>>>>>>>> separate function, that can be implemented by connectors.
>> Just
>> >> >>> moving
>> >> >>>>>>>>> the retry logic would make connector's LookupFunction more
>> >> >> concise
>> >> >>> +
>> >> >>>>>>>>> avoid duplicate code. However, it's a minor change. The
>> decision
>> >> >> is
>> >> >>>> up
>> >> >>>>>>>>> to you.
>> >> >>>>>>>>>
>> >> >>>>>>>>>> We decide not to provide common DDL options and let
>> developers
>> >> >> to
>> >> >>>> define their own options as we do now per connector.
>> >> >>>>>>>>>
>> >> >>>>>>>>> What is the reason for that? One of the main goals of this
>> FLIP
>> >> >> was
>> >> >>>> to
>> >> >>>>>>>>> unify the configs, wasn't it? I understand that current cache
>> >> >>> design
>> >> >>>>>>>>> doesn't depend on ConfigOptions, like was before. But still
>> we
>> >> >> can
>> >> >>>> put
>> >> >>>>>>>>> these options into the framework, so connectors can reuse
>> them
>> >> >> and
>> >> >>>>>>>>> avoid code duplication, and, what is more significant, avoid
>> >> >>> possible
>> >> >>>>>>>>> different options naming. This moment can be pointed out in
>> >> >>>>>>>>> documentation for connector developers.
>> >> >>>>>>>>>
>> >> >>>>>>>>> Best regards,
>> >> >>>>>>>>> Alexander
>> >> >>>>>>>>>
>> >> >>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
>> renqschn@gmail.com>:
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> Hi Alexander,
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> Thanks for the review and glad to see we are on the same
>> page!
>> >> I
>> >> >>>> think you forgot to cc the dev mailing list so I’m also quoting
>> your
>> >> >>> reply
>> >> >>>> under this email.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>> We can add 'maxRetryTimes' option into this class
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> In my opinion the retry logic should be implemented in
>> lookup()
>> >> >>>> instead of in LookupFunction#eval(). Retrying is only meaningful
>> >> under
>> >> >>> some
>> >> >>>> specific retriable failures, and there might be custom logic
>> before
>> >> >>> making
>> >> >>>> retry, such as re-establish the connection
>> (JdbcRowDataLookupFunction
>> >> >> is
>> >> >>> an
>> >> >>>> example), so it's more handy to leave it to the connector.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>> I don't see DDL options, that were in previous version of
>> >> FLIP.
>> >> >>> Do
>> >> >>>> you have any special plans for them?
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> We decide not to provide common DDL options and let
>> developers
>> >> >> to
>> >> >>>> define their own options as we do now per connector.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> The rest of comments sound great and I’ll update the FLIP.
>> Hope
>> >> >> we
>> >> >>>> can finalize our proposal soon!
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> Best,
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> Qingsheng
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
>> >> >>> smiralexan@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> Hi Qingsheng and devs!
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> I like the overall design of updated FLIP, however I have
>> >> >> several
>> >> >>>>>>>>>>> suggestions and questions.
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
>> TableFunction
>> >> >> is a
>> >> >>>> good
>> >> >>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this class.
>> >> 'eval'
>> >> >>>> method
>> >> >>>>>>>>>>> of new LookupFunction is great for this purpose. The same
>> is
>> >> >> for
>> >> >>>>>>>>>>> 'async' case.
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> 2) There might be other configs in future, such as
>> >> >>>> 'cacheMissingKey'
>> >> >>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
>> >> >>>> ScanRuntimeProvider.
>> >> >>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
>> >> >>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
>> >> >>> method
>> >> >>>>>>>>>>> instead of many 'of' methods in future)?
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> 3) What are the plans for existing TableFunctionProvider
>> and
>> >> >>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
>> deprecated.
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> 4) Am I right that the current design does not assume
>> usage of
>> >> >>>>>>>>>>> user-provided LookupCache in re-scanning? In this case, it
>> is
>> >> >> not
>> >> >>>> very
>> >> >>>>>>>>>>> clear why do we need methods such as 'invalidate' or
>> 'putAll'
>> >> >> in
>> >> >>>>>>>>>>> LookupCache.
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> 5) I don't see DDL options, that were in previous version
>> of
>> >> >>> FLIP.
>> >> >>>> Do
>> >> >>>>>>>>>>> you have any special plans for them?
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> If you don't mind, I would be glad to be able to make small
>> >> >>>>>>>>>>> adjustments to the FLIP document too. I think it's worth
>> >> >>> mentioning
>> >> >>>>>>>>>>> about what exactly optimizations are planning in the
>> future.
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>> Smirnov Alexander
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
>> renqschn@gmail.com
>> >> >>> :
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> Hi Alexander and devs,
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> Thank you very much for the in-depth discussion! As Jark
>> >> >>>> mentioned we were inspired by Alexander's idea and made a
>> refactor on
>> >> >> our
>> >> >>>> design. FLIP-221 [1] has been updated to reflect our design now
>> and
>> >> we
>> >> >>> are
>> >> >>>> happy to hear more suggestions from you!
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> Compared to the previous design:
>> >> >>>>>>>>>>>> 1. The lookup cache serves at table runtime level and is
>> >> >>>> integrated as a component of LookupJoinRunner as discussed
>> >> previously.
>> >> >>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the
>> new
>> >> >>>> design.
>> >> >>>>>>>>>>>> 3. We separate the all-caching case individually and
>> >> >> introduce a
>> >> >>>> new RescanRuntimeProvider to reuse the ability of scanning. We are
>> >> >>> planning
>> >> >>>> to support SourceFunction / InputFormat for now considering the
>> >> >>> complexity
>> >> >>>> of FLIP-27 Source API.
>> >> >>>>>>>>>>>> 4. A new interface LookupFunction is introduced to make
>> the
>> >> >>>> semantic of lookup more straightforward for developers.
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> For replying to Alexander:
>> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>> >> >> deprecated
>> >> >>>> or not. Am I right that it will be so in the future, but currently
>> >> it's
>> >> >>> not?
>> >> >>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for now.
>> I
>> >> >>> think
>> >> >>>> it will be deprecated in the future but we don't have a clear plan
>> >> for
>> >> >>> that.
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> Thanks again for the discussion on this FLIP and looking
>> >> >> forward
>> >> >>>> to cooperating with you after we finalize the design and
>> interfaces!
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> [1]
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> Qingsheng
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
>> >> >>>> smiralexan@gmail.com> wrote:
>> >> >>>>>>>>>>>>>
>> >> >>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>> >> >>>>>>>>>>>>>
>> >> >>>>>>>>>>>>> Glad to see that we came to a consensus on almost all
>> >> points!
>> >> >>>>>>>>>>>>>
>> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>> >> >> deprecated
>> >> >>>> or
>> >> >>>>>>>>>>>>> not. Am I right that it will be so in the future, but
>> >> >> currently
>> >> >>>> it's
>> >> >>>>>>>>>>>>> not? Actually I also think that for the first version
>> it's
>> >> OK
>> >> >>> to
>> >> >>>> use
>> >> >>>>>>>>>>>>> InputFormat in ALL cache realization, because supporting
>> >> >> rescan
>> >> >>>>>>>>>>>>> ability seems like a very distant prospect. But for this
>> >> >>>> decision we
>> >> >>>>>>>>>>>>> need a consensus among all discussion participants.
>> >> >>>>>>>>>>>>>
>> >> >>>>>>>>>>>>> In general, I don't have something to argue with your
>> >> >>>> statements. All
>> >> >>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would be
>> nice
>> >> >> to
>> >> >>>> work
>> >> >>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot of
>> work
>> >> >> on
>> >> >>>> lookup
>> >> >>>>>>>>>>>>> join caching with realization very close to the one we
>> are
>> >> >>>> discussing,
>> >> >>>>>>>>>>>>> and want to share the results of this work. Anyway
>> looking
>> >> >>>> forward for
>> >> >>>>>>>>>>>>> the FLIP update!
>> >> >>>>>>>>>>>>>
>> >> >>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>> Smirnov Alexander
>> >> >>>>>>>>>>>>>
>> >> >>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> Hi Alex,
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> Thanks for summarizing your points.
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
>> discussed
>> >> >> it
>> >> >>>> several times
>> >> >>>>>>>>>>>>>> and we have totally refactored the design.
>> >> >>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many of
>> your
>> >> >>>> points!
>> >> >>>>>>>>>>>>>> Qingsheng is still working on updating the design docs
>> and
>> >> >>>> maybe can be
>> >> >>>>>>>>>>>>>> available in the next few days.
>> >> >>>>>>>>>>>>>> I will share some conclusions from our discussions:
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> 1) we have refactored the design towards to "cache in
>> >> >>>> framework" way.
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize and
>> a
>> >> >>>> default
>> >> >>>>>>>>>>>>>> implementation with builder for users to easy-use.
>> >> >>>>>>>>>>>>>> This can both make it possible to both have flexibility
>> and
>> >> >>>> conciseness.
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
>> >> >> cache,
>> >> >>>> esp reducing
>> >> >>>>>>>>>>>>>> IO.
>> >> >>>>>>>>>>>>>> Filter pushdown should be the final state and the
>> unified
>> >> >> way
>> >> >>>> to both
>> >> >>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
>> >> >>>>>>>>>>>>>> so I think we should make effort in this direction. If
>> we
>> >> >> need
>> >> >>>> to support
>> >> >>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
>> >> >>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide to
>> >> >>> implement
>> >> >>>> the cache
>> >> >>>>>>>>>>>>>> in the framework, we have the chance to support
>> >> >>>>>>>>>>>>>> filter on cache anytime. This is an optimization and it
>> >> >>> doesn't
>> >> >>>> affect the
>> >> >>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
>> >> >>>>>>>>>>>>>> discuss it when the FLIP is accepted.
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
>> >> >> proposal.
>> >> >>>>>>>>>>>>>> In the first version, we will only support InputFormat,
>> >> >>>> SourceFunction for
>> >> >>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
>> >> >>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source
>> operator
>> >> >>>> instead of
>> >> >>>>>>>>>>>>>> calling it embedded in the join operator.
>> >> >>>>>>>>>>>>>> However, this needs another FLIP to support the re-scan
>> >> >>> ability
>> >> >>>> for FLIP-27
>> >> >>>>>>>>>>>>>> Source, and this can be a large work.
>> >> >>>>>>>>>>>>>> In order to not block this issue, we can put the effort
>> of
>> >> >>>> FLIP-27 source
>> >> >>>>>>>>>>>>>> integration into future work and integrate
>> >> >>>>>>>>>>>>>> InputFormat&SourceFunction for now.
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as
>> >> they
>> >> >>>> are not
>> >> >>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another
>> >> function
>> >> >>>>>>>>>>>>>> similar to them which is meaningless. We need to plan
>> >> >> FLIP-27
>> >> >>>> source
>> >> >>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
>> >> >>>> deprecated.
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> Best,
>> >> >>>>>>>>>>>>>> Jark
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
>> >> >>>> smiralexan@gmail.com>
>> >> >>>>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>> Hi Martijn!
>> >> >>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat is
>> not
>> >> >>>> considered.
>> >> >>>>>>>>>>>>>>> Thanks for clearing that up!
>> >> >>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>> Smirnov Alexander
>> >> >>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
>> >> >>>> martijn@ververica.com>:
>> >> >>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>> Hi,
>> >> >>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>> With regards to:
>> >> >>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors to
>> >> >>> FLIP-27
>> >> >>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
>> >> >>>> interfaces will be
>> >> >>>>>>>>>>>>>>>> deprecated and connectors will either be refactored to
>> >> use
>> >> >>>> the new ones
>> >> >>>>>>>>>>>>>>> or
>> >> >>>>>>>>>>>>>>>> dropped.
>> >> >>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>> The caching should work for connectors that are using
>> >> >>> FLIP-27
>> >> >>>> interfaces,
>> >> >>>>>>>>>>>>>>>> we should not introduce new features for old
>> interfaces.
>> >> >>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>> Martijn
>> >> >>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
>> >> >>>> smiralexan@gmail.com>
>> >> >>>>>>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> Hi Jark!
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make
>> some
>> >> >>>> comments and
>> >> >>>>>>>>>>>>>>>>> clarify my points.
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we can
>> >> >>> achieve
>> >> >>>> both
>> >> >>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
>> >> >>>> flink-table-common,
>> >> >>>>>>>>>>>>>>>>> but have implementations of it in
>> flink-table-runtime.
>> >> >>>> Therefore if a
>> >> >>>>>>>>>>>>>>>>> connector developer wants to use existing cache
>> >> >> strategies
>> >> >>>> and their
>> >> >>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
>> >> >>>> planner, but if
>> >> >>>>>>>>>>>>>>>>> he wants to have its own cache implementation in his
>> >> >>>> TableFunction, it
>> >> >>>>>>>>>>>>>>>>> will be possible for him to use the existing
>> interface
>> >> >> for
>> >> >>>> this
>> >> >>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
>> >> >>>> documentation). In
>> >> >>>>>>>>>>>>>>>>> this way all configs and metrics will be unified.
>> WDYT?
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
>> will
>> >> >>>> have 90% of
>> >> >>>>>>>>>>>>>>>>> lookup requests that can never be cached
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in
>> case
>> >> >> of
>> >> >>>> LRU cache.
>> >> >>>>>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>.
>> Here
>> >> >> we
>> >> >>>> always
>> >> >>>>>>>>>>>>>>>>> store the response of the dimension table in cache,
>> even
>> >> >>>> after
>> >> >>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows
>> after
>> >> >>>> applying
>> >> >>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
>> >> >>> TableFunction,
>> >> >>>> we store
>> >> >>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache
>> line
>> >> >>> will
>> >> >>>> be
>> >> >>>>>>>>>>>>>>>>> filled, but will require much less memory (in bytes).
>> >> >> I.e.
>> >> >>>> we don't
>> >> >>>>>>>>>>>>>>>>> completely filter keys, by which result was pruned,
>> but
>> >> >>>> significantly
>> >> >>>>>>>>>>>>>>>>> reduce required memory to store this result. If the
>> user
>> >> >>>> knows about
>> >> >>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option
>> >> >> before
>> >> >>>> the start
>> >> >>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea
>> that we
>> >> >>> can
>> >> >>>> do this
>> >> >>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and
>> 'weigher'
>> >> >>>> methods of
>> >> >>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
>> collection
>> >> >> of
>> >> >>>> rows
>> >> >>>>>>>>>>>>>>>>> (value of cache). Therefore cache can automatically
>> fit
>> >> >>> much
>> >> >>>> more
>> >> >>>>>>>>>>>>>>>>> records than before.
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters
>> and
>> >> >>>> projects
>> >> >>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>> >> >>>> SupportsProjectionPushDown.
>> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>> >> >> don't
>> >> >>>> mean it's
>> >> >>>>>>>>>>>>>>> hard
>> >> >>>>>>>>>>>>>>>>> to implement.
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> It's debatable how difficult it will be to implement
>> >> >> filter
>> >> >>>> pushdown.
>> >> >>>>>>>>>>>>>>>>> But I think the fact that currently there is no
>> database
>> >> >>>> connector
>> >> >>>>>>>>>>>>>>>>> with filter pushdown at least means that this feature
>> >> >> won't
>> >> >>>> be
>> >> >>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk
>> about
>> >> >>>> other
>> >> >>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases might
>> >> not
>> >> >>>> support all
>> >> >>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I
>> think
>> >> >>> users
>> >> >>>> are
>> >> >>>>>>>>>>>>>>>>> interested in supporting cache filters optimization
>> >> >>>> independently of
>> >> >>>>>>>>>>>>>>>>> supporting other features and solving more complex
>> >> >> problems
>> >> >>>> (or
>> >> >>>>>>>>>>>>>>>>> unsolvable at all).
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
>> >> >>>> internal version
>> >> >>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and
>> >> reloading
>> >> >>>> data from
>> >> >>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to
>> >> >> unify
>> >> >>>> the logic
>> >> >>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
>> >> SourceFunction,
>> >> >>>> Source,...)
>> >> >>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
>> >> >> settled
>> >> >>>> on using
>> >> >>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in all
>> >> >> lookup
>> >> >>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans to
>> >> >>> deprecate
>> >> >>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
>> >> >>>> FLIP-27 source
>> >> >>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this source
>> was
>> >> >>>> designed to
>> >> >>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
>> >> >>>> JobManager and
>> >> >>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
>> >> >> (lookup
>> >> >>>> join
>> >> >>>>>>>>>>>>>>>>> operator in our case). There is even no direct way to
>> >> >> pass
>> >> >>>> splits from
>> >> >>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works
>> >> through
>> >> >>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>> >> >>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
>> >> >> AddSplitEvents).
>> >> >>>> Usage of
>> >> >>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
>> >> >>>> easier. But if
>> >> >>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
>> FLIP-27, I
>> >> >>>> have the
>> >> >>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join
>> >> ALL
>> >> >>>> cache in
>> >> >>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of batch
>> >> >>> source?
>> >> >>>> The point
>> >> >>>>>>>>>>>>>>>>> is that the only difference between lookup join ALL
>> >> cache
>> >> >>>> and simple
>> >> >>>>>>>>>>>>>>>>> join with batch source is that in the first case
>> >> scanning
>> >> >>> is
>> >> >>>> performed
>> >> >>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is
>> >> cleared
>> >> >>>> (correct me
>> >> >>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
>> functionality of
>> >> >>>> simple join
>> >> >>>>>>>>>>>>>>>>> to support state reloading + extend the
>> functionality of
>> >> >>>> scanning
>> >> >>>>>>>>>>>>>>>>> batch source multiple times (this one should be easy
>> >> with
>> >> >>>> new FLIP-27
>> >> >>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we
>> will
>> >> >> need
>> >> >>>> to change
>> >> >>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again
>> after
>> >> >>>> some TTL).
>> >> >>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term
>> goal
>> >> >> and
>> >> >>>> will make
>> >> >>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said.
>> Maybe
>> >> >> we
>> >> >>>> can limit
>> >> >>>>>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> So to sum up, my points is like this:
>> >> >>>>>>>>>>>>>>>>> 1) There is a way to make both concise and flexible
>> >> >>>> interfaces for
>> >> >>>>>>>>>>>>>>>>> caching in lookup join.
>> >> >>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both in
>> LRU
>> >> >> and
>> >> >>>> ALL caches.
>> >> >>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
>> supported
>> >> >> in
>> >> >>>> Flink
>> >> >>>>>>>>>>>>>>>>> connectors, some of the connectors might not have the
>> >> >>>> opportunity to
>> >> >>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently filter
>> >> >>>> pushdown works
>> >> >>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
>> >> >>>> projections
>> >> >>>>>>>>>>>>>>>>> optimization should be independent from other
>> features.
>> >> >>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that
>> >> involves
>> >> >>>> multiple
>> >> >>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
>> >> >>>> InputFormat in favor
>> >> >>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization
>> really
>> >> >>>> complex and
>> >> >>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
>> >> >>>> functionality of
>> >> >>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
>> >> >>> lookup
>> >> >>>> join ALL
>> >> >>>>>>>>>>>>>>>>> cache?
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>>>> Smirnov Alexander
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> [1]
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <imjark@gmail.com
>> >:
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want to
>> >> share
>> >> >>> my
>> >> >>>> ideas:
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors
>> base
>> >> >>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways
>> should
>> >> >>>> work (e.g.,
>> >> >>>>>>>>>>>>>>> cache
>> >> >>>>>>>>>>>>>>>>>> pruning, compatibility).
>> >> >>>>>>>>>>>>>>>>>> The framework way can provide more concise
>> interfaces.
>> >> >>>>>>>>>>>>>>>>>> The connector base way can define more flexible
>> cache
>> >> >>>>>>>>>>>>>>>>>> strategies/implementations.
>> >> >>>>>>>>>>>>>>>>>> We are still investigating a way to see if we can
>> have
>> >> >>> both
>> >> >>>>>>>>>>>>>>> advantages.
>> >> >>>>>>>>>>>>>>>>>> We should reach a consensus that the way should be a
>> >> >> final
>> >> >>>> state,
>> >> >>>>>>>>>>>>>>> and we
>> >> >>>>>>>>>>>>>>>>>> are on the path to it.
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
>> >> >>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into
>> cache
>> >> >> can
>> >> >>>> benefit a
>> >> >>>>>>>>>>>>>>> lot
>> >> >>>>>>>>>>>>>>>>> for
>> >> >>>>>>>>>>>>>>>>>> ALL cache.
>> >> >>>>>>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors
>> use
>> >> >>>> cache to
>> >> >>>>>>>>>>>>>>> reduce
>> >> >>>>>>>>>>>>>>>>> IO
>> >> >>>>>>>>>>>>>>>>>> requests to databases for better throughput.
>> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
>> will
>> >> >>>> have 90% of
>> >> >>>>>>>>>>>>>>>>> lookup
>> >> >>>>>>>>>>>>>>>>>> requests that can never be cached
>> >> >>>>>>>>>>>>>>>>>> and hit directly to the databases. That means the
>> cache
>> >> >> is
>> >> >>>>>>>>>>>>>>> meaningless in
>> >> >>>>>>>>>>>>>>>>>> this case.
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do
>> >> filters
>> >> >>>> and projects
>> >> >>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>> >> >>>>>>>>>>>>>>> SupportsProjectionPushDown.
>> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>> >> >> don't
>> >> >>>> mean it's
>> >> >>>>>>>>>>>>>>> hard
>> >> >>>>>>>>>>>>>>>>> to
>> >> >>>>>>>>>>>>>>>>>> implement.
>> >> >>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to
>> reduce
>> >> >> IO
>> >> >>>> and the
>> >> >>>>>>>>>>>>>>> cache
>> >> >>>>>>>>>>>>>>>>>> size.
>> >> >>>>>>>>>>>>>>>>>> That should be a final state that the scan source
>> and
>> >> >>>> lookup source
>> >> >>>>>>>>>>>>>>> share
>> >> >>>>>>>>>>>>>>>>>> the exact pushdown implementation.
>> >> >>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown
>> logic
>> >> >> in
>> >> >>>> caches,
>> >> >>>>>>>>>>>>>>> which
>> >> >>>>>>>>>>>>>>>>>> will complex the lookup join design.
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
>> >> >>>>>>>>>>>>>>>>>> All cache might be the most challenging part of this
>> >> >> FLIP.
>> >> >>>> We have
>> >> >>>>>>>>>>>>>>> never
>> >> >>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
>> >> >>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval"
>> method
>> >> >> of
>> >> >>>>>>>>>>>>>>> TableFunction.
>> >> >>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>> >> >>>>>>>>>>>>>>>>>> Ideally, connector implementation should share the
>> >> logic
>> >> >>> of
>> >> >>>> reload
>> >> >>>>>>>>>>>>>>> and
>> >> >>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
>> >> >>>> InputFormat/SourceFunction/FLIP-27
>> >> >>>>>>>>>>>>>>>>> Source.
>> >> >>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated,
>> and
>> >> >>> the
>> >> >>>> FLIP-27
>> >> >>>>>>>>>>>>>>>>> source
>> >> >>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>> >> >>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
>> LookupJoin,
>> >> >>> this
>> >> >>>> may make
>> >> >>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
>> >> >>>>>>>>>>>>>>>>>> We are still investigating how to abstract the ALL
>> >> cache
>> >> >>>> logic and
>> >> >>>>>>>>>>>>>>> reuse
>> >> >>>>>>>>>>>>>>>>>> the existing source interfaces.
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>> Best,
>> >> >>>>>>>>>>>>>>>>>> Jark
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
>> >> >>>> ro.v.boyko@gmail.com>
>> >> >>>>>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies out
>> of
>> >> >> the
>> >> >>>> scope of
>> >> >>>>>>>>>>>>>>> this
>> >> >>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be done
>> for
>> >> >>> all
>> >> >>>>>>>>>>>>>>>>> ScanTableSource
>> >> >>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>> >> >>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>> >> >>>>>>>>>>>>>>> martijnvisser@apache.org>
>> >> >>>>>>>>>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>> Hi everyone,
>> >> >>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
>> >> >>> mentioned
>> >> >>>> that
>> >> >>>>>>>>>>>>>>> filter
>> >> >>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
>> >> >> jdbc/hive/hbase."
>> >> >>>> -> Would
>> >> >>>>>>>>>>>>>>> an
>> >> >>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement
>> these
>> >> >>> filter
>> >> >>>>>>>>>>>>>>> pushdowns?
>> >> >>>>>>>>>>>>>>>>> I
>> >> >>>>>>>>>>>>>>>>>>>> can
>> >> >>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to doing
>> >> >> that,
>> >> >>>> outside
>> >> >>>>>>>>>>>>>>> of
>> >> >>>>>>>>>>>>>>>>> lookup
>> >> >>>>>>>>>>>>>>>>>>>> caching and metrics.
>> >> >>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>> Martijn Visser
>> >> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>> >> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>> >> >>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>> >> >>>> ro.v.boyko@gmail.com>
>> >> >>>>>>>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>> Hi everyone!
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation
>> would be
>> >> >> a
>> >> >>>> nice
>> >> >>>>>>>>>>>>>>>>> opportunity
>> >> >>>>>>>>>>>>>>>>>>>> for
>> >> >>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS
>> OF
>> >> >>>> proc_time"
>> >> >>>>>>>>>>>>>>>>> semantics
>> >> >>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
>> implemented.
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say
>> that:
>> >> >>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut
>> off
>> >> >>> the
>> >> >>>> cache
>> >> >>>>>>>>>>>>>>> size
>> >> >>>>>>>>>>>>>>>>> by
>> >> >>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most
>> >> handy
>> >> >>>> way to do
>> >> >>>>>>>>>>>>>>> it
>> >> >>>>>>>>>>>>>>>>> is
>> >> >>>>>>>>>>>>>>>>>>>> apply
>> >> >>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit
>> harder to
>> >> >>>> pass it
>> >> >>>>>>>>>>>>>>>>> through the
>> >> >>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
>> >> >>> correctly
>> >> >>>>>>>>>>>>>>> mentioned
>> >> >>>>>>>>>>>>>>>>> that
>> >> >>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
>> >> >>>> jdbc/hive/hbase.
>> >> >>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
>> >> >> parameters
>> >> >>>> for
>> >> >>>>>>>>>>>>>>> different
>> >> >>>>>>>>>>>>>>>>>>>> tables
>> >> >>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
>> >> >> through
>> >> >>>> DDL
>> >> >>>>>>>>>>>>>>> rather
>> >> >>>>>>>>>>>>>>>>> than
>> >> >>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for
>> >> all
>> >> >>>> lookup
>> >> >>>>>>>>>>>>>>> tables.
>> >> >>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
>> >> >>>> deprives us of
>> >> >>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement
>> >> their
>> >> >>> own
>> >> >>>>>>>>>>>>>>> cache).
>> >> >>>>>>>>>>>>>>>>> But
>> >> >>>>>>>>>>>>>>>>>>>> most
>> >> >>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
>> >> >> different
>> >> >>>> cache
>> >> >>>>>>>>>>>>>>>>> strategies
>> >> >>>>>>>>>>>>>>>>>>>> and
>> >> >>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>> All these points are much closer to the schema
>> >> >> proposed
>> >> >>>> by
>> >> >>>>>>>>>>>>>>>>> Alexander.
>> >> >>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right
>> and
>> >> >>> all
>> >> >>>> these
>> >> >>>>>>>>>>>>>>>>>>>> facilities
>> >> >>>>>>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>>>>>>>> Roman Boyko
>> >> >>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>> >> >>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>> >> >>>>>>>>>>>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>> >> >>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
>> >> >>>> express that
>> >> >>>>>>>>>>>>>>> I
>> >> >>>>>>>>>>>>>>>>> really
>> >> >>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic
>> >> >> and I
>> >> >>>> hope
>> >> >>>>>>>>>>>>>>> that
>> >> >>>>>>>>>>>>>>>>>>>> others
>> >> >>>>>>>>>>>>>>>>>>>>>> will join the conversation.
>> >> >>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>> Martijn
>> >> >>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>> >> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>> >> >>>>>>>>>>>>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I
>> have
>> >> >>>> questions
>> >> >>>>>>>>>>>>>>>>> about
>> >> >>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
>> >> >>>> something?).
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
>> >> >>>> SYSTEM_TIME
>> >> >>>>>>>>>>>>>>> AS OF
>> >> >>>>>>>>>>>>>>>>>>>>>> proc_time”
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME
>> AS
>> >> >> OF
>> >> >>>>>>>>>>>>>>> proc_time"
>> >> >>>>>>>>>>>>>>>>> is
>> >> >>>>>>>>>>>>>>>>>>>> not
>> >> >>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you
>> said,
>> >> >>> users
>> >> >>>> go
>> >> >>>>>>>>>>>>>>> on it
>> >> >>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no
>> one
>> >> >>>> proposed
>> >> >>>>>>>>>>>>>>> to
>> >> >>>>>>>>>>>>>>>>> enable
>> >> >>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you
>> mean
>> >> >>>> other
>> >> >>>>>>>>>>>>>>>>> developers
>> >> >>>>>>>>>>>>>>>>>>>> of
>> >> >>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
>> >> >>> specify
>> >> >>>>>>>>>>>>>>> whether
>> >> >>>>>>>>>>>>>>>>> their
>> >> >>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list
>> of
>> >> >>>> supported
>> >> >>>>>>>>>>>>>>>>>>>> options),
>> >> >>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want
>> to.
>> >> So
>> >> >>>> what
>> >> >>>>>>>>>>>>>>>>> exactly is
>> >> >>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching in
>> >> >>> modules
>> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common
>> from
>> >> >>> the
>> >> >>>>>>>>>>>>>>>>> considered
>> >> >>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
>> >> >>>> breaking/non-breaking
>> >> >>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
>> options in
>> >> >>> DDL
>> >> >>>> to
>> >> >>>>>>>>>>>>>>>>> control
>> >> >>>>>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never
>> >> happened
>> >> >>>>>>>>>>>>>>> previously
>> >> >>>>>>>>>>>>>>>>> and
>> >> >>>>>>>>>>>>>>>>>>>>> should
>> >> >>>>>>>>>>>>>>>>>>>>>>> be cautious
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics
>> of
>> >> >> DDL
>> >> >>>>>>>>>>>>>>> options
>> >> >>>>>>>>>>>>>>>>> and
>> >> >>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it
>> about
>> >> >>>> limiting
>> >> >>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>> scope
>> >> >>>>>>>>>>>>>>>>>>>> of
>> >> >>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user business
>> >> >> logic
>> >> >>>> rather
>> >> >>>>>>>>>>>>>>> than
>> >> >>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
>> >> >>>> framework? I
>> >> >>>>>>>>>>>>>>>>> mean
>> >> >>>>>>>>>>>>>>>>>>>> that
>> >> >>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option
>> with
>> >> >>>> lookup
>> >> >>>>>>>>>>>>>>> cache
>> >> >>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
>> >> >>>> decision,
>> >> >>>>>>>>>>>>>>>>> because it
>> >> >>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not
>> >> >> just
>> >> >>>>>>>>>>>>>>> performance
>> >> >>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several functions
>> of
>> >> >> ONE
>> >> >>>> table
>> >> >>>>>>>>>>>>>>>>> (there
>> >> >>>>>>>>>>>>>>>>>>>> can
>> >> >>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches).
>> Does it
>> >> >>>> really
>> >> >>>>>>>>>>>>>>>>> matter for
>> >> >>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
>> >> >>> located,
>> >> >>>>>>>>>>>>>>> which is
>> >> >>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
>> >> >>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
>> 'sink.parallelism',
>> >> >>>> which in
>> >> >>>>>>>>>>>>>>>>> some way
>> >> >>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I
>> >> >> don't
>> >> >>>> see any
>> >> >>>>>>>>>>>>>>>>> problem
>> >> >>>>>>>>>>>>>>>>>>>>>>> here.
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
>> >> >>>> scenario
>> >> >>>>>>>>>>>>>>> and
>> >> >>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>> design
>> >> >>>>>>>>>>>>>>>>>>>>>>> would become more complex
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion,
>> but
>> >> >>>> actually
>> >> >>>>>>>>>>>>>>> in our
>> >> >>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
>> >> >> easily
>> >> >>> -
>> >> >>>> we
>> >> >>>>>>>>>>>>>>> reused
>> >> >>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a
>> new
>> >> >>> API).
>> >> >>>> The
>> >> >>>>>>>>>>>>>>>>> point is
>> >> >>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
>> >> >> InputFormat
>> >> >>>> for
>> >> >>>>>>>>>>>>>>>>> scanning
>> >> >>>>>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive
>> - it
>> >> >>> uses
>> >> >>>>>>>>>>>>>>> class
>> >> >>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a
>> wrapper
>> >> >>> around
>> >> >>>>>>>>>>>>>>>>> InputFormat.
>> >> >>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability
>> to
>> >> >>> reload
>> >> >>>>>>>>>>>>>>> cache
>> >> >>>>>>>>>>>>>>>>> data
>> >> >>>>>>>>>>>>>>>>>>>> in
>> >> >>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number
>> of
>> >> >>>>>>>>>>>>>>> InputSplits,
>> >> >>>>>>>>>>>>>>>>> but
>> >> >>>>>>>>>>>>>>>>>>>> has
>> >> >>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
>> >> >>>> significantly
>> >> >>>>>>>>>>>>>>>>> reduces
>> >> >>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I
>> know
>> >> >>> that
>> >> >>>>>>>>>>>>>>> usually
>> >> >>>>>>>>>>>>>>>>> we
>> >> >>>>>>>>>>>>>>>>>>>> try
>> >> >>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code,
>> but
>> >> >>> maybe
>> >> >>>> this
>> >> >>>>>>>>>>>>>>> one
>> >> >>>>>>>>>>>>>>>>> can
>> >> >>>>>>>>>>>>>>>>>>>> be
>> >> >>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an
>> ideal
>> >> >>>> solution,
>> >> >>>>>>>>>>>>>>> maybe
>> >> >>>>>>>>>>>>>>>>>>>> there
>> >> >>>>>>>>>>>>>>>>>>>>>>> are better ones.
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
>> >> >> introduce
>> >> >>>>>>>>>>>>>>>>> compatibility
>> >> >>>>>>>>>>>>>>>>>>>>>> issues
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer
>> of
>> >> >> the
>> >> >>>>>>>>>>>>>>> connector
>> >> >>>>>>>>>>>>>>>>>>>> won't
>> >> >>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new
>> cache
>> >> >>>> options
>> >> >>>>>>>>>>>>>>>>>>>> incorrectly
>> >> >>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into
>> 2
>> >> >>>> different
>> >> >>>>>>>>>>>>>>> code
>> >> >>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need
>> to
>> >> >> do
>> >> >>>> is to
>> >> >>>>>>>>>>>>>>>>> redirect
>> >> >>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
>> LookupConfig
>> >> (+
>> >> >>>> maybe
>> >> >>>>>>>>>>>>>>> add an
>> >> >>>>>>>>>>>>>>>>>>>> alias
>> >> >>>>>>>>>>>>>>>>>>>>>>> for options, if there was different naming),
>> >> >>> everything
>> >> >>>>>>>>>>>>>>> will be
>> >> >>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't
>> do
>> >> >>>>>>>>>>>>>>> refactoring at
>> >> >>>>>>>>>>>>>>>>> all,
>> >> >>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector
>> because
>> >> >> of
>> >> >>>>>>>>>>>>>>> backward
>> >> >>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use
>> >> his
>> >> >>> own
>> >> >>>>>>>>>>>>>>> cache
>> >> >>>>>>>>>>>>>>>>> logic,
>> >> >>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs
>> >> into
>> >> >>> the
>> >> >>>>>>>>>>>>>>>>> framework,
>> >> >>>>>>>>>>>>>>>>>>>> and
>> >> >>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with
>> already
>> >> >>>> existing
>> >> >>>>>>>>>>>>>>>>> configs
>> >> >>>>>>>>>>>>>>>>>>>> and
>> >> >>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare
>> >> >> case).
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all
>> the
>> >> >> way
>> >> >>>> down
>> >> >>>>>>>>>>>>>>> to
>> >> >>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>> table
>> >> >>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that
>> the
>> >> >>> ONLY
>> >> >>>>>>>>>>>>>>> connector
>> >> >>>>>>>>>>>>>>>>>>>> that
>> >> >>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
>> FileSystemTableSource
>> >> >>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it currently).
>> >> Also
>> >> >>>> for some
>> >> >>>>>>>>>>>>>>>>>>>> databases
>> >> >>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
>> >> >>> filters
>> >> >>>>>>>>>>>>>>> that we
>> >> >>>>>>>>>>>>>>>>> have
>> >> >>>>>>>>>>>>>>>>>>>>>>> in Flink.
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache
>> >> >> seems
>> >> >>>> not
>> >> >>>>>>>>>>>>>>>>> quite
>> >> >>>>>>>>>>>>>>>>>>>>> useful
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large
>> amount of
>> >> >>> data
>> >> >>>>>>>>>>>>>>> from the
>> >> >>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose
>> in
>> >> >>>> dimension
>> >> >>>>>>>>>>>>>>>>> table
>> >> >>>>>>>>>>>>>>>>>>>>>>> 'users'
>> >> >>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40,
>> >> and
>> >> >>>> input
>> >> >>>>>>>>>>>>>>> stream
>> >> >>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age
>> of
>> >> >>>> users. If
>> >> >>>>>>>>>>>>>>> we
>> >> >>>>>>>>>>>>>>>>> have
>> >> >>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>> >> >>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This
>> means
>> >> >>> the
>> >> >>>> user
>> >> >>>>>>>>>>>>>>> can
>> >> >>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2
>> >> times.
>> >> >>> It
>> >> >>>> will
>> >> >>>>>>>>>>>>>>>>> gain a
>> >> >>>>>>>>>>>>>>>>>>>>>>> huge
>> >> >>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization
>> >> >> starts
>> >> >>>> to
>> >> >>>>>>>>>>>>>>> really
>> >> >>>>>>>>>>>>>>>>>>>> shine
>> >> >>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters
>> and
>> >> >>>> projections
>> >> >>>>>>>>>>>>>>>>> can't
>> >> >>>>>>>>>>>>>>>>>>>> fit
>> >> >>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
>> >> >>>> additional
>> >> >>>>>>>>>>>>>>>>>>>> possibilities
>> >> >>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
>> >> >>>> useful'.
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
>> regarding
>> >> >> this
>> >> >>>> topic!
>> >> >>>>>>>>>>>>>>>>> Because
>> >> >>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points,
>> and I
>> >> >>>> think
>> >> >>>>>>>>>>>>>>> with
>> >> >>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>> help
>> >> >>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
>> >> >>>> consensus.
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>> >> >>>>>>>>>>>>>>> renqschn@gmail.com
>> >> >>>>>>>>>>>>>>>>>> :
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my
>> late
>> >> >>>> response!
>> >> >>>>>>>>>>>>>>> We
>> >> >>>>>>>>>>>>>>>>> had
>> >> >>>>>>>>>>>>>>>>>>>> an
>> >> >>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and
>> Leonard
>> >> >>> and
>> >> >>>> I’d
>> >> >>>>>>>>>>>>>>> like
>> >> >>>>>>>>>>>>>>>>> to
>> >> >>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing
>> the
>> >> >>> cache
>> >> >>>>>>>>>>>>>>> logic in
>> >> >>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>> table
>> >> >>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
>> user-provided
>> >> >>>> table
>> >> >>>>>>>>>>>>>>>>> function,
>> >> >>>>>>>>>>>>>>>>>>>> we
>> >> >>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
>> >> >>>> TableFunction
>> >> >>>>>>>>>>>>>>> with
>> >> >>>>>>>>>>>>>>>>> these
>> >> >>>>>>>>>>>>>>>>>>>>>>> concerns:
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of
>> "FOR
>> >> >>>>>>>>>>>>>>> SYSTEM_TIME
>> >> >>>>>>>>>>>>>>>>> AS OF
>> >> >>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect
>> the
>> >> >>>> content
>> >> >>>>>>>>>>>>>>> of the
>> >> >>>>>>>>>>>>>>>>>>>> lookup
>> >> >>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users
>> choose
>> >> to
>> >> >>>> enable
>> >> >>>>>>>>>>>>>>>>> caching
>> >> >>>>>>>>>>>>>>>>>>>> on
>> >> >>>>>>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that
>> this
>> >> >>>> breakage is
>> >> >>>>>>>>>>>>>>>>>>>> acceptable
>> >> >>>>>>>>>>>>>>>>>>>>>> in
>> >> >>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not
>> to
>> >> >>>> provide
>> >> >>>>>>>>>>>>>>>>> caching on
>> >> >>>>>>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>> table runtime level.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
>> >> >>>> framework
>> >> >>>>>>>>>>>>>>>>> (whether
>> >> >>>>>>>>>>>>>>>>>>>> in a
>> >> >>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we
>> have
>> >> >> to
>> >> >>>>>>>>>>>>>>> confront a
>> >> >>>>>>>>>>>>>>>>>>>>>> situation
>> >> >>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
>> >> >>>> behavior of
>> >> >>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>> framework,
>> >> >>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and should
>> be
>> >> >>>> cautious.
>> >> >>>>>>>>>>>>>>>>> Under
>> >> >>>>>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the framework
>> >> should
>> >> >>>> only be
>> >> >>>>>>>>>>>>>>>>>>>> specified
>> >> >>>>>>>>>>>>>>>>>>>>> by
>> >> >>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s
>> hard
>> >> to
>> >> >>>> apply
>> >> >>>>>>>>>>>>>>> these
>> >> >>>>>>>>>>>>>>>>>>>> general
>> >> >>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads
>> and
>> >> >>>> refresh
>> >> >>>>>>>>>>>>>>> all
>> >> >>>>>>>>>>>>>>>>>>>> records
>> >> >>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high
>> >> lookup
>> >> >>>>>>>>>>>>>>> performance
>> >> >>>>>>>>>>>>>>>>>>>> (like
>> >> >>>>>>>>>>>>>>>>>>>>>> Hive
>> >> >>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely
>> used
>> >> by
>> >> >>> our
>> >> >>>>>>>>>>>>>>> internal
>> >> >>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the
>> user’s
>> >> >>>>>>>>>>>>>>> TableFunction
>> >> >>>>>>>>>>>>>>>>>>>> works
>> >> >>>>>>>>>>>>>>>>>>>>>> fine
>> >> >>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
>> introduce a
>> >> >>> new
>> >> >>>>>>>>>>>>>>>>> interface for
>> >> >>>>>>>>>>>>>>>>>>>>> this
>> >> >>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would
>> become
>> >> >> more
>> >> >>>>>>>>>>>>>>> complex.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
>> >> >>>> introduce
>> >> >>>>>>>>>>>>>>>>>>>> compatibility
>> >> >>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there
>> might
>> >> >>>> exist two
>> >> >>>>>>>>>>>>>>>>> caches
>> >> >>>>>>>>>>>>>>>>>>>>> with
>> >> >>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
>> >> >> incorrectly
>> >> >>>>>>>>>>>>>>> configures
>> >> >>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>> table
>> >> >>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented
>> by
>> >> >> the
>> >> >>>> lookup
>> >> >>>>>>>>>>>>>>>>> source).
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
>> Alexander, I
>> >> >>>> think
>> >> >>>>>>>>>>>>>>>>> filters
>> >> >>>>>>>>>>>>>>>>>>>> and
>> >> >>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down
>> to
>> >> >> the
>> >> >>>> table
>> >> >>>>>>>>>>>>>>>>> function,
>> >> >>>>>>>>>>>>>>>>>>>>> like
>> >> >>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
>> >> >> runner
>> >> >>>> with
>> >> >>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>> cache.
>> >> >>>>>>>>>>>>>>>>>>>>> The
>> >> >>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network
>> I/O
>> >> >> and
>> >> >>>>>>>>>>>>>>> pressure
>> >> >>>>>>>>>>>>>>>>> on the
>> >> >>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
>> >> >>> optimizations
>> >> >>>> to
>> >> >>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>> cache
>> >> >>>>>>>>>>>>>>>>>>>>> seems
>> >> >>>>>>>>>>>>>>>>>>>>>>> not quite useful.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect
>> our
>> >> >>>> ideas.
>> >> >>>>>>>>>>>>>>> We
>> >> >>>>>>>>>>>>>>>>>>>> prefer to
>> >> >>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
>> >> >>>> TableFunction,
>> >> >>>>>>>>>>>>>>> and we
>> >> >>>>>>>>>>>>>>>>>>>> could
>> >> >>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
>> (CachingTableFunction,
>> >> >>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>> >> >>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
>> >> >> regulate
>> >> >>>>>>>>>>>>>>> metrics
>> >> >>>>>>>>>>>>>>>>> of the
>> >> >>>>>>>>>>>>>>>>>>>>>> cache.
>> >> >>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> [1]
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> >> >>>>>>>>>>>>>>>>>>>>>>>> [2]
>> >> >>> https://github.com/PatrickRen/flink/tree/FLIP-221
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр
>> Смирнов
>> >> >> <
>> >> >>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>> >> >>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution
>> as
>> >> >> the
>> >> >>>>>>>>>>>>>>> first
>> >> >>>>>>>>>>>>>>>>> step:
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually
>> exclusive
>> >> >>>>>>>>>>>>>>> (originally
>> >> >>>>>>>>>>>>>>>>>>>>> proposed
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually
>> >> they
>> >> >>>> follow
>> >> >>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>> same
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
>> different.
>> >> >> If
>> >> >>> we
>> >> >>>>>>>>>>>>>>> will
>> >> >>>>>>>>>>>>>>>>> go one
>> >> >>>>>>>>>>>>>>>>>>>>> way,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
>> >> >>>> deleting
>> >> >>>>>>>>>>>>>>>>> existing
>> >> >>>>>>>>>>>>>>>>>>>> code
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
>> connectors.
>> >> >> So
>> >> >>> I
>> >> >>>>>>>>>>>>>>> think we
>> >> >>>>>>>>>>>>>>>>>>>> should
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about
>> that
>> >> >> and
>> >> >>>> then
>> >> >>>>>>>>>>>>>>> work
>> >> >>>>>>>>>>>>>>>>>>>>> together
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks
>> for
>> >> >>>> different
>> >> >>>>>>>>>>>>>>>>> parts
>> >> >>>>>>>>>>>>>>>>>>>> of
>> >> >>>>>>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
>> >> >>>> introducing
>> >> >>>>>>>>>>>>>>>>> proposed
>> >> >>>>>>>>>>>>>>>>>>>> set
>> >> >>>>>>>>>>>>>>>>>>>>> of
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests
>> >> >> after
>> >> >>>>>>>>>>>>>>> filter
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of
>> the
>> >> >>>> lookup
>> >> >>>>>>>>>>>>>>>>> table, we
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after
>> that we
>> >> >>> can
>> >> >>>>>>>>>>>>>>> filter
>> >> >>>>>>>>>>>>>>>>>>>>> responses,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
>> >> >>>> pushdown. So
>> >> >>>>>>>>>>>>>>> if
>> >> >>>>>>>>>>>>>>>>>>>>> filtering
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much
>> less
>> >> >>> rows
>> >> >>>> in
>> >> >>>>>>>>>>>>>>>>> cache.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture
>> is
>> >> >> not
>> >> >>>>>>>>>>>>>>> shared.
>> >> >>>>>>>>>>>>>>>>> I
>> >> >>>>>>>>>>>>>>>>>>>> don't
>> >> >>>>>>>>>>>>>>>>>>>>>>> know the
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds
>> of
>> >> >>>>>>>>>>>>>>> conversations
>> >> >>>>>>>>>>>>>>>>> :)
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so
>> I
>> >> >>> made a
>> >> >>>>>>>>>>>>>>> Jira
>> >> >>>>>>>>>>>>>>>>> issue,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
>> >> >>> details
>> >> >>>> -
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >> https://issues.apache.org/jira/browse/FLINK-27411.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>> >> >>>>>>>>>>>>>>> arvid@apache.org>:
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency
>> was
>> >> >> not
>> >> >>>>>>>>>>>>>>>>> satisfying
>> >> >>>>>>>>>>>>>>>>>>>> for
>> >> >>>>>>>>>>>>>>>>>>>>>> me.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could
>> also
>> >> >>> live
>> >> >>>>>>>>>>>>>>> with
>> >> >>>>>>>>>>>>>>>>> an
>> >> >>>>>>>>>>>>>>>>>>>>> easier
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of
>> making
>> >> >>>> caching
>> >> >>>>>>>>>>>>>>> an
>> >> >>>>>>>>>>>>>>>>>>>>>>> implementation
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
>> >> >> caching
>> >> >>>>>>>>>>>>>>> layer
>> >> >>>>>>>>>>>>>>>>>>>> around X.
>> >> >>>>>>>>>>>>>>>>>>>>>> So
>> >> >>>>>>>>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction
>> that
>> >> >>>>>>>>>>>>>>> delegates to
>> >> >>>>>>>>>>>>>>>>> X in
>> >> >>>>>>>>>>>>>>>>>>>>> case
>> >> >>>>>>>>>>>>>>>>>>>>>>> of
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting
>> it
>> >> >> into
>> >> >>>> the
>> >> >>>>>>>>>>>>>>>>> operator
>> >> >>>>>>>>>>>>>>>>>>>>>> model
>> >> >>>>>>>>>>>>>>>>>>>>>>> as
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
>> probably
>> >> >>>>>>>>>>>>>>> unnecessary
>> >> >>>>>>>>>>>>>>>>> in
>> >> >>>>>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>> first step
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
>> >> >>> receive
>> >> >>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>> requests
>> >> >>>>>>>>>>>>>>>>>>>>>>> after
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
>> >> >>> interesting
>> >> >>>> to
>> >> >>>>>>>>>>>>>>> save
>> >> >>>>>>>>>>>>>>>>>>>>> memory).
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of
>> >> >> this
>> >> >>>> FLIP
>> >> >>>>>>>>>>>>>>>>> would be
>> >> >>>>>>>>>>>>>>>>>>>>>>> limited to
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
>> >> >>>> Everything
>> >> >>>>>>>>>>>>>>> else
>> >> >>>>>>>>>>>>>>>>>>>>> remains
>> >> >>>>>>>>>>>>>>>>>>>>>> an
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means
>> we
>> >> >> can
>> >> >>>>>>>>>>>>>>> easily
>> >> >>>>>>>>>>>>>>>>>>>>>> incorporate
>> >> >>>>>>>>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander
>> pointed
>> >> >> out
>> >> >>>>>>>>>>>>>>> later.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture
>> is
>> >> >> not
>> >> >>>>>>>>>>>>>>> shared.
>> >> >>>>>>>>>>>>>>>>> I
>> >> >>>>>>>>>>>>>>>>>>>> don't
>> >> >>>>>>>>>>>>>>>>>>>>>>> know the
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
>> >> >> Смирнов
>> >> >>> <
>> >> >>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm
>> not a
>> >> >>>>>>>>>>>>>>> committer
>> >> >>>>>>>>>>>>>>>>> yet,
>> >> >>>>>>>>>>>>>>>>>>>> but
>> >> >>>>>>>>>>>>>>>>>>>>>> I'd
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP
>> >> really
>> >> >>>>>>>>>>>>>>>>> interested
>> >> >>>>>>>>>>>>>>>>>>>> me.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
>> feature in
>> >> >> my
>> >> >>>>>>>>>>>>>>>>> company’s
>> >> >>>>>>>>>>>>>>>>>>>>> Flink
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our
>> thoughts
>> >> >> on
>> >> >>>>>>>>>>>>>>> this and
>> >> >>>>>>>>>>>>>>>>>>>> make
>> >> >>>>>>>>>>>>>>>>>>>>>> code
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
>> >> >>>>>>>>>>>>>>> introducing an
>> >> >>>>>>>>>>>>>>>>>>>>> abstract
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
>> >> (CachingTableFunction).
>> >> >>> As
>> >> >>>>>>>>>>>>>>> you
>> >> >>>>>>>>>>>>>>>>> know,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
>> flink-table-common
>> >> >>>>>>>>>>>>>>> module,
>> >> >>>>>>>>>>>>>>>>> which
>> >> >>>>>>>>>>>>>>>>>>>>>>> provides
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s
>> >> very
>> >> >>>>>>>>>>>>>>>>> convenient
>> >> >>>>>>>>>>>>>>>>>>>> for
>> >> >>>>>>>>>>>>>>>>>>>>>>> importing
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
>> CachingTableFunction
>> >> >>>> contains
>> >> >>>>>>>>>>>>>>>>> logic
>> >> >>>>>>>>>>>>>>>>>>>> for
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
>> >> >> everything
>> >> >>>>>>>>>>>>>>>>> connected
>> >> >>>>>>>>>>>>>>>>>>>> with
>> >> >>>>>>>>>>>>>>>>>>>>> it
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module,
>> probably
>> >> >> in
>> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend
>> on
>> >> >>>> another
>> >> >>>>>>>>>>>>>>>>> module,
>> >> >>>>>>>>>>>>>>>>>>>>>> which
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which
>> doesn’t
>> >> >>>> sound
>> >> >>>>>>>>>>>>>>>>> good.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
>> >> ‘getLookupConfig’
>> >> >>> to
>> >> >>>>>>>>>>>>>>>>>>>>>> LookupTableSource
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
>> connectors
>> >> to
>> >> >>>> only
>> >> >>>>>>>>>>>>>>> pass
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore
>> they
>> >> >>> won’t
>> >> >>>>>>>>>>>>>>>>> depend on
>> >> >>>>>>>>>>>>>>>>>>>>>>> runtime
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner
>> >> >> will
>> >> >>>>>>>>>>>>>>>>> construct a
>> >> >>>>>>>>>>>>>>>>>>>>>> lookup
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime
>> logic
>> >> >>>>>>>>>>>>>>>>>>>> (ProcessFunctions
>> >> >>>>>>>>>>>>>>>>>>>>>> in
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture
>> >> looks
>> >> >>>> like
>> >> >>>>>>>>>>>>>>> in
>> >> >>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>> pinned
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually
>> >> >> yours
>> >> >>>>>>>>>>>>>>>>>>>> CacheConfig).
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will
>> be
>> >> >>>>>>>>>>>>>>> responsible
>> >> >>>>>>>>>>>>>>>>> for
>> >> >>>>>>>>>>>>>>>>>>>>> this
>> >> >>>>>>>>>>>>>>>>>>>>>> –
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
>> inheritors.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>> >> >>>>>>>>>>>>>>> flink-table-runtime
>> >> >>>>>>>>>>>>>>>>> -
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>> >> >>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
>> >> >> LookupJoinCachingRunner,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful
>> advantage
>> >> >> of
>> >> >>>>>>>>>>>>>>> such a
>> >> >>>>>>>>>>>>>>>>>>>>> solution.
>> >> >>>>>>>>>>>>>>>>>>>>>>> If
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we
>> can
>> >> >>>> apply
>> >> >>>>>>>>>>>>>>> some
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
>> LookupJoinRunnerWithCalc
>> >> >> was
>> >> >>>>>>>>>>>>>>> named
>> >> >>>>>>>>>>>>>>>>> like
>> >> >>>>>>>>>>>>>>>>>>>>> this
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
>> >> >>> actually
>> >> >>>>>>>>>>>>>>>>> mostly
>> >> >>>>>>>>>>>>>>>>>>>>>> consists
>> >> >>>>>>>>>>>>>>>>>>>>>>> of
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup
>> table
>> >> >> B
>> >> >>>>>>>>>>>>>>>>> condition
>> >> >>>>>>>>>>>>>>>>>>>>> ‘JOIN …
>> >> >>>>>>>>>>>>>>>>>>>>>>> ON
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
>> >> >>> B.salary >
>> >> >>>>>>>>>>>>>>> 1000’
>> >> >>>>>>>>>>>>>>>>>>>>> ‘calc’
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age =
>> B.age +
>> >> >> 10
>> >> >>>> and
>> >> >>>>>>>>>>>>>>>>>>>> B.salary >
>> >> >>>>>>>>>>>>>>>>>>>>>>> 1000.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
>> >> >> records
>> >> >>> in
>> >> >>>>>>>>>>>>>>>>> cache,
>> >> >>>>>>>>>>>>>>>>>>>> size
>> >> >>>>>>>>>>>>>>>>>>>>> of
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced:
>> filters =
>> >> >>>> avoid
>> >> >>>>>>>>>>>>>>>>> storing
>> >> >>>>>>>>>>>>>>>>>>>>>> useless
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce
>> >> records’
>> >> >>>>>>>>>>>>>>> size. So
>> >> >>>>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>>>>>>> initial
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be
>> >> increased
>> >> >>> by
>> >> >>>>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>>>>>> user.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
>> discussion
>> >> >>> about
>> >> >>>>>>>>>>>>>>>>>>>> FLIP-221[1],
>> >> >>>>>>>>>>>>>>>>>>>>>>> which
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table
>> >> cache
>> >> >>> and
>> >> >>>>>>>>>>>>>>> its
>> >> >>>>>>>>>>>>>>>>>>>> standard
>> >> >>>>>>>>>>>>>>>>>>>>>>> metrics.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
>> >> >>>> implement
>> >> >>>>>>>>>>>>>>>>> their
>> >> >>>>>>>>>>>>>>>>>>>> own
>> >> >>>>>>>>>>>>>>>>>>>>>>> cache to
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
>> >> >> standard
>> >> >>> of
>> >> >>>>>>>>>>>>>>>>> metrics
>> >> >>>>>>>>>>>>>>>>>>>> for
>> >> >>>>>>>>>>>>>>>>>>>>>>> users and
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
>> >> >>> joins,
>> >> >>>>>>>>>>>>>>> which
>> >> >>>>>>>>>>>>>>>>> is a
>> >> >>>>>>>>>>>>>>>>>>>>>> quite
>> >> >>>>>>>>>>>>>>>>>>>>>>> common
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
>> including
>> >> >>>> cache,
>> >> >>>>>>>>>>>>>>>>>>>> metrics,
>> >> >>>>>>>>>>>>>>>>>>>>>>> wrapper
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table
>> >> options.
>> >> >>>>>>>>>>>>>>> Please
>> >> >>>>>>>>>>>>>>>>> take a
>> >> >>>>>>>>>>>>>>>>>>>>> look
>> >> >>>>>>>>>>>>>>>>>>>>>>> at the
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
>> >> >>> suggestions
>> >> >>>>>>>>>>>>>>> and
>> >> >>>>>>>>>>>>>>>>>>>> comments
>> >> >>>>>>>>>>>>>>>>>>>>>>> would be
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> --
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>> >> >>>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>> >> >>>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>>> --
>> >> >>>>>>>>>>>>>>>>>>> Best regards,
>> >> >>>>>>>>>>>>>>>>>>> Roman Boyko
>> >> >>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>> >> >>>>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>>>
>> >> >>>>>>>>>>>>>
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> --
>> >> >>>>>>>>>>>> Best Regards,
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> Qingsheng Ren
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> Real-time Computing Team
>> >> >>>>>>>>>>>> Alibaba Cloud
>> >> >>>>>>>>>>>>
>> >> >>>>>>>>>>>> Email: renqschn@gmail.com
>> >> >>>>>>>>>>
>> >> >>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>> >>
>>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jingsong Li <ji...@gmail.com>.

Thanks Qingsheng and all for your discussion.

Very sorry to jump in so late.

Maybe I missed something?
My first impression when I saw the cache interface was, why don't we
provide an interface similar to guava cache [1], on top of guava cache,
caffeine also makes extensions for asynchronous calls.[2]
There is also the bulk load in caffeine too.

I am also more confused why first from LookupCacheFactory.Builder and then
to Factory to create Cache.

[1] https://github.com/google/guava
[2] https://github.com/ben-manes/caffeine/wiki/Population

Best,
Jingsong

On Thu, May 26, 2022 at 11:17 PM Jark Wu <im...@gmail.com> wrote:

> After looking at the new introduced ReloadTime and Becket's comment,
> I agree with Becket we should have a pluggable reloading strategy.
> We can provide some common implementations, e.g., periodic reloading, and
> daily reloading.
> But there definitely be some connector- or business-specific reloading
> strategies, e.g.
> notify by a zookeeper watcher, reload once a new Hive partition is
> complete.
>
> Best,
> Jark
>
> On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com> wrote:
>
> > Hi Qingsheng,
> >
> > Thanks for updating the FLIP. A few comments / questions below:
> >
> > 1. Is there a reason that we have both "XXXFactory" and "XXXProvider".
> > What is the difference between them? If they are the same, can we just
> use
> > XXXFactory everywhere?
> >
> > 2. Regarding the FullCachingLookupProvider, should the reloading policy
> > also be pluggable? Periodical reloading could be sometimes be tricky in
> > practice. For example, if user uses 24 hours as the cache refresh
> interval
> > and some nightly batch job delayed, the cache update may still see the
> > stale data.
> >
> > 3. In DefaultLookupCacheFactory, it looks like InitialCapacity should be
> > removed.
> >
> > 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a little
> > confusing to me. If Optional<LookupCacheFactory> getCacheFactory()
> returns
> > a non-empty factory, doesn't that already indicates the framework to
> cache
> > the missing keys? Also, why is this method returning an Optional<Boolean>
> > instead of boolean?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <re...@gmail.com>
> wrote:
> >
> >> Hi Lincoln and Jark,
> >>
> >> Thanks for the comments! If the community reaches a consensus that we
> use
> >> SQL hint instead of table options to decide whether to use sync or async
> >> mode, it’s indeed not necessary to introduce the “lookup.async” option.
> >>
> >> I think it’s a good idea to let the decision of async made on query
> >> level, which could make better optimization with more infomation
> gathered
> >> by planner. Is there any FLIP describing the issue in FLINK-27625? I
> >> thought FLIP-234 is only proposing adding SQL hint for retry on missing
> >> instead of the entire async mode to be controlled by hint.
> >>
> >> Best regards,
> >>
> >> Qingsheng
> >>
> >> > On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com>
> wrote:
> >> >
> >> > Hi Jark,
> >> >
> >> > Thanks for your reply!
> >> >
> >> > Currently 'lookup.async' just lies in HBase connector, I have no idea
> >> > whether or when to remove it (we can discuss it in another issue for
> the
> >> > HBase connector after FLINK-27625 is done), just not add it into a
> >> common
> >> > option now.
> >> >
> >> > Best,
> >> > Lincoln Lee
> >> >
> >> >
> >> > Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> >> >
> >> >> Hi Lincoln,
> >> >>
> >> >> I have taken a look at FLIP-234, and I agree with you that the
> >> connectors
> >> >> can
> >> >> provide both async and sync runtime providers simultaneously instead
> >> of one
> >> >> of them.
> >> >> At that point, "lookup.async" looks redundant. If this option is
> >> planned to
> >> >> be removed
> >> >> in the long term, I think it makes sense not to introduce it in this
> >> FLIP.
> >> >>
> >> >> Best,
> >> >> Jark
> >> >>
> >> >> On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com>
> >> wrote:
> >> >>
> >> >>> Hi Qingsheng,
> >> >>>
> >> >>> Sorry for jumping into the discussion so late. It's a good idea that
> >> we
> >> >> can
> >> >>> have a common table option. I have a minor comments on
> 'lookup.async'
> >> >> that
> >> >>> not make it a common option:
> >> >>>
> >> >>> The table layer abstracts both sync and async lookup capabilities,
> >> >>> connectors implementers can choose one or both, in the case of
> >> >> implementing
> >> >>> only one capability(status of the most of existing builtin
> connectors)
> >> >>> 'lookup.async' will not be used.  And when a connector has both
> >> >>> capabilities, I think this choice is more suitable for making
> >> decisions
> >> >> at
> >> >>> the query level, for example, table planner can choose the physical
> >> >>> implementation of async lookup or sync lookup based on its cost
> >> model, or
> >> >>> users can give query hint based on their own better understanding.
> If
> >> >>> there is another common table option 'lookup.async', it may confuse
> >> the
> >> >>> users in the long run.
> >> >>>
> >> >>> So, I prefer to leave the 'lookup.async' option in private place
> (for
> >> the
> >> >>> current hbase connector) and not turn it into a common option.
> >> >>>
> >> >>> WDYT?
> >> >>>
> >> >>> Best,
> >> >>> Lincoln Lee
> >> >>>
> >> >>>
> >> >>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> >> >>>
> >> >>>> Hi Alexander,
> >> >>>>
> >> >>>> Thanks for the review! We recently updated the FLIP and you can
> find
> >> >>> those
> >> >>>> changes from my latest email. Since some terminologies has changed
> so
> >> >>> I’ll
> >> >>>> use the new concept for replying your comments.
> >> >>>>
> >> >>>> 1. Builder vs ‘of’
> >> >>>> I’m OK to use builder pattern if we have additional optional
> >> parameters
> >> >>>> for full caching mode (“rescan” previously). The
> schedule-with-delay
> >> >> idea
> >> >>>> looks reasonable to me, but I think we need to redesign the builder
> >> API
> >> >>> of
> >> >>>> full caching to make it more descriptive for developers. Would you
> >> mind
> >> >>>> sharing your ideas about the API? For accessing the FLIP workspace
> >> you
> >> >>> can
> >> >>>> just provide your account ID and ping any PMC member including
> Jark.
> >> >>>>
> >> >>>> 2. Common table options
> >> >>>> We have some discussions these days and propose to introduce 8
> common
> >> >>>> table options about caching. It has been updated on the FLIP.
> >> >>>>
> >> >>>> 3. Retries
> >> >>>> I think we are on the same page :-)
> >> >>>>
> >> >>>> For your additional concerns:
> >> >>>> 1) The table option has been updated.
> >> >>>> 2) We got “lookup.cache” back for configuring whether to use
> partial
> >> or
> >> >>>> full caching mode.
> >> >>>>
> >> >>>> Best regards,
> >> >>>>
> >> >>>> Qingsheng
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> smiralexan@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Also I have a few additions:
> >> >>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> >> >>>>> 'lookup.cache.max-rows'? I think it will be more clear that we
> talk
> >> >>>>> not about bytes, but about the number of rows. Plus it fits more,
> >> >>>>> considering my optimization with filters.
> >> >>>>> 2) How will users enable rescanning? Are we going to separate
> >> caching
> >> >>>>> and rescanning from the options point of view? Like initially we
> had
> >> >>>>> one option 'lookup.cache' with values LRU / ALL. I think now we
> can
> >> >>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
> >> >>>>> 'lookup.rescan.interval', etc.
> >> >>>>>
> >> >>>>> Best regards,
> >> >>>>> Alexander
> >> >>>>>
> >> >>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> smiralexan@gmail.com
> >> >>> :
> >> >>>>>>
> >> >>>>>> Hi Qingsheng and Jark,
> >> >>>>>>
> >> >>>>>> 1. Builders vs 'of'
> >> >>>>>> I understand that builders are used when we have multiple
> >> >> parameters.
> >> >>>>>> I suggested them because we could add parameters later. To
> prevent
> >> >>>>>> Builder for ScanRuntimeProvider from looking redundant I can
> >> suggest
> >> >>>>>> one more config now - "rescanStartTime".
> >> >>>>>> It's a time in UTC (LocalTime class) when the first reload of
> cache
> >> >>>>>> starts. This parameter can be thought of as 'initialDelay' (diff
> >> >>>>>> between current time and rescanStartTime) in method
> >> >>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be
> very
> >> >>>>>> useful when the dimension table is updated by some other
> scheduled
> >> >> job
> >> >>>>>> at a certain time. Or when the user simply wants a second scan
> >> >> (first
> >> >>>>>> cache reload) be delayed. This option can be used even without
> >> >>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one day.
> >> >>>>>> If you are fine with this option, I would be very glad if you
> would
> >> >>>>>> give me access to edit FLIP page, so I could add it myself
> >> >>>>>>
> >> >>>>>> 2. Common table options
> >> >>>>>> I also think that FactoryUtil would be overloaded by all cache
> >> >>>>>> options. But maybe unify all suggested options, not only for
> >> default
> >> >>>>>> cache? I.e. class 'LookupOptions', that unifies default cache
> >> >> options,
> >> >>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> >> >>>>>>
> >> >>>>>> 3. Retries
> >> >>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times,
> call)
> >> >>>>>>
> >> >>>>>> [1]
> >> >>>>
> >> >>>
> >> >>
> >>
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >> >>>>>>
> >> >>>>>> Best regards,
> >> >>>>>> Alexander
> >> >>>>>>
> >> >>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
> >> >>>>>>>
> >> >>>>>>> Hi Jark and Alexander,
> >> >>>>>>>
> >> >>>>>>> Thanks for your comments! I’m also OK to introduce common table
> >> >>>> options. I prefer to introduce a new DefaultLookupCacheOptions
> class
> >> >> for
> >> >>>> holding these option definitions because putting all options into
> >> >>>> FactoryUtil would make it a bit ”crowded” and not well categorized.
> >> >>>>>>>
> >> >>>>>>> FLIP has been updated according to suggestions above:
> >> >>>>>>> 1. Use static “of” method for constructing RescanRuntimeProvider
> >> >>>> considering both arguments are required.
> >> >>>>>>> 2. Introduce new table options matching
> DefaultLookupCacheFactory
> >> >>>>>>>
> >> >>>>>>> Best,
> >> >>>>>>> Qingsheng
> >> >>>>>>>
> >> >>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com>
> wrote:
> >> >>>>>>>>
> >> >>>>>>>> Hi Alex,
> >> >>>>>>>>
> >> >>>>>>>> 1) retry logic
> >> >>>>>>>> I think we can extract some common retry logic into utilities,
> >> >> e.g.
> >> >>>> RetryUtils#tryTimes(times, call).
> >> >>>>>>>> This seems independent of this FLIP and can be reused by
> >> >> DataStream
> >> >>>> users.
> >> >>>>>>>> Maybe we can open an issue to discuss this and where to put it.
> >> >>>>>>>>
> >> >>>>>>>> 2) cache ConfigOptions
> >> >>>>>>>> I'm fine with defining cache config options in the framework.
> >> >>>>>>>> A candidate place to put is FactoryUtil which also includes
> >> >>>> "sink.parallelism", "format" options.
> >> >>>>>>>>
> >> >>>>>>>> Best,
> >> >>>>>>>> Jark
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> >> >>> smiralexan@gmail.com>
> >> >>>> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> Hi Qingsheng,
> >> >>>>>>>>>
> >> >>>>>>>>> Thank you for considering my comments.
> >> >>>>>>>>>
> >> >>>>>>>>>> there might be custom logic before making retry, such as
> >> >>>> re-establish the connection
> >> >>>>>>>>>
> >> >>>>>>>>> Yes, I understand that. I meant that such logic can be placed
> in
> >> >> a
> >> >>>>>>>>> separate function, that can be implemented by connectors. Just
> >> >>> moving
> >> >>>>>>>>> the retry logic would make connector's LookupFunction more
> >> >> concise
> >> >>> +
> >> >>>>>>>>> avoid duplicate code. However, it's a minor change. The
> decision
> >> >> is
> >> >>>> up
> >> >>>>>>>>> to you.
> >> >>>>>>>>>
> >> >>>>>>>>>> We decide not to provide common DDL options and let
> developers
> >> >> to
> >> >>>> define their own options as we do now per connector.
> >> >>>>>>>>>
> >> >>>>>>>>> What is the reason for that? One of the main goals of this
> FLIP
> >> >> was
> >> >>>> to
> >> >>>>>>>>> unify the configs, wasn't it? I understand that current cache
> >> >>> design
> >> >>>>>>>>> doesn't depend on ConfigOptions, like was before. But still we
> >> >> can
> >> >>>> put
> >> >>>>>>>>> these options into the framework, so connectors can reuse them
> >> >> and
> >> >>>>>>>>> avoid code duplication, and, what is more significant, avoid
> >> >>> possible
> >> >>>>>>>>> different options naming. This moment can be pointed out in
> >> >>>>>>>>> documentation for connector developers.
> >> >>>>>>>>>
> >> >>>>>>>>> Best regards,
> >> >>>>>>>>> Alexander
> >> >>>>>>>>>
> >> >>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <renqschn@gmail.com
> >:
> >> >>>>>>>>>>
> >> >>>>>>>>>> Hi Alexander,
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thanks for the review and glad to see we are on the same
> page!
> >> I
> >> >>>> think you forgot to cc the dev mailing list so I’m also quoting
> your
> >> >>> reply
> >> >>>> under this email.
> >> >>>>>>>>>>
> >> >>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> >> >>>>>>>>>>
> >> >>>>>>>>>> In my opinion the retry logic should be implemented in
> lookup()
> >> >>>> instead of in LookupFunction#eval(). Retrying is only meaningful
> >> under
> >> >>> some
> >> >>>> specific retriable failures, and there might be custom logic before
> >> >>> making
> >> >>>> retry, such as re-establish the connection
> (JdbcRowDataLookupFunction
> >> >> is
> >> >>> an
> >> >>>> example), so it's more handy to leave it to the connector.
> >> >>>>>>>>>>
> >> >>>>>>>>>>> I don't see DDL options, that were in previous version of
> >> FLIP.
> >> >>> Do
> >> >>>> you have any special plans for them?
> >> >>>>>>>>>>
> >> >>>>>>>>>> We decide not to provide common DDL options and let
> developers
> >> >> to
> >> >>>> define their own options as we do now per connector.
> >> >>>>>>>>>>
> >> >>>>>>>>>> The rest of comments sound great and I’ll update the FLIP.
> Hope
> >> >> we
> >> >>>> can finalize our proposal soon!
> >> >>>>>>>>>>
> >> >>>>>>>>>> Best,
> >> >>>>>>>>>>
> >> >>>>>>>>>> Qingsheng
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> >> >>> smiralexan@gmail.com>
> >> >>>> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Hi Qingsheng and devs!
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I like the overall design of updated FLIP, however I have
> >> >> several
> >> >>>>>>>>>>> suggestions and questions.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction
> >> >> is a
> >> >>>> good
> >> >>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this class.
> >> 'eval'
> >> >>>> method
> >> >>>>>>>>>>> of new LookupFunction is great for this purpose. The same is
> >> >> for
> >> >>>>>>>>>>> 'async' case.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 2) There might be other configs in future, such as
> >> >>>> 'cacheMissingKey'
> >> >>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> >> >>>> ScanRuntimeProvider.
> >> >>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
> >> >>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
> >> >>> method
> >> >>>>>>>>>>> instead of many 'of' methods in future)?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 3) What are the plans for existing TableFunctionProvider and
> >> >>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> deprecated.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 4) Am I right that the current design does not assume usage
> of
> >> >>>>>>>>>>> user-provided LookupCache in re-scanning? In this case, it
> is
> >> >> not
> >> >>>> very
> >> >>>>>>>>>>> clear why do we need methods such as 'invalidate' or
> 'putAll'
> >> >> in
> >> >>>>>>>>>>> LookupCache.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 5) I don't see DDL options, that were in previous version of
> >> >>> FLIP.
> >> >>>> Do
> >> >>>>>>>>>>> you have any special plans for them?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> If you don't mind, I would be glad to be able to make small
> >> >>>>>>>>>>> adjustments to the FLIP document too. I think it's worth
> >> >>> mentioning
> >> >>>>>>>>>>> about what exactly optimizations are planning in the future.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Best regards,
> >> >>>>>>>>>>> Smirnov Alexander
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> renqschn@gmail.com
> >> >>> :
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Hi Alexander and devs,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Thank you very much for the in-depth discussion! As Jark
> >> >>>> mentioned we were inspired by Alexander's idea and made a refactor
> on
> >> >> our
> >> >>>> design. FLIP-221 [1] has been updated to reflect our design now and
> >> we
> >> >>> are
> >> >>>> happy to hear more suggestions from you!
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Compared to the previous design:
> >> >>>>>>>>>>>> 1. The lookup cache serves at table runtime level and is
> >> >>>> integrated as a component of LookupJoinRunner as discussed
> >> previously.
> >> >>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the
> new
> >> >>>> design.
> >> >>>>>>>>>>>> 3. We separate the all-caching case individually and
> >> >> introduce a
> >> >>>> new RescanRuntimeProvider to reuse the ability of scanning. We are
> >> >>> planning
> >> >>>> to support SourceFunction / InputFormat for now considering the
> >> >>> complexity
> >> >>>> of FLIP-27 Source API.
> >> >>>>>>>>>>>> 4. A new interface LookupFunction is introduced to make the
> >> >>>> semantic of lookup more straightforward for developers.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> For replying to Alexander:
> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> >> >> deprecated
> >> >>>> or not. Am I right that it will be so in the future, but currently
> >> it's
> >> >>> not?
> >> >>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for now. I
> >> >>> think
> >> >>>> it will be deprecated in the future but we don't have a clear plan
> >> for
> >> >>> that.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Thanks again for the discussion on this FLIP and looking
> >> >> forward
> >> >>>> to cooperating with you after we finalize the design and
> interfaces!
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> [1]
> >> >>>>
> >> >>>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Qingsheng
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> >> >>>> smiralexan@gmail.com> wrote:
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Glad to see that we came to a consensus on almost all
> >> points!
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> >> >> deprecated
> >> >>>> or
> >> >>>>>>>>>>>>> not. Am I right that it will be so in the future, but
> >> >> currently
> >> >>>> it's
> >> >>>>>>>>>>>>> not? Actually I also think that for the first version it's
> >> OK
> >> >>> to
> >> >>>> use
> >> >>>>>>>>>>>>> InputFormat in ALL cache realization, because supporting
> >> >> rescan
> >> >>>>>>>>>>>>> ability seems like a very distant prospect. But for this
> >> >>>> decision we
> >> >>>>>>>>>>>>> need a consensus among all discussion participants.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> In general, I don't have something to argue with your
> >> >>>> statements. All
> >> >>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would be
> nice
> >> >> to
> >> >>>> work
> >> >>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot of
> work
> >> >> on
> >> >>>> lookup
> >> >>>>>>>>>>>>> join caching with realization very close to the one we are
> >> >>>> discussing,
> >> >>>>>>>>>>>>> and want to share the results of this work. Anyway looking
> >> >>>> forward for
> >> >>>>>>>>>>>>> the FLIP update!
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>> Smirnov Alexander
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Hi Alex,
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Thanks for summarizing your points.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> discussed
> >> >> it
> >> >>>> several times
> >> >>>>>>>>>>>>>> and we have totally refactored the design.
> >> >>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many of
> your
> >> >>>> points!
> >> >>>>>>>>>>>>>> Qingsheng is still working on updating the design docs
> and
> >> >>>> maybe can be
> >> >>>>>>>>>>>>>> available in the next few days.
> >> >>>>>>>>>>>>>> I will share some conclusions from our discussions:
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> 1) we have refactored the design towards to "cache in
> >> >>>> framework" way.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize and a
> >> >>>> default
> >> >>>>>>>>>>>>>> implementation with builder for users to easy-use.
> >> >>>>>>>>>>>>>> This can both make it possible to both have flexibility
> and
> >> >>>> conciseness.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
> >> >> cache,
> >> >>>> esp reducing
> >> >>>>>>>>>>>>>> IO.
> >> >>>>>>>>>>>>>> Filter pushdown should be the final state and the unified
> >> >> way
> >> >>>> to both
> >> >>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> >> >>>>>>>>>>>>>> so I think we should make effort in this direction. If we
> >> >> need
> >> >>>> to support
> >> >>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> >> >>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide to
> >> >>> implement
> >> >>>> the cache
> >> >>>>>>>>>>>>>> in the framework, we have the chance to support
> >> >>>>>>>>>>>>>> filter on cache anytime. This is an optimization and it
> >> >>> doesn't
> >> >>>> affect the
> >> >>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
> >> >>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
> >> >> proposal.
> >> >>>>>>>>>>>>>> In the first version, we will only support InputFormat,
> >> >>>> SourceFunction for
> >> >>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> >> >>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source
> operator
> >> >>>> instead of
> >> >>>>>>>>>>>>>> calling it embedded in the join operator.
> >> >>>>>>>>>>>>>> However, this needs another FLIP to support the re-scan
> >> >>> ability
> >> >>>> for FLIP-27
> >> >>>>>>>>>>>>>> Source, and this can be a large work.
> >> >>>>>>>>>>>>>> In order to not block this issue, we can put the effort
> of
> >> >>>> FLIP-27 source
> >> >>>>>>>>>>>>>> integration into future work and integrate
> >> >>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as
> >> they
> >> >>>> are not
> >> >>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another
> >> function
> >> >>>>>>>>>>>>>> similar to them which is meaningless. We need to plan
> >> >> FLIP-27
> >> >>>> source
> >> >>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
> >> >>>> deprecated.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>> Jark
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> >> >>>> smiralexan@gmail.com>
> >> >>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Hi Martijn!
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat is
> not
> >> >>>> considered.
> >> >>>>>>>>>>>>>>> Thanks for clearing that up!
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>> Smirnov Alexander
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> >> >>>> martijn@ververica.com>:
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Hi,
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> With regards to:
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors to
> >> >>> FLIP-27
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
> >> >>>> interfaces will be
> >> >>>>>>>>>>>>>>>> deprecated and connectors will either be refactored to
> >> use
> >> >>>> the new ones
> >> >>>>>>>>>>>>>>> or
> >> >>>>>>>>>>>>>>>> dropped.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> The caching should work for connectors that are using
> >> >>> FLIP-27
> >> >>>> interfaces,
> >> >>>>>>>>>>>>>>>> we should not introduce new features for old
> interfaces.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Martijn
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> >> >>>> smiralexan@gmail.com>
> >> >>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Hi Jark!
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make some
> >> >>>> comments and
> >> >>>>>>>>>>>>>>>>> clarify my points.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we can
> >> >>> achieve
> >> >>>> both
> >> >>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
> >> >>>> flink-table-common,
> >> >>>>>>>>>>>>>>>>> but have implementations of it in flink-table-runtime.
> >> >>>> Therefore if a
> >> >>>>>>>>>>>>>>>>> connector developer wants to use existing cache
> >> >> strategies
> >> >>>> and their
> >> >>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
> >> >>>> planner, but if
> >> >>>>>>>>>>>>>>>>> he wants to have its own cache implementation in his
> >> >>>> TableFunction, it
> >> >>>>>>>>>>>>>>>>> will be possible for him to use the existing interface
> >> >> for
> >> >>>> this
> >> >>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
> >> >>>> documentation). In
> >> >>>>>>>>>>>>>>>>> this way all configs and metrics will be unified.
> WDYT?
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
> will
> >> >>>> have 90% of
> >> >>>>>>>>>>>>>>>>> lookup requests that can never be cached
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in
> case
> >> >> of
> >> >>>> LRU cache.
> >> >>>>>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>.
> Here
> >> >> we
> >> >>>> always
> >> >>>>>>>>>>>>>>>>> store the response of the dimension table in cache,
> even
> >> >>>> after
> >> >>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows
> after
> >> >>>> applying
> >> >>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> >> >>> TableFunction,
> >> >>>> we store
> >> >>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache
> line
> >> >>> will
> >> >>>> be
> >> >>>>>>>>>>>>>>>>> filled, but will require much less memory (in bytes).
> >> >> I.e.
> >> >>>> we don't
> >> >>>>>>>>>>>>>>>>> completely filter keys, by which result was pruned,
> but
> >> >>>> significantly
> >> >>>>>>>>>>>>>>>>> reduce required memory to store this result. If the
> user
> >> >>>> knows about
> >> >>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option
> >> >> before
> >> >>>> the start
> >> >>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea that
> we
> >> >>> can
> >> >>>> do this
> >> >>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and
> 'weigher'
> >> >>>> methods of
> >> >>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> collection
> >> >> of
> >> >>>> rows
> >> >>>>>>>>>>>>>>>>> (value of cache). Therefore cache can automatically
> fit
> >> >>> much
> >> >>>> more
> >> >>>>>>>>>>>>>>>>> records than before.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters
> and
> >> >>>> projects
> >> >>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >> >>>> SupportsProjectionPushDown.
> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
> >> >> don't
> >> >>>> mean it's
> >> >>>>>>>>>>>>>>> hard
> >> >>>>>>>>>>>>>>>>> to implement.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> It's debatable how difficult it will be to implement
> >> >> filter
> >> >>>> pushdown.
> >> >>>>>>>>>>>>>>>>> But I think the fact that currently there is no
> database
> >> >>>> connector
> >> >>>>>>>>>>>>>>>>> with filter pushdown at least means that this feature
> >> >> won't
> >> >>>> be
> >> >>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk
> about
> >> >>>> other
> >> >>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases might
> >> not
> >> >>>> support all
> >> >>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I think
> >> >>> users
> >> >>>> are
> >> >>>>>>>>>>>>>>>>> interested in supporting cache filters optimization
> >> >>>> independently of
> >> >>>>>>>>>>>>>>>>> supporting other features and solving more complex
> >> >> problems
> >> >>>> (or
> >> >>>>>>>>>>>>>>>>> unsolvable at all).
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
> >> >>>> internal version
> >> >>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and
> >> reloading
> >> >>>> data from
> >> >>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to
> >> >> unify
> >> >>>> the logic
> >> >>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> >> SourceFunction,
> >> >>>> Source,...)
> >> >>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
> >> >> settled
> >> >>>> on using
> >> >>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in all
> >> >> lookup
> >> >>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans to
> >> >>> deprecate
> >> >>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
> >> >>>> FLIP-27 source
> >> >>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this source
> was
> >> >>>> designed to
> >> >>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
> >> >>>> JobManager and
> >> >>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
> >> >> (lookup
> >> >>>> join
> >> >>>>>>>>>>>>>>>>> operator in our case). There is even no direct way to
> >> >> pass
> >> >>>> splits from
> >> >>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works
> >> through
> >> >>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >> >>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> >> >> AddSplitEvents).
> >> >>>> Usage of
> >> >>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
> >> >>>> easier. But if
> >> >>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> FLIP-27, I
> >> >>>> have the
> >> >>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join
> >> ALL
> >> >>>> cache in
> >> >>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of batch
> >> >>> source?
> >> >>>> The point
> >> >>>>>>>>>>>>>>>>> is that the only difference between lookup join ALL
> >> cache
> >> >>>> and simple
> >> >>>>>>>>>>>>>>>>> join with batch source is that in the first case
> >> scanning
> >> >>> is
> >> >>>> performed
> >> >>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is
> >> cleared
> >> >>>> (correct me
> >> >>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality
> of
> >> >>>> simple join
> >> >>>>>>>>>>>>>>>>> to support state reloading + extend the functionality
> of
> >> >>>> scanning
> >> >>>>>>>>>>>>>>>>> batch source multiple times (this one should be easy
> >> with
> >> >>>> new FLIP-27
> >> >>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we will
> >> >> need
> >> >>>> to change
> >> >>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again
> after
> >> >>>> some TTL).
> >> >>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal
> >> >> and
> >> >>>> will make
> >> >>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said.
> Maybe
> >> >> we
> >> >>>> can limit
> >> >>>>>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> >> >>>>>>>>>>>>>>>>> 1) There is a way to make both concise and flexible
> >> >>>> interfaces for
> >> >>>>>>>>>>>>>>>>> caching in lookup join.
> >> >>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU
> >> >> and
> >> >>>> ALL caches.
> >> >>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> supported
> >> >> in
> >> >>>> Flink
> >> >>>>>>>>>>>>>>>>> connectors, some of the connectors might not have the
> >> >>>> opportunity to
> >> >>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently filter
> >> >>>> pushdown works
> >> >>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
> >> >>>> projections
> >> >>>>>>>>>>>>>>>>> optimization should be independent from other
> features.
> >> >>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that
> >> involves
> >> >>>> multiple
> >> >>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
> >> >>>> InputFormat in favor
> >> >>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization
> really
> >> >>>> complex and
> >> >>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
> >> >>>> functionality of
> >> >>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
> >> >>> lookup
> >> >>>> join ALL
> >> >>>>>>>>>>>>>>>>> cache?
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>> Smirnov Alexander
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> [1]
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <imjark@gmail.com
> >:
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want to
> >> share
> >> >>> my
> >> >>>> ideas:
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors
> base
> >> >>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways
> should
> >> >>>> work (e.g.,
> >> >>>>>>>>>>>>>>> cache
> >> >>>>>>>>>>>>>>>>>> pruning, compatibility).
> >> >>>>>>>>>>>>>>>>>> The framework way can provide more concise
> interfaces.
> >> >>>>>>>>>>>>>>>>>> The connector base way can define more flexible cache
> >> >>>>>>>>>>>>>>>>>> strategies/implementations.
> >> >>>>>>>>>>>>>>>>>> We are still investigating a way to see if we can
> have
> >> >>> both
> >> >>>>>>>>>>>>>>> advantages.
> >> >>>>>>>>>>>>>>>>>> We should reach a consensus that the way should be a
> >> >> final
> >> >>>> state,
> >> >>>>>>>>>>>>>>> and we
> >> >>>>>>>>>>>>>>>>>> are on the path to it.
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> >> >>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache
> >> >> can
> >> >>>> benefit a
> >> >>>>>>>>>>>>>>> lot
> >> >>>>>>>>>>>>>>>>> for
> >> >>>>>>>>>>>>>>>>>> ALL cache.
> >> >>>>>>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors
> use
> >> >>>> cache to
> >> >>>>>>>>>>>>>>> reduce
> >> >>>>>>>>>>>>>>>>> IO
> >> >>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> >> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we
> will
> >> >>>> have 90% of
> >> >>>>>>>>>>>>>>>>> lookup
> >> >>>>>>>>>>>>>>>>>> requests that can never be cached
> >> >>>>>>>>>>>>>>>>>> and hit directly to the databases. That means the
> cache
> >> >> is
> >> >>>>>>>>>>>>>>> meaningless in
> >> >>>>>>>>>>>>>>>>>> this case.
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do
> >> filters
> >> >>>> and projects
> >> >>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >> >>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
> >> >> don't
> >> >>>> mean it's
> >> >>>>>>>>>>>>>>> hard
> >> >>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>> implement.
> >> >>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to
> reduce
> >> >> IO
> >> >>>> and the
> >> >>>>>>>>>>>>>>> cache
> >> >>>>>>>>>>>>>>>>>> size.
> >> >>>>>>>>>>>>>>>>>> That should be a final state that the scan source and
> >> >>>> lookup source
> >> >>>>>>>>>>>>>>> share
> >> >>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> >> >>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown
> logic
> >> >> in
> >> >>>> caches,
> >> >>>>>>>>>>>>>>> which
> >> >>>>>>>>>>>>>>>>>> will complex the lookup join design.
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> >> >>>>>>>>>>>>>>>>>> All cache might be the most challenging part of this
> >> >> FLIP.
> >> >>>> We have
> >> >>>>>>>>>>>>>>> never
> >> >>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> >> >>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval"
> method
> >> >> of
> >> >>>>>>>>>>>>>>> TableFunction.
> >> >>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >> >>>>>>>>>>>>>>>>>> Ideally, connector implementation should share the
> >> logic
> >> >>> of
> >> >>>> reload
> >> >>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> >> >>>> InputFormat/SourceFunction/FLIP-27
> >> >>>>>>>>>>>>>>>>> Source.
> >> >>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated,
> and
> >> >>> the
> >> >>>> FLIP-27
> >> >>>>>>>>>>>>>>>>> source
> >> >>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >> >>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> LookupJoin,
> >> >>> this
> >> >>>> may make
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> >> >>>>>>>>>>>>>>>>>> We are still investigating how to abstract the ALL
> >> cache
> >> >>>> logic and
> >> >>>>>>>>>>>>>>> reuse
> >> >>>>>>>>>>>>>>>>>> the existing source interfaces.
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>>>>>> Jark
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> >> >>>> ro.v.boyko@gmail.com>
> >> >>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies out
> of
> >> >> the
> >> >>>> scope of
> >> >>>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be done
> for
> >> >>> all
> >> >>>>>>>>>>>>>>>>> ScanTableSource
> >> >>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> >> >>>>>>>>>>>>>>> martijnvisser@apache.org>
> >> >>>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> Hi everyone,
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
> >> >>> mentioned
> >> >>>> that
> >> >>>>>>>>>>>>>>> filter
> >> >>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> >> >> jdbc/hive/hbase."
> >> >>>> -> Would
> >> >>>>>>>>>>>>>>> an
> >> >>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement these
> >> >>> filter
> >> >>>>>>>>>>>>>>> pushdowns?
> >> >>>>>>>>>>>>>>>>> I
> >> >>>>>>>>>>>>>>>>>>>> can
> >> >>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to doing
> >> >> that,
> >> >>>> outside
> >> >>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>> lookup
> >> >>>>>>>>>>>>>>>>>>>> caching and metrics.
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> Martijn Visser
> >> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> >> >>>> ro.v.boyko@gmail.com>
> >> >>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> Hi everyone!
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation would
> be
> >> >> a
> >> >>>> nice
> >> >>>>>>>>>>>>>>>>> opportunity
> >> >>>>>>>>>>>>>>>>>>>> for
> >> >>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS
> OF
> >> >>>> proc_time"
> >> >>>>>>>>>>>>>>>>> semantics
> >> >>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> implemented.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say
> that:
> >> >>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut
> off
> >> >>> the
> >> >>>> cache
> >> >>>>>>>>>>>>>>> size
> >> >>>>>>>>>>>>>>>>> by
> >> >>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most
> >> handy
> >> >>>> way to do
> >> >>>>>>>>>>>>>>> it
> >> >>>>>>>>>>>>>>>>> is
> >> >>>>>>>>>>>>>>>>>>>> apply
> >> >>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder
> to
> >> >>>> pass it
> >> >>>>>>>>>>>>>>>>> through the
> >> >>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
> >> >>> correctly
> >> >>>>>>>>>>>>>>> mentioned
> >> >>>>>>>>>>>>>>>>> that
> >> >>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
> >> >>>> jdbc/hive/hbase.
> >> >>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
> >> >> parameters
> >> >>>> for
> >> >>>>>>>>>>>>>>> different
> >> >>>>>>>>>>>>>>>>>>>> tables
> >> >>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
> >> >> through
> >> >>>> DDL
> >> >>>>>>>>>>>>>>> rather
> >> >>>>>>>>>>>>>>>>> than
> >> >>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for
> >> all
> >> >>>> lookup
> >> >>>>>>>>>>>>>>> tables.
> >> >>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
> >> >>>> deprives us of
> >> >>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement
> >> their
> >> >>> own
> >> >>>>>>>>>>>>>>> cache).
> >> >>>>>>>>>>>>>>>>> But
> >> >>>>>>>>>>>>>>>>>>>> most
> >> >>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
> >> >> different
> >> >>>> cache
> >> >>>>>>>>>>>>>>>>> strategies
> >> >>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> All these points are much closer to the schema
> >> >> proposed
> >> >>>> by
> >> >>>>>>>>>>>>>>>>> Alexander.
> >> >>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right
> and
> >> >>> all
> >> >>>> these
> >> >>>>>>>>>>>>>>>>>>>> facilities
> >> >>>>>>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >> >>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> >> >>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >> >>>>>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
> >> >>>> express that
> >> >>>>>>>>>>>>>>> I
> >> >>>>>>>>>>>>>>>>> really
> >> >>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic
> >> >> and I
> >> >>>> hope
> >> >>>>>>>>>>>>>>> that
> >> >>>>>>>>>>>>>>>>>>>> others
> >> >>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Martijn
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> >> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I
> have
> >> >>>> questions
> >> >>>>>>>>>>>>>>>>> about
> >> >>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
> >> >>>> something?).
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
> >> >>>> SYSTEM_TIME
> >> >>>>>>>>>>>>>>> AS OF
> >> >>>>>>>>>>>>>>>>>>>>>> proc_time”
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME
> AS
> >> >> OF
> >> >>>>>>>>>>>>>>> proc_time"
> >> >>>>>>>>>>>>>>>>> is
> >> >>>>>>>>>>>>>>>>>>>> not
> >> >>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said,
> >> >>> users
> >> >>>> go
> >> >>>>>>>>>>>>>>> on it
> >> >>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no
> one
> >> >>>> proposed
> >> >>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>> enable
> >> >>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you
> mean
> >> >>>> other
> >> >>>>>>>>>>>>>>>>> developers
> >> >>>>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
> >> >>> specify
> >> >>>>>>>>>>>>>>> whether
> >> >>>>>>>>>>>>>>>>> their
> >> >>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list
> of
> >> >>>> supported
> >> >>>>>>>>>>>>>>>>>>>> options),
> >> >>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to.
> >> So
> >> >>>> what
> >> >>>>>>>>>>>>>>>>> exactly is
> >> >>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching in
> >> >>> modules
> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common
> from
> >> >>> the
> >> >>>>>>>>>>>>>>>>> considered
> >> >>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> >> >>>> breaking/non-breaking
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table options
> in
> >> >>> DDL
> >> >>>> to
> >> >>>>>>>>>>>>>>>>> control
> >> >>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never
> >> happened
> >> >>>>>>>>>>>>>>> previously
> >> >>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>> should
> >> >>>>>>>>>>>>>>>>>>>>>>> be cautious
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics
> of
> >> >> DDL
> >> >>>>>>>>>>>>>>> options
> >> >>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about
> >> >>>> limiting
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> scope
> >> >>>>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user business
> >> >> logic
> >> >>>> rather
> >> >>>>>>>>>>>>>>> than
> >> >>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
> >> >>>> framework? I
> >> >>>>>>>>>>>>>>>>> mean
> >> >>>>>>>>>>>>>>>>>>>> that
> >> >>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option
> with
> >> >>>> lookup
> >> >>>>>>>>>>>>>>> cache
> >> >>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
> >> >>>> decision,
> >> >>>>>>>>>>>>>>>>> because it
> >> >>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not
> >> >> just
> >> >>>>>>>>>>>>>>> performance
> >> >>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several functions
> of
> >> >> ONE
> >> >>>> table
> >> >>>>>>>>>>>>>>>>> (there
> >> >>>>>>>>>>>>>>>>>>>> can
> >> >>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does
> it
> >> >>>> really
> >> >>>>>>>>>>>>>>>>> matter for
> >> >>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
> >> >>> located,
> >> >>>>>>>>>>>>>>> which is
> >> >>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> >> >>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> 'sink.parallelism',
> >> >>>> which in
> >> >>>>>>>>>>>>>>>>> some way
> >> >>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I
> >> >> don't
> >> >>>> see any
> >> >>>>>>>>>>>>>>>>> problem
> >> >>>>>>>>>>>>>>>>>>>>>>> here.
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
> >> >>>> scenario
> >> >>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> design
> >> >>>>>>>>>>>>>>>>>>>>>>> would become more complex
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but
> >> >>>> actually
> >> >>>>>>>>>>>>>>> in our
> >> >>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
> >> >> easily
> >> >>> -
> >> >>>> we
> >> >>>>>>>>>>>>>>> reused
> >> >>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new
> >> >>> API).
> >> >>>> The
> >> >>>>>>>>>>>>>>>>> point is
> >> >>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
> >> >> InputFormat
> >> >>>> for
> >> >>>>>>>>>>>>>>>>> scanning
> >> >>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive -
> it
> >> >>> uses
> >> >>>>>>>>>>>>>>> class
> >> >>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper
> >> >>> around
> >> >>>>>>>>>>>>>>>>> InputFormat.
> >> >>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to
> >> >>> reload
> >> >>>>>>>>>>>>>>> cache
> >> >>>>>>>>>>>>>>>>> data
> >> >>>>>>>>>>>>>>>>>>>> in
> >> >>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
> >> >>>>>>>>>>>>>>> InputSplits,
> >> >>>>>>>>>>>>>>>>> but
> >> >>>>>>>>>>>>>>>>>>>> has
> >> >>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
> >> >>>> significantly
> >> >>>>>>>>>>>>>>>>> reduces
> >> >>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I
> know
> >> >>> that
> >> >>>>>>>>>>>>>>> usually
> >> >>>>>>>>>>>>>>>>> we
> >> >>>>>>>>>>>>>>>>>>>> try
> >> >>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but
> >> >>> maybe
> >> >>>> this
> >> >>>>>>>>>>>>>>> one
> >> >>>>>>>>>>>>>>>>> can
> >> >>>>>>>>>>>>>>>>>>>> be
> >> >>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal
> >> >>>> solution,
> >> >>>>>>>>>>>>>>> maybe
> >> >>>>>>>>>>>>>>>>>>>> there
> >> >>>>>>>>>>>>>>>>>>>>>>> are better ones.
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
> >> >> introduce
> >> >>>>>>>>>>>>>>>>> compatibility
> >> >>>>>>>>>>>>>>>>>>>>>> issues
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer
> of
> >> >> the
> >> >>>>>>>>>>>>>>> connector
> >> >>>>>>>>>>>>>>>>>>>> won't
> >> >>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new
> cache
> >> >>>> options
> >> >>>>>>>>>>>>>>>>>>>> incorrectly
> >> >>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2
> >> >>>> different
> >> >>>>>>>>>>>>>>> code
> >> >>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need
> to
> >> >> do
> >> >>>> is to
> >> >>>>>>>>>>>>>>>>> redirect
> >> >>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig
> >> (+
> >> >>>> maybe
> >> >>>>>>>>>>>>>>> add an
> >> >>>>>>>>>>>>>>>>>>>> alias
> >> >>>>>>>>>>>>>>>>>>>>>>> for options, if there was different naming),
> >> >>> everything
> >> >>>>>>>>>>>>>>> will be
> >> >>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
> >> >>>>>>>>>>>>>>> refactoring at
> >> >>>>>>>>>>>>>>>>> all,
> >> >>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector
> because
> >> >> of
> >> >>>>>>>>>>>>>>> backward
> >> >>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use
> >> his
> >> >>> own
> >> >>>>>>>>>>>>>>> cache
> >> >>>>>>>>>>>>>>>>> logic,
> >> >>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs
> >> into
> >> >>> the
> >> >>>>>>>>>>>>>>>>> framework,
> >> >>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with already
> >> >>>> existing
> >> >>>>>>>>>>>>>>>>> configs
> >> >>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare
> >> >> case).
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all
> the
> >> >> way
> >> >>>> down
> >> >>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> table
> >> >>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that
> the
> >> >>> ONLY
> >> >>>>>>>>>>>>>>> connector
> >> >>>>>>>>>>>>>>>>>>>> that
> >> >>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> FileSystemTableSource
> >> >>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it currently).
> >> Also
> >> >>>> for some
> >> >>>>>>>>>>>>>>>>>>>> databases
> >> >>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
> >> >>> filters
> >> >>>>>>>>>>>>>>> that we
> >> >>>>>>>>>>>>>>>>> have
> >> >>>>>>>>>>>>>>>>>>>>>>> in Flink.
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache
> >> >> seems
> >> >>>> not
> >> >>>>>>>>>>>>>>>>> quite
> >> >>>>>>>>>>>>>>>>>>>>> useful
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount
> of
> >> >>> data
> >> >>>>>>>>>>>>>>> from the
> >> >>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose
> in
> >> >>>> dimension
> >> >>>>>>>>>>>>>>>>> table
> >> >>>>>>>>>>>>>>>>>>>>>>> 'users'
> >> >>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40,
> >> and
> >> >>>> input
> >> >>>>>>>>>>>>>>> stream
> >> >>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age
> of
> >> >>>> users. If
> >> >>>>>>>>>>>>>>> we
> >> >>>>>>>>>>>>>>>>> have
> >> >>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >> >>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This
> means
> >> >>> the
> >> >>>> user
> >> >>>>>>>>>>>>>>> can
> >> >>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2
> >> times.
> >> >>> It
> >> >>>> will
> >> >>>>>>>>>>>>>>>>> gain a
> >> >>>>>>>>>>>>>>>>>>>>>>> huge
> >> >>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization
> >> >> starts
> >> >>>> to
> >> >>>>>>>>>>>>>>> really
> >> >>>>>>>>>>>>>>>>>>>> shine
> >> >>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
> >> >>>> projections
> >> >>>>>>>>>>>>>>>>> can't
> >> >>>>>>>>>>>>>>>>>>>> fit
> >> >>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
> >> >>>> additional
> >> >>>>>>>>>>>>>>>>>>>> possibilities
> >> >>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
> >> >>>> useful'.
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding
> >> >> this
> >> >>>> topic!
> >> >>>>>>>>>>>>>>>>> Because
> >> >>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points,
> and I
> >> >>>> think
> >> >>>>>>>>>>>>>>> with
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>> help
> >> >>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
> >> >>>> consensus.
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> >> >>>>>>>>>>>>>>> renqschn@gmail.com
> >> >>>>>>>>>>>>>>>>>> :
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late
> >> >>>> response!
> >> >>>>>>>>>>>>>>> We
> >> >>>>>>>>>>>>>>>>> had
> >> >>>>>>>>>>>>>>>>>>>> an
> >> >>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and
> Leonard
> >> >>> and
> >> >>>> I’d
> >> >>>>>>>>>>>>>>> like
> >> >>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the
> >> >>> cache
> >> >>>>>>>>>>>>>>> logic in
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> table
> >> >>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> user-provided
> >> >>>> table
> >> >>>>>>>>>>>>>>>>> function,
> >> >>>>>>>>>>>>>>>>>>>> we
> >> >>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
> >> >>>> TableFunction
> >> >>>>>>>>>>>>>>> with
> >> >>>>>>>>>>>>>>>>> these
> >> >>>>>>>>>>>>>>>>>>>>>>> concerns:
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> >> >>>>>>>>>>>>>>> SYSTEM_TIME
> >> >>>>>>>>>>>>>>>>> AS OF
> >> >>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect
> the
> >> >>>> content
> >> >>>>>>>>>>>>>>> of the
> >> >>>>>>>>>>>>>>>>>>>> lookup
> >> >>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose
> >> to
> >> >>>> enable
> >> >>>>>>>>>>>>>>>>> caching
> >> >>>>>>>>>>>>>>>>>>>> on
> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this
> >> >>>> breakage is
> >> >>>>>>>>>>>>>>>>>>>> acceptable
> >> >>>>>>>>>>>>>>>>>>>>>> in
> >> >>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not
> to
> >> >>>> provide
> >> >>>>>>>>>>>>>>>>> caching on
> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
> >> >>>> framework
> >> >>>>>>>>>>>>>>>>> (whether
> >> >>>>>>>>>>>>>>>>>>>> in a
> >> >>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we
> have
> >> >> to
> >> >>>>>>>>>>>>>>> confront a
> >> >>>>>>>>>>>>>>>>>>>>>> situation
> >> >>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
> >> >>>> behavior of
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>> framework,
> >> >>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and should
> be
> >> >>>> cautious.
> >> >>>>>>>>>>>>>>>>> Under
> >> >>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the framework
> >> should
> >> >>>> only be
> >> >>>>>>>>>>>>>>>>>>>> specified
> >> >>>>>>>>>>>>>>>>>>>>> by
> >> >>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard
> >> to
> >> >>>> apply
> >> >>>>>>>>>>>>>>> these
> >> >>>>>>>>>>>>>>>>>>>> general
> >> >>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads
> and
> >> >>>> refresh
> >> >>>>>>>>>>>>>>> all
> >> >>>>>>>>>>>>>>>>>>>> records
> >> >>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high
> >> lookup
> >> >>>>>>>>>>>>>>> performance
> >> >>>>>>>>>>>>>>>>>>>> (like
> >> >>>>>>>>>>>>>>>>>>>>>> Hive
> >> >>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely used
> >> by
> >> >>> our
> >> >>>>>>>>>>>>>>> internal
> >> >>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the
> user’s
> >> >>>>>>>>>>>>>>> TableFunction
> >> >>>>>>>>>>>>>>>>>>>> works
> >> >>>>>>>>>>>>>>>>>>>>>> fine
> >> >>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> introduce a
> >> >>> new
> >> >>>>>>>>>>>>>>>>> interface for
> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become
> >> >> more
> >> >>>>>>>>>>>>>>> complex.
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
> >> >>>> introduce
> >> >>>>>>>>>>>>>>>>>>>> compatibility
> >> >>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there
> might
> >> >>>> exist two
> >> >>>>>>>>>>>>>>>>> caches
> >> >>>>>>>>>>>>>>>>>>>>> with
> >> >>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
> >> >> incorrectly
> >> >>>>>>>>>>>>>>> configures
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> table
> >> >>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by
> >> >> the
> >> >>>> lookup
> >> >>>>>>>>>>>>>>>>> source).
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> Alexander, I
> >> >>>> think
> >> >>>>>>>>>>>>>>>>> filters
> >> >>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to
> >> >> the
> >> >>>> table
> >> >>>>>>>>>>>>>>>>> function,
> >> >>>>>>>>>>>>>>>>>>>>> like
> >> >>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
> >> >> runner
> >> >>>> with
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> cache.
> >> >>>>>>>>>>>>>>>>>>>>> The
> >> >>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O
> >> >> and
> >> >>>>>>>>>>>>>>> pressure
> >> >>>>>>>>>>>>>>>>> on the
> >> >>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
> >> >>> optimizations
> >> >>>> to
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> cache
> >> >>>>>>>>>>>>>>>>>>>>> seems
> >> >>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect
> our
> >> >>>> ideas.
> >> >>>>>>>>>>>>>>> We
> >> >>>>>>>>>>>>>>>>>>>> prefer to
> >> >>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
> >> >>>> TableFunction,
> >> >>>>>>>>>>>>>>> and we
> >> >>>>>>>>>>>>>>>>>>>> could
> >> >>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> (CachingTableFunction,
> >> >>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >> >>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
> >> >> regulate
> >> >>>>>>>>>>>>>>> metrics
> >> >>>>>>>>>>>>>>>>> of the
> >> >>>>>>>>>>>>>>>>>>>>>> cache.
> >> >>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> >>>>>>>>>>>>>>>>>>>>>>>> [2]
> >> >>> https://github.com/PatrickRen/flink/tree/FLIP-221
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр
> Смирнов
> >> >> <
> >> >>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution
> as
> >> >> the
> >> >>>>>>>>>>>>>>> first
> >> >>>>>>>>>>>>>>>>> step:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually
> exclusive
> >> >>>>>>>>>>>>>>> (originally
> >> >>>>>>>>>>>>>>>>>>>>> proposed
> >> >>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually
> >> they
> >> >>>> follow
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> same
> >> >>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> different.
> >> >> If
> >> >>> we
> >> >>>>>>>>>>>>>>> will
> >> >>>>>>>>>>>>>>>>> go one
> >> >>>>>>>>>>>>>>>>>>>>> way,
> >> >>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
> >> >>>> deleting
> >> >>>>>>>>>>>>>>>>> existing
> >> >>>>>>>>>>>>>>>>>>>> code
> >> >>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> connectors.
> >> >> So
> >> >>> I
> >> >>>>>>>>>>>>>>> think we
> >> >>>>>>>>>>>>>>>>>>>> should
> >> >>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about
> that
> >> >> and
> >> >>>> then
> >> >>>>>>>>>>>>>>> work
> >> >>>>>>>>>>>>>>>>>>>>> together
> >> >>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks
> for
> >> >>>> different
> >> >>>>>>>>>>>>>>>>> parts
> >> >>>>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
> >> >>>> introducing
> >> >>>>>>>>>>>>>>>>> proposed
> >> >>>>>>>>>>>>>>>>>>>> set
> >> >>>>>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests
> >> >> after
> >> >>>>>>>>>>>>>>> filter
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of
> the
> >> >>>> lookup
> >> >>>>>>>>>>>>>>>>> table, we
> >> >>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that
> we
> >> >>> can
> >> >>>>>>>>>>>>>>> filter
> >> >>>>>>>>>>>>>>>>>>>>> responses,
> >> >>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
> >> >>>> pushdown. So
> >> >>>>>>>>>>>>>>> if
> >> >>>>>>>>>>>>>>>>>>>>> filtering
> >> >>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much
> less
> >> >>> rows
> >> >>>> in
> >> >>>>>>>>>>>>>>>>> cache.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture
> is
> >> >> not
> >> >>>>>>>>>>>>>>> shared.
> >> >>>>>>>>>>>>>>>>> I
> >> >>>>>>>>>>>>>>>>>>>> don't
> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> >> >>>>>>>>>>>>>>> conversations
> >> >>>>>>>>>>>>>>>>> :)
> >> >>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I
> >> >>> made a
> >> >>>>>>>>>>>>>>> Jira
> >> >>>>>>>>>>>>>>>>> issue,
> >> >>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
> >> >>> details
> >> >>>> -
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >> https://issues.apache.org/jira/browse/FLINK-27411.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> >> >>>>>>>>>>>>>>> arvid@apache.org>:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency
> was
> >> >> not
> >> >>>>>>>>>>>>>>>>> satisfying
> >> >>>>>>>>>>>>>>>>>>>> for
> >> >>>>>>>>>>>>>>>>>>>>>> me.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could
> also
> >> >>> live
> >> >>>>>>>>>>>>>>> with
> >> >>>>>>>>>>>>>>>>> an
> >> >>>>>>>>>>>>>>>>>>>>> easier
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making
> >> >>>> caching
> >> >>>>>>>>>>>>>>> an
> >> >>>>>>>>>>>>>>>>>>>>>>> implementation
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
> >> >> caching
> >> >>>>>>>>>>>>>>> layer
> >> >>>>>>>>>>>>>>>>>>>> around X.
> >> >>>>>>>>>>>>>>>>>>>>>> So
> >> >>>>>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> >> >>>>>>>>>>>>>>> delegates to
> >> >>>>>>>>>>>>>>>>> X in
> >> >>>>>>>>>>>>>>>>>>>>> case
> >> >>>>>>>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it
> >> >> into
> >> >>>> the
> >> >>>>>>>>>>>>>>>>> operator
> >> >>>>>>>>>>>>>>>>>>>>>> model
> >> >>>>>>>>>>>>>>>>>>>>>>> as
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
> >> >>>>>>>>>>>>>>> unnecessary
> >> >>>>>>>>>>>>>>>>> in
> >> >>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>> first step
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
> >> >>> receive
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>> requests
> >> >>>>>>>>>>>>>>>>>>>>>>> after
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
> >> >>> interesting
> >> >>>> to
> >> >>>>>>>>>>>>>>> save
> >> >>>>>>>>>>>>>>>>>>>>> memory).
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of
> >> >> this
> >> >>>> FLIP
> >> >>>>>>>>>>>>>>>>> would be
> >> >>>>>>>>>>>>>>>>>>>>>>> limited to
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
> >> >>>> Everything
> >> >>>>>>>>>>>>>>> else
> >> >>>>>>>>>>>>>>>>>>>>> remains
> >> >>>>>>>>>>>>>>>>>>>>>> an
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means
> we
> >> >> can
> >> >>>>>>>>>>>>>>> easily
> >> >>>>>>>>>>>>>>>>>>>>>> incorporate
> >> >>>>>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed
> >> >> out
> >> >>>>>>>>>>>>>>> later.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture
> is
> >> >> not
> >> >>>>>>>>>>>>>>> shared.
> >> >>>>>>>>>>>>>>>>> I
> >> >>>>>>>>>>>>>>>>>>>> don't
> >> >>>>>>>>>>>>>>>>>>>>>>> know the
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
> >> >> Смирнов
> >> >>> <
> >> >>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not
> a
> >> >>>>>>>>>>>>>>> committer
> >> >>>>>>>>>>>>>>>>> yet,
> >> >>>>>>>>>>>>>>>>>>>> but
> >> >>>>>>>>>>>>>>>>>>>>>> I'd
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP
> >> really
> >> >>>>>>>>>>>>>>>>> interested
> >> >>>>>>>>>>>>>>>>>>>> me.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature
> in
> >> >> my
> >> >>>>>>>>>>>>>>>>> company’s
> >> >>>>>>>>>>>>>>>>>>>>> Flink
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our
> thoughts
> >> >> on
> >> >>>>>>>>>>>>>>> this and
> >> >>>>>>>>>>>>>>>>>>>> make
> >> >>>>>>>>>>>>>>>>>>>>>> code
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
> >> >>>>>>>>>>>>>>> introducing an
> >> >>>>>>>>>>>>>>>>>>>>> abstract
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> >> (CachingTableFunction).
> >> >>> As
> >> >>>>>>>>>>>>>>> you
> >> >>>>>>>>>>>>>>>>> know,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> flink-table-common
> >> >>>>>>>>>>>>>>> module,
> >> >>>>>>>>>>>>>>>>> which
> >> >>>>>>>>>>>>>>>>>>>>>>> provides
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s
> >> very
> >> >>>>>>>>>>>>>>>>> convenient
> >> >>>>>>>>>>>>>>>>>>>> for
> >> >>>>>>>>>>>>>>>>>>>>>>> importing
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction
> >> >>>> contains
> >> >>>>>>>>>>>>>>>>> logic
> >> >>>>>>>>>>>>>>>>>>>> for
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
> >> >> everything
> >> >>>>>>>>>>>>>>>>> connected
> >> >>>>>>>>>>>>>>>>>>>> with
> >> >>>>>>>>>>>>>>>>>>>>> it
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module,
> probably
> >> >> in
> >> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend
> on
> >> >>>> another
> >> >>>>>>>>>>>>>>>>> module,
> >> >>>>>>>>>>>>>>>>>>>>>> which
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which
> doesn’t
> >> >>>> sound
> >> >>>>>>>>>>>>>>>>> good.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> >> ‘getLookupConfig’
> >> >>> to
> >> >>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors
> >> to
> >> >>>> only
> >> >>>>>>>>>>>>>>> pass
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore
> they
> >> >>> won’t
> >> >>>>>>>>>>>>>>>>> depend on
> >> >>>>>>>>>>>>>>>>>>>>>>> runtime
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner
> >> >> will
> >> >>>>>>>>>>>>>>>>> construct a
> >> >>>>>>>>>>>>>>>>>>>>>> lookup
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime
> logic
> >> >>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> >> >>>>>>>>>>>>>>>>>>>>>> in
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture
> >> looks
> >> >>>> like
> >> >>>>>>>>>>>>>>> in
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> pinned
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually
> >> >> yours
> >> >>>>>>>>>>>>>>>>>>>> CacheConfig).
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> >> >>>>>>>>>>>>>>> responsible
> >> >>>>>>>>>>>>>>>>> for
> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>>>>>> –
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >> >>>>>>>>>>>>>>> flink-table-runtime
> >> >>>>>>>>>>>>>>>>> -
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> >> >>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> >> >> LookupJoinCachingRunner,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful
> advantage
> >> >> of
> >> >>>>>>>>>>>>>>> such a
> >> >>>>>>>>>>>>>>>>>>>>> solution.
> >> >>>>>>>>>>>>>>>>>>>>>>> If
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we
> can
> >> >>>> apply
> >> >>>>>>>>>>>>>>> some
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> LookupJoinRunnerWithCalc
> >> >> was
> >> >>>>>>>>>>>>>>> named
> >> >>>>>>>>>>>>>>>>> like
> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
> >> >>> actually
> >> >>>>>>>>>>>>>>>>> mostly
> >> >>>>>>>>>>>>>>>>>>>>>> consists
> >> >>>>>>>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup
> table
> >> >> B
> >> >>>>>>>>>>>>>>>>> condition
> >> >>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> >> >>>>>>>>>>>>>>>>>>>>>>> ON
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
> >> >>> B.salary >
> >> >>>>>>>>>>>>>>> 1000’
> >> >>>>>>>>>>>>>>>>>>>>> ‘calc’
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age
> +
> >> >> 10
> >> >>>> and
> >> >>>>>>>>>>>>>>>>>>>> B.salary >
> >> >>>>>>>>>>>>>>>>>>>>>>> 1000.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
> >> >> records
> >> >>> in
> >> >>>>>>>>>>>>>>>>> cache,
> >> >>>>>>>>>>>>>>>>>>>> size
> >> >>>>>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced:
> filters =
> >> >>>> avoid
> >> >>>>>>>>>>>>>>>>> storing
> >> >>>>>>>>>>>>>>>>>>>>>> useless
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce
> >> records’
> >> >>>>>>>>>>>>>>> size. So
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>> initial
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be
> >> increased
> >> >>> by
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> user.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion
> >> >>> about
> >> >>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> >> >>>>>>>>>>>>>>>>>>>>>>> which
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table
> >> cache
> >> >>> and
> >> >>>>>>>>>>>>>>> its
> >> >>>>>>>>>>>>>>>>>>>> standard
> >> >>>>>>>>>>>>>>>>>>>>>>> metrics.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
> >> >>>> implement
> >> >>>>>>>>>>>>>>>>> their
> >> >>>>>>>>>>>>>>>>>>>> own
> >> >>>>>>>>>>>>>>>>>>>>>>> cache to
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
> >> >> standard
> >> >>> of
> >> >>>>>>>>>>>>>>>>> metrics
> >> >>>>>>>>>>>>>>>>>>>> for
> >> >>>>>>>>>>>>>>>>>>>>>>> users and
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
> >> >>> joins,
> >> >>>>>>>>>>>>>>> which
> >> >>>>>>>>>>>>>>>>> is a
> >> >>>>>>>>>>>>>>>>>>>>>> quite
> >> >>>>>>>>>>>>>>>>>>>>>>> common
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> including
> >> >>>> cache,
> >> >>>>>>>>>>>>>>>>>>>> metrics,
> >> >>>>>>>>>>>>>>>>>>>>>>> wrapper
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table
> >> options.
> >> >>>>>>>>>>>>>>> Please
> >> >>>>>>>>>>>>>>>>> take a
> >> >>>>>>>>>>>>>>>>>>>>> look
> >> >>>>>>>>>>>>>>>>>>>>>>> at the
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
> >> >>> suggestions
> >> >>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>> comments
> >> >>>>>>>>>>>>>>>>>>>>>>> would be
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> --
> >> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >> >>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> --
> >> >>>>>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>>>> Roman Boyko
> >> >>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> --
> >> >>>>>>>>>>>> Best Regards,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Qingsheng Ren
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Real-time Computing Team
> >> >>>>>>>>>>>> Alibaba Cloud
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Email: renqschn@gmail.com
> >> >>>>>>>>>>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> >>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jark Wu <im...@gmail.com>.

After looking at the new introduced ReloadTime and Becket's comment,
I agree with Becket we should have a pluggable reloading strategy.
We can provide some common implementations, e.g., periodic reloading, and
daily reloading.
But there definitely be some connector- or business-specific reloading
strategies, e.g.
notify by a zookeeper watcher, reload once a new Hive partition is
complete.

Best,
Jark

On Thu, 26 May 2022 at 11:52, Becket Qin <be...@gmail.com> wrote:

> Hi Qingsheng,
>
> Thanks for updating the FLIP. A few comments / questions below:
>
> 1. Is there a reason that we have both "XXXFactory" and "XXXProvider".
> What is the difference between them? If they are the same, can we just use
> XXXFactory everywhere?
>
> 2. Regarding the FullCachingLookupProvider, should the reloading policy
> also be pluggable? Periodical reloading could be sometimes be tricky in
> practice. For example, if user uses 24 hours as the cache refresh interval
> and some nightly batch job delayed, the cache update may still see the
> stale data.
>
> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity should be
> removed.
>
> 4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a little
> confusing to me. If Optional<LookupCacheFactory> getCacheFactory() returns
> a non-empty factory, doesn't that already indicates the framework to cache
> the missing keys? Also, why is this method returning an Optional<Boolean>
> instead of boolean?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <re...@gmail.com> wrote:
>
>> Hi Lincoln and Jark,
>>
>> Thanks for the comments! If the community reaches a consensus that we use
>> SQL hint instead of table options to decide whether to use sync or async
>> mode, it’s indeed not necessary to introduce the “lookup.async” option.
>>
>> I think it’s a good idea to let the decision of async made on query
>> level, which could make better optimization with more infomation gathered
>> by planner. Is there any FLIP describing the issue in FLINK-27625? I
>> thought FLIP-234 is only proposing adding SQL hint for retry on missing
>> instead of the entire async mode to be controlled by hint.
>>
>> Best regards,
>>
>> Qingsheng
>>
>> > On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com> wrote:
>> >
>> > Hi Jark,
>> >
>> > Thanks for your reply!
>> >
>> > Currently 'lookup.async' just lies in HBase connector, I have no idea
>> > whether or when to remove it (we can discuss it in another issue for the
>> > HBase connector after FLINK-27625 is done), just not add it into a
>> common
>> > option now.
>> >
>> > Best,
>> > Lincoln Lee
>> >
>> >
>> > Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
>> >
>> >> Hi Lincoln,
>> >>
>> >> I have taken a look at FLIP-234, and I agree with you that the
>> connectors
>> >> can
>> >> provide both async and sync runtime providers simultaneously instead
>> of one
>> >> of them.
>> >> At that point, "lookup.async" looks redundant. If this option is
>> planned to
>> >> be removed
>> >> in the long term, I think it makes sense not to introduce it in this
>> FLIP.
>> >>
>> >> Best,
>> >> Jark
>> >>
>> >> On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com>
>> wrote:
>> >>
>> >>> Hi Qingsheng,
>> >>>
>> >>> Sorry for jumping into the discussion so late. It's a good idea that
>> we
>> >> can
>> >>> have a common table option. I have a minor comments on  'lookup.async'
>> >> that
>> >>> not make it a common option:
>> >>>
>> >>> The table layer abstracts both sync and async lookup capabilities,
>> >>> connectors implementers can choose one or both, in the case of
>> >> implementing
>> >>> only one capability(status of the most of existing builtin connectors)
>> >>> 'lookup.async' will not be used.  And when a connector has both
>> >>> capabilities, I think this choice is more suitable for making
>> decisions
>> >> at
>> >>> the query level, for example, table planner can choose the physical
>> >>> implementation of async lookup or sync lookup based on its cost
>> model, or
>> >>> users can give query hint based on their own better understanding.  If
>> >>> there is another common table option 'lookup.async', it may confuse
>> the
>> >>> users in the long run.
>> >>>
>> >>> So, I prefer to leave the 'lookup.async' option in private place (for
>> the
>> >>> current hbase connector) and not turn it into a common option.
>> >>>
>> >>> WDYT?
>> >>>
>> >>> Best,
>> >>> Lincoln Lee
>> >>>
>> >>>
>> >>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>> >>>
>> >>>> Hi Alexander,
>> >>>>
>> >>>> Thanks for the review! We recently updated the FLIP and you can find
>> >>> those
>> >>>> changes from my latest email. Since some terminologies has changed so
>> >>> I’ll
>> >>>> use the new concept for replying your comments.
>> >>>>
>> >>>> 1. Builder vs ‘of’
>> >>>> I’m OK to use builder pattern if we have additional optional
>> parameters
>> >>>> for full caching mode (“rescan” previously). The schedule-with-delay
>> >> idea
>> >>>> looks reasonable to me, but I think we need to redesign the builder
>> API
>> >>> of
>> >>>> full caching to make it more descriptive for developers. Would you
>> mind
>> >>>> sharing your ideas about the API? For accessing the FLIP workspace
>> you
>> >>> can
>> >>>> just provide your account ID and ping any PMC member including Jark.
>> >>>>
>> >>>> 2. Common table options
>> >>>> We have some discussions these days and propose to introduce 8 common
>> >>>> table options about caching. It has been updated on the FLIP.
>> >>>>
>> >>>> 3. Retries
>> >>>> I think we are on the same page :-)
>> >>>>
>> >>>> For your additional concerns:
>> >>>> 1) The table option has been updated.
>> >>>> 2) We got “lookup.cache” back for configuring whether to use partial
>> or
>> >>>> full caching mode.
>> >>>>
>> >>>> Best regards,
>> >>>>
>> >>>> Qingsheng
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On May 19, 2022, at 17:25, Александр Смирнов <sm...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Also I have a few additions:
>> >>>>> 1) maybe rename 'lookup.cache.maximum-size' to
>> >>>>> 'lookup.cache.max-rows'? I think it will be more clear that we talk
>> >>>>> not about bytes, but about the number of rows. Plus it fits more,
>> >>>>> considering my optimization with filters.
>> >>>>> 2) How will users enable rescanning? Are we going to separate
>> caching
>> >>>>> and rescanning from the options point of view? Like initially we had
>> >>>>> one option 'lookup.cache' with values LRU / ALL. I think now we can
>> >>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
>> >>>>> 'lookup.rescan.interval', etc.
>> >>>>>
>> >>>>> Best regards,
>> >>>>> Alexander
>> >>>>>
>> >>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <smiralexan@gmail.com
>> >>> :
>> >>>>>>
>> >>>>>> Hi Qingsheng and Jark,
>> >>>>>>
>> >>>>>> 1. Builders vs 'of'
>> >>>>>> I understand that builders are used when we have multiple
>> >> parameters.
>> >>>>>> I suggested them because we could add parameters later. To prevent
>> >>>>>> Builder for ScanRuntimeProvider from looking redundant I can
>> suggest
>> >>>>>> one more config now - "rescanStartTime".
>> >>>>>> It's a time in UTC (LocalTime class) when the first reload of cache
>> >>>>>> starts. This parameter can be thought of as 'initialDelay' (diff
>> >>>>>> between current time and rescanStartTime) in method
>> >>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
>> >>>>>> useful when the dimension table is updated by some other scheduled
>> >> job
>> >>>>>> at a certain time. Or when the user simply wants a second scan
>> >> (first
>> >>>>>> cache reload) be delayed. This option can be used even without
>> >>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one day.
>> >>>>>> If you are fine with this option, I would be very glad if you would
>> >>>>>> give me access to edit FLIP page, so I could add it myself
>> >>>>>>
>> >>>>>> 2. Common table options
>> >>>>>> I also think that FactoryUtil would be overloaded by all cache
>> >>>>>> options. But maybe unify all suggested options, not only for
>> default
>> >>>>>> cache? I.e. class 'LookupOptions', that unifies default cache
>> >> options,
>> >>>>>> rescan options, 'async', 'maxRetries'. WDYT?
>> >>>>>>
>> >>>>>> 3. Retries
>> >>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
>> >>>>>>
>> >>>>>> [1]
>> >>>>
>> >>>
>> >>
>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>> >>>>>>
>> >>>>>> Best regards,
>> >>>>>> Alexander
>> >>>>>>
>> >>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
>> >>>>>>>
>> >>>>>>> Hi Jark and Alexander,
>> >>>>>>>
>> >>>>>>> Thanks for your comments! I’m also OK to introduce common table
>> >>>> options. I prefer to introduce a new DefaultLookupCacheOptions class
>> >> for
>> >>>> holding these option definitions because putting all options into
>> >>>> FactoryUtil would make it a bit ”crowded” and not well categorized.
>> >>>>>>>
>> >>>>>>> FLIP has been updated according to suggestions above:
>> >>>>>>> 1. Use static “of” method for constructing RescanRuntimeProvider
>> >>>> considering both arguments are required.
>> >>>>>>> 2. Introduce new table options matching DefaultLookupCacheFactory
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>> Qingsheng
>> >>>>>>>
>> >>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Alex,
>> >>>>>>>>
>> >>>>>>>> 1) retry logic
>> >>>>>>>> I think we can extract some common retry logic into utilities,
>> >> e.g.
>> >>>> RetryUtils#tryTimes(times, call).
>> >>>>>>>> This seems independent of this FLIP and can be reused by
>> >> DataStream
>> >>>> users.
>> >>>>>>>> Maybe we can open an issue to discuss this and where to put it.
>> >>>>>>>>
>> >>>>>>>> 2) cache ConfigOptions
>> >>>>>>>> I'm fine with defining cache config options in the framework.
>> >>>>>>>> A candidate place to put is FactoryUtil which also includes
>> >>>> "sink.parallelism", "format" options.
>> >>>>>>>>
>> >>>>>>>> Best,
>> >>>>>>>> Jark
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
>> >>> smiralexan@gmail.com>
>> >>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi Qingsheng,
>> >>>>>>>>>
>> >>>>>>>>> Thank you for considering my comments.
>> >>>>>>>>>
>> >>>>>>>>>> there might be custom logic before making retry, such as
>> >>>> re-establish the connection
>> >>>>>>>>>
>> >>>>>>>>> Yes, I understand that. I meant that such logic can be placed in
>> >> a
>> >>>>>>>>> separate function, that can be implemented by connectors. Just
>> >>> moving
>> >>>>>>>>> the retry logic would make connector's LookupFunction more
>> >> concise
>> >>> +
>> >>>>>>>>> avoid duplicate code. However, it's a minor change. The decision
>> >> is
>> >>>> up
>> >>>>>>>>> to you.
>> >>>>>>>>>
>> >>>>>>>>>> We decide not to provide common DDL options and let developers
>> >> to
>> >>>> define their own options as we do now per connector.
>> >>>>>>>>>
>> >>>>>>>>> What is the reason for that? One of the main goals of this FLIP
>> >> was
>> >>>> to
>> >>>>>>>>> unify the configs, wasn't it? I understand that current cache
>> >>> design
>> >>>>>>>>> doesn't depend on ConfigOptions, like was before. But still we
>> >> can
>> >>>> put
>> >>>>>>>>> these options into the framework, so connectors can reuse them
>> >> and
>> >>>>>>>>> avoid code duplication, and, what is more significant, avoid
>> >>> possible
>> >>>>>>>>> different options naming. This moment can be pointed out in
>> >>>>>>>>> documentation for connector developers.
>> >>>>>>>>>
>> >>>>>>>>> Best regards,
>> >>>>>>>>> Alexander
>> >>>>>>>>>
>> >>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi Alexander,
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks for the review and glad to see we are on the same page!
>> I
>> >>>> think you forgot to cc the dev mailing list so I’m also quoting your
>> >>> reply
>> >>>> under this email.
>> >>>>>>>>>>
>> >>>>>>>>>>> We can add 'maxRetryTimes' option into this class
>> >>>>>>>>>>
>> >>>>>>>>>> In my opinion the retry logic should be implemented in lookup()
>> >>>> instead of in LookupFunction#eval(). Retrying is only meaningful
>> under
>> >>> some
>> >>>> specific retriable failures, and there might be custom logic before
>> >>> making
>> >>>> retry, such as re-establish the connection (JdbcRowDataLookupFunction
>> >> is
>> >>> an
>> >>>> example), so it's more handy to leave it to the connector.
>> >>>>>>>>>>
>> >>>>>>>>>>> I don't see DDL options, that were in previous version of
>> FLIP.
>> >>> Do
>> >>>> you have any special plans for them?
>> >>>>>>>>>>
>> >>>>>>>>>> We decide not to provide common DDL options and let developers
>> >> to
>> >>>> define their own options as we do now per connector.
>> >>>>>>>>>>
>> >>>>>>>>>> The rest of comments sound great and I’ll update the FLIP. Hope
>> >> we
>> >>>> can finalize our proposal soon!
>> >>>>>>>>>>
>> >>>>>>>>>> Best,
>> >>>>>>>>>>
>> >>>>>>>>>> Qingsheng
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
>> >>> smiralexan@gmail.com>
>> >>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi Qingsheng and devs!
>> >>>>>>>>>>>
>> >>>>>>>>>>> I like the overall design of updated FLIP, however I have
>> >> several
>> >>>>>>>>>>> suggestions and questions.
>> >>>>>>>>>>>
>> >>>>>>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction
>> >> is a
>> >>>> good
>> >>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this class.
>> 'eval'
>> >>>> method
>> >>>>>>>>>>> of new LookupFunction is great for this purpose. The same is
>> >> for
>> >>>>>>>>>>> 'async' case.
>> >>>>>>>>>>>
>> >>>>>>>>>>> 2) There might be other configs in future, such as
>> >>>> 'cacheMissingKey'
>> >>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
>> >>>> ScanRuntimeProvider.
>> >>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
>> >>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
>> >>> method
>> >>>>>>>>>>> instead of many 'of' methods in future)?
>> >>>>>>>>>>>
>> >>>>>>>>>>> 3) What are the plans for existing TableFunctionProvider and
>> >>>>>>>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
>> >>>>>>>>>>>
>> >>>>>>>>>>> 4) Am I right that the current design does not assume usage of
>> >>>>>>>>>>> user-provided LookupCache in re-scanning? In this case, it is
>> >> not
>> >>>> very
>> >>>>>>>>>>> clear why do we need methods such as 'invalidate' or 'putAll'
>> >> in
>> >>>>>>>>>>> LookupCache.
>> >>>>>>>>>>>
>> >>>>>>>>>>> 5) I don't see DDL options, that were in previous version of
>> >>> FLIP.
>> >>>> Do
>> >>>>>>>>>>> you have any special plans for them?
>> >>>>>>>>>>>
>> >>>>>>>>>>> If you don't mind, I would be glad to be able to make small
>> >>>>>>>>>>> adjustments to the FLIP document too. I think it's worth
>> >>> mentioning
>> >>>>>>>>>>> about what exactly optimizations are planning in the future.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best regards,
>> >>>>>>>>>>> Smirnov Alexander
>> >>>>>>>>>>>
>> >>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <renqschn@gmail.com
>> >>> :
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Hi Alexander and devs,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thank you very much for the in-depth discussion! As Jark
>> >>>> mentioned we were inspired by Alexander's idea and made a refactor on
>> >> our
>> >>>> design. FLIP-221 [1] has been updated to reflect our design now and
>> we
>> >>> are
>> >>>> happy to hear more suggestions from you!
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Compared to the previous design:
>> >>>>>>>>>>>> 1. The lookup cache serves at table runtime level and is
>> >>>> integrated as a component of LookupJoinRunner as discussed
>> previously.
>> >>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the new
>> >>>> design.
>> >>>>>>>>>>>> 3. We separate the all-caching case individually and
>> >> introduce a
>> >>>> new RescanRuntimeProvider to reuse the ability of scanning. We are
>> >>> planning
>> >>>> to support SourceFunction / InputFormat for now considering the
>> >>> complexity
>> >>>> of FLIP-27 Source API.
>> >>>>>>>>>>>> 4. A new interface LookupFunction is introduced to make the
>> >>>> semantic of lookup more straightforward for developers.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> For replying to Alexander:
>> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>> >> deprecated
>> >>>> or not. Am I right that it will be so in the future, but currently
>> it's
>> >>> not?
>> >>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for now. I
>> >>> think
>> >>>> it will be deprecated in the future but we don't have a clear plan
>> for
>> >>> that.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks again for the discussion on this FLIP and looking
>> >> forward
>> >>>> to cooperating with you after we finalize the design and interfaces!
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> [1]
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Qingsheng
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
>> >>>> smiralexan@gmail.com> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Glad to see that we came to a consensus on almost all
>> points!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>> >> deprecated
>> >>>> or
>> >>>>>>>>>>>>> not. Am I right that it will be so in the future, but
>> >> currently
>> >>>> it's
>> >>>>>>>>>>>>> not? Actually I also think that for the first version it's
>> OK
>> >>> to
>> >>>> use
>> >>>>>>>>>>>>> InputFormat in ALL cache realization, because supporting
>> >> rescan
>> >>>>>>>>>>>>> ability seems like a very distant prospect. But for this
>> >>>> decision we
>> >>>>>>>>>>>>> need a consensus among all discussion participants.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> In general, I don't have something to argue with your
>> >>>> statements. All
>> >>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would be nice
>> >> to
>> >>>> work
>> >>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot of work
>> >> on
>> >>>> lookup
>> >>>>>>>>>>>>> join caching with realization very close to the one we are
>> >>>> discussing,
>> >>>>>>>>>>>>> and want to share the results of this work. Anyway looking
>> >>>> forward for
>> >>>>>>>>>>>>> the FLIP update!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>> Smirnov Alexander
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hi Alex,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thanks for summarizing your points.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed
>> >> it
>> >>>> several times
>> >>>>>>>>>>>>>> and we have totally refactored the design.
>> >>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many of your
>> >>>> points!
>> >>>>>>>>>>>>>> Qingsheng is still working on updating the design docs and
>> >>>> maybe can be
>> >>>>>>>>>>>>>> available in the next few days.
>> >>>>>>>>>>>>>> I will share some conclusions from our discussions:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 1) we have refactored the design towards to "cache in
>> >>>> framework" way.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize and a
>> >>>> default
>> >>>>>>>>>>>>>> implementation with builder for users to easy-use.
>> >>>>>>>>>>>>>> This can both make it possible to both have flexibility and
>> >>>> conciseness.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
>> >> cache,
>> >>>> esp reducing
>> >>>>>>>>>>>>>> IO.
>> >>>>>>>>>>>>>> Filter pushdown should be the final state and the unified
>> >> way
>> >>>> to both
>> >>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
>> >>>>>>>>>>>>>> so I think we should make effort in this direction. If we
>> >> need
>> >>>> to support
>> >>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
>> >>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide to
>> >>> implement
>> >>>> the cache
>> >>>>>>>>>>>>>> in the framework, we have the chance to support
>> >>>>>>>>>>>>>> filter on cache anytime. This is an optimization and it
>> >>> doesn't
>> >>>> affect the
>> >>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
>> >>>>>>>>>>>>>> discuss it when the FLIP is accepted.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
>> >> proposal.
>> >>>>>>>>>>>>>> In the first version, we will only support InputFormat,
>> >>>> SourceFunction for
>> >>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
>> >>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source operator
>> >>>> instead of
>> >>>>>>>>>>>>>> calling it embedded in the join operator.
>> >>>>>>>>>>>>>> However, this needs another FLIP to support the re-scan
>> >>> ability
>> >>>> for FLIP-27
>> >>>>>>>>>>>>>> Source, and this can be a large work.
>> >>>>>>>>>>>>>> In order to not block this issue, we can put the effort of
>> >>>> FLIP-27 source
>> >>>>>>>>>>>>>> integration into future work and integrate
>> >>>>>>>>>>>>>> InputFormat&SourceFunction for now.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as
>> they
>> >>>> are not
>> >>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another
>> function
>> >>>>>>>>>>>>>> similar to them which is meaningless. We need to plan
>> >> FLIP-27
>> >>>> source
>> >>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
>> >>>> deprecated.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>> Jark
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
>> >>>> smiralexan@gmail.com>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi Martijn!
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat is not
>> >>>> considered.
>> >>>>>>>>>>>>>>> Thanks for clearing that up!
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>> Smirnov Alexander
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
>> >>>> martijn@ververica.com>:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Hi,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> With regards to:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors to
>> >>> FLIP-27
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
>> >>>> interfaces will be
>> >>>>>>>>>>>>>>>> deprecated and connectors will either be refactored to
>> use
>> >>>> the new ones
>> >>>>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>>> dropped.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> The caching should work for connectors that are using
>> >>> FLIP-27
>> >>>> interfaces,
>> >>>>>>>>>>>>>>>> we should not introduce new features for old interfaces.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Martijn
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
>> >>>> smiralexan@gmail.com>
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Hi Jark!
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make some
>> >>>> comments and
>> >>>>>>>>>>>>>>>>> clarify my points.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we can
>> >>> achieve
>> >>>> both
>> >>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
>> >>>> flink-table-common,
>> >>>>>>>>>>>>>>>>> but have implementations of it in flink-table-runtime.
>> >>>> Therefore if a
>> >>>>>>>>>>>>>>>>> connector developer wants to use existing cache
>> >> strategies
>> >>>> and their
>> >>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
>> >>>> planner, but if
>> >>>>>>>>>>>>>>>>> he wants to have its own cache implementation in his
>> >>>> TableFunction, it
>> >>>>>>>>>>>>>>>>> will be possible for him to use the existing interface
>> >> for
>> >>>> this
>> >>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
>> >>>> documentation). In
>> >>>>>>>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
>> >>>> have 90% of
>> >>>>>>>>>>>>>>>>> lookup requests that can never be cached
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case
>> >> of
>> >>>> LRU cache.
>> >>>>>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here
>> >> we
>> >>>> always
>> >>>>>>>>>>>>>>>>> store the response of the dimension table in cache, even
>> >>>> after
>> >>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows after
>> >>>> applying
>> >>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
>> >>> TableFunction,
>> >>>> we store
>> >>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line
>> >>> will
>> >>>> be
>> >>>>>>>>>>>>>>>>> filled, but will require much less memory (in bytes).
>> >> I.e.
>> >>>> we don't
>> >>>>>>>>>>>>>>>>> completely filter keys, by which result was pruned, but
>> >>>> significantly
>> >>>>>>>>>>>>>>>>> reduce required memory to store this result. If the user
>> >>>> knows about
>> >>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option
>> >> before
>> >>>> the start
>> >>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea that we
>> >>> can
>> >>>> do this
>> >>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher'
>> >>>> methods of
>> >>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection
>> >> of
>> >>>> rows
>> >>>>>>>>>>>>>>>>> (value of cache). Therefore cache can automatically fit
>> >>> much
>> >>>> more
>> >>>>>>>>>>>>>>>>> records than before.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and
>> >>>> projects
>> >>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>> >>>> SupportsProjectionPushDown.
>> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>> >> don't
>> >>>> mean it's
>> >>>>>>>>>>>>>>> hard
>> >>>>>>>>>>>>>>>>> to implement.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> It's debatable how difficult it will be to implement
>> >> filter
>> >>>> pushdown.
>> >>>>>>>>>>>>>>>>> But I think the fact that currently there is no database
>> >>>> connector
>> >>>>>>>>>>>>>>>>> with filter pushdown at least means that this feature
>> >> won't
>> >>>> be
>> >>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about
>> >>>> other
>> >>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases might
>> not
>> >>>> support all
>> >>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I think
>> >>> users
>> >>>> are
>> >>>>>>>>>>>>>>>>> interested in supporting cache filters optimization
>> >>>> independently of
>> >>>>>>>>>>>>>>>>> supporting other features and solving more complex
>> >> problems
>> >>>> (or
>> >>>>>>>>>>>>>>>>> unsolvable at all).
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
>> >>>> internal version
>> >>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and
>> reloading
>> >>>> data from
>> >>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to
>> >> unify
>> >>>> the logic
>> >>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
>> SourceFunction,
>> >>>> Source,...)
>> >>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
>> >> settled
>> >>>> on using
>> >>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in all
>> >> lookup
>> >>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans to
>> >>> deprecate
>> >>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
>> >>>> FLIP-27 source
>> >>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this source was
>> >>>> designed to
>> >>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
>> >>>> JobManager and
>> >>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
>> >> (lookup
>> >>>> join
>> >>>>>>>>>>>>>>>>> operator in our case). There is even no direct way to
>> >> pass
>> >>>> splits from
>> >>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works
>> through
>> >>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>> >>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
>> >> AddSplitEvents).
>> >>>> Usage of
>> >>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
>> >>>> easier. But if
>> >>>>>>>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I
>> >>>> have the
>> >>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join
>> ALL
>> >>>> cache in
>> >>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of batch
>> >>> source?
>> >>>> The point
>> >>>>>>>>>>>>>>>>> is that the only difference between lookup join ALL
>> cache
>> >>>> and simple
>> >>>>>>>>>>>>>>>>> join with batch source is that in the first case
>> scanning
>> >>> is
>> >>>> performed
>> >>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is
>> cleared
>> >>>> (correct me
>> >>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of
>> >>>> simple join
>> >>>>>>>>>>>>>>>>> to support state reloading + extend the functionality of
>> >>>> scanning
>> >>>>>>>>>>>>>>>>> batch source multiple times (this one should be easy
>> with
>> >>>> new FLIP-27
>> >>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we will
>> >> need
>> >>>> to change
>> >>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again after
>> >>>> some TTL).
>> >>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal
>> >> and
>> >>>> will make
>> >>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe
>> >> we
>> >>>> can limit
>> >>>>>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> So to sum up, my points is like this:
>> >>>>>>>>>>>>>>>>> 1) There is a way to make both concise and flexible
>> >>>> interfaces for
>> >>>>>>>>>>>>>>>>> caching in lookup join.
>> >>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU
>> >> and
>> >>>> ALL caches.
>> >>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported
>> >> in
>> >>>> Flink
>> >>>>>>>>>>>>>>>>> connectors, some of the connectors might not have the
>> >>>> opportunity to
>> >>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently filter
>> >>>> pushdown works
>> >>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
>> >>>> projections
>> >>>>>>>>>>>>>>>>> optimization should be independent from other features.
>> >>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that
>> involves
>> >>>> multiple
>> >>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
>> >>>> InputFormat in favor
>> >>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really
>> >>>> complex and
>> >>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
>> >>>> functionality of
>> >>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
>> >>> lookup
>> >>>> join ALL
>> >>>>>>>>>>>>>>>>> cache?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>> Smirnov Alexander
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>
>> >>>
>> >>
>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want to
>> share
>> >>> my
>> >>>> ideas:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
>> >>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should
>> >>>> work (e.g.,
>> >>>>>>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>>>> pruning, compatibility).
>> >>>>>>>>>>>>>>>>>> The framework way can provide more concise interfaces.
>> >>>>>>>>>>>>>>>>>> The connector base way can define more flexible cache
>> >>>>>>>>>>>>>>>>>> strategies/implementations.
>> >>>>>>>>>>>>>>>>>> We are still investigating a way to see if we can have
>> >>> both
>> >>>>>>>>>>>>>>> advantages.
>> >>>>>>>>>>>>>>>>>> We should reach a consensus that the way should be a
>> >> final
>> >>>> state,
>> >>>>>>>>>>>>>>> and we
>> >>>>>>>>>>>>>>>>>> are on the path to it.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
>> >>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache
>> >> can
>> >>>> benefit a
>> >>>>>>>>>>>>>>> lot
>> >>>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>> ALL cache.
>> >>>>>>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use
>> >>>> cache to
>> >>>>>>>>>>>>>>> reduce
>> >>>>>>>>>>>>>>>>> IO
>> >>>>>>>>>>>>>>>>>> requests to databases for better throughput.
>> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
>> >>>> have 90% of
>> >>>>>>>>>>>>>>>>> lookup
>> >>>>>>>>>>>>>>>>>> requests that can never be cached
>> >>>>>>>>>>>>>>>>>> and hit directly to the databases. That means the cache
>> >> is
>> >>>>>>>>>>>>>>> meaningless in
>> >>>>>>>>>>>>>>>>>> this case.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do
>> filters
>> >>>> and projects
>> >>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>> >>>>>>>>>>>>>>> SupportsProjectionPushDown.
>> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>> >> don't
>> >>>> mean it's
>> >>>>>>>>>>>>>>> hard
>> >>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>> implement.
>> >>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to reduce
>> >> IO
>> >>>> and the
>> >>>>>>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>>>> size.
>> >>>>>>>>>>>>>>>>>> That should be a final state that the scan source and
>> >>>> lookup source
>> >>>>>>>>>>>>>>> share
>> >>>>>>>>>>>>>>>>>> the exact pushdown implementation.
>> >>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic
>> >> in
>> >>>> caches,
>> >>>>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>>> will complex the lookup join design.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
>> >>>>>>>>>>>>>>>>>> All cache might be the most challenging part of this
>> >> FLIP.
>> >>>> We have
>> >>>>>>>>>>>>>>> never
>> >>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
>> >>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method
>> >> of
>> >>>>>>>>>>>>>>> TableFunction.
>> >>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>> >>>>>>>>>>>>>>>>>> Ideally, connector implementation should share the
>> logic
>> >>> of
>> >>>> reload
>> >>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
>> >>>> InputFormat/SourceFunction/FLIP-27
>> >>>>>>>>>>>>>>>>> Source.
>> >>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and
>> >>> the
>> >>>> FLIP-27
>> >>>>>>>>>>>>>>>>> source
>> >>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>> >>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin,
>> >>> this
>> >>>> may make
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
>> >>>>>>>>>>>>>>>>>> We are still investigating how to abstract the ALL
>> cache
>> >>>> logic and
>> >>>>>>>>>>>>>>> reuse
>> >>>>>>>>>>>>>>>>>> the existing source interfaces.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>> Jark
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
>> >>>> ro.v.boyko@gmail.com>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies out of
>> >> the
>> >>>> scope of
>> >>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be done for
>> >>> all
>> >>>>>>>>>>>>>>>>> ScanTableSource
>> >>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>> >>>>>>>>>>>>>>> martijnvisser@apache.org>
>> >>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
>> >>> mentioned
>> >>>> that
>> >>>>>>>>>>>>>>> filter
>> >>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
>> >> jdbc/hive/hbase."
>> >>>> -> Would
>> >>>>>>>>>>>>>>> an
>> >>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement these
>> >>> filter
>> >>>>>>>>>>>>>>> pushdowns?
>> >>>>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to doing
>> >> that,
>> >>>> outside
>> >>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>> lookup
>> >>>>>>>>>>>>>>>>>>>> caching and metrics.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Martijn Visser
>> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>> >>>> ro.v.boyko@gmail.com>
>> >>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Hi everyone!
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation would be
>> >> a
>> >>>> nice
>> >>>>>>>>>>>>>>>>> opportunity
>> >>>>>>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
>> >>>> proc_time"
>> >>>>>>>>>>>>>>>>> semantics
>> >>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
>> >>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off
>> >>> the
>> >>>> cache
>> >>>>>>>>>>>>>>> size
>> >>>>>>>>>>>>>>>>> by
>> >>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most
>> handy
>> >>>> way to do
>> >>>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>> apply
>> >>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to
>> >>>> pass it
>> >>>>>>>>>>>>>>>>> through the
>> >>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
>> >>> correctly
>> >>>>>>>>>>>>>>> mentioned
>> >>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
>> >>>> jdbc/hive/hbase.
>> >>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
>> >> parameters
>> >>>> for
>> >>>>>>>>>>>>>>> different
>> >>>>>>>>>>>>>>>>>>>> tables
>> >>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
>> >> through
>> >>>> DDL
>> >>>>>>>>>>>>>>> rather
>> >>>>>>>>>>>>>>>>> than
>> >>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for
>> all
>> >>>> lookup
>> >>>>>>>>>>>>>>> tables.
>> >>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
>> >>>> deprives us of
>> >>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement
>> their
>> >>> own
>> >>>>>>>>>>>>>>> cache).
>> >>>>>>>>>>>>>>>>> But
>> >>>>>>>>>>>>>>>>>>>> most
>> >>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
>> >> different
>> >>>> cache
>> >>>>>>>>>>>>>>>>> strategies
>> >>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> All these points are much closer to the schema
>> >> proposed
>> >>>> by
>> >>>>>>>>>>>>>>>>> Alexander.
>> >>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and
>> >>> all
>> >>>> these
>> >>>>>>>>>>>>>>>>>>>> facilities
>> >>>>>>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>>>> Roman Boyko
>> >>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>> >>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>> >>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
>> >>>> express that
>> >>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>> really
>> >>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic
>> >> and I
>> >>>> hope
>> >>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>> others
>> >>>>>>>>>>>>>>>>>>>>>> will join the conversation.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Martijn
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
>> >>>> questions
>> >>>>>>>>>>>>>>>>> about
>> >>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
>> >>>> something?).
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
>> >>>> SYSTEM_TIME
>> >>>>>>>>>>>>>>> AS OF
>> >>>>>>>>>>>>>>>>>>>>>> proc_time”
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS
>> >> OF
>> >>>>>>>>>>>>>>> proc_time"
>> >>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said,
>> >>> users
>> >>>> go
>> >>>>>>>>>>>>>>> on it
>> >>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one
>> >>>> proposed
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> enable
>> >>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean
>> >>>> other
>> >>>>>>>>>>>>>>>>> developers
>> >>>>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
>> >>> specify
>> >>>>>>>>>>>>>>> whether
>> >>>>>>>>>>>>>>>>> their
>> >>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of
>> >>>> supported
>> >>>>>>>>>>>>>>>>>>>> options),
>> >>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to.
>> So
>> >>>> what
>> >>>>>>>>>>>>>>>>> exactly is
>> >>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching in
>> >>> modules
>> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from
>> >>> the
>> >>>>>>>>>>>>>>>>> considered
>> >>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
>> >>>> breaking/non-breaking
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table options in
>> >>> DDL
>> >>>> to
>> >>>>>>>>>>>>>>>>> control
>> >>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never
>> happened
>> >>>>>>>>>>>>>>> previously
>> >>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>> should
>> >>>>>>>>>>>>>>>>>>>>>>> be cautious
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of
>> >> DDL
>> >>>>>>>>>>>>>>> options
>> >>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about
>> >>>> limiting
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> scope
>> >>>>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user business
>> >> logic
>> >>>> rather
>> >>>>>>>>>>>>>>> than
>> >>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
>> >>>> framework? I
>> >>>>>>>>>>>>>>>>> mean
>> >>>>>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option with
>> >>>> lookup
>> >>>>>>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
>> >>>> decision,
>> >>>>>>>>>>>>>>>>> because it
>> >>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not
>> >> just
>> >>>>>>>>>>>>>>> performance
>> >>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several functions of
>> >> ONE
>> >>>> table
>> >>>>>>>>>>>>>>>>> (there
>> >>>>>>>>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it
>> >>>> really
>> >>>>>>>>>>>>>>>>> matter for
>> >>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
>> >>> located,
>> >>>>>>>>>>>>>>> which is
>> >>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
>> >>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism',
>> >>>> which in
>> >>>>>>>>>>>>>>>>> some way
>> >>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I
>> >> don't
>> >>>> see any
>> >>>>>>>>>>>>>>>>> problem
>> >>>>>>>>>>>>>>>>>>>>>>> here.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
>> >>>> scenario
>> >>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> design
>> >>>>>>>>>>>>>>>>>>>>>>> would become more complex
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but
>> >>>> actually
>> >>>>>>>>>>>>>>> in our
>> >>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
>> >> easily
>> >>> -
>> >>>> we
>> >>>>>>>>>>>>>>> reused
>> >>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new
>> >>> API).
>> >>>> The
>> >>>>>>>>>>>>>>>>> point is
>> >>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
>> >> InputFormat
>> >>>> for
>> >>>>>>>>>>>>>>>>> scanning
>> >>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it
>> >>> uses
>> >>>>>>>>>>>>>>> class
>> >>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper
>> >>> around
>> >>>>>>>>>>>>>>>>> InputFormat.
>> >>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to
>> >>> reload
>> >>>>>>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>>> data
>> >>>>>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
>> >>>>>>>>>>>>>>> InputSplits,
>> >>>>>>>>>>>>>>>>> but
>> >>>>>>>>>>>>>>>>>>>> has
>> >>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
>> >>>> significantly
>> >>>>>>>>>>>>>>>>> reduces
>> >>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know
>> >>> that
>> >>>>>>>>>>>>>>> usually
>> >>>>>>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>>>>> try
>> >>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but
>> >>> maybe
>> >>>> this
>> >>>>>>>>>>>>>>> one
>> >>>>>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal
>> >>>> solution,
>> >>>>>>>>>>>>>>> maybe
>> >>>>>>>>>>>>>>>>>>>> there
>> >>>>>>>>>>>>>>>>>>>>>>> are better ones.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
>> >> introduce
>> >>>>>>>>>>>>>>>>> compatibility
>> >>>>>>>>>>>>>>>>>>>>>> issues
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of
>> >> the
>> >>>>>>>>>>>>>>> connector
>> >>>>>>>>>>>>>>>>>>>> won't
>> >>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache
>> >>>> options
>> >>>>>>>>>>>>>>>>>>>> incorrectly
>> >>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2
>> >>>> different
>> >>>>>>>>>>>>>>> code
>> >>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to
>> >> do
>> >>>> is to
>> >>>>>>>>>>>>>>>>> redirect
>> >>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig
>> (+
>> >>>> maybe
>> >>>>>>>>>>>>>>> add an
>> >>>>>>>>>>>>>>>>>>>> alias
>> >>>>>>>>>>>>>>>>>>>>>>> for options, if there was different naming),
>> >>> everything
>> >>>>>>>>>>>>>>> will be
>> >>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
>> >>>>>>>>>>>>>>> refactoring at
>> >>>>>>>>>>>>>>>>> all,
>> >>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector because
>> >> of
>> >>>>>>>>>>>>>>> backward
>> >>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use
>> his
>> >>> own
>> >>>>>>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>>> logic,
>> >>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs
>> into
>> >>> the
>> >>>>>>>>>>>>>>>>> framework,
>> >>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with already
>> >>>> existing
>> >>>>>>>>>>>>>>>>> configs
>> >>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare
>> >> case).
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the
>> >> way
>> >>>> down
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> table
>> >>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the
>> >>> ONLY
>> >>>>>>>>>>>>>>> connector
>> >>>>>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
>> >>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it currently).
>> Also
>> >>>> for some
>> >>>>>>>>>>>>>>>>>>>> databases
>> >>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
>> >>> filters
>> >>>>>>>>>>>>>>> that we
>> >>>>>>>>>>>>>>>>> have
>> >>>>>>>>>>>>>>>>>>>>>>> in Flink.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache
>> >> seems
>> >>>> not
>> >>>>>>>>>>>>>>>>> quite
>> >>>>>>>>>>>>>>>>>>>>> useful
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of
>> >>> data
>> >>>>>>>>>>>>>>> from the
>> >>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in
>> >>>> dimension
>> >>>>>>>>>>>>>>>>> table
>> >>>>>>>>>>>>>>>>>>>>>>> 'users'
>> >>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40,
>> and
>> >>>> input
>> >>>>>>>>>>>>>>> stream
>> >>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of
>> >>>> users. If
>> >>>>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>> have
>> >>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>> >>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means
>> >>> the
>> >>>> user
>> >>>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2
>> times.
>> >>> It
>> >>>> will
>> >>>>>>>>>>>>>>>>> gain a
>> >>>>>>>>>>>>>>>>>>>>>>> huge
>> >>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization
>> >> starts
>> >>>> to
>> >>>>>>>>>>>>>>> really
>> >>>>>>>>>>>>>>>>>>>> shine
>> >>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
>> >>>> projections
>> >>>>>>>>>>>>>>>>> can't
>> >>>>>>>>>>>>>>>>>>>> fit
>> >>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
>> >>>> additional
>> >>>>>>>>>>>>>>>>>>>> possibilities
>> >>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
>> >>>> useful'.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding
>> >> this
>> >>>> topic!
>> >>>>>>>>>>>>>>>>> Because
>> >>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I
>> >>>> think
>> >>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> help
>> >>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
>> >>>> consensus.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>> >>>>>>>>>>>>>>> renqschn@gmail.com
>> >>>>>>>>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late
>> >>>> response!
>> >>>>>>>>>>>>>>> We
>> >>>>>>>>>>>>>>>>> had
>> >>>>>>>>>>>>>>>>>>>> an
>> >>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard
>> >>> and
>> >>>> I’d
>> >>>>>>>>>>>>>>> like
>> >>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the
>> >>> cache
>> >>>>>>>>>>>>>>> logic in
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> table
>> >>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided
>> >>>> table
>> >>>>>>>>>>>>>>>>> function,
>> >>>>>>>>>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
>> >>>> TableFunction
>> >>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>> these
>> >>>>>>>>>>>>>>>>>>>>>>> concerns:
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
>> >>>>>>>>>>>>>>> SYSTEM_TIME
>> >>>>>>>>>>>>>>>>> AS OF
>> >>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the
>> >>>> content
>> >>>>>>>>>>>>>>> of the
>> >>>>>>>>>>>>>>>>>>>> lookup
>> >>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose
>> to
>> >>>> enable
>> >>>>>>>>>>>>>>>>> caching
>> >>>>>>>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this
>> >>>> breakage is
>> >>>>>>>>>>>>>>>>>>>> acceptable
>> >>>>>>>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to
>> >>>> provide
>> >>>>>>>>>>>>>>>>> caching on
>> >>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> table runtime level.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
>> >>>> framework
>> >>>>>>>>>>>>>>>>> (whether
>> >>>>>>>>>>>>>>>>>>>> in a
>> >>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have
>> >> to
>> >>>>>>>>>>>>>>> confront a
>> >>>>>>>>>>>>>>>>>>>>>> situation
>> >>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
>> >>>> behavior of
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>> framework,
>> >>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and should be
>> >>>> cautious.
>> >>>>>>>>>>>>>>>>> Under
>> >>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the framework
>> should
>> >>>> only be
>> >>>>>>>>>>>>>>>>>>>> specified
>> >>>>>>>>>>>>>>>>>>>>> by
>> >>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard
>> to
>> >>>> apply
>> >>>>>>>>>>>>>>> these
>> >>>>>>>>>>>>>>>>>>>> general
>> >>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and
>> >>>> refresh
>> >>>>>>>>>>>>>>> all
>> >>>>>>>>>>>>>>>>>>>> records
>> >>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high
>> lookup
>> >>>>>>>>>>>>>>> performance
>> >>>>>>>>>>>>>>>>>>>> (like
>> >>>>>>>>>>>>>>>>>>>>>> Hive
>> >>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely used
>> by
>> >>> our
>> >>>>>>>>>>>>>>> internal
>> >>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
>> >>>>>>>>>>>>>>> TableFunction
>> >>>>>>>>>>>>>>>>>>>> works
>> >>>>>>>>>>>>>>>>>>>>>> fine
>> >>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a
>> >>> new
>> >>>>>>>>>>>>>>>>> interface for
>> >>>>>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become
>> >> more
>> >>>>>>>>>>>>>>> complex.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
>> >>>> introduce
>> >>>>>>>>>>>>>>>>>>>> compatibility
>> >>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might
>> >>>> exist two
>> >>>>>>>>>>>>>>>>> caches
>> >>>>>>>>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
>> >> incorrectly
>> >>>>>>>>>>>>>>> configures
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> table
>> >>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by
>> >> the
>> >>>> lookup
>> >>>>>>>>>>>>>>>>> source).
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I
>> >>>> think
>> >>>>>>>>>>>>>>>>> filters
>> >>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to
>> >> the
>> >>>> table
>> >>>>>>>>>>>>>>>>> function,
>> >>>>>>>>>>>>>>>>>>>>> like
>> >>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
>> >> runner
>> >>>> with
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> cache.
>> >>>>>>>>>>>>>>>>>>>>> The
>> >>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O
>> >> and
>> >>>>>>>>>>>>>>> pressure
>> >>>>>>>>>>>>>>>>> on the
>> >>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
>> >>> optimizations
>> >>>> to
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>>>>>>> seems
>> >>>>>>>>>>>>>>>>>>>>>>> not quite useful.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our
>> >>>> ideas.
>> >>>>>>>>>>>>>>> We
>> >>>>>>>>>>>>>>>>>>>> prefer to
>> >>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
>> >>>> TableFunction,
>> >>>>>>>>>>>>>>> and we
>> >>>>>>>>>>>>>>>>>>>> could
>> >>>>>>>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
>> >>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>> >>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
>> >> regulate
>> >>>>>>>>>>>>>>> metrics
>> >>>>>>>>>>>>>>>>> of the
>> >>>>>>>>>>>>>>>>>>>>>> cache.
>> >>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> >>>>>>>>>>>>>>>>>>>>>>>> [2]
>> >>> https://github.com/PatrickRen/flink/tree/FLIP-221
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов
>> >> <
>> >>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as
>> >> the
>> >>>>>>>>>>>>>>> first
>> >>>>>>>>>>>>>>>>> step:
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
>> >>>>>>>>>>>>>>> (originally
>> >>>>>>>>>>>>>>>>>>>>> proposed
>> >>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually
>> they
>> >>>> follow
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> same
>> >>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are different.
>> >> If
>> >>> we
>> >>>>>>>>>>>>>>> will
>> >>>>>>>>>>>>>>>>> go one
>> >>>>>>>>>>>>>>>>>>>>> way,
>> >>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
>> >>>> deleting
>> >>>>>>>>>>>>>>>>> existing
>> >>>>>>>>>>>>>>>>>>>> code
>> >>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors.
>> >> So
>> >>> I
>> >>>>>>>>>>>>>>> think we
>> >>>>>>>>>>>>>>>>>>>> should
>> >>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that
>> >> and
>> >>>> then
>> >>>>>>>>>>>>>>> work
>> >>>>>>>>>>>>>>>>>>>>> together
>> >>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for
>> >>>> different
>> >>>>>>>>>>>>>>>>> parts
>> >>>>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
>> >>>> introducing
>> >>>>>>>>>>>>>>>>> proposed
>> >>>>>>>>>>>>>>>>>>>> set
>> >>>>>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests
>> >> after
>> >>>>>>>>>>>>>>> filter
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the
>> >>>> lookup
>> >>>>>>>>>>>>>>>>> table, we
>> >>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we
>> >>> can
>> >>>>>>>>>>>>>>> filter
>> >>>>>>>>>>>>>>>>>>>>> responses,
>> >>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
>> >>>> pushdown. So
>> >>>>>>>>>>>>>>> if
>> >>>>>>>>>>>>>>>>>>>>> filtering
>> >>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less
>> >>> rows
>> >>>> in
>> >>>>>>>>>>>>>>>>> cache.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
>> >> not
>> >>>>>>>>>>>>>>> shared.
>> >>>>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>>>> don't
>> >>>>>>>>>>>>>>>>>>>>>>> know the
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
>> >>>>>>>>>>>>>>> conversations
>> >>>>>>>>>>>>>>>>> :)
>> >>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I
>> >>> made a
>> >>>>>>>>>>>>>>> Jira
>> >>>>>>>>>>>>>>>>> issue,
>> >>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
>> >>> details
>> >>>> -
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >> https://issues.apache.org/jira/browse/FLINK-27411.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>> >>>>>>>>>>>>>>> arvid@apache.org>:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was
>> >> not
>> >>>>>>>>>>>>>>>>> satisfying
>> >>>>>>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>>>>>> me.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also
>> >>> live
>> >>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>> an
>> >>>>>>>>>>>>>>>>>>>>> easier
>> >>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making
>> >>>> caching
>> >>>>>>>>>>>>>>> an
>> >>>>>>>>>>>>>>>>>>>>>>> implementation
>> >>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
>> >> caching
>> >>>>>>>>>>>>>>> layer
>> >>>>>>>>>>>>>>>>>>>> around X.
>> >>>>>>>>>>>>>>>>>>>>>> So
>> >>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
>> >>>>>>>>>>>>>>> delegates to
>> >>>>>>>>>>>>>>>>> X in
>> >>>>>>>>>>>>>>>>>>>>> case
>> >>>>>>>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it
>> >> into
>> >>>> the
>> >>>>>>>>>>>>>>>>> operator
>> >>>>>>>>>>>>>>>>>>>>>> model
>> >>>>>>>>>>>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
>> >>>>>>>>>>>>>>> unnecessary
>> >>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> first step
>> >>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
>> >>> receive
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> requests
>> >>>>>>>>>>>>>>>>>>>>>>> after
>> >>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
>> >>> interesting
>> >>>> to
>> >>>>>>>>>>>>>>> save
>> >>>>>>>>>>>>>>>>>>>>> memory).
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of
>> >> this
>> >>>> FLIP
>> >>>>>>>>>>>>>>>>> would be
>> >>>>>>>>>>>>>>>>>>>>>>> limited to
>> >>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
>> >>>> Everything
>> >>>>>>>>>>>>>>> else
>> >>>>>>>>>>>>>>>>>>>>> remains
>> >>>>>>>>>>>>>>>>>>>>>> an
>> >>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we
>> >> can
>> >>>>>>>>>>>>>>> easily
>> >>>>>>>>>>>>>>>>>>>>>> incorporate
>> >>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed
>> >> out
>> >>>>>>>>>>>>>>> later.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
>> >> not
>> >>>>>>>>>>>>>>> shared.
>> >>>>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>>>> don't
>> >>>>>>>>>>>>>>>>>>>>>>> know the
>> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
>> >> Смирнов
>> >>> <
>> >>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
>> >>>>>>>>>>>>>>> committer
>> >>>>>>>>>>>>>>>>> yet,
>> >>>>>>>>>>>>>>>>>>>> but
>> >>>>>>>>>>>>>>>>>>>>>> I'd
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP
>> really
>> >>>>>>>>>>>>>>>>> interested
>> >>>>>>>>>>>>>>>>>>>> me.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in
>> >> my
>> >>>>>>>>>>>>>>>>> company’s
>> >>>>>>>>>>>>>>>>>>>>> Flink
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts
>> >> on
>> >>>>>>>>>>>>>>> this and
>> >>>>>>>>>>>>>>>>>>>> make
>> >>>>>>>>>>>>>>>>>>>>>> code
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
>> >>>>>>>>>>>>>>> introducing an
>> >>>>>>>>>>>>>>>>>>>>> abstract
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
>> (CachingTableFunction).
>> >>> As
>> >>>>>>>>>>>>>>> you
>> >>>>>>>>>>>>>>>>> know,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
>> >>>>>>>>>>>>>>> module,
>> >>>>>>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>>>>>>>> provides
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s
>> very
>> >>>>>>>>>>>>>>>>> convenient
>> >>>>>>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>>>>>>> importing
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction
>> >>>> contains
>> >>>>>>>>>>>>>>>>> logic
>> >>>>>>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
>> >> everything
>> >>>>>>>>>>>>>>>>> connected
>> >>>>>>>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably
>> >> in
>> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on
>> >>>> another
>> >>>>>>>>>>>>>>>>> module,
>> >>>>>>>>>>>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t
>> >>>> sound
>> >>>>>>>>>>>>>>>>> good.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
>> ‘getLookupConfig’
>> >>> to
>> >>>>>>>>>>>>>>>>>>>>>> LookupTableSource
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors
>> to
>> >>>> only
>> >>>>>>>>>>>>>>> pass
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they
>> >>> won’t
>> >>>>>>>>>>>>>>>>> depend on
>> >>>>>>>>>>>>>>>>>>>>>>> runtime
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner
>> >> will
>> >>>>>>>>>>>>>>>>> construct a
>> >>>>>>>>>>>>>>>>>>>>>> lookup
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
>> >>>>>>>>>>>>>>>>>>>> (ProcessFunctions
>> >>>>>>>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture
>> looks
>> >>>> like
>> >>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> pinned
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually
>> >> yours
>> >>>>>>>>>>>>>>>>>>>> CacheConfig).
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
>> >>>>>>>>>>>>>>> responsible
>> >>>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>> –
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>> >>>>>>>>>>>>>>> flink-table-runtime
>> >>>>>>>>>>>>>>>>> -
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>> >>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
>> >> LookupJoinCachingRunner,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage
>> >> of
>> >>>>>>>>>>>>>>> such a
>> >>>>>>>>>>>>>>>>>>>>> solution.
>> >>>>>>>>>>>>>>>>>>>>>>> If
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can
>> >>>> apply
>> >>>>>>>>>>>>>>> some
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc
>> >> was
>> >>>>>>>>>>>>>>> named
>> >>>>>>>>>>>>>>>>> like
>> >>>>>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
>> >>> actually
>> >>>>>>>>>>>>>>>>> mostly
>> >>>>>>>>>>>>>>>>>>>>>> consists
>> >>>>>>>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table
>> >> B
>> >>>>>>>>>>>>>>>>> condition
>> >>>>>>>>>>>>>>>>>>>>> ‘JOIN …
>> >>>>>>>>>>>>>>>>>>>>>>> ON
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
>> >>> B.salary >
>> >>>>>>>>>>>>>>> 1000’
>> >>>>>>>>>>>>>>>>>>>>> ‘calc’
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age +
>> >> 10
>> >>>> and
>> >>>>>>>>>>>>>>>>>>>> B.salary >
>> >>>>>>>>>>>>>>>>>>>>>>> 1000.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
>> >> records
>> >>> in
>> >>>>>>>>>>>>>>>>> cache,
>> >>>>>>>>>>>>>>>>>>>> size
>> >>>>>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters =
>> >>>> avoid
>> >>>>>>>>>>>>>>>>> storing
>> >>>>>>>>>>>>>>>>>>>>>> useless
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce
>> records’
>> >>>>>>>>>>>>>>> size. So
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>> initial
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be
>> increased
>> >>> by
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> user.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion
>> >>> about
>> >>>>>>>>>>>>>>>>>>>> FLIP-221[1],
>> >>>>>>>>>>>>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table
>> cache
>> >>> and
>> >>>>>>>>>>>>>>> its
>> >>>>>>>>>>>>>>>>>>>> standard
>> >>>>>>>>>>>>>>>>>>>>>>> metrics.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
>> >>>> implement
>> >>>>>>>>>>>>>>>>> their
>> >>>>>>>>>>>>>>>>>>>> own
>> >>>>>>>>>>>>>>>>>>>>>>> cache to
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
>> >> standard
>> >>> of
>> >>>>>>>>>>>>>>>>> metrics
>> >>>>>>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>>>>>>> users and
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
>> >>> joins,
>> >>>>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>> is a
>> >>>>>>>>>>>>>>>>>>>>>> quite
>> >>>>>>>>>>>>>>>>>>>>>>> common
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including
>> >>>> cache,
>> >>>>>>>>>>>>>>>>>>>> metrics,
>> >>>>>>>>>>>>>>>>>>>>>>> wrapper
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table
>> options.
>> >>>>>>>>>>>>>>> Please
>> >>>>>>>>>>>>>>>>> take a
>> >>>>>>>>>>>>>>>>>>>>> look
>> >>>>>>>>>>>>>>>>>>>>>>> at the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
>> >>> suggestions
>> >>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>> comments
>> >>>>>>>>>>>>>>>>>>>>>>> would be
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>> >>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>> Roman Boyko
>> >>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> --
>> >>>>>>>>>>>> Best Regards,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Qingsheng Ren
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Real-time Computing Team
>> >>>>>>>>>>>> Alibaba Cloud
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Email: renqschn@gmail.com
>> >>>>>>>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>>
>>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Becket Qin <be...@gmail.com>.

Hi Qingsheng,

Thanks for updating the FLIP. A few comments / questions below:

1. Is there a reason that we have both "XXXFactory" and "XXXProvider". What
is the difference between them? If they are the same, can we just use
XXXFactory everywhere?

2. Regarding the FullCachingLookupProvider, should the reloading policy
also be pluggable? Periodical reloading could be sometimes be tricky in
practice. For example, if user uses 24 hours as the cache refresh interval
and some nightly batch job delayed, the cache update may still see the
stale data.

3. In DefaultLookupCacheFactory, it looks like InitialCapacity should be
removed.

4. The purpose of LookupFunctionProvider#cacheMissingKey() seems a little
confusing to me. If Optional<LookupCacheFactory> getCacheFactory() returns
a non-empty factory, doesn't that already indicates the framework to cache
the missing keys? Also, why is this method returning an Optional<Boolean>
instead of boolean?

Thanks,

Jiangjie (Becket) Qin



On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <re...@gmail.com> wrote:

> Hi Lincoln and Jark,
>
> Thanks for the comments! If the community reaches a consensus that we use
> SQL hint instead of table options to decide whether to use sync or async
> mode, it’s indeed not necessary to introduce the “lookup.async” option.
>
> I think it’s a good idea to let the decision of async made on query level,
> which could make better optimization with more infomation gathered by
> planner. Is there any FLIP describing the issue in FLINK-27625? I thought
> FLIP-234 is only proposing adding SQL hint for retry on missing instead of
> the entire async mode to be controlled by hint.
>
> Best regards,
>
> Qingsheng
>
> > On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com> wrote:
> >
> > Hi Jark,
> >
> > Thanks for your reply!
> >
> > Currently 'lookup.async' just lies in HBase connector, I have no idea
> > whether or when to remove it (we can discuss it in another issue for the
> > HBase connector after FLINK-27625 is done), just not add it into a common
> > option now.
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> >
> >> Hi Lincoln,
> >>
> >> I have taken a look at FLIP-234, and I agree with you that the
> connectors
> >> can
> >> provide both async and sync runtime providers simultaneously instead of
> one
> >> of them.
> >> At that point, "lookup.async" looks redundant. If this option is
> planned to
> >> be removed
> >> in the long term, I think it makes sense not to introduce it in this
> FLIP.
> >>
> >> Best,
> >> Jark
> >>
> >> On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com>
> wrote:
> >>
> >>> Hi Qingsheng,
> >>>
> >>> Sorry for jumping into the discussion so late. It's a good idea that we
> >> can
> >>> have a common table option. I have a minor comments on  'lookup.async'
> >> that
> >>> not make it a common option:
> >>>
> >>> The table layer abstracts both sync and async lookup capabilities,
> >>> connectors implementers can choose one or both, in the case of
> >> implementing
> >>> only one capability(status of the most of existing builtin connectors)
> >>> 'lookup.async' will not be used.  And when a connector has both
> >>> capabilities, I think this choice is more suitable for making decisions
> >> at
> >>> the query level, for example, table planner can choose the physical
> >>> implementation of async lookup or sync lookup based on its cost model,
> or
> >>> users can give query hint based on their own better understanding.  If
> >>> there is another common table option 'lookup.async', it may confuse the
> >>> users in the long run.
> >>>
> >>> So, I prefer to leave the 'lookup.async' option in private place (for
> the
> >>> current hbase connector) and not turn it into a common option.
> >>>
> >>> WDYT?
> >>>
> >>> Best,
> >>> Lincoln Lee
> >>>
> >>>
> >>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> >>>
> >>>> Hi Alexander,
> >>>>
> >>>> Thanks for the review! We recently updated the FLIP and you can find
> >>> those
> >>>> changes from my latest email. Since some terminologies has changed so
> >>> I’ll
> >>>> use the new concept for replying your comments.
> >>>>
> >>>> 1. Builder vs ‘of’
> >>>> I’m OK to use builder pattern if we have additional optional
> parameters
> >>>> for full caching mode (“rescan” previously). The schedule-with-delay
> >> idea
> >>>> looks reasonable to me, but I think we need to redesign the builder
> API
> >>> of
> >>>> full caching to make it more descriptive for developers. Would you
> mind
> >>>> sharing your ideas about the API? For accessing the FLIP workspace you
> >>> can
> >>>> just provide your account ID and ping any PMC member including Jark.
> >>>>
> >>>> 2. Common table options
> >>>> We have some discussions these days and propose to introduce 8 common
> >>>> table options about caching. It has been updated on the FLIP.
> >>>>
> >>>> 3. Retries
> >>>> I think we are on the same page :-)
> >>>>
> >>>> For your additional concerns:
> >>>> 1) The table option has been updated.
> >>>> 2) We got “lookup.cache” back for configuring whether to use partial
> or
> >>>> full caching mode.
> >>>>
> >>>> Best regards,
> >>>>
> >>>> Qingsheng
> >>>>
> >>>>
> >>>>
> >>>>> On May 19, 2022, at 17:25, Александр Смирнов <sm...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Also I have a few additions:
> >>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> >>>>> 'lookup.cache.max-rows'? I think it will be more clear that we talk
> >>>>> not about bytes, but about the number of rows. Plus it fits more,
> >>>>> considering my optimization with filters.
> >>>>> 2) How will users enable rescanning? Are we going to separate caching
> >>>>> and rescanning from the options point of view? Like initially we had
> >>>>> one option 'lookup.cache' with values LRU / ALL. I think now we can
> >>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
> >>>>> 'lookup.rescan.interval', etc.
> >>>>>
> >>>>> Best regards,
> >>>>> Alexander
> >>>>>
> >>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <smiralexan@gmail.com
> >>> :
> >>>>>>
> >>>>>> Hi Qingsheng and Jark,
> >>>>>>
> >>>>>> 1. Builders vs 'of'
> >>>>>> I understand that builders are used when we have multiple
> >> parameters.
> >>>>>> I suggested them because we could add parameters later. To prevent
> >>>>>> Builder for ScanRuntimeProvider from looking redundant I can suggest
> >>>>>> one more config now - "rescanStartTime".
> >>>>>> It's a time in UTC (LocalTime class) when the first reload of cache
> >>>>>> starts. This parameter can be thought of as 'initialDelay' (diff
> >>>>>> between current time and rescanStartTime) in method
> >>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
> >>>>>> useful when the dimension table is updated by some other scheduled
> >> job
> >>>>>> at a certain time. Or when the user simply wants a second scan
> >> (first
> >>>>>> cache reload) be delayed. This option can be used even without
> >>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one day.
> >>>>>> If you are fine with this option, I would be very glad if you would
> >>>>>> give me access to edit FLIP page, so I could add it myself
> >>>>>>
> >>>>>> 2. Common table options
> >>>>>> I also think that FactoryUtil would be overloaded by all cache
> >>>>>> options. But maybe unify all suggested options, not only for default
> >>>>>> cache? I.e. class 'LookupOptions', that unifies default cache
> >> options,
> >>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> >>>>>>
> >>>>>> 3. Retries
> >>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
> >>>>>>
> >>>>>> [1]
> >>>>
> >>>
> >>
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Alexander
> >>>>>>
> >>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
> >>>>>>>
> >>>>>>> Hi Jark and Alexander,
> >>>>>>>
> >>>>>>> Thanks for your comments! I’m also OK to introduce common table
> >>>> options. I prefer to introduce a new DefaultLookupCacheOptions class
> >> for
> >>>> holding these option definitions because putting all options into
> >>>> FactoryUtil would make it a bit ”crowded” and not well categorized.
> >>>>>>>
> >>>>>>> FLIP has been updated according to suggestions above:
> >>>>>>> 1. Use static “of” method for constructing RescanRuntimeProvider
> >>>> considering both arguments are required.
> >>>>>>> 2. Introduce new table options matching DefaultLookupCacheFactory
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Qingsheng
> >>>>>>>
> >>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Alex,
> >>>>>>>>
> >>>>>>>> 1) retry logic
> >>>>>>>> I think we can extract some common retry logic into utilities,
> >> e.g.
> >>>> RetryUtils#tryTimes(times, call).
> >>>>>>>> This seems independent of this FLIP and can be reused by
> >> DataStream
> >>>> users.
> >>>>>>>> Maybe we can open an issue to discuss this and where to put it.
> >>>>>>>>
> >>>>>>>> 2) cache ConfigOptions
> >>>>>>>> I'm fine with defining cache config options in the framework.
> >>>>>>>> A candidate place to put is FactoryUtil which also includes
> >>>> "sink.parallelism", "format" options.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> >>> smiralexan@gmail.com>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Qingsheng,
> >>>>>>>>>
> >>>>>>>>> Thank you for considering my comments.
> >>>>>>>>>
> >>>>>>>>>> there might be custom logic before making retry, such as
> >>>> re-establish the connection
> >>>>>>>>>
> >>>>>>>>> Yes, I understand that. I meant that such logic can be placed in
> >> a
> >>>>>>>>> separate function, that can be implemented by connectors. Just
> >>> moving
> >>>>>>>>> the retry logic would make connector's LookupFunction more
> >> concise
> >>> +
> >>>>>>>>> avoid duplicate code. However, it's a minor change. The decision
> >> is
> >>>> up
> >>>>>>>>> to you.
> >>>>>>>>>
> >>>>>>>>>> We decide not to provide common DDL options and let developers
> >> to
> >>>> define their own options as we do now per connector.
> >>>>>>>>>
> >>>>>>>>> What is the reason for that? One of the main goals of this FLIP
> >> was
> >>>> to
> >>>>>>>>> unify the configs, wasn't it? I understand that current cache
> >>> design
> >>>>>>>>> doesn't depend on ConfigOptions, like was before. But still we
> >> can
> >>>> put
> >>>>>>>>> these options into the framework, so connectors can reuse them
> >> and
> >>>>>>>>> avoid code duplication, and, what is more significant, avoid
> >>> possible
> >>>>>>>>> different options naming. This moment can be pointed out in
> >>>>>>>>> documentation for connector developers.
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>> Alexander
> >>>>>>>>>
> >>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
> >>>>>>>>>>
> >>>>>>>>>> Hi Alexander,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the review and glad to see we are on the same page! I
> >>>> think you forgot to cc the dev mailing list so I’m also quoting your
> >>> reply
> >>>> under this email.
> >>>>>>>>>>
> >>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> >>>>>>>>>>
> >>>>>>>>>> In my opinion the retry logic should be implemented in lookup()
> >>>> instead of in LookupFunction#eval(). Retrying is only meaningful under
> >>> some
> >>>> specific retriable failures, and there might be custom logic before
> >>> making
> >>>> retry, such as re-establish the connection (JdbcRowDataLookupFunction
> >> is
> >>> an
> >>>> example), so it's more handy to leave it to the connector.
> >>>>>>>>>>
> >>>>>>>>>>> I don't see DDL options, that were in previous version of FLIP.
> >>> Do
> >>>> you have any special plans for them?
> >>>>>>>>>>
> >>>>>>>>>> We decide not to provide common DDL options and let developers
> >> to
> >>>> define their own options as we do now per connector.
> >>>>>>>>>>
> >>>>>>>>>> The rest of comments sound great and I’ll update the FLIP. Hope
> >> we
> >>>> can finalize our proposal soon!
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>> Qingsheng
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> >>> smiralexan@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Qingsheng and devs!
> >>>>>>>>>>>
> >>>>>>>>>>> I like the overall design of updated FLIP, however I have
> >> several
> >>>>>>>>>>> suggestions and questions.
> >>>>>>>>>>>
> >>>>>>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction
> >> is a
> >>>> good
> >>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this class. 'eval'
> >>>> method
> >>>>>>>>>>> of new LookupFunction is great for this purpose. The same is
> >> for
> >>>>>>>>>>> 'async' case.
> >>>>>>>>>>>
> >>>>>>>>>>> 2) There might be other configs in future, such as
> >>>> 'cacheMissingKey'
> >>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> >>>> ScanRuntimeProvider.
> >>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
> >>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
> >>> method
> >>>>>>>>>>> instead of many 'of' methods in future)?
> >>>>>>>>>>>
> >>>>>>>>>>> 3) What are the plans for existing TableFunctionProvider and
> >>>>>>>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
> >>>>>>>>>>>
> >>>>>>>>>>> 4) Am I right that the current design does not assume usage of
> >>>>>>>>>>> user-provided LookupCache in re-scanning? In this case, it is
> >> not
> >>>> very
> >>>>>>>>>>> clear why do we need methods such as 'invalidate' or 'putAll'
> >> in
> >>>>>>>>>>> LookupCache.
> >>>>>>>>>>>
> >>>>>>>>>>> 5) I don't see DDL options, that were in previous version of
> >>> FLIP.
> >>>> Do
> >>>>>>>>>>> you have any special plans for them?
> >>>>>>>>>>>
> >>>>>>>>>>> If you don't mind, I would be glad to be able to make small
> >>>>>>>>>>> adjustments to the FLIP document too. I think it's worth
> >>> mentioning
> >>>>>>>>>>> about what exactly optimizations are planning in the future.
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards,
> >>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>
> >>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <renqschn@gmail.com
> >>> :
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Alexander and devs,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thank you very much for the in-depth discussion! As Jark
> >>>> mentioned we were inspired by Alexander's idea and made a refactor on
> >> our
> >>>> design. FLIP-221 [1] has been updated to reflect our design now and we
> >>> are
> >>>> happy to hear more suggestions from you!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Compared to the previous design:
> >>>>>>>>>>>> 1. The lookup cache serves at table runtime level and is
> >>>> integrated as a component of LookupJoinRunner as discussed previously.
> >>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the new
> >>>> design.
> >>>>>>>>>>>> 3. We separate the all-caching case individually and
> >> introduce a
> >>>> new RescanRuntimeProvider to reuse the ability of scanning. We are
> >>> planning
> >>>> to support SourceFunction / InputFormat for now considering the
> >>> complexity
> >>>> of FLIP-27 Source API.
> >>>>>>>>>>>> 4. A new interface LookupFunction is introduced to make the
> >>>> semantic of lookup more straightforward for developers.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For replying to Alexander:
> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> >> deprecated
> >>>> or not. Am I right that it will be so in the future, but currently
> it's
> >>> not?
> >>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for now. I
> >>> think
> >>>> it will be deprecated in the future but we don't have a clear plan for
> >>> that.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks again for the discussion on this FLIP and looking
> >> forward
> >>>> to cooperating with you after we finalize the design and interfaces!
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1]
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> >>>> smiralexan@gmail.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Glad to see that we came to a consensus on almost all points!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> However I'm a little confused whether InputFormat is
> >> deprecated
> >>>> or
> >>>>>>>>>>>>> not. Am I right that it will be so in the future, but
> >> currently
> >>>> it's
> >>>>>>>>>>>>> not? Actually I also think that for the first version it's OK
> >>> to
> >>>> use
> >>>>>>>>>>>>> InputFormat in ALL cache realization, because supporting
> >> rescan
> >>>>>>>>>>>>> ability seems like a very distant prospect. But for this
> >>>> decision we
> >>>>>>>>>>>>> need a consensus among all discussion participants.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In general, I don't have something to argue with your
> >>>> statements. All
> >>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would be nice
> >> to
> >>>> work
> >>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot of work
> >> on
> >>>> lookup
> >>>>>>>>>>>>> join caching with realization very close to the one we are
> >>>> discussing,
> >>>>>>>>>>>>> and want to share the results of this work. Anyway looking
> >>>> forward for
> >>>>>>>>>>>>> the FLIP update!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for summarizing your points.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed
> >> it
> >>>> several times
> >>>>>>>>>>>>>> and we have totally refactored the design.
> >>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many of your
> >>>> points!
> >>>>>>>>>>>>>> Qingsheng is still working on updating the design docs and
> >>>> maybe can be
> >>>>>>>>>>>>>> available in the next few days.
> >>>>>>>>>>>>>> I will share some conclusions from our discussions:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1) we have refactored the design towards to "cache in
> >>>> framework" way.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize and a
> >>>> default
> >>>>>>>>>>>>>> implementation with builder for users to easy-use.
> >>>>>>>>>>>>>> This can both make it possible to both have flexibility and
> >>>> conciseness.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
> >> cache,
> >>>> esp reducing
> >>>>>>>>>>>>>> IO.
> >>>>>>>>>>>>>> Filter pushdown should be the final state and the unified
> >> way
> >>>> to both
> >>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> >>>>>>>>>>>>>> so I think we should make effort in this direction. If we
> >> need
> >>>> to support
> >>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> >>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide to
> >>> implement
> >>>> the cache
> >>>>>>>>>>>>>> in the framework, we have the chance to support
> >>>>>>>>>>>>>> filter on cache anytime. This is an optimization and it
> >>> doesn't
> >>>> affect the
> >>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
> >>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
> >> proposal.
> >>>>>>>>>>>>>> In the first version, we will only support InputFormat,
> >>>> SourceFunction for
> >>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> >>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source operator
> >>>> instead of
> >>>>>>>>>>>>>> calling it embedded in the join operator.
> >>>>>>>>>>>>>> However, this needs another FLIP to support the re-scan
> >>> ability
> >>>> for FLIP-27
> >>>>>>>>>>>>>> Source, and this can be a large work.
> >>>>>>>>>>>>>> In order to not block this issue, we can put the effort of
> >>>> FLIP-27 source
> >>>>>>>>>>>>>> integration into future work and integrate
> >>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as they
> >>>> are not
> >>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another function
> >>>>>>>>>>>>>> similar to them which is meaningless. We need to plan
> >> FLIP-27
> >>>> source
> >>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
> >>>> deprecated.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> >>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Martijn!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat is not
> >>>> considered.
> >>>>>>>>>>>>>>> Thanks for clearing that up!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> >>>> martijn@ververica.com>:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> With regards to:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors to
> >>> FLIP-27
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
> >>>> interfaces will be
> >>>>>>>>>>>>>>>> deprecated and connectors will either be refactored to use
> >>>> the new ones
> >>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> dropped.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The caching should work for connectors that are using
> >>> FLIP-27
> >>>> interfaces,
> >>>>>>>>>>>>>>>> we should not introduce new features for old interfaces.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> >>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Jark!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make some
> >>>> comments and
> >>>>>>>>>>>>>>>>> clarify my points.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we can
> >>> achieve
> >>>> both
> >>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
> >>>> flink-table-common,
> >>>>>>>>>>>>>>>>> but have implementations of it in flink-table-runtime.
> >>>> Therefore if a
> >>>>>>>>>>>>>>>>> connector developer wants to use existing cache
> >> strategies
> >>>> and their
> >>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
> >>>> planner, but if
> >>>>>>>>>>>>>>>>> he wants to have its own cache implementation in his
> >>>> TableFunction, it
> >>>>>>>>>>>>>>>>> will be possible for him to use the existing interface
> >> for
> >>>> this
> >>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
> >>>> documentation). In
> >>>>>>>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
> >>>> have 90% of
> >>>>>>>>>>>>>>>>> lookup requests that can never be cached
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case
> >> of
> >>>> LRU cache.
> >>>>>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here
> >> we
> >>>> always
> >>>>>>>>>>>>>>>>> store the response of the dimension table in cache, even
> >>>> after
> >>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows after
> >>>> applying
> >>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> >>> TableFunction,
> >>>> we store
> >>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line
> >>> will
> >>>> be
> >>>>>>>>>>>>>>>>> filled, but will require much less memory (in bytes).
> >> I.e.
> >>>> we don't
> >>>>>>>>>>>>>>>>> completely filter keys, by which result was pruned, but
> >>>> significantly
> >>>>>>>>>>>>>>>>> reduce required memory to store this result. If the user
> >>>> knows about
> >>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option
> >> before
> >>>> the start
> >>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea that we
> >>> can
> >>>> do this
> >>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher'
> >>>> methods of
> >>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection
> >> of
> >>>> rows
> >>>>>>>>>>>>>>>>> (value of cache). Therefore cache can automatically fit
> >>> much
> >>>> more
> >>>>>>>>>>>>>>>>> records than before.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and
> >>>> projects
> >>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
> >> don't
> >>>> mean it's
> >>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>> to implement.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> It's debatable how difficult it will be to implement
> >> filter
> >>>> pushdown.
> >>>>>>>>>>>>>>>>> But I think the fact that currently there is no database
> >>>> connector
> >>>>>>>>>>>>>>>>> with filter pushdown at least means that this feature
> >> won't
> >>>> be
> >>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about
> >>>> other
> >>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases might not
> >>>> support all
> >>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I think
> >>> users
> >>>> are
> >>>>>>>>>>>>>>>>> interested in supporting cache filters optimization
> >>>> independently of
> >>>>>>>>>>>>>>>>> supporting other features and solving more complex
> >> problems
> >>>> (or
> >>>>>>>>>>>>>>>>> unsolvable at all).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
> >>>> internal version
> >>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and reloading
> >>>> data from
> >>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to
> >> unify
> >>>> the logic
> >>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
> >>>> Source,...)
> >>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
> >> settled
> >>>> on using
> >>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in all
> >> lookup
> >>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans to
> >>> deprecate
> >>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
> >>>> FLIP-27 source
> >>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this source was
> >>>> designed to
> >>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
> >>>> JobManager and
> >>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
> >> (lookup
> >>>> join
> >>>>>>>>>>>>>>>>> operator in our case). There is even no direct way to
> >> pass
> >>>> splits from
> >>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works through
> >>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> >> AddSplitEvents).
> >>>> Usage of
> >>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
> >>>> easier. But if
> >>>>>>>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I
> >>>> have the
> >>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join ALL
> >>>> cache in
> >>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of batch
> >>> source?
> >>>> The point
> >>>>>>>>>>>>>>>>> is that the only difference between lookup join ALL cache
> >>>> and simple
> >>>>>>>>>>>>>>>>> join with batch source is that in the first case scanning
> >>> is
> >>>> performed
> >>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is cleared
> >>>> (correct me
> >>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of
> >>>> simple join
> >>>>>>>>>>>>>>>>> to support state reloading + extend the functionality of
> >>>> scanning
> >>>>>>>>>>>>>>>>> batch source multiple times (this one should be easy with
> >>>> new FLIP-27
> >>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we will
> >> need
> >>>> to change
> >>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again after
> >>>> some TTL).
> >>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal
> >> and
> >>>> will make
> >>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe
> >> we
> >>>> can limit
> >>>>>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> >>>>>>>>>>>>>>>>> 1) There is a way to make both concise and flexible
> >>>> interfaces for
> >>>>>>>>>>>>>>>>> caching in lookup join.
> >>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU
> >> and
> >>>> ALL caches.
> >>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported
> >> in
> >>>> Flink
> >>>>>>>>>>>>>>>>> connectors, some of the connectors might not have the
> >>>> opportunity to
> >>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently filter
> >>>> pushdown works
> >>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
> >>>> projections
> >>>>>>>>>>>>>>>>> optimization should be independent from other features.
> >>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that involves
> >>>> multiple
> >>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
> >>>> InputFormat in favor
> >>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really
> >>>> complex and
> >>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
> >>>> functionality of
> >>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
> >>> lookup
> >>>> join ALL
> >>>>>>>>>>>>>>>>> cache?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>
> >>>
> >>
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want to share
> >>> my
> >>>> ideas:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
> >>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should
> >>>> work (e.g.,
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>> pruning, compatibility).
> >>>>>>>>>>>>>>>>>> The framework way can provide more concise interfaces.
> >>>>>>>>>>>>>>>>>> The connector base way can define more flexible cache
> >>>>>>>>>>>>>>>>>> strategies/implementations.
> >>>>>>>>>>>>>>>>>> We are still investigating a way to see if we can have
> >>> both
> >>>>>>>>>>>>>>> advantages.
> >>>>>>>>>>>>>>>>>> We should reach a consensus that the way should be a
> >> final
> >>>> state,
> >>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>> are on the path to it.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> >>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache
> >> can
> >>>> benefit a
> >>>>>>>>>>>>>>> lot
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> ALL cache.
> >>>>>>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use
> >>>> cache to
> >>>>>>>>>>>>>>> reduce
> >>>>>>>>>>>>>>>>> IO
> >>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> >>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
> >>>> have 90% of
> >>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>> requests that can never be cached
> >>>>>>>>>>>>>>>>>> and hit directly to the databases. That means the cache
> >> is
> >>>>>>>>>>>>>>> meaningless in
> >>>>>>>>>>>>>>>>>> this case.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters
> >>>> and projects
> >>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
> >> don't
> >>>> mean it's
> >>>>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> implement.
> >>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to reduce
> >> IO
> >>>> and the
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>>>>>> That should be a final state that the scan source and
> >>>> lookup source
> >>>>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> >>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic
> >> in
> >>>> caches,
> >>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>> will complex the lookup join design.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> >>>>>>>>>>>>>>>>>> All cache might be the most challenging part of this
> >> FLIP.
> >>>> We have
> >>>>>>>>>>>>>>> never
> >>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> >>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method
> >> of
> >>>>>>>>>>>>>>> TableFunction.
> >>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >>>>>>>>>>>>>>>>>> Ideally, connector implementation should share the logic
> >>> of
> >>>> reload
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> >>>> InputFormat/SourceFunction/FLIP-27
> >>>>>>>>>>>>>>>>> Source.
> >>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and
> >>> the
> >>>> FLIP-27
> >>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin,
> >>> this
> >>>> may make
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> >>>>>>>>>>>>>>>>>> We are still investigating how to abstract the ALL cache
> >>>> logic and
> >>>>>>>>>>>>>>> reuse
> >>>>>>>>>>>>>>>>>> the existing source interfaces.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> >>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies out of
> >> the
> >>>> scope of
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be done for
> >>> all
> >>>>>>>>>>>>>>>>> ScanTableSource
> >>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> >>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
> >>> mentioned
> >>>> that
> >>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> >> jdbc/hive/hbase."
> >>>> -> Would
> >>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement these
> >>> filter
> >>>>>>>>>>>>>>> pushdowns?
> >>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to doing
> >> that,
> >>>> outside
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>> caching and metrics.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Martijn Visser
> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> >>>> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation would be
> >> a
> >>>> nice
> >>>>>>>>>>>>>>>>> opportunity
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
> >>>> proc_time"
> >>>>>>>>>>>>>>>>> semantics
> >>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> >>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off
> >>> the
> >>>> cache
> >>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most handy
> >>>> way to do
> >>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to
> >>>> pass it
> >>>>>>>>>>>>>>>>> through the
> >>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
> >>> correctly
> >>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
> >>>> jdbc/hive/hbase.
> >>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
> >> parameters
> >>>> for
> >>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
> >> through
> >>>> DDL
> >>>>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for all
> >>>> lookup
> >>>>>>>>>>>>>>> tables.
> >>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
> >>>> deprives us of
> >>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement their
> >>> own
> >>>>>>>>>>>>>>> cache).
> >>>>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
> >> different
> >>>> cache
> >>>>>>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> All these points are much closer to the schema
> >> proposed
> >>>> by
> >>>>>>>>>>>>>>>>> Alexander.
> >>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and
> >>> all
> >>>> these
> >>>>>>>>>>>>>>>>>>>> facilities
> >>>>>>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> >>>>>>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
> >>>> express that
> >>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic
> >> and I
> >>>> hope
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>> others
> >>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> >>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
> >>>> questions
> >>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
> >>>> something?).
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
> >>>> SYSTEM_TIME
> >>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>> proc_time”
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS
> >> OF
> >>>>>>>>>>>>>>> proc_time"
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said,
> >>> users
> >>>> go
> >>>>>>>>>>>>>>> on it
> >>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one
> >>>> proposed
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean
> >>>> other
> >>>>>>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
> >>> specify
> >>>>>>>>>>>>>>> whether
> >>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of
> >>>> supported
> >>>>>>>>>>>>>>>>>>>> options),
> >>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to. So
> >>>> what
> >>>>>>>>>>>>>>>>> exactly is
> >>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching in
> >>> modules
> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from
> >>> the
> >>>>>>>>>>>>>>>>> considered
> >>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> >>>> breaking/non-breaking
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table options in
> >>> DDL
> >>>> to
> >>>>>>>>>>>>>>>>> control
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never happened
> >>>>>>>>>>>>>>> previously
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>> be cautious
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of
> >> DDL
> >>>>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about
> >>>> limiting
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> scope
> >>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user business
> >> logic
> >>>> rather
> >>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
> >>>> framework? I
> >>>>>>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option with
> >>>> lookup
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
> >>>> decision,
> >>>>>>>>>>>>>>>>> because it
> >>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not
> >> just
> >>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several functions of
> >> ONE
> >>>> table
> >>>>>>>>>>>>>>>>> (there
> >>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it
> >>>> really
> >>>>>>>>>>>>>>>>> matter for
> >>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
> >>> located,
> >>>>>>>>>>>>>>> which is
> >>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> >>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism',
> >>>> which in
> >>>>>>>>>>>>>>>>> some way
> >>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I
> >> don't
> >>>> see any
> >>>>>>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
> >>>> scenario
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>>>>>> would become more complex
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but
> >>>> actually
> >>>>>>>>>>>>>>> in our
> >>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
> >> easily
> >>> -
> >>>> we
> >>>>>>>>>>>>>>> reused
> >>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new
> >>> API).
> >>>> The
> >>>>>>>>>>>>>>>>> point is
> >>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
> >> InputFormat
> >>>> for
> >>>>>>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it
> >>> uses
> >>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper
> >>> around
> >>>>>>>>>>>>>>>>> InputFormat.
> >>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to
> >>> reload
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
> >>>>>>>>>>>>>>> InputSplits,
> >>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
> >>>> significantly
> >>>>>>>>>>>>>>>>> reduces
> >>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know
> >>> that
> >>>>>>>>>>>>>>> usually
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but
> >>> maybe
> >>>> this
> >>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal
> >>>> solution,
> >>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>>>> are better ones.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
> >> introduce
> >>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of
> >> the
> >>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache
> >>>> options
> >>>>>>>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2
> >>>> different
> >>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to
> >> do
> >>>> is to
> >>>>>>>>>>>>>>>>> redirect
> >>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+
> >>>> maybe
> >>>>>>>>>>>>>>> add an
> >>>>>>>>>>>>>>>>>>>> alias
> >>>>>>>>>>>>>>>>>>>>>>> for options, if there was different naming),
> >>> everything
> >>>>>>>>>>>>>>> will be
> >>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
> >>>>>>>>>>>>>>> refactoring at
> >>>>>>>>>>>>>>>>> all,
> >>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector because
> >> of
> >>>>>>>>>>>>>>> backward
> >>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his
> >>> own
> >>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> logic,
> >>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs into
> >>> the
> >>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with already
> >>>> existing
> >>>>>>>>>>>>>>>>> configs
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare
> >> case).
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the
> >> way
> >>>> down
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the
> >>> ONLY
> >>>>>>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> >>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it currently). Also
> >>>> for some
> >>>>>>>>>>>>>>>>>>>> databases
> >>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
> >>> filters
> >>>>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>> in Flink.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache
> >> seems
> >>>> not
> >>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of
> >>> data
> >>>>>>>>>>>>>>> from the
> >>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in
> >>>> dimension
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>> 'users'
> >>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and
> >>>> input
> >>>>>>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of
> >>>> users. If
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means
> >>> the
> >>>> user
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times.
> >>> It
> >>>> will
> >>>>>>>>>>>>>>>>> gain a
> >>>>>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization
> >> starts
> >>>> to
> >>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>> shine
> >>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
> >>>> projections
> >>>>>>>>>>>>>>>>> can't
> >>>>>>>>>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
> >>>> additional
> >>>>>>>>>>>>>>>>>>>> possibilities
> >>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
> >>>> useful'.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding
> >> this
> >>>> topic!
> >>>>>>>>>>>>>>>>> Because
> >>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I
> >>>> think
> >>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> help
> >>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
> >>>> consensus.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> >>>>>>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late
> >>>> response!
> >>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard
> >>> and
> >>>> I’d
> >>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the
> >>> cache
> >>>>>>>>>>>>>>> logic in
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided
> >>>> table
> >>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
> >>>> TableFunction
> >>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>>>> concerns:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> >>>>>>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the
> >>>> content
> >>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose to
> >>>> enable
> >>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this
> >>>> breakage is
> >>>>>>>>>>>>>>>>>>>> acceptable
> >>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to
> >>>> provide
> >>>>>>>>>>>>>>>>> caching on
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
> >>>> framework
> >>>>>>>>>>>>>>>>> (whether
> >>>>>>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have
> >> to
> >>>>>>>>>>>>>>> confront a
> >>>>>>>>>>>>>>>>>>>>>> situation
> >>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
> >>>> behavior of
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and should be
> >>>> cautious.
> >>>>>>>>>>>>>>>>> Under
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the framework should
> >>>> only be
> >>>>>>>>>>>>>>>>>>>> specified
> >>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to
> >>>> apply
> >>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>> general
> >>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and
> >>>> refresh
> >>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
> >>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>> (like
> >>>>>>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely used by
> >>> our
> >>>>>>>>>>>>>>> internal
> >>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> >>>>>>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>>>>>> works
> >>>>>>>>>>>>>>>>>>>>>> fine
> >>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a
> >>> new
> >>>>>>>>>>>>>>>>> interface for
> >>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become
> >> more
> >>>>>>>>>>>>>>> complex.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
> >>>> introduce
> >>>>>>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might
> >>>> exist two
> >>>>>>>>>>>>>>>>> caches
> >>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
> >> incorrectly
> >>>>>>>>>>>>>>> configures
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by
> >> the
> >>>> lookup
> >>>>>>>>>>>>>>>>> source).
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I
> >>>> think
> >>>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to
> >> the
> >>>> table
> >>>>>>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
> >> runner
> >>>> with
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O
> >> and
> >>>>>>>>>>>>>>> pressure
> >>>>>>>>>>>>>>>>> on the
> >>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
> >>> optimizations
> >>>> to
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our
> >>>> ideas.
> >>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>> prefer to
> >>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
> >>>> TableFunction,
> >>>>>>>>>>>>>>> and we
> >>>>>>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
> >> regulate
> >>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>> [2]
> >>> https://github.com/PatrickRen/flink/tree/FLIP-221
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов
> >> <
> >>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as
> >> the
> >>>>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>> step:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> >>>>>>>>>>>>>>> (originally
> >>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they
> >>>> follow
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are different.
> >> If
> >>> we
> >>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> go one
> >>>>>>>>>>>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
> >>>> deleting
> >>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors.
> >> So
> >>> I
> >>>>>>>>>>>>>>> think we
> >>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that
> >> and
> >>>> then
> >>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for
> >>>> different
> >>>>>>>>>>>>>>>>> parts
> >>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
> >>>> introducing
> >>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests
> >> after
> >>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the
> >>>> lookup
> >>>>>>>>>>>>>>>>> table, we
> >>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we
> >>> can
> >>>>>>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>> responses,
> >>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
> >>>> pushdown. So
> >>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>>> filtering
> >>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less
> >>> rows
> >>>> in
> >>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
> >> not
> >>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> >>>>>>>>>>>>>>> conversations
> >>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I
> >>> made a
> >>>>>>>>>>>>>>> Jira
> >>>>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
> >>> details
> >>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> https://issues.apache.org/jira/browse/FLINK-27411.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> >>>>>>>>>>>>>>> arvid@apache.org>:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was
> >> not
> >>>>>>>>>>>>>>>>> satisfying
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also
> >>> live
> >>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>> easier
> >>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making
> >>>> caching
> >>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
> >> caching
> >>>>>>>>>>>>>>> layer
> >>>>>>>>>>>>>>>>>>>> around X.
> >>>>>>>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> >>>>>>>>>>>>>>> delegates to
> >>>>>>>>>>>>>>>>> X in
> >>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it
> >> into
> >>>> the
> >>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
> >>>>>>>>>>>>>>> unnecessary
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> first step
> >>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
> >>> receive
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> requests
> >>>>>>>>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
> >>> interesting
> >>>> to
> >>>>>>>>>>>>>>> save
> >>>>>>>>>>>>>>>>>>>>> memory).
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of
> >> this
> >>>> FLIP
> >>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>> limited to
> >>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
> >>>> Everything
> >>>>>>>>>>>>>>> else
> >>>>>>>>>>>>>>>>>>>>> remains
> >>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we
> >> can
> >>>>>>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>>>>>> incorporate
> >>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed
> >> out
> >>>>>>>>>>>>>>> later.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
> >> not
> >>>>>>>>>>>>>>> shared.
> >>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
> >> Смирнов
> >>> <
> >>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> >>>>>>>>>>>>>>> committer
> >>>>>>>>>>>>>>>>> yet,
> >>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> >>>>>>>>>>>>>>>>> interested
> >>>>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in
> >> my
> >>>>>>>>>>>>>>>>> company’s
> >>>>>>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts
> >> on
> >>>>>>>>>>>>>>> this and
> >>>>>>>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
> >>>>>>>>>>>>>>> introducing an
> >>>>>>>>>>>>>>>>>>>>> abstract
> >>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction).
> >>> As
> >>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>> know,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> >>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>> provides
> >>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> >>>>>>>>>>>>>>>>> convenient
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>> importing
> >>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction
> >>>> contains
> >>>>>>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
> >> everything
> >>>>>>>>>>>>>>>>> connected
> >>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably
> >> in
> >>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on
> >>>> another
> >>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t
> >>>> sound
> >>>>>>>>>>>>>>>>> good.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’
> >>> to
> >>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> >>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to
> >>>> only
> >>>>>>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they
> >>> won’t
> >>>>>>>>>>>>>>>>> depend on
> >>>>>>>>>>>>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner
> >> will
> >>>>>>>>>>>>>>>>> construct a
> >>>>>>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> >>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> >>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks
> >>>> like
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> pinned
> >>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually
> >> yours
> >>>>>>>>>>>>>>>>>>>> CacheConfig).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> >>>>>>>>>>>>>>> responsible
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>> –
> >>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>>>>>>>>>>>>>> flink-table-runtime
> >>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> >>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> >> LookupJoinCachingRunner,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage
> >> of
> >>>>>>>>>>>>>>> such a
> >>>>>>>>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can
> >>>> apply
> >>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc
> >> was
> >>>>>>>>>>>>>>> named
> >>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
> >>> actually
> >>>>>>>>>>>>>>>>> mostly
> >>>>>>>>>>>>>>>>>>>>>> consists
> >>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table
> >> B
> >>>>>>>>>>>>>>>>> condition
> >>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> >>>>>>>>>>>>>>>>>>>>>>> ON
> >>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
> >>> B.salary >
> >>>>>>>>>>>>>>> 1000’
> >>>>>>>>>>>>>>>>>>>>> ‘calc’
> >>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age +
> >> 10
> >>>> and
> >>>>>>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>>>>>> 1000.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
> >> records
> >>> in
> >>>>>>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters =
> >>>> avoid
> >>>>>>>>>>>>>>>>> storing
> >>>>>>>>>>>>>>>>>>>>>> useless
> >>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> >>>>>>>>>>>>>>> size. So
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be increased
> >>> by
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> user.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion
> >>> about
> >>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> >>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache
> >>> and
> >>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
> >>>> implement
> >>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
> >> standard
> >>> of
> >>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>> users and
> >>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
> >>> joins,
> >>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including
> >>>> cache,
> >>>>>>>>>>>>>>>>>>>> metrics,
> >>>>>>>>>>>>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> >>>>>>>>>>>>>>> Please
> >>>>>>>>>>>>>>>>> take a
> >>>>>>>>>>>>>>>>>>>>> look
> >>>>>>>>>>>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
> >>> suggestions
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> comments
> >>>>>>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>
> >>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>
> >>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi Lincoln and Jark, 

Thanks for the comments! If the community reaches a consensus that we use SQL hint instead of table options to decide whether to use sync or async mode, it’s indeed not necessary to introduce the “lookup.async” option. 

I think it’s a good idea to let the decision of async made on query level, which could make better optimization with more infomation gathered by planner. Is there any FLIP describing the issue in FLINK-27625? I thought FLIP-234 is only proposing adding SQL hint for retry on missing instead of the entire async mode to be controlled by hint.

Best regards, 

Qingsheng

> On May 25, 2022, at 15:13, Lincoln Lee <li...@gmail.com> wrote:
> 
> Hi Jark,
> 
> Thanks for your reply!
> 
> Currently 'lookup.async' just lies in HBase connector, I have no idea
> whether or when to remove it (we can discuss it in another issue for the
> HBase connector after FLINK-27625 is done), just not add it into a common
> option now.
> 
> Best,
> Lincoln Lee
> 
> 
> Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：
> 
>> Hi Lincoln,
>> 
>> I have taken a look at FLIP-234, and I agree with you that the connectors
>> can
>> provide both async and sync runtime providers simultaneously instead of one
>> of them.
>> At that point, "lookup.async" looks redundant. If this option is planned to
>> be removed
>> in the long term, I think it makes sense not to introduce it in this FLIP.
>> 
>> Best,
>> Jark
>> 
>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com> wrote:
>> 
>>> Hi Qingsheng,
>>> 
>>> Sorry for jumping into the discussion so late. It's a good idea that we
>> can
>>> have a common table option. I have a minor comments on  'lookup.async'
>> that
>>> not make it a common option:
>>> 
>>> The table layer abstracts both sync and async lookup capabilities,
>>> connectors implementers can choose one or both, in the case of
>> implementing
>>> only one capability(status of the most of existing builtin connectors)
>>> 'lookup.async' will not be used.  And when a connector has both
>>> capabilities, I think this choice is more suitable for making decisions
>> at
>>> the query level, for example, table planner can choose the physical
>>> implementation of async lookup or sync lookup based on its cost model, or
>>> users can give query hint based on their own better understanding.  If
>>> there is another common table option 'lookup.async', it may confuse the
>>> users in the long run.
>>> 
>>> So, I prefer to leave the 'lookup.async' option in private place (for the
>>> current hbase connector) and not turn it into a common option.
>>> 
>>> WDYT?
>>> 
>>> Best,
>>> Lincoln Lee
>>> 
>>> 
>>> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>>> 
>>>> Hi Alexander,
>>>> 
>>>> Thanks for the review! We recently updated the FLIP and you can find
>>> those
>>>> changes from my latest email. Since some terminologies has changed so
>>> I’ll
>>>> use the new concept for replying your comments.
>>>> 
>>>> 1. Builder vs ‘of’
>>>> I’m OK to use builder pattern if we have additional optional parameters
>>>> for full caching mode (“rescan” previously). The schedule-with-delay
>> idea
>>>> looks reasonable to me, but I think we need to redesign the builder API
>>> of
>>>> full caching to make it more descriptive for developers. Would you mind
>>>> sharing your ideas about the API? For accessing the FLIP workspace you
>>> can
>>>> just provide your account ID and ping any PMC member including Jark.
>>>> 
>>>> 2. Common table options
>>>> We have some discussions these days and propose to introduce 8 common
>>>> table options about caching. It has been updated on the FLIP.
>>>> 
>>>> 3. Retries
>>>> I think we are on the same page :-)
>>>> 
>>>> For your additional concerns:
>>>> 1) The table option has been updated.
>>>> 2) We got “lookup.cache” back for configuring whether to use partial or
>>>> full caching mode.
>>>> 
>>>> Best regards,
>>>> 
>>>> Qingsheng
>>>> 
>>>> 
>>>> 
>>>>> On May 19, 2022, at 17:25, Александр Смирнов <sm...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Also I have a few additions:
>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
>>>>> 'lookup.cache.max-rows'? I think it will be more clear that we talk
>>>>> not about bytes, but about the number of rows. Plus it fits more,
>>>>> considering my optimization with filters.
>>>>> 2) How will users enable rescanning? Are we going to separate caching
>>>>> and rescanning from the options point of view? Like initially we had
>>>>> one option 'lookup.cache' with values LRU / ALL. I think now we can
>>>>> make a boolean option 'lookup.rescan'. RescanInterval can be
>>>>> 'lookup.rescan.interval', etc.
>>>>> 
>>>>> Best regards,
>>>>> Alexander
>>>>> 
>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <smiralexan@gmail.com
>>> :
>>>>>> 
>>>>>> Hi Qingsheng and Jark,
>>>>>> 
>>>>>> 1. Builders vs 'of'
>>>>>> I understand that builders are used when we have multiple
>> parameters.
>>>>>> I suggested them because we could add parameters later. To prevent
>>>>>> Builder for ScanRuntimeProvider from looking redundant I can suggest
>>>>>> one more config now - "rescanStartTime".
>>>>>> It's a time in UTC (LocalTime class) when the first reload of cache
>>>>>> starts. This parameter can be thought of as 'initialDelay' (diff
>>>>>> between current time and rescanStartTime) in method
>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
>>>>>> useful when the dimension table is updated by some other scheduled
>> job
>>>>>> at a certain time. Or when the user simply wants a second scan
>> (first
>>>>>> cache reload) be delayed. This option can be used even without
>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be one day.
>>>>>> If you are fine with this option, I would be very glad if you would
>>>>>> give me access to edit FLIP page, so I could add it myself
>>>>>> 
>>>>>> 2. Common table options
>>>>>> I also think that FactoryUtil would be overloaded by all cache
>>>>>> options. But maybe unify all suggested options, not only for default
>>>>>> cache? I.e. class 'LookupOptions', that unifies default cache
>> options,
>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
>>>>>> 
>>>>>> 3. Retries
>>>>>> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
>>>>>> 
>>>>>> [1]
>>>> 
>>> 
>> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>>>>>> 
>>>>>> Best regards,
>>>>>> Alexander
>>>>>> 
>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
>>>>>>> 
>>>>>>> Hi Jark and Alexander,
>>>>>>> 
>>>>>>> Thanks for your comments! I’m also OK to introduce common table
>>>> options. I prefer to introduce a new DefaultLookupCacheOptions class
>> for
>>>> holding these option definitions because putting all options into
>>>> FactoryUtil would make it a bit ”crowded” and not well categorized.
>>>>>>> 
>>>>>>> FLIP has been updated according to suggestions above:
>>>>>>> 1. Use static “of” method for constructing RescanRuntimeProvider
>>>> considering both arguments are required.
>>>>>>> 2. Introduce new table options matching DefaultLookupCacheFactory
>>>>>>> 
>>>>>>> Best,
>>>>>>> Qingsheng
>>>>>>> 
>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Alex,
>>>>>>>> 
>>>>>>>> 1) retry logic
>>>>>>>> I think we can extract some common retry logic into utilities,
>> e.g.
>>>> RetryUtils#tryTimes(times, call).
>>>>>>>> This seems independent of this FLIP and can be reused by
>> DataStream
>>>> users.
>>>>>>>> Maybe we can open an issue to discuss this and where to put it.
>>>>>>>> 
>>>>>>>> 2) cache ConfigOptions
>>>>>>>> I'm fine with defining cache config options in the framework.
>>>>>>>> A candidate place to put is FactoryUtil which also includes
>>>> "sink.parallelism", "format" options.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
>>> smiralexan@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Qingsheng,
>>>>>>>>> 
>>>>>>>>> Thank you for considering my comments.
>>>>>>>>> 
>>>>>>>>>> there might be custom logic before making retry, such as
>>>> re-establish the connection
>>>>>>>>> 
>>>>>>>>> Yes, I understand that. I meant that such logic can be placed in
>> a
>>>>>>>>> separate function, that can be implemented by connectors. Just
>>> moving
>>>>>>>>> the retry logic would make connector's LookupFunction more
>> concise
>>> +
>>>>>>>>> avoid duplicate code. However, it's a minor change. The decision
>> is
>>>> up
>>>>>>>>> to you.
>>>>>>>>> 
>>>>>>>>>> We decide not to provide common DDL options and let developers
>> to
>>>> define their own options as we do now per connector.
>>>>>>>>> 
>>>>>>>>> What is the reason for that? One of the main goals of this FLIP
>> was
>>>> to
>>>>>>>>> unify the configs, wasn't it? I understand that current cache
>>> design
>>>>>>>>> doesn't depend on ConfigOptions, like was before. But still we
>> can
>>>> put
>>>>>>>>> these options into the framework, so connectors can reuse them
>> and
>>>>>>>>> avoid code duplication, and, what is more significant, avoid
>>> possible
>>>>>>>>> different options naming. This moment can be pointed out in
>>>>>>>>> documentation for connector developers.
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Alexander
>>>>>>>>> 
>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
>>>>>>>>>> 
>>>>>>>>>> Hi Alexander,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the review and glad to see we are on the same page! I
>>>> think you forgot to cc the dev mailing list so I’m also quoting your
>>> reply
>>>> under this email.
>>>>>>>>>> 
>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
>>>>>>>>>> 
>>>>>>>>>> In my opinion the retry logic should be implemented in lookup()
>>>> instead of in LookupFunction#eval(). Retrying is only meaningful under
>>> some
>>>> specific retriable failures, and there might be custom logic before
>>> making
>>>> retry, such as re-establish the connection (JdbcRowDataLookupFunction
>> is
>>> an
>>>> example), so it's more handy to leave it to the connector.
>>>>>>>>>> 
>>>>>>>>>>> I don't see DDL options, that were in previous version of FLIP.
>>> Do
>>>> you have any special plans for them?
>>>>>>>>>> 
>>>>>>>>>> We decide not to provide common DDL options and let developers
>> to
>>>> define their own options as we do now per connector.
>>>>>>>>>> 
>>>>>>>>>> The rest of comments sound great and I’ll update the FLIP. Hope
>> we
>>>> can finalize our proposal soon!
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> 
>>>>>>>>>> Qingsheng
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
>>> smiralexan@gmail.com>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Qingsheng and devs!
>>>>>>>>>>> 
>>>>>>>>>>> I like the overall design of updated FLIP, however I have
>> several
>>>>>>>>>>> suggestions and questions.
>>>>>>>>>>> 
>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction
>> is a
>>>> good
>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this class. 'eval'
>>>> method
>>>>>>>>>>> of new LookupFunction is great for this purpose. The same is
>> for
>>>>>>>>>>> 'async' case.
>>>>>>>>>>> 
>>>>>>>>>>> 2) There might be other configs in future, such as
>>>> 'cacheMissingKey'
>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
>>>> ScanRuntimeProvider.
>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
>>> method
>>>>>>>>>>> instead of many 'of' methods in future)?
>>>>>>>>>>> 
>>>>>>>>>>> 3) What are the plans for existing TableFunctionProvider and
>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
>>>>>>>>>>> 
>>>>>>>>>>> 4) Am I right that the current design does not assume usage of
>>>>>>>>>>> user-provided LookupCache in re-scanning? In this case, it is
>> not
>>>> very
>>>>>>>>>>> clear why do we need methods such as 'invalidate' or 'putAll'
>> in
>>>>>>>>>>> LookupCache.
>>>>>>>>>>> 
>>>>>>>>>>> 5) I don't see DDL options, that were in previous version of
>>> FLIP.
>>>> Do
>>>>>>>>>>> you have any special plans for them?
>>>>>>>>>>> 
>>>>>>>>>>> If you don't mind, I would be glad to be able to make small
>>>>>>>>>>> adjustments to the FLIP document too. I think it's worth
>>> mentioning
>>>>>>>>>>> about what exactly optimizations are planning in the future.
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>> 
>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <renqschn@gmail.com
>>> :
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Alexander and devs,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank you very much for the in-depth discussion! As Jark
>>>> mentioned we were inspired by Alexander's idea and made a refactor on
>> our
>>>> design. FLIP-221 [1] has been updated to reflect our design now and we
>>> are
>>>> happy to hear more suggestions from you!
>>>>>>>>>>>> 
>>>>>>>>>>>> Compared to the previous design:
>>>>>>>>>>>> 1. The lookup cache serves at table runtime level and is
>>>> integrated as a component of LookupJoinRunner as discussed previously.
>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the new
>>>> design.
>>>>>>>>>>>> 3. We separate the all-caching case individually and
>> introduce a
>>>> new RescanRuntimeProvider to reuse the ability of scanning. We are
>>> planning
>>>> to support SourceFunction / InputFormat for now considering the
>>> complexity
>>>> of FLIP-27 Source API.
>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to make the
>>>> semantic of lookup more straightforward for developers.
>>>>>>>>>>>> 
>>>>>>>>>>>> For replying to Alexander:
>>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>> deprecated
>>>> or not. Am I right that it will be so in the future, but currently it's
>>> not?
>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated for now. I
>>> think
>>>> it will be deprecated in the future but we don't have a clear plan for
>>> that.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks again for the discussion on this FLIP and looking
>> forward
>>>> to cooperating with you after we finalize the design and interfaces!
>>>>>>>>>>>> 
>>>>>>>>>>>> [1]
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
>>>> smiralexan@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Glad to see that we came to a consensus on almost all points!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However I'm a little confused whether InputFormat is
>> deprecated
>>>> or
>>>>>>>>>>>>> not. Am I right that it will be so in the future, but
>> currently
>>>> it's
>>>>>>>>>>>>> not? Actually I also think that for the first version it's OK
>>> to
>>>> use
>>>>>>>>>>>>> InputFormat in ALL cache realization, because supporting
>> rescan
>>>>>>>>>>>>> ability seems like a very distant prospect. But for this
>>>> decision we
>>>>>>>>>>>>> need a consensus among all discussion participants.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In general, I don't have something to argue with your
>>>> statements. All
>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it would be nice
>> to
>>>> work
>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a lot of work
>> on
>>>> lookup
>>>>>>>>>>>>> join caching with realization very close to the one we are
>>>> discussing,
>>>>>>>>>>>>> and want to share the results of this work. Anyway looking
>>>> forward for
>>>>>>>>>>>>> the FLIP update!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>> 
>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for summarizing your points.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed
>> it
>>>> several times
>>>>>>>>>>>>>> and we have totally refactored the design.
>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on many of your
>>>> points!
>>>>>>>>>>>>>> Qingsheng is still working on updating the design docs and
>>>> maybe can be
>>>>>>>>>>>>>> available in the next few days.
>>>>>>>>>>>>>> I will share some conclusions from our discussions:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1) we have refactored the design towards to "cache in
>>>> framework" way.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to customize and a
>>>> default
>>>>>>>>>>>>>> implementation with builder for users to easy-use.
>>>>>>>>>>>>>> This can both make it possible to both have flexibility and
>>>> conciseness.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
>> cache,
>>>> esp reducing
>>>>>>>>>>>>>> IO.
>>>>>>>>>>>>>> Filter pushdown should be the final state and the unified
>> way
>>>> to both
>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
>>>>>>>>>>>>>> so I think we should make effort in this direction. If we
>> need
>>>> to support
>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we decide to
>>> implement
>>>> the cache
>>>>>>>>>>>>>> in the framework, we have the chance to support
>>>>>>>>>>>>>> filter on cache anytime. This is an optimization and it
>>> doesn't
>>>> affect the
>>>>>>>>>>>>>> public API. I think we can create a JIRA issue to
>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to your
>> proposal.
>>>>>>>>>>>>>> In the first version, we will only support InputFormat,
>>>> SourceFunction for
>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true source operator
>>>> instead of
>>>>>>>>>>>>>> calling it embedded in the join operator.
>>>>>>>>>>>>>> However, this needs another FLIP to support the re-scan
>>> ability
>>>> for FLIP-27
>>>>>>>>>>>>>> Source, and this can be a large work.
>>>>>>>>>>>>>> In order to not block this issue, we can put the effort of
>>>> FLIP-27 source
>>>>>>>>>>>>>> integration into future work and integrate
>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as they
>>>> are not
>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce another function
>>>>>>>>>>>>>> similar to them which is meaningless. We need to plan
>> FLIP-27
>>>> source
>>>>>>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
>>>> deprecated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Martijn!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Got it. Therefore, the realization with InputFormat is not
>>>> considered.
>>>>>>>>>>>>>>> Thanks for clearing that up!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
>>>> martijn@ververica.com>:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> With regards to:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> But if there are plans to refactor all connectors to
>>> FLIP-27
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
>>>> interfaces will be
>>>>>>>>>>>>>>>> deprecated and connectors will either be refactored to use
>>>> the new ones
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> dropped.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The caching should work for connectors that are using
>>> FLIP-27
>>>> interfaces,
>>>>>>>>>>>>>>>> we should not introduce new features for old interfaces.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Jark!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to make some
>>>> comments and
>>>>>>>>>>>>>>>>> clarify my points.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think we can
>>> achieve
>>>> both
>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface in
>>>> flink-table-common,
>>>>>>>>>>>>>>>>> but have implementations of it in flink-table-runtime.
>>>> Therefore if a
>>>>>>>>>>>>>>>>> connector developer wants to use existing cache
>> strategies
>>>> and their
>>>>>>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
>>>> planner, but if
>>>>>>>>>>>>>>>>> he wants to have its own cache implementation in his
>>>> TableFunction, it
>>>>>>>>>>>>>>>>> will be possible for him to use the existing interface
>> for
>>>> this
>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in the
>>>> documentation). In
>>>>>>>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
>>>> have 90% of
>>>>>>>>>>>>>>>>> lookup requests that can never be cached
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case
>> of
>>>> LRU cache.
>>>>>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here
>> we
>>>> always
>>>>>>>>>>>>>>>>> store the response of the dimension table in cache, even
>>>> after
>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no rows after
>>>> applying
>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
>>> TableFunction,
>>>> we store
>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line
>>> will
>>>> be
>>>>>>>>>>>>>>>>> filled, but will require much less memory (in bytes).
>> I.e.
>>>> we don't
>>>>>>>>>>>>>>>>> completely filter keys, by which result was pruned, but
>>>> significantly
>>>>>>>>>>>>>>>>> reduce required memory to store this result. If the user
>>>> knows about
>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option
>> before
>>>> the start
>>>>>>>>>>>>>>>>> of the job. But actually I came up with the idea that we
>>> can
>>>> do this
>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher'
>>>> methods of
>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection
>> of
>>>> rows
>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can automatically fit
>>> much
>>>> more
>>>>>>>>>>>>>>>>> records than before.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and
>>>> projects
>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>> don't
>>>> mean it's
>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>> to implement.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to implement
>> filter
>>>> pushdown.
>>>>>>>>>>>>>>>>> But I think the fact that currently there is no database
>>>> connector
>>>>>>>>>>>>>>>>> with filter pushdown at least means that this feature
>> won't
>>>> be
>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about
>>>> other
>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their databases might not
>>>> support all
>>>>>>>>>>>>>>>>> Flink filters (or not support filters at all). I think
>>> users
>>>> are
>>>>>>>>>>>>>>>>> interested in supporting cache filters optimization
>>>> independently of
>>>>>>>>>>>>>>>>> supporting other features and solving more complex
>> problems
>>>> (or
>>>>>>>>>>>>>>>>> unsolvable at all).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
>>>> internal version
>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning and reloading
>>>> data from
>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to
>> unify
>>>> the logic
>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
>>>> Source,...)
>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
>> settled
>>>> on using
>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning in all
>> lookup
>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are plans to
>>> deprecate
>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
>>>> FLIP-27 source
>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this source was
>>>> designed to
>>>>>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
>>>> JobManager and
>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
>> (lookup
>>>> join
>>>>>>>>>>>>>>>>> operator in our case). There is even no direct way to
>> pass
>>>> splits from
>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works through
>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
>> AddSplitEvents).
>>>> Usage of
>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
>>>> easier. But if
>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I
>>>> have the
>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join ALL
>>>> cache in
>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning of batch
>>> source?
>>>> The point
>>>>>>>>>>>>>>>>> is that the only difference between lookup join ALL cache
>>>> and simple
>>>>>>>>>>>>>>>>> join with batch source is that in the first case scanning
>>> is
>>>> performed
>>>>>>>>>>>>>>>>> multiple times, in between which state (cache) is cleared
>>>> (correct me
>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of
>>>> simple join
>>>>>>>>>>>>>>>>> to support state reloading + extend the functionality of
>>>> scanning
>>>>>>>>>>>>>>>>> batch source multiple times (this one should be easy with
>>>> new FLIP-27
>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading - we will
>> need
>>>> to change
>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again after
>>>> some TTL).
>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal
>> and
>>>> will make
>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe
>> we
>>>> can limit
>>>>>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and flexible
>>>> interfaces for
>>>>>>>>>>>>>>>>> caching in lookup join.
>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU
>> and
>>>> ALL caches.
>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported
>> in
>>>> Flink
>>>>>>>>>>>>>>>>> connectors, some of the connectors might not have the
>>>> opportunity to
>>>>>>>>>>>>>>>>> support filter pushdown + as I know, currently filter
>>>> pushdown works
>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
>>>> projections
>>>>>>>>>>>>>>>>> optimization should be independent from other features.
>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that involves
>>>> multiple
>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
>>>> InputFormat in favor
>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really
>>>> complex and
>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
>>>> functionality of
>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
>>> lookup
>>>> join ALL
>>>>>>>>>>>>>>>>> cache?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>> 
>>> 
>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I want to share
>>> my
>>>> ideas:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should
>>>> work (e.g.,
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>> pruning, compatibility).
>>>>>>>>>>>>>>>>>> The framework way can provide more concise interfaces.
>>>>>>>>>>>>>>>>>> The connector base way can define more flexible cache
>>>>>>>>>>>>>>>>>> strategies/implementations.
>>>>>>>>>>>>>>>>>> We are still investigating a way to see if we can have
>>> both
>>>>>>>>>>>>>>> advantages.
>>>>>>>>>>>>>>>>>> We should reach a consensus that the way should be a
>> final
>>>> state,
>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>> are on the path to it.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache
>> can
>>>> benefit a
>>>>>>>>>>>>>>> lot
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> ALL cache.
>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use
>>>> cache to
>>>>>>>>>>>>>>> reduce
>>>>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
>>>> have 90% of
>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>> requests that can never be cached
>>>>>>>>>>>>>>>>>> and hit directly to the databases. That means the cache
>> is
>>>>>>>>>>>>>>> meaningless in
>>>>>>>>>>>>>>>>>> this case.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters
>>>> and projects
>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
>> don't
>>>> mean it's
>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> implement.
>>>>>>>>>>>>>>>>>> They should implement the pushdown interfaces to reduce
>> IO
>>>> and the
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>>>>>> That should be a final state that the scan source and
>>>> lookup source
>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic
>> in
>>>> caches,
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>> will complex the lookup join design.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>>>>>>>>>>> All cache might be the most challenging part of this
>> FLIP.
>>>> We have
>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method
>> of
>>>>>>>>>>>>>>> TableFunction.
>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>>>>>>>>>>> Ideally, connector implementation should share the logic
>>> of
>>>> reload
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
>>>> InputFormat/SourceFunction/FLIP-27
>>>>>>>>>>>>>>>>> Source.
>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and
>>> the
>>>> FLIP-27
>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin,
>>> this
>>>> may make
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>>>>>>>>>>> We are still investigating how to abstract the ALL cache
>>>> logic and
>>>>>>>>>>>>>>> reuse
>>>>>>>>>>>>>>>>>> the existing source interfaces.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and lies out of
>> the
>>>> scope of
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should be done for
>>> all
>>>>>>>>>>>>>>>>> ScanTableSource
>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
>>> mentioned
>>>> that
>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
>> jdbc/hive/hbase."
>>>> -> Would
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>> alternative solution be to actually implement these
>>> filter
>>>>>>>>>>>>>>> pushdowns?
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits to doing
>> that,
>>>> outside
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>> caching and metrics.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>>>> ro.v.boyko@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I do think that single cache implementation would be
>> a
>>>> nice
>>>>>>>>>>>>>>>>> opportunity
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
>>>> proc_time"
>>>>>>>>>>>>>>>>> semantics
>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off
>>> the
>>>> cache
>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most handy
>>>> way to do
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to
>>>> pass it
>>>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
>>> correctly
>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
>>>> jdbc/hive/hbase.
>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different caching
>> parameters
>>>> for
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
>> through
>>>> DDL
>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for all
>>>> lookup
>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
>>>> deprives us of
>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement their
>>> own
>>>>>>>>>>>>>>> cache).
>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating more
>> different
>>>> cache
>>>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the schema
>> proposed
>>>> by
>>>>>>>>>>>>>>>>> Alexander.
>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and
>>> all
>>>> these
>>>>>>>>>>>>>>>>>>>> facilities
>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>>>>>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
>>>> express that
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic
>> and I
>>>> hope
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
>>>> questions
>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
>>>> something?).
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
>>>> SYSTEM_TIME
>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS
>> OF
>>>>>>>>>>>>>>> proc_time"
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said,
>>> users
>>>> go
>>>>>>>>>>>>>>> on it
>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one
>>>> proposed
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean
>>>> other
>>>>>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
>>> specify
>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of
>>>> supported
>>>>>>>>>>>>>>>>>>>> options),
>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to. So
>>>> what
>>>>>>>>>>>>>>>>> exactly is
>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing caching in
>>> modules
>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from
>>> the
>>>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
>>>> breaking/non-breaking
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table options in
>>> DDL
>>>> to
>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never happened
>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of
>> DDL
>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about
>>>> limiting
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user business
>> logic
>>>> rather
>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
>>>> framework? I
>>>>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option with
>>>> lookup
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
>>>> decision,
>>>>>>>>>>>>>>>>> because it
>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not
>> just
>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several functions of
>> ONE
>>>> table
>>>>>>>>>>>>>>>>> (there
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it
>>>> really
>>>>>>>>>>>>>>>>> matter for
>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
>>> located,
>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism',
>>>> which in
>>>>>>>>>>>>>>>>> some way
>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I
>> don't
>>>> see any
>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
>>>> scenario
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but
>>>> actually
>>>>>>>>>>>>>>> in our
>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
>> easily
>>> -
>>>> we
>>>>>>>>>>>>>>> reused
>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new
>>> API).
>>>> The
>>>>>>>>>>>>>>>>> point is
>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
>> InputFormat
>>>> for
>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it
>>> uses
>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper
>>> around
>>>>>>>>>>>>>>>>> InputFormat.
>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to
>>> reload
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
>>>>>>>>>>>>>>> InputSplits,
>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
>>>> significantly
>>>>>>>>>>>>>>>>> reduces
>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know
>>> that
>>>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but
>>> maybe
>>>> this
>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal
>>>> solution,
>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
>> introduce
>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of
>> the
>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache
>>>> options
>>>>>>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2
>>>> different
>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to
>> do
>>>> is to
>>>>>>>>>>>>>>>>> redirect
>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+
>>>> maybe
>>>>>>>>>>>>>>> add an
>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different naming),
>>> everything
>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
>>>>>>>>>>>>>>> refactoring at
>>>>>>>>>>>>>>>>> all,
>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector because
>> of
>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his
>>> own
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs into
>>> the
>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation with already
>>>> existing
>>>>>>>>>>>>>>>>> configs
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare
>> case).
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the
>> way
>>>> down
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the
>>> ONLY
>>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it currently). Also
>>>> for some
>>>>>>>>>>>>>>>>>>>> databases
>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
>>> filters
>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache
>> seems
>>>> not
>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of
>>> data
>>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in
>>>> dimension
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and
>>>> input
>>>>>>>>>>>>>>> stream
>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of
>>>> users. If
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means
>>> the
>>>> user
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times.
>>> It
>>>> will
>>>>>>>>>>>>>>>>> gain a
>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization
>> starts
>>>> to
>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>> shine
>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
>>>> projections
>>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
>>>> additional
>>>>>>>>>>>>>>>>>>>> possibilities
>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
>>>> useful'.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding
>> this
>>>> topic!
>>>>>>>>>>>>>>>>> Because
>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I
>>>> think
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
>>>> consensus.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>>>>>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late
>>>> response!
>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard
>>> and
>>>> I’d
>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the
>>> cache
>>>>>>>>>>>>>>> logic in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided
>>>> table
>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
>>>> TableFunction
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
>>>>>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the
>>>> content
>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose to
>>>> enable
>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this
>>>> breakage is
>>>>>>>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to
>>>> provide
>>>>>>>>>>>>>>>>> caching on
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
>>>> framework
>>>>>>>>>>>>>>>>> (whether
>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have
>> to
>>>>>>>>>>>>>>> confront a
>>>>>>>>>>>>>>>>>>>>>> situation
>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
>>>> behavior of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and should be
>>>> cautious.
>>>>>>>>>>>>>>>>> Under
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the framework should
>>>> only be
>>>>>>>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to
>>>> apply
>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>> general
>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and
>>>> refresh
>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also widely used by
>>> our
>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
>>>>>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>>>>>> works
>>>>>>>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a
>>> new
>>>>>>>>>>>>>>>>> interface for
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become
>> more
>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
>>>> introduce
>>>>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might
>>>> exist two
>>>>>>>>>>>>>>>>> caches
>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the user
>> incorrectly
>>>>>>>>>>>>>>> configures
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by
>> the
>>>> lookup
>>>>>>>>>>>>>>>>> source).
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I
>>>> think
>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to
>> the
>>>> table
>>>>>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
>> runner
>>>> with
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O
>> and
>>>>>>>>>>>>>>> pressure
>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying these
>>> optimizations
>>>> to
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our
>>>> ideas.
>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
>>>> TableFunction,
>>>>>>>>>>>>>>> and we
>>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
>> regulate
>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>> https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов
>> <
>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as
>> the
>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>> step:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
>>>>>>>>>>>>>>> (originally
>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they
>>>> follow
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are different.
>> If
>>> we
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> go one
>>>>>>>>>>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
>>>> deleting
>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors.
>> So
>>> I
>>>>>>>>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that
>> and
>>>> then
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for
>>>> different
>>>>>>>>>>>>>>>>> parts
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
>>>> introducing
>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests
>> after
>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the
>>>> lookup
>>>>>>>>>>>>>>>>> table, we
>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we
>>> can
>>>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
>>>> pushdown. So
>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less
>>> rows
>>>> in
>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
>> not
>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
>>>>>>>>>>>>>>> conversations
>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I
>>> made a
>>>>>>>>>>>>>>> Jira
>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
>>> details
>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>> 
>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>>>>>>>>>>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was
>> not
>>>>>>>>>>>>>>>>> satisfying
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also
>>> live
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making
>>>> caching
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
>> caching
>>>>>>>>>>>>>>> layer
>>>>>>>>>>>>>>>>>>>> around X.
>>>>>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
>>>>>>>>>>>>>>> delegates to
>>>>>>>>>>>>>>>>> X in
>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it
>> into
>>>> the
>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
>>>>>>>>>>>>>>> unnecessary
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
>>> receive
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
>>> interesting
>>>> to
>>>>>>>>>>>>>>> save
>>>>>>>>>>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of
>> this
>>>> FLIP
>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
>>>> Everything
>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>>> remains
>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we
>> can
>>>>>>>>>>>>>>> easily
>>>>>>>>>>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed
>> out
>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
>> not
>>>>>>>>>>>>>>> shared.
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
>> Смирнов
>>> <
>>>>>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
>>>>>>>>>>>>>>> committer
>>>>>>>>>>>>>>>>> yet,
>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
>>>>>>>>>>>>>>>>> interested
>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in
>> my
>>>>>>>>>>>>>>>>> company’s
>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts
>> on
>>>>>>>>>>>>>>> this and
>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
>>>>>>>>>>>>>>> introducing an
>>>>>>>>>>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction).
>>> As
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> know,
>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
>>>>>>>>>>>>>>>>> convenient
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction
>>>> contains
>>>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
>> everything
>>>>>>>>>>>>>>>>> connected
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably
>> in
>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on
>>>> another
>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t
>>>> sound
>>>>>>>>>>>>>>>>> good.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’
>>> to
>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to
>>>> only
>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they
>>> won’t
>>>>>>>>>>>>>>>>> depend on
>>>>>>>>>>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner
>> will
>>>>>>>>>>>>>>>>> construct a
>>>>>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks
>>>> like
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually
>> yours
>>>>>>>>>>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
>>>>>>>>>>>>>>> responsible
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>>>>>>>>>>> flink-table-runtime
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
>> LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage
>> of
>>>>>>>>>>>>>>> such a
>>>>>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can
>>>> apply
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc
>> was
>>>>>>>>>>>>>>> named
>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
>>> actually
>>>>>>>>>>>>>>>>> mostly
>>>>>>>>>>>>>>>>>>>>>> consists
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table
>> B
>>>>>>>>>>>>>>>>> condition
>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
>>> B.salary >
>>>>>>>>>>>>>>> 1000’
>>>>>>>>>>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age +
>> 10
>>>> and
>>>>>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
>> records
>>> in
>>>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters =
>>>> avoid
>>>>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
>>>>>>>>>>>>>>> size. So
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be increased
>>> by
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> user.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion
>>> about
>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache
>>> and
>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
>>>> implement
>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
>> standard
>>> of
>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
>>> joins,
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including
>>>> cache,
>>>>>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
>>> suggestions
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> comments
>>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>> 
>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>> 
>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Lincoln Lee <li...@gmail.com>.

Hi Jark,

Thanks for your reply!

Currently 'lookup.async' just lies in HBase connector, I have no idea
whether or when to remove it (we can discuss it in another issue for the
HBase connector after FLINK-27625 is done), just not add it into a common
option now.

Best,
Lincoln Lee


Jark Wu <im...@gmail.com> 于2022年5月24日周二 20:14写道：

> Hi Lincoln,
>
> I have taken a look at FLIP-234, and I agree with you that the connectors
> can
> provide both async and sync runtime providers simultaneously instead of one
> of them.
> At that point, "lookup.async" looks redundant. If this option is planned to
> be removed
> in the long term, I think it makes sense not to introduce it in this FLIP.
>
> Best,
> Jark
>
> On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com> wrote:
>
> > Hi Qingsheng,
> >
> > Sorry for jumping into the discussion so late. It's a good idea that we
> can
> > have a common table option. I have a minor comments on  'lookup.async'
> that
> > not make it a common option:
> >
> > The table layer abstracts both sync and async lookup capabilities,
> > connectors implementers can choose one or both, in the case of
> implementing
> > only one capability(status of the most of existing builtin connectors)
> > 'lookup.async' will not be used.  And when a connector has both
> > capabilities, I think this choice is more suitable for making decisions
> at
> > the query level, for example, table planner can choose the physical
> > implementation of async lookup or sync lookup based on its cost model, or
> > users can give query hint based on their own better understanding.  If
> > there is another common table option 'lookup.async', it may confuse the
> > users in the long run.
> >
> > So, I prefer to leave the 'lookup.async' option in private place (for the
> > current hbase connector) and not turn it into a common option.
> >
> > WDYT?
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
> >
> > > Hi Alexander,
> > >
> > > Thanks for the review! We recently updated the FLIP and you can find
> > those
> > > changes from my latest email. Since some terminologies has changed so
> > I’ll
> > > use the new concept for replying your comments.
> > >
> > > 1. Builder vs ‘of’
> > > I’m OK to use builder pattern if we have additional optional parameters
> > > for full caching mode (“rescan” previously). The schedule-with-delay
> idea
> > > looks reasonable to me, but I think we need to redesign the builder API
> > of
> > > full caching to make it more descriptive for developers. Would you mind
> > > sharing your ideas about the API? For accessing the FLIP workspace you
> > can
> > > just provide your account ID and ping any PMC member including Jark.
> > >
> > > 2. Common table options
> > > We have some discussions these days and propose to introduce 8 common
> > > table options about caching. It has been updated on the FLIP.
> > >
> > > 3. Retries
> > > I think we are on the same page :-)
> > >
> > > For your additional concerns:
> > > 1) The table option has been updated.
> > > 2) We got “lookup.cache” back for configuring whether to use partial or
> > > full caching mode.
> > >
> > > Best regards,
> > >
> > > Qingsheng
> > >
> > >
> > >
> > > > On May 19, 2022, at 17:25, Александр Смирнов <sm...@gmail.com>
> > > wrote:
> > > >
> > > > Also I have a few additions:
> > > > 1) maybe rename 'lookup.cache.maximum-size' to
> > > > 'lookup.cache.max-rows'? I think it will be more clear that we talk
> > > > not about bytes, but about the number of rows. Plus it fits more,
> > > > considering my optimization with filters.
> > > > 2) How will users enable rescanning? Are we going to separate caching
> > > > and rescanning from the options point of view? Like initially we had
> > > > one option 'lookup.cache' with values LRU / ALL. I think now we can
> > > > make a boolean option 'lookup.rescan'. RescanInterval can be
> > > > 'lookup.rescan.interval', etc.
> > > >
> > > > Best regards,
> > > > Alexander
> > > >
> > > > чт, 19 мая 2022 г. в 14:50, Александр Смирнов <smiralexan@gmail.com
> >:
> > > >>
> > > >> Hi Qingsheng and Jark,
> > > >>
> > > >> 1. Builders vs 'of'
> > > >> I understand that builders are used when we have multiple
> parameters.
> > > >> I suggested them because we could add parameters later. To prevent
> > > >> Builder for ScanRuntimeProvider from looking redundant I can suggest
> > > >> one more config now - "rescanStartTime".
> > > >> It's a time in UTC (LocalTime class) when the first reload of cache
> > > >> starts. This parameter can be thought of as 'initialDelay' (diff
> > > >> between current time and rescanStartTime) in method
> > > >> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
> > > >> useful when the dimension table is updated by some other scheduled
> job
> > > >> at a certain time. Or when the user simply wants a second scan
> (first
> > > >> cache reload) be delayed. This option can be used even without
> > > >> 'rescanInterval' - in this case 'rescanInterval' will be one day.
> > > >> If you are fine with this option, I would be very glad if you would
> > > >> give me access to edit FLIP page, so I could add it myself
> > > >>
> > > >> 2. Common table options
> > > >> I also think that FactoryUtil would be overloaded by all cache
> > > >> options. But maybe unify all suggested options, not only for default
> > > >> cache? I.e. class 'LookupOptions', that unifies default cache
> options,
> > > >> rescan options, 'async', 'maxRetries'. WDYT?
> > > >>
> > > >> 3. Retries
> > > >> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
> > > >>
> > > >> [1]
> > >
> >
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> > > >>
> > > >> Best regards,
> > > >> Alexander
> > > >>
> > > >> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
> > > >>>
> > > >>> Hi Jark and Alexander,
> > > >>>
> > > >>> Thanks for your comments! I’m also OK to introduce common table
> > > options. I prefer to introduce a new DefaultLookupCacheOptions class
> for
> > > holding these option definitions because putting all options into
> > > FactoryUtil would make it a bit ”crowded” and not well categorized.
> > > >>>
> > > >>> FLIP has been updated according to suggestions above:
> > > >>> 1. Use static “of” method for constructing RescanRuntimeProvider
> > > considering both arguments are required.
> > > >>> 2. Introduce new table options matching DefaultLookupCacheFactory
> > > >>>
> > > >>> Best,
> > > >>> Qingsheng
> > > >>>
> > > >>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
> > > >>>>
> > > >>>> Hi Alex,
> > > >>>>
> > > >>>> 1) retry logic
> > > >>>> I think we can extract some common retry logic into utilities,
> e.g.
> > > RetryUtils#tryTimes(times, call).
> > > >>>> This seems independent of this FLIP and can be reused by
> DataStream
> > > users.
> > > >>>> Maybe we can open an issue to discuss this and where to put it.
> > > >>>>
> > > >>>> 2) cache ConfigOptions
> > > >>>> I'm fine with defining cache config options in the framework.
> > > >>>> A candidate place to put is FactoryUtil which also includes
> > > "sink.parallelism", "format" options.
> > > >>>>
> > > >>>> Best,
> > > >>>> Jark
> > > >>>>
> > > >>>>
> > > >>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> > smiralexan@gmail.com>
> > > wrote:
> > > >>>>>
> > > >>>>> Hi Qingsheng,
> > > >>>>>
> > > >>>>> Thank you for considering my comments.
> > > >>>>>
> > > >>>>>> there might be custom logic before making retry, such as
> > > re-establish the connection
> > > >>>>>
> > > >>>>> Yes, I understand that. I meant that such logic can be placed in
> a
> > > >>>>> separate function, that can be implemented by connectors. Just
> > moving
> > > >>>>> the retry logic would make connector's LookupFunction more
> concise
> > +
> > > >>>>> avoid duplicate code. However, it's a minor change. The decision
> is
> > > up
> > > >>>>> to you.
> > > >>>>>
> > > >>>>>> We decide not to provide common DDL options and let developers
> to
> > > define their own options as we do now per connector.
> > > >>>>>
> > > >>>>> What is the reason for that? One of the main goals of this FLIP
> was
> > > to
> > > >>>>> unify the configs, wasn't it? I understand that current cache
> > design
> > > >>>>> doesn't depend on ConfigOptions, like was before. But still we
> can
> > > put
> > > >>>>> these options into the framework, so connectors can reuse them
> and
> > > >>>>> avoid code duplication, and, what is more significant, avoid
> > possible
> > > >>>>> different options naming. This moment can be pointed out in
> > > >>>>> documentation for connector developers.
> > > >>>>>
> > > >>>>> Best regards,
> > > >>>>> Alexander
> > > >>>>>
> > > >>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
> > > >>>>>>
> > > >>>>>> Hi Alexander,
> > > >>>>>>
> > > >>>>>> Thanks for the review and glad to see we are on the same page! I
> > > think you forgot to cc the dev mailing list so I’m also quoting your
> > reply
> > > under this email.
> > > >>>>>>
> > > >>>>>>> We can add 'maxRetryTimes' option into this class
> > > >>>>>>
> > > >>>>>> In my opinion the retry logic should be implemented in lookup()
> > > instead of in LookupFunction#eval(). Retrying is only meaningful under
> > some
> > > specific retriable failures, and there might be custom logic before
> > making
> > > retry, such as re-establish the connection (JdbcRowDataLookupFunction
> is
> > an
> > > example), so it's more handy to leave it to the connector.
> > > >>>>>>
> > > >>>>>>> I don't see DDL options, that were in previous version of FLIP.
> > Do
> > > you have any special plans for them?
> > > >>>>>>
> > > >>>>>> We decide not to provide common DDL options and let developers
> to
> > > define their own options as we do now per connector.
> > > >>>>>>
> > > >>>>>> The rest of comments sound great and I’ll update the FLIP. Hope
> we
> > > can finalize our proposal soon!
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>>
> > > >>>>>> Qingsheng
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> > smiralexan@gmail.com>
> > > wrote:
> > > >>>>>>>
> > > >>>>>>> Hi Qingsheng and devs!
> > > >>>>>>>
> > > >>>>>>> I like the overall design of updated FLIP, however I have
> several
> > > >>>>>>> suggestions and questions.
> > > >>>>>>>
> > > >>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction
> is a
> > > good
> > > >>>>>>> idea. We can add 'maxRetryTimes' option into this class. 'eval'
> > > method
> > > >>>>>>> of new LookupFunction is great for this purpose. The same is
> for
> > > >>>>>>> 'async' case.
> > > >>>>>>>
> > > >>>>>>> 2) There might be other configs in future, such as
> > > 'cacheMissingKey'
> > > >>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> > > ScanRuntimeProvider.
> > > >>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
> > > >>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
> > method
> > > >>>>>>> instead of many 'of' methods in future)?
> > > >>>>>>>
> > > >>>>>>> 3) What are the plans for existing TableFunctionProvider and
> > > >>>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
> > > >>>>>>>
> > > >>>>>>> 4) Am I right that the current design does not assume usage of
> > > >>>>>>> user-provided LookupCache in re-scanning? In this case, it is
> not
> > > very
> > > >>>>>>> clear why do we need methods such as 'invalidate' or 'putAll'
> in
> > > >>>>>>> LookupCache.
> > > >>>>>>>
> > > >>>>>>> 5) I don't see DDL options, that were in previous version of
> > FLIP.
> > > Do
> > > >>>>>>> you have any special plans for them?
> > > >>>>>>>
> > > >>>>>>> If you don't mind, I would be glad to be able to make small
> > > >>>>>>> adjustments to the FLIP document too. I think it's worth
> > mentioning
> > > >>>>>>> about what exactly optimizations are planning in the future.
> > > >>>>>>>
> > > >>>>>>> Best regards,
> > > >>>>>>> Smirnov Alexander
> > > >>>>>>>
> > > >>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <renqschn@gmail.com
> >:
> > > >>>>>>>>
> > > >>>>>>>> Hi Alexander and devs,
> > > >>>>>>>>
> > > >>>>>>>> Thank you very much for the in-depth discussion! As Jark
> > > mentioned we were inspired by Alexander's idea and made a refactor on
> our
> > > design. FLIP-221 [1] has been updated to reflect our design now and we
> > are
> > > happy to hear more suggestions from you!
> > > >>>>>>>>
> > > >>>>>>>> Compared to the previous design:
> > > >>>>>>>> 1. The lookup cache serves at table runtime level and is
> > > integrated as a component of LookupJoinRunner as discussed previously.
> > > >>>>>>>> 2. Interfaces are renamed and re-designed to reflect the new
> > > design.
> > > >>>>>>>> 3. We separate the all-caching case individually and
> introduce a
> > > new RescanRuntimeProvider to reuse the ability of scanning. We are
> > planning
> > > to support SourceFunction / InputFormat for now considering the
> > complexity
> > > of FLIP-27 Source API.
> > > >>>>>>>> 4. A new interface LookupFunction is introduced to make the
> > > semantic of lookup more straightforward for developers.
> > > >>>>>>>>
> > > >>>>>>>> For replying to Alexander:
> > > >>>>>>>>> However I'm a little confused whether InputFormat is
> deprecated
> > > or not. Am I right that it will be so in the future, but currently it's
> > not?
> > > >>>>>>>> Yes you are right. InputFormat is not deprecated for now. I
> > think
> > > it will be deprecated in the future but we don't have a clear plan for
> > that.
> > > >>>>>>>>
> > > >>>>>>>> Thanks again for the discussion on this FLIP and looking
> forward
> > > to cooperating with you after we finalize the design and interfaces!
> > > >>>>>>>>
> > > >>>>>>>> [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >>>>>>>>
> > > >>>>>>>> Best regards,
> > > >>>>>>>>
> > > >>>>>>>> Qingsheng
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> > > smiralexan@gmail.com> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Jark, Qingsheng and Leonard!
> > > >>>>>>>>>
> > > >>>>>>>>> Glad to see that we came to a consensus on almost all points!
> > > >>>>>>>>>
> > > >>>>>>>>> However I'm a little confused whether InputFormat is
> deprecated
> > > or
> > > >>>>>>>>> not. Am I right that it will be so in the future, but
> currently
> > > it's
> > > >>>>>>>>> not? Actually I also think that for the first version it's OK
> > to
> > > use
> > > >>>>>>>>> InputFormat in ALL cache realization, because supporting
> rescan
> > > >>>>>>>>> ability seems like a very distant prospect. But for this
> > > decision we
> > > >>>>>>>>> need a consensus among all discussion participants.
> > > >>>>>>>>>
> > > >>>>>>>>> In general, I don't have something to argue with your
> > > statements. All
> > > >>>>>>>>> of them correspond my ideas. Looking ahead, it would be nice
> to
> > > work
> > > >>>>>>>>> on this FLIP cooperatively. I've already done a lot of work
> on
> > > lookup
> > > >>>>>>>>> join caching with realization very close to the one we are
> > > discussing,
> > > >>>>>>>>> and want to share the results of this work. Anyway looking
> > > forward for
> > > >>>>>>>>> the FLIP update!
> > > >>>>>>>>>
> > > >>>>>>>>> Best regards,
> > > >>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>
> > > >>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hi Alex,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks for summarizing your points.
> > > >>>>>>>>>>
> > > >>>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed
> it
> > > several times
> > > >>>>>>>>>> and we have totally refactored the design.
> > > >>>>>>>>>> I'm glad to say we have reached a consensus on many of your
> > > points!
> > > >>>>>>>>>> Qingsheng is still working on updating the design docs and
> > > maybe can be
> > > >>>>>>>>>> available in the next few days.
> > > >>>>>>>>>> I will share some conclusions from our discussions:
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1) we have refactored the design towards to "cache in
> > > framework" way.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2) a "LookupCache" interface for users to customize and a
> > > default
> > > >>>>>>>>>> implementation with builder for users to easy-use.
> > > >>>>>>>>>> This can both make it possible to both have flexibility and
> > > conciseness.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup
> cache,
> > > esp reducing
> > > >>>>>>>>>> IO.
> > > >>>>>>>>>> Filter pushdown should be the final state and the unified
> way
> > > to both
> > > >>>>>>>>>> support pruning ALL cache and LRU cache,
> > > >>>>>>>>>> so I think we should make effort in this direction. If we
> need
> > > to support
> > > >>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> > > >>>>>>>>>> it for LRU cache as well? Either way, as we decide to
> > implement
> > > the cache
> > > >>>>>>>>>> in the framework, we have the chance to support
> > > >>>>>>>>>> filter on cache anytime. This is an optimization and it
> > doesn't
> > > affect the
> > > >>>>>>>>>> public API. I think we can create a JIRA issue to
> > > >>>>>>>>>> discuss it when the FLIP is accepted.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 4) The idea to support ALL cache is similar to your
> proposal.
> > > >>>>>>>>>> In the first version, we will only support InputFormat,
> > > SourceFunction for
> > > >>>>>>>>>> cache all (invoke InputFormat in join operator).
> > > >>>>>>>>>> For FLIP-27 source, we need to join a true source operator
> > > instead of
> > > >>>>>>>>>> calling it embedded in the join operator.
> > > >>>>>>>>>> However, this needs another FLIP to support the re-scan
> > ability
> > > for FLIP-27
> > > >>>>>>>>>> Source, and this can be a large work.
> > > >>>>>>>>>> In order to not block this issue, we can put the effort of
> > > FLIP-27 source
> > > >>>>>>>>>> integration into future work and integrate
> > > >>>>>>>>>> InputFormat&SourceFunction for now.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as they
> > > are not
> > > >>>>>>>>>> deprecated, otherwise, we have to introduce another function
> > > >>>>>>>>>> similar to them which is meaningless. We need to plan
> FLIP-27
> > > source
> > > >>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
> > > deprecated.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Best,
> > > >>>>>>>>>> Jark
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> > > smiralexan@gmail.com>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi Martijn!
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Got it. Therefore, the realization with InputFormat is not
> > > considered.
> > > >>>>>>>>>>> Thanks for clearing that up!
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Best regards,
> > > >>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> > > martijn@ververica.com>:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> With regards to:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> But if there are plans to refactor all connectors to
> > FLIP-27
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
> > > interfaces will be
> > > >>>>>>>>>>>> deprecated and connectors will either be refactored to use
> > > the new ones
> > > >>>>>>>>>>> or
> > > >>>>>>>>>>>> dropped.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The caching should work for connectors that are using
> > FLIP-27
> > > interfaces,
> > > >>>>>>>>>>>> we should not introduce new features for old interfaces.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Martijn
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> > > smiralexan@gmail.com>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi Jark!
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Sorry for the late response. I would like to make some
> > > comments and
> > > >>>>>>>>>>>>> clarify my points.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 1) I agree with your first statement. I think we can
> > achieve
> > > both
> > > >>>>>>>>>>>>> advantages this way: put the Cache interface in
> > > flink-table-common,
> > > >>>>>>>>>>>>> but have implementations of it in flink-table-runtime.
> > > Therefore if a
> > > >>>>>>>>>>>>> connector developer wants to use existing cache
> strategies
> > > and their
> > > >>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
> > > planner, but if
> > > >>>>>>>>>>>>> he wants to have its own cache implementation in his
> > > TableFunction, it
> > > >>>>>>>>>>>>> will be possible for him to use the existing interface
> for
> > > this
> > > >>>>>>>>>>>>> purpose (we can explicitly point this out in the
> > > documentation). In
> > > >>>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
> > > have 90% of
> > > >>>>>>>>>>>>> lookup requests that can never be cached
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case
> of
> > > LRU cache.
> > > >>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here
> we
> > > always
> > > >>>>>>>>>>>>> store the response of the dimension table in cache, even
> > > after
> > > >>>>>>>>>>>>> applying calc function. I.e. if there are no rows after
> > > applying
> > > >>>>>>>>>>>>> filters to the result of the 'eval' method of
> > TableFunction,
> > > we store
> > > >>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line
> > will
> > > be
> > > >>>>>>>>>>>>> filled, but will require much less memory (in bytes).
> I.e.
> > > we don't
> > > >>>>>>>>>>>>> completely filter keys, by which result was pruned, but
> > > significantly
> > > >>>>>>>>>>>>> reduce required memory to store this result. If the user
> > > knows about
> > > >>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option
> before
> > > the start
> > > >>>>>>>>>>>>> of the job. But actually I came up with the idea that we
> > can
> > > do this
> > > >>>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher'
> > > methods of
> > > >>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection
> of
> > > rows
> > > >>>>>>>>>>>>> (value of cache). Therefore cache can automatically fit
> > much
> > > more
> > > >>>>>>>>>>>>> records than before.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and
> > > projects
> > > >>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > > SupportsProjectionPushDown.
> > > >>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
> don't
> > > mean it's
> > > >>>>>>>>>>> hard
> > > >>>>>>>>>>>>> to implement.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> It's debatable how difficult it will be to implement
> filter
> > > pushdown.
> > > >>>>>>>>>>>>> But I think the fact that currently there is no database
> > > connector
> > > >>>>>>>>>>>>> with filter pushdown at least means that this feature
> won't
> > > be
> > > >>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about
> > > other
> > > >>>>>>>>>>>>> connectors (not in Flink repo), their databases might not
> > > support all
> > > >>>>>>>>>>>>> Flink filters (or not support filters at all). I think
> > users
> > > are
> > > >>>>>>>>>>>>> interested in supporting cache filters optimization
> > > independently of
> > > >>>>>>>>>>>>> supporting other features and solving more complex
> problems
> > > (or
> > > >>>>>>>>>>>>> unsolvable at all).
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
> > > internal version
> > > >>>>>>>>>>>>> I also tried to unify the logic of scanning and reloading
> > > data from
> > > >>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to
> unify
> > > the logic
> > > >>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
> > > Source,...)
> > > >>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I
> settled
> > > on using
> > > >>>>>>>>>>>>> InputFormat, because it was used for scanning in all
> lookup
> > > >>>>>>>>>>>>> connectors. (I didn't know that there are plans to
> > deprecate
> > > >>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
> > > FLIP-27 source
> > > >>>>>>>>>>>>> in ALL caching is not good idea, because this source was
> > > designed to
> > > >>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
> > > JobManager and
> > > >>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator
> (lookup
> > > join
> > > >>>>>>>>>>>>> operator in our case). There is even no direct way to
> pass
> > > splits from
> > > >>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works through
> > > >>>>>>>>>>>>> SplitEnumeratorContext, which requires
> > > >>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> AddSplitEvents).
> > > Usage of
> > > >>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
> > > easier. But if
> > > >>>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I
> > > have the
> > > >>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join ALL
> > > cache in
> > > >>>>>>>>>>>>> favor of simple join with multiple scanning of batch
> > source?
> > > The point
> > > >>>>>>>>>>>>> is that the only difference between lookup join ALL cache
> > > and simple
> > > >>>>>>>>>>>>> join with batch source is that in the first case scanning
> > is
> > > performed
> > > >>>>>>>>>>>>> multiple times, in between which state (cache) is cleared
> > > (correct me
> > > >>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of
> > > simple join
> > > >>>>>>>>>>>>> to support state reloading + extend the functionality of
> > > scanning
> > > >>>>>>>>>>>>> batch source multiple times (this one should be easy with
> > > new FLIP-27
> > > >>>>>>>>>>>>> source, that unifies streaming/batch reading - we will
> need
> > > to change
> > > >>>>>>>>>>>>> only SplitEnumerator, which will pass splits again after
> > > some TTL).
> > > >>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal
> and
> > > will make
> > > >>>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe
> we
> > > can limit
> > > >>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> So to sum up, my points is like this:
> > > >>>>>>>>>>>>> 1) There is a way to make both concise and flexible
> > > interfaces for
> > > >>>>>>>>>>>>> caching in lookup join.
> > > >>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU
> and
> > > ALL caches.
> > > >>>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported
> in
> > > Flink
> > > >>>>>>>>>>>>> connectors, some of the connectors might not have the
> > > opportunity to
> > > >>>>>>>>>>>>> support filter pushdown + as I know, currently filter
> > > pushdown works
> > > >>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
> > > projections
> > > >>>>>>>>>>>>> optimization should be independent from other features.
> > > >>>>>>>>>>>>> 4) ALL cache realization is a complex topic that involves
> > > multiple
> > > >>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
> > > InputFormat in favor
> > > >>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really
> > > complex and
> > > >>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
> > > functionality of
> > > >>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
> > lookup
> > > join ALL
> > > >>>>>>>>>>>>> cache?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > >
> >
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> It's great to see the active discussion! I want to share
> > my
> > > ideas:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
> > > >>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should
> > > work (e.g.,
> > > >>>>>>>>>>> cache
> > > >>>>>>>>>>>>>> pruning, compatibility).
> > > >>>>>>>>>>>>>> The framework way can provide more concise interfaces.
> > > >>>>>>>>>>>>>> The connector base way can define more flexible cache
> > > >>>>>>>>>>>>>> strategies/implementations.
> > > >>>>>>>>>>>>>> We are still investigating a way to see if we can have
> > both
> > > >>>>>>>>>>> advantages.
> > > >>>>>>>>>>>>>> We should reach a consensus that the way should be a
> final
> > > state,
> > > >>>>>>>>>>> and we
> > > >>>>>>>>>>>>>> are on the path to it.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> 2) filters and projections pushdown:
> > > >>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache
> can
> > > benefit a
> > > >>>>>>>>>>> lot
> > > >>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>> ALL cache.
> > > >>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use
> > > cache to
> > > >>>>>>>>>>> reduce
> > > >>>>>>>>>>>>> IO
> > > >>>>>>>>>>>>>> requests to databases for better throughput.
> > > >>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
> > > have 90% of
> > > >>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>> requests that can never be cached
> > > >>>>>>>>>>>>>> and hit directly to the databases. That means the cache
> is
> > > >>>>>>>>>>> meaningless in
> > > >>>>>>>>>>>>>> this case.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters
> > > and projects
> > > >>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > > >>>>>>>>>>> SupportsProjectionPushDown.
> > > >>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces,
> don't
> > > mean it's
> > > >>>>>>>>>>> hard
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> implement.
> > > >>>>>>>>>>>>>> They should implement the pushdown interfaces to reduce
> IO
> > > and the
> > > >>>>>>>>>>> cache
> > > >>>>>>>>>>>>>> size.
> > > >>>>>>>>>>>>>> That should be a final state that the scan source and
> > > lookup source
> > > >>>>>>>>>>> share
> > > >>>>>>>>>>>>>> the exact pushdown implementation.
> > > >>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic
> in
> > > caches,
> > > >>>>>>>>>>> which
> > > >>>>>>>>>>>>>> will complex the lookup join design.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> 3) ALL cache abstraction
> > > >>>>>>>>>>>>>> All cache might be the most challenging part of this
> FLIP.
> > > We have
> > > >>>>>>>>>>> never
> > > >>>>>>>>>>>>>> provided a reload-lookup public interface.
> > > >>>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method
> of
> > > >>>>>>>>>>> TableFunction.
> > > >>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> > > >>>>>>>>>>>>>> Ideally, connector implementation should share the logic
> > of
> > > reload
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> > > InputFormat/SourceFunction/FLIP-27
> > > >>>>>>>>>>>>> Source.
> > > >>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and
> > the
> > > FLIP-27
> > > >>>>>>>>>>>>> source
> > > >>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> > > >>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin,
> > this
> > > may make
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>> scope of this FLIP much larger.
> > > >>>>>>>>>>>>>> We are still investigating how to abstract the ALL cache
> > > logic and
> > > >>>>>>>>>>> reuse
> > > >>>>>>>>>>>>>> the existing source interfaces.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>> Jark
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> > > ro.v.boyko@gmail.com>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> It's a much more complicated activity and lies out of
> the
> > > scope of
> > > >>>>>>>>>>> this
> > > >>>>>>>>>>>>>>> improvement. Because such pushdowns should be done for
> > all
> > > >>>>>>>>>>>>> ScanTableSource
> > > >>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> > > >>>>>>>>>>> martijnvisser@apache.org>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi everyone,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
> > mentioned
> > > that
> > > >>>>>>>>>>> filter
> > > >>>>>>>>>>>>>>>> pushdown still is not implemented for
> jdbc/hive/hbase."
> > > -> Would
> > > >>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>> alternative solution be to actually implement these
> > filter
> > > >>>>>>>>>>> pushdowns?
> > > >>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>> imagine that there are many more benefits to doing
> that,
> > > outside
> > > >>>>>>>>>>> of
> > > >>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>> caching and metrics.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Martijn Visser
> > > >>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > > >>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> > > ro.v.boyko@gmail.com>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Hi everyone!
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> I do think that single cache implementation would be
> a
> > > nice
> > > >>>>>>>>>>>>> opportunity
> > > >>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
> > > proc_time"
> > > >>>>>>>>>>>>> semantics
> > > >>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> > > >>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off
> > the
> > > cache
> > > >>>>>>>>>>> size
> > > >>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most handy
> > > way to do
> > > >>>>>>>>>>> it
> > > >>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>> apply
> > > >>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to
> > > pass it
> > > >>>>>>>>>>>>> through the
> > > >>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
> > correctly
> > > >>>>>>>>>>> mentioned
> > > >>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
> > > jdbc/hive/hbase.
> > > >>>>>>>>>>>>>>>>> 2) The ability to set the different caching
> parameters
> > > for
> > > >>>>>>>>>>> different
> > > >>>>>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it
> through
> > > DDL
> > > >>>>>>>>>>> rather
> > > >>>>>>>>>>>>> than
> > > >>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for all
> > > lookup
> > > >>>>>>>>>>> tables.
> > > >>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
> > > deprives us of
> > > >>>>>>>>>>>>>>>>> extensibility (users won't be able to implement their
> > own
> > > >>>>>>>>>>> cache).
> > > >>>>>>>>>>>>> But
> > > >>>>>>>>>>>>>>>> most
> > > >>>>>>>>>>>>>>>>> probably it might be solved by creating more
> different
> > > cache
> > > >>>>>>>>>>>>> strategies
> > > >>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>> a wider set of configurations.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> All these points are much closer to the schema
> proposed
> > > by
> > > >>>>>>>>>>>>> Alexander.
> > > >>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and
> > all
> > > these
> > > >>>>>>>>>>>>>>>> facilities
> > > >>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>> Roman Boyko
> > > >>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> > > >>>>>>>>>>>>> martijnvisser@apache.org>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Hi everyone,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
> > > express that
> > > >>>>>>>>>>> I
> > > >>>>>>>>>>>>> really
> > > >>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic
> and I
> > > hope
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>> others
> > > >>>>>>>>>>>>>>>>>> will join the conversation.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Martijn
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> > > >>>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
> > > questions
> > > >>>>>>>>>>>>> about
> > > >>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
> > > something?).
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
> > > SYSTEM_TIME
> > > >>>>>>>>>>> AS OF
> > > >>>>>>>>>>>>>>>>>> proc_time”
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS
> OF
> > > >>>>>>>>>>> proc_time"
> > > >>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said,
> > users
> > > go
> > > >>>>>>>>>>> on it
> > > >>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one
> > > proposed
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>> enable
> > > >>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean
> > > other
> > > >>>>>>>>>>>>> developers
> > > >>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
> > specify
> > > >>>>>>>>>>> whether
> > > >>>>>>>>>>>>> their
> > > >>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of
> > > supported
> > > >>>>>>>>>>>>>>>> options),
> > > >>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to. So
> > > what
> > > >>>>>>>>>>>>> exactly is
> > > >>>>>>>>>>>>>>>>>>> the difference between implementing caching in
> > modules
> > > >>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from
> > the
> > > >>>>>>>>>>>>> considered
> > > >>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> > > breaking/non-breaking
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> confront a situation that allows table options in
> > DDL
> > > to
> > > >>>>>>>>>>>>> control
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> behavior of the framework, which has never happened
> > > >>>>>>>>>>> previously
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>> should
> > > >>>>>>>>>>>>>>>>>>> be cautious
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of
> DDL
> > > >>>>>>>>>>> options
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about
> > > limiting
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> scope
> > > >>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>> the options + importance for the user business
> logic
> > > rather
> > > >>>>>>>>>>> than
> > > >>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
> > > framework? I
> > > >>>>>>>>>>>>> mean
> > > >>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>> in my design, for example, putting an option with
> > > lookup
> > > >>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
> > > decision,
> > > >>>>>>>>>>>>> because it
> > > >>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not
> just
> > > >>>>>>>>>>> performance
> > > >>>>>>>>>>>>>>>>>>> optimization) + touches just several functions of
> ONE
> > > table
> > > >>>>>>>>>>>>> (there
> > > >>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it
> > > really
> > > >>>>>>>>>>>>> matter for
> > > >>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
> > located,
> > > >>>>>>>>>>> which is
> > > >>>>>>>>>>>>>>>>>>> affected by the applied option?
> > > >>>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism',
> > > which in
> > > >>>>>>>>>>>>> some way
> > > >>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I
> don't
> > > see any
> > > >>>>>>>>>>>>> problem
> > > >>>>>>>>>>>>>>>>>>> here.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
> > > scenario
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> design
> > > >>>>>>>>>>>>>>>>>>> would become more complex
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but
> > > actually
> > > >>>>>>>>>>> in our
> > > >>>>>>>>>>>>>>>>>>> internal version we solved this problem quite
> easily
> > -
> > > we
> > > >>>>>>>>>>> reused
> > > >>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new
> > API).
> > > The
> > > >>>>>>>>>>>>> point is
> > > >>>>>>>>>>>>>>>>>>> that currently all lookup connectors use
> InputFormat
> > > for
> > > >>>>>>>>>>>>> scanning
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it
> > uses
> > > >>>>>>>>>>> class
> > > >>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper
> > around
> > > >>>>>>>>>>>>> InputFormat.
> > > >>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to
> > reload
> > > >>>>>>>>>>> cache
> > > >>>>>>>>>>>>> data
> > > >>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
> > > >>>>>>>>>>> InputSplits,
> > > >>>>>>>>>>>>> but
> > > >>>>>>>>>>>>>>>> has
> > > >>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
> > > significantly
> > > >>>>>>>>>>>>> reduces
> > > >>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know
> > that
> > > >>>>>>>>>>> usually
> > > >>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>> try
> > > >>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but
> > maybe
> > > this
> > > >>>>>>>>>>> one
> > > >>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal
> > > solution,
> > > >>>>>>>>>>> maybe
> > > >>>>>>>>>>>>>>>> there
> > > >>>>>>>>>>>>>>>>>>> are better ones.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might
> introduce
> > > >>>>>>>>>>>>> compatibility
> > > >>>>>>>>>>>>>>>>>> issues
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of
> the
> > > >>>>>>>>>>> connector
> > > >>>>>>>>>>>>>>>> won't
> > > >>>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache
> > > options
> > > >>>>>>>>>>>>>>>> incorrectly
> > > >>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2
> > > different
> > > >>>>>>>>>>> code
> > > >>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to
> do
> > > is to
> > > >>>>>>>>>>>>> redirect
> > > >>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+
> > > maybe
> > > >>>>>>>>>>> add an
> > > >>>>>>>>>>>>>>>> alias
> > > >>>>>>>>>>>>>>>>>>> for options, if there was different naming),
> > everything
> > > >>>>>>>>>>> will be
> > > >>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
> > > >>>>>>>>>>> refactoring at
> > > >>>>>>>>>>>>> all,
> > > >>>>>>>>>>>>>>>>>>> nothing will be changed for the connector because
> of
> > > >>>>>>>>>>> backward
> > > >>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his
> > own
> > > >>>>>>>>>>> cache
> > > >>>>>>>>>>>>> logic,
> > > >>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs into
> > the
> > > >>>>>>>>>>>>> framework,
> > > >>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>> instead make his own implementation with already
> > > existing
> > > >>>>>>>>>>>>> configs
> > > >>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare
> case).
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the
> way
> > > down
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the
> > ONLY
> > > >>>>>>>>>>> connector
> > > >>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> > > >>>>>>>>>>>>>>>>>>> (no database connector supports it currently). Also
> > > for some
> > > >>>>>>>>>>>>>>>> databases
> > > >>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
> > filters
> > > >>>>>>>>>>> that we
> > > >>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>> in Flink.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache
> seems
> > > not
> > > >>>>>>>>>>>>> quite
> > > >>>>>>>>>>>>>>>>> useful
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of
> > data
> > > >>>>>>>>>>> from the
> > > >>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in
> > > dimension
> > > >>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>> 'users'
> > > >>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and
> > > input
> > > >>>>>>>>>>> stream
> > > >>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of
> > > users. If
> > > >>>>>>>>>>> we
> > > >>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>> filter 'age > 30',
> > > >>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means
> > the
> > > user
> > > >>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times.
> > It
> > > will
> > > >>>>>>>>>>>>> gain a
> > > >>>>>>>>>>>>>>>>>>> huge
> > > >>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization
> starts
> > > to
> > > >>>>>>>>>>> really
> > > >>>>>>>>>>>>>>>> shine
> > > >>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
> > > projections
> > > >>>>>>>>>>>>> can't
> > > >>>>>>>>>>>>>>>> fit
> > > >>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
> > > additional
> > > >>>>>>>>>>>>>>>> possibilities
> > > >>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
> > > useful'.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding
> this
> > > topic!
> > > >>>>>>>>>>>>> Because
> > > >>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I
> > > think
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> help
> > > >>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
> > > consensus.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> > > >>>>>>>>>>> renqschn@gmail.com
> > > >>>>>>>>>>>>>> :
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late
> > > response!
> > > >>>>>>>>>>> We
> > > >>>>>>>>>>>>> had
> > > >>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard
> > and
> > > I’d
> > > >>>>>>>>>>> like
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the
> > cache
> > > >>>>>>>>>>> logic in
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided
> > > table
> > > >>>>>>>>>>>>> function,
> > > >>>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
> > > TableFunction
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>>>> these
> > > >>>>>>>>>>>>>>>>>>> concerns:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> > > >>>>>>>>>>> SYSTEM_TIME
> > > >>>>>>>>>>>>> AS OF
> > > >>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the
> > > content
> > > >>>>>>>>>>> of the
> > > >>>>>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose to
> > > enable
> > > >>>>>>>>>>>>> caching
> > > >>>>>>>>>>>>>>>> on
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this
> > > breakage is
> > > >>>>>>>>>>>>>>>> acceptable
> > > >>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to
> > > provide
> > > >>>>>>>>>>>>> caching on
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> table runtime level.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
> > > framework
> > > >>>>>>>>>>>>> (whether
> > > >>>>>>>>>>>>>>>> in a
> > > >>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have
> to
> > > >>>>>>>>>>> confront a
> > > >>>>>>>>>>>>>>>>>> situation
> > > >>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
> > > behavior of
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>> framework,
> > > >>>>>>>>>>>>>>>>>>> which has never happened previously and should be
> > > cautious.
> > > >>>>>>>>>>>>> Under
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> current design the behavior of the framework should
> > > only be
> > > >>>>>>>>>>>>>>>> specified
> > > >>>>>>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to
> > > apply
> > > >>>>>>>>>>> these
> > > >>>>>>>>>>>>>>>> general
> > > >>>>>>>>>>>>>>>>>>> configs to a specific table.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and
> > > refresh
> > > >>>>>>>>>>> all
> > > >>>>>>>>>>>>>>>> records
> > > >>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
> > > >>>>>>>>>>> performance
> > > >>>>>>>>>>>>>>>> (like
> > > >>>>>>>>>>>>>>>>>> Hive
> > > >>>>>>>>>>>>>>>>>>> connector in the community, and also widely used by
> > our
> > > >>>>>>>>>>> internal
> > > >>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> > > >>>>>>>>>>> TableFunction
> > > >>>>>>>>>>>>>>>> works
> > > >>>>>>>>>>>>>>>>>> fine
> > > >>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a
> > new
> > > >>>>>>>>>>>>> interface for
> > > >>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become
> more
> > > >>>>>>>>>>> complex.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
> > > introduce
> > > >>>>>>>>>>>>>>>> compatibility
> > > >>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might
> > > exist two
> > > >>>>>>>>>>>>> caches
> > > >>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>> totally different strategies if the user
> incorrectly
> > > >>>>>>>>>>> configures
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by
> the
> > > lookup
> > > >>>>>>>>>>>>> source).
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I
> > > think
> > > >>>>>>>>>>>>> filters
> > > >>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to
> the
> > > table
> > > >>>>>>>>>>>>> function,
> > > >>>>>>>>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the
> runner
> > > with
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> cache.
> > > >>>>>>>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O
> and
> > > >>>>>>>>>>> pressure
> > > >>>>>>>>>>>>> on the
> > > >>>>>>>>>>>>>>>>>>> external system, and only applying these
> > optimizations
> > > to
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>> seems
> > > >>>>>>>>>>>>>>>>>>> not quite useful.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our
> > > ideas.
> > > >>>>>>>>>>> We
> > > >>>>>>>>>>>>>>>> prefer to
> > > >>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
> > > TableFunction,
> > > >>>>>>>>>>> and we
> > > >>>>>>>>>>>>>>>> could
> > > >>>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> > > >>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> > > >>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and
> regulate
> > > >>>>>>>>>>> metrics
> > > >>>>>>>>>>>>> of the
> > > >>>>>>>>>>>>>>>>>> cache.
> > > >>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >>>>>>>>>>>>>>>>>>>> [2]
> > https://github.com/PatrickRen/flink/tree/FLIP-221
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Qingsheng
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов
> <
> > > >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as
> the
> > > >>>>>>>>>>> first
> > > >>>>>>>>>>>>> step:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> > > >>>>>>>>>>> (originally
> > > >>>>>>>>>>>>>>>>> proposed
> > > >>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they
> > > follow
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>>>>> goal, but implementation details are different.
> If
> > we
> > > >>>>>>>>>>> will
> > > >>>>>>>>>>>>> go one
> > > >>>>>>>>>>>>>>>>> way,
> > > >>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
> > > deleting
> > > >>>>>>>>>>>>> existing
> > > >>>>>>>>>>>>>>>> code
> > > >>>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors.
> So
> > I
> > > >>>>>>>>>>> think we
> > > >>>>>>>>>>>>>>>> should
> > > >>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that
> and
> > > then
> > > >>>>>>>>>>> work
> > > >>>>>>>>>>>>>>>>> together
> > > >>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for
> > > different
> > > >>>>>>>>>>>>> parts
> > > >>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
> > > introducing
> > > >>>>>>>>>>>>> proposed
> > > >>>>>>>>>>>>>>>> set
> > > >>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests
> after
> > > >>>>>>>>>>> filter
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the
> > > lookup
> > > >>>>>>>>>>>>> table, we
> > > >>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we
> > can
> > > >>>>>>>>>>> filter
> > > >>>>>>>>>>>>>>>>> responses,
> > > >>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
> > > pushdown. So
> > > >>>>>>>>>>> if
> > > >>>>>>>>>>>>>>>>> filtering
> > > >>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less
> > rows
> > > in
> > > >>>>>>>>>>>>> cache.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
> not
> > > >>>>>>>>>>> shared.
> > > >>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>> don't
> > > >>>>>>>>>>>>>>>>>>> know the
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> > > >>>>>>>>>>> conversations
> > > >>>>>>>>>>>>> :)
> > > >>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I
> > made a
> > > >>>>>>>>>>> Jira
> > > >>>>>>>>>>>>> issue,
> > > >>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
> > details
> > > -
> > > >>>>>>>>>>>>>>>>>>>>>
> https://issues.apache.org/jira/browse/FLINK-27411.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> > > >>>>>>>>>>> arvid@apache.org>:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was
> not
> > > >>>>>>>>>>>>> satisfying
> > > >>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>> me.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also
> > live
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>> easier
> > > >>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making
> > > caching
> > > >>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>> implementation
> > > >>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a
> caching
> > > >>>>>>>>>>> layer
> > > >>>>>>>>>>>>>>>> around X.
> > > >>>>>>>>>>>>>>>>>> So
> > > >>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> > > >>>>>>>>>>> delegates to
> > > >>>>>>>>>>>>> X in
> > > >>>>>>>>>>>>>>>>> case
> > > >>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it
> into
> > > the
> > > >>>>>>>>>>>>> operator
> > > >>>>>>>>>>>>>>>>>> model
> > > >>>>>>>>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
> > > >>>>>>>>>>> unnecessary
> > > >>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> first step
> > > >>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
> > receive
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> requests
> > > >>>>>>>>>>>>>>>>>>> after
> > > >>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
> > interesting
> > > to
> > > >>>>>>>>>>> save
> > > >>>>>>>>>>>>>>>>> memory).
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of
> this
> > > FLIP
> > > >>>>>>>>>>>>> would be
> > > >>>>>>>>>>>>>>>>>>> limited to
> > > >>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
> > > Everything
> > > >>>>>>>>>>> else
> > > >>>>>>>>>>>>>>>>> remains
> > > >>>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we
> can
> > > >>>>>>>>>>> easily
> > > >>>>>>>>>>>>>>>>>> incorporate
> > > >>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed
> out
> > > >>>>>>>>>>> later.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is
> not
> > > >>>>>>>>>>> shared.
> > > >>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>> don't
> > > >>>>>>>>>>>>>>>>>>> know the
> > > >>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр
> Смирнов
> > <
> > > >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> > > >>>>>>>>>>> committer
> > > >>>>>>>>>>>>> yet,
> > > >>>>>>>>>>>>>>>> but
> > > >>>>>>>>>>>>>>>>>> I'd
> > > >>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> > > >>>>>>>>>>>>> interested
> > > >>>>>>>>>>>>>>>> me.
> > > >>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in
> my
> > > >>>>>>>>>>>>> company’s
> > > >>>>>>>>>>>>>>>>> Flink
> > > >>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts
> on
> > > >>>>>>>>>>> this and
> > > >>>>>>>>>>>>>>>> make
> > > >>>>>>>>>>>>>>>>>> code
> > > >>>>>>>>>>>>>>>>>>>>>>> open source.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
> > > >>>>>>>>>>> introducing an
> > > >>>>>>>>>>>>>>>>> abstract
> > > >>>>>>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction).
> > As
> > > >>>>>>>>>>> you
> > > >>>>>>>>>>>>> know,
> > > >>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> > > >>>>>>>>>>> module,
> > > >>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>>>>>> provides
> > > >>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> > > >>>>>>>>>>>>> convenient
> > > >>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>> importing
> > > >>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction
> > > contains
> > > >>>>>>>>>>>>> logic
> > > >>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and
> everything
> > > >>>>>>>>>>>>> connected
> > > >>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>> it
> > > >>>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably
> in
> > > >>>>>>>>>>>>>>>>>>> flink-table-runtime.
> > > >>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on
> > > another
> > > >>>>>>>>>>>>> module,
> > > >>>>>>>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t
> > > sound
> > > >>>>>>>>>>>>> good.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’
> > to
> > > >>>>>>>>>>>>>>>>>> LookupTableSource
> > > >>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to
> > > only
> > > >>>>>>>>>>> pass
> > > >>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they
> > won’t
> > > >>>>>>>>>>>>> depend on
> > > >>>>>>>>>>>>>>>>>>> runtime
> > > >>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner
> will
> > > >>>>>>>>>>>>> construct a
> > > >>>>>>>>>>>>>>>>>> lookup
> > > >>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> > > >>>>>>>>>>>>>>>> (ProcessFunctions
> > > >>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks
> > > like
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> pinned
> > > >>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually
> yours
> > > >>>>>>>>>>>>>>>> CacheConfig).
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> > > >>>>>>>>>>> responsible
> > > >>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>> –
> > > >>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> > > >>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> > > >>>>>>>>>>> flink-table-runtime
> > > >>>>>>>>>>>>> -
> > > >>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> > > >>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> > > >>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> LookupJoinCachingRunner,
> > > >>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage
> of
> > > >>>>>>>>>>> such a
> > > >>>>>>>>>>>>>>>>> solution.
> > > >>>>>>>>>>>>>>>>>>> If
> > > >>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can
> > > apply
> > > >>>>>>>>>>> some
> > > >>>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc
> was
> > > >>>>>>>>>>> named
> > > >>>>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
> > actually
> > > >>>>>>>>>>>>> mostly
> > > >>>>>>>>>>>>>>>>>> consists
> > > >>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table
> B
> > > >>>>>>>>>>>>> condition
> > > >>>>>>>>>>>>>>>>> ‘JOIN …
> > > >>>>>>>>>>>>>>>>>>> ON
> > > >>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
> > B.salary >
> > > >>>>>>>>>>> 1000’
> > > >>>>>>>>>>>>>>>>> ‘calc’
> > > >>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age +
> 10
> > > and
> > > >>>>>>>>>>>>>>>> B.salary >
> > > >>>>>>>>>>>>>>>>>>> 1000.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing
> records
> > in
> > > >>>>>>>>>>>>> cache,
> > > >>>>>>>>>>>>>>>> size
> > > >>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters =
> > > avoid
> > > >>>>>>>>>>>>> storing
> > > >>>>>>>>>>>>>>>>>> useless
> > > >>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> > > >>>>>>>>>>> size. So
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>> initial
> > > >>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be increased
> > by
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> user.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion
> > about
> > > >>>>>>>>>>>>>>>> FLIP-221[1],
> > > >>>>>>>>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache
> > and
> > > >>>>>>>>>>> its
> > > >>>>>>>>>>>>>>>> standard
> > > >>>>>>>>>>>>>>>>>>> metrics.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
> > > implement
> > > >>>>>>>>>>>>> their
> > > >>>>>>>>>>>>>>>> own
> > > >>>>>>>>>>>>>>>>>>> cache to
> > > >>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a
> standard
> > of
> > > >>>>>>>>>>>>> metrics
> > > >>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>> users and
> > > >>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
> > joins,
> > > >>>>>>>>>>> which
> > > >>>>>>>>>>>>> is a
> > > >>>>>>>>>>>>>>>>>> quite
> > > >>>>>>>>>>>>>>>>>>> common
> > > >>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including
> > > cache,
> > > >>>>>>>>>>>>>>>> metrics,
> > > >>>>>>>>>>>>>>>>>>> wrapper
> > > >>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> > > >>>>>>>>>>> Please
> > > >>>>>>>>>>>>> take a
> > > >>>>>>>>>>>>>>>>> look
> > > >>>>>>>>>>>>>>>>>>> at the
> > > >>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
> > suggestions
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>> comments
> > > >>>>>>>>>>>>>>>>>>> would be
> > > >>>>>>>>>>>>>>>>>>>>>>> appreciated!
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> --
> > > >>>>>>>>>>>>>>>>>>>> Best Regards,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > > >>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> --
> > > >>>>>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>>>>> Roman Boyko
> > > >>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> --
> > > >>>>>>>> Best Regards,
> > > >>>>>>>>
> > > >>>>>>>> Qingsheng Ren
> > > >>>>>>>>
> > > >>>>>>>> Real-time Computing Team
> > > >>>>>>>> Alibaba Cloud
> > > >>>>>>>>
> > > >>>>>>>> Email: renqschn@gmail.com
> > > >>>>>>
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jark Wu <im...@gmail.com>.

Hi Lincoln,

I have taken a look at FLIP-234, and I agree with you that the connectors
can
provide both async and sync runtime providers simultaneously instead of one
of them.
At that point, "lookup.async" looks redundant. If this option is planned to
be removed
in the long term, I think it makes sense not to introduce it in this FLIP.

Best,
Jark

On Tue, 24 May 2022 at 11:08, Lincoln Lee <li...@gmail.com> wrote:

> Hi Qingsheng,
>
> Sorry for jumping into the discussion so late. It's a good idea that we can
> have a common table option. I have a minor comments on  'lookup.async' that
> not make it a common option:
>
> The table layer abstracts both sync and async lookup capabilities,
> connectors implementers can choose one or both, in the case of implementing
> only one capability(status of the most of existing builtin connectors)
> 'lookup.async' will not be used.  And when a connector has both
> capabilities, I think this choice is more suitable for making decisions at
> the query level, for example, table planner can choose the physical
> implementation of async lookup or sync lookup based on its cost model, or
> users can give query hint based on their own better understanding.  If
> there is another common table option 'lookup.async', it may confuse the
> users in the long run.
>
> So, I prefer to leave the 'lookup.async' option in private place (for the
> current hbase connector) and not turn it into a common option.
>
> WDYT?
>
> Best,
> Lincoln Lee
>
>
> Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：
>
> > Hi Alexander,
> >
> > Thanks for the review! We recently updated the FLIP and you can find
> those
> > changes from my latest email. Since some terminologies has changed so
> I’ll
> > use the new concept for replying your comments.
> >
> > 1. Builder vs ‘of’
> > I’m OK to use builder pattern if we have additional optional parameters
> > for full caching mode (“rescan” previously). The schedule-with-delay idea
> > looks reasonable to me, but I think we need to redesign the builder API
> of
> > full caching to make it more descriptive for developers. Would you mind
> > sharing your ideas about the API? For accessing the FLIP workspace you
> can
> > just provide your account ID and ping any PMC member including Jark.
> >
> > 2. Common table options
> > We have some discussions these days and propose to introduce 8 common
> > table options about caching. It has been updated on the FLIP.
> >
> > 3. Retries
> > I think we are on the same page :-)
> >
> > For your additional concerns:
> > 1) The table option has been updated.
> > 2) We got “lookup.cache” back for configuring whether to use partial or
> > full caching mode.
> >
> > Best regards,
> >
> > Qingsheng
> >
> >
> >
> > > On May 19, 2022, at 17:25, Александр Смирнов <sm...@gmail.com>
> > wrote:
> > >
> > > Also I have a few additions:
> > > 1) maybe rename 'lookup.cache.maximum-size' to
> > > 'lookup.cache.max-rows'? I think it will be more clear that we talk
> > > not about bytes, but about the number of rows. Plus it fits more,
> > > considering my optimization with filters.
> > > 2) How will users enable rescanning? Are we going to separate caching
> > > and rescanning from the options point of view? Like initially we had
> > > one option 'lookup.cache' with values LRU / ALL. I think now we can
> > > make a boolean option 'lookup.rescan'. RescanInterval can be
> > > 'lookup.rescan.interval', etc.
> > >
> > > Best regards,
> > > Alexander
> > >
> > > чт, 19 мая 2022 г. в 14:50, Александр Смирнов <sm...@gmail.com>:
> > >>
> > >> Hi Qingsheng and Jark,
> > >>
> > >> 1. Builders vs 'of'
> > >> I understand that builders are used when we have multiple parameters.
> > >> I suggested them because we could add parameters later. To prevent
> > >> Builder for ScanRuntimeProvider from looking redundant I can suggest
> > >> one more config now - "rescanStartTime".
> > >> It's a time in UTC (LocalTime class) when the first reload of cache
> > >> starts. This parameter can be thought of as 'initialDelay' (diff
> > >> between current time and rescanStartTime) in method
> > >> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
> > >> useful when the dimension table is updated by some other scheduled job
> > >> at a certain time. Or when the user simply wants a second scan (first
> > >> cache reload) be delayed. This option can be used even without
> > >> 'rescanInterval' - in this case 'rescanInterval' will be one day.
> > >> If you are fine with this option, I would be very glad if you would
> > >> give me access to edit FLIP page, so I could add it myself
> > >>
> > >> 2. Common table options
> > >> I also think that FactoryUtil would be overloaded by all cache
> > >> options. But maybe unify all suggested options, not only for default
> > >> cache? I.e. class 'LookupOptions', that unifies default cache options,
> > >> rescan options, 'async', 'maxRetries'. WDYT?
> > >>
> > >> 3. Retries
> > >> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
> > >>
> > >> [1]
> >
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> > >>
> > >> Best regards,
> > >> Alexander
> > >>
> > >> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
> > >>>
> > >>> Hi Jark and Alexander,
> > >>>
> > >>> Thanks for your comments! I’m also OK to introduce common table
> > options. I prefer to introduce a new DefaultLookupCacheOptions class for
> > holding these option definitions because putting all options into
> > FactoryUtil would make it a bit ”crowded” and not well categorized.
> > >>>
> > >>> FLIP has been updated according to suggestions above:
> > >>> 1. Use static “of” method for constructing RescanRuntimeProvider
> > considering both arguments are required.
> > >>> 2. Introduce new table options matching DefaultLookupCacheFactory
> > >>>
> > >>> Best,
> > >>> Qingsheng
> > >>>
> > >>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
> > >>>>
> > >>>> Hi Alex,
> > >>>>
> > >>>> 1) retry logic
> > >>>> I think we can extract some common retry logic into utilities, e.g.
> > RetryUtils#tryTimes(times, call).
> > >>>> This seems independent of this FLIP and can be reused by DataStream
> > users.
> > >>>> Maybe we can open an issue to discuss this and where to put it.
> > >>>>
> > >>>> 2) cache ConfigOptions
> > >>>> I'm fine with defining cache config options in the framework.
> > >>>> A candidate place to put is FactoryUtil which also includes
> > "sink.parallelism", "format" options.
> > >>>>
> > >>>> Best,
> > >>>> Jark
> > >>>>
> > >>>>
> > >>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> smiralexan@gmail.com>
> > wrote:
> > >>>>>
> > >>>>> Hi Qingsheng,
> > >>>>>
> > >>>>> Thank you for considering my comments.
> > >>>>>
> > >>>>>> there might be custom logic before making retry, such as
> > re-establish the connection
> > >>>>>
> > >>>>> Yes, I understand that. I meant that such logic can be placed in a
> > >>>>> separate function, that can be implemented by connectors. Just
> moving
> > >>>>> the retry logic would make connector's LookupFunction more concise
> +
> > >>>>> avoid duplicate code. However, it's a minor change. The decision is
> > up
> > >>>>> to you.
> > >>>>>
> > >>>>>> We decide not to provide common DDL options and let developers to
> > define their own options as we do now per connector.
> > >>>>>
> > >>>>> What is the reason for that? One of the main goals of this FLIP was
> > to
> > >>>>> unify the configs, wasn't it? I understand that current cache
> design
> > >>>>> doesn't depend on ConfigOptions, like was before. But still we can
> > put
> > >>>>> these options into the framework, so connectors can reuse them and
> > >>>>> avoid code duplication, and, what is more significant, avoid
> possible
> > >>>>> different options naming. This moment can be pointed out in
> > >>>>> documentation for connector developers.
> > >>>>>
> > >>>>> Best regards,
> > >>>>> Alexander
> > >>>>>
> > >>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
> > >>>>>>
> > >>>>>> Hi Alexander,
> > >>>>>>
> > >>>>>> Thanks for the review and glad to see we are on the same page! I
> > think you forgot to cc the dev mailing list so I’m also quoting your
> reply
> > under this email.
> > >>>>>>
> > >>>>>>> We can add 'maxRetryTimes' option into this class
> > >>>>>>
> > >>>>>> In my opinion the retry logic should be implemented in lookup()
> > instead of in LookupFunction#eval(). Retrying is only meaningful under
> some
> > specific retriable failures, and there might be custom logic before
> making
> > retry, such as re-establish the connection (JdbcRowDataLookupFunction is
> an
> > example), so it's more handy to leave it to the connector.
> > >>>>>>
> > >>>>>>> I don't see DDL options, that were in previous version of FLIP.
> Do
> > you have any special plans for them?
> > >>>>>>
> > >>>>>> We decide not to provide common DDL options and let developers to
> > define their own options as we do now per connector.
> > >>>>>>
> > >>>>>> The rest of comments sound great and I’ll update the FLIP. Hope we
> > can finalize our proposal soon!
> > >>>>>>
> > >>>>>> Best,
> > >>>>>>
> > >>>>>> Qingsheng
> > >>>>>>
> > >>>>>>
> > >>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> smiralexan@gmail.com>
> > wrote:
> > >>>>>>>
> > >>>>>>> Hi Qingsheng and devs!
> > >>>>>>>
> > >>>>>>> I like the overall design of updated FLIP, however I have several
> > >>>>>>> suggestions and questions.
> > >>>>>>>
> > >>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction is a
> > good
> > >>>>>>> idea. We can add 'maxRetryTimes' option into this class. 'eval'
> > method
> > >>>>>>> of new LookupFunction is great for this purpose. The same is for
> > >>>>>>> 'async' case.
> > >>>>>>>
> > >>>>>>> 2) There might be other configs in future, such as
> > 'cacheMissingKey'
> > >>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> > ScanRuntimeProvider.
> > >>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
> > >>>>>>> RescanRuntimeProvider for more flexibility (use one 'build'
> method
> > >>>>>>> instead of many 'of' methods in future)?
> > >>>>>>>
> > >>>>>>> 3) What are the plans for existing TableFunctionProvider and
> > >>>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
> > >>>>>>>
> > >>>>>>> 4) Am I right that the current design does not assume usage of
> > >>>>>>> user-provided LookupCache in re-scanning? In this case, it is not
> > very
> > >>>>>>> clear why do we need methods such as 'invalidate' or 'putAll' in
> > >>>>>>> LookupCache.
> > >>>>>>>
> > >>>>>>> 5) I don't see DDL options, that were in previous version of
> FLIP.
> > Do
> > >>>>>>> you have any special plans for them?
> > >>>>>>>
> > >>>>>>> If you don't mind, I would be glad to be able to make small
> > >>>>>>> adjustments to the FLIP document too. I think it's worth
> mentioning
> > >>>>>>> about what exactly optimizations are planning in the future.
> > >>>>>>>
> > >>>>>>> Best regards,
> > >>>>>>> Smirnov Alexander
> > >>>>>>>
> > >>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
> > >>>>>>>>
> > >>>>>>>> Hi Alexander and devs,
> > >>>>>>>>
> > >>>>>>>> Thank you very much for the in-depth discussion! As Jark
> > mentioned we were inspired by Alexander's idea and made a refactor on our
> > design. FLIP-221 [1] has been updated to reflect our design now and we
> are
> > happy to hear more suggestions from you!
> > >>>>>>>>
> > >>>>>>>> Compared to the previous design:
> > >>>>>>>> 1. The lookup cache serves at table runtime level and is
> > integrated as a component of LookupJoinRunner as discussed previously.
> > >>>>>>>> 2. Interfaces are renamed and re-designed to reflect the new
> > design.
> > >>>>>>>> 3. We separate the all-caching case individually and introduce a
> > new RescanRuntimeProvider to reuse the ability of scanning. We are
> planning
> > to support SourceFunction / InputFormat for now considering the
> complexity
> > of FLIP-27 Source API.
> > >>>>>>>> 4. A new interface LookupFunction is introduced to make the
> > semantic of lookup more straightforward for developers.
> > >>>>>>>>
> > >>>>>>>> For replying to Alexander:
> > >>>>>>>>> However I'm a little confused whether InputFormat is deprecated
> > or not. Am I right that it will be so in the future, but currently it's
> not?
> > >>>>>>>> Yes you are right. InputFormat is not deprecated for now. I
> think
> > it will be deprecated in the future but we don't have a clear plan for
> that.
> > >>>>>>>>
> > >>>>>>>> Thanks again for the discussion on this FLIP and looking forward
> > to cooperating with you after we finalize the design and interfaces!
> > >>>>>>>>
> > >>>>>>>> [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>
> > >>>>>>>> Best regards,
> > >>>>>>>>
> > >>>>>>>> Qingsheng
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> > smiralexan@gmail.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Jark, Qingsheng and Leonard!
> > >>>>>>>>>
> > >>>>>>>>> Glad to see that we came to a consensus on almost all points!
> > >>>>>>>>>
> > >>>>>>>>> However I'm a little confused whether InputFormat is deprecated
> > or
> > >>>>>>>>> not. Am I right that it will be so in the future, but currently
> > it's
> > >>>>>>>>> not? Actually I also think that for the first version it's OK
> to
> > use
> > >>>>>>>>> InputFormat in ALL cache realization, because supporting rescan
> > >>>>>>>>> ability seems like a very distant prospect. But for this
> > decision we
> > >>>>>>>>> need a consensus among all discussion participants.
> > >>>>>>>>>
> > >>>>>>>>> In general, I don't have something to argue with your
> > statements. All
> > >>>>>>>>> of them correspond my ideas. Looking ahead, it would be nice to
> > work
> > >>>>>>>>> on this FLIP cooperatively. I've already done a lot of work on
> > lookup
> > >>>>>>>>> join caching with realization very close to the one we are
> > discussing,
> > >>>>>>>>> and want to share the results of this work. Anyway looking
> > forward for
> > >>>>>>>>> the FLIP update!
> > >>>>>>>>>
> > >>>>>>>>> Best regards,
> > >>>>>>>>> Smirnov Alexander
> > >>>>>>>>>
> > >>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi Alex,
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for summarizing your points.
> > >>>>>>>>>>
> > >>>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed it
> > several times
> > >>>>>>>>>> and we have totally refactored the design.
> > >>>>>>>>>> I'm glad to say we have reached a consensus on many of your
> > points!
> > >>>>>>>>>> Qingsheng is still working on updating the design docs and
> > maybe can be
> > >>>>>>>>>> available in the next few days.
> > >>>>>>>>>> I will share some conclusions from our discussions:
> > >>>>>>>>>>
> > >>>>>>>>>> 1) we have refactored the design towards to "cache in
> > framework" way.
> > >>>>>>>>>>
> > >>>>>>>>>> 2) a "LookupCache" interface for users to customize and a
> > default
> > >>>>>>>>>> implementation with builder for users to easy-use.
> > >>>>>>>>>> This can both make it possible to both have flexibility and
> > conciseness.
> > >>>>>>>>>>
> > >>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup cache,
> > esp reducing
> > >>>>>>>>>> IO.
> > >>>>>>>>>> Filter pushdown should be the final state and the unified way
> > to both
> > >>>>>>>>>> support pruning ALL cache and LRU cache,
> > >>>>>>>>>> so I think we should make effort in this direction. If we need
> > to support
> > >>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> > >>>>>>>>>> it for LRU cache as well? Either way, as we decide to
> implement
> > the cache
> > >>>>>>>>>> in the framework, we have the chance to support
> > >>>>>>>>>> filter on cache anytime. This is an optimization and it
> doesn't
> > affect the
> > >>>>>>>>>> public API. I think we can create a JIRA issue to
> > >>>>>>>>>> discuss it when the FLIP is accepted.
> > >>>>>>>>>>
> > >>>>>>>>>> 4) The idea to support ALL cache is similar to your proposal.
> > >>>>>>>>>> In the first version, we will only support InputFormat,
> > SourceFunction for
> > >>>>>>>>>> cache all (invoke InputFormat in join operator).
> > >>>>>>>>>> For FLIP-27 source, we need to join a true source operator
> > instead of
> > >>>>>>>>>> calling it embedded in the join operator.
> > >>>>>>>>>> However, this needs another FLIP to support the re-scan
> ability
> > for FLIP-27
> > >>>>>>>>>> Source, and this can be a large work.
> > >>>>>>>>>> In order to not block this issue, we can put the effort of
> > FLIP-27 source
> > >>>>>>>>>> integration into future work and integrate
> > >>>>>>>>>> InputFormat&SourceFunction for now.
> > >>>>>>>>>>
> > >>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as they
> > are not
> > >>>>>>>>>> deprecated, otherwise, we have to introduce another function
> > >>>>>>>>>> similar to them which is meaningless. We need to plan FLIP-27
> > source
> > >>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
> > deprecated.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Jark
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> > smiralexan@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Martijn!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Got it. Therefore, the realization with InputFormat is not
> > considered.
> > >>>>>>>>>>> Thanks for clearing that up!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best regards,
> > >>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>
> > >>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> > martijn@ververica.com>:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> With regards to:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> But if there are plans to refactor all connectors to
> FLIP-27
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
> > interfaces will be
> > >>>>>>>>>>>> deprecated and connectors will either be refactored to use
> > the new ones
> > >>>>>>>>>>> or
> > >>>>>>>>>>>> dropped.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The caching should work for connectors that are using
> FLIP-27
> > interfaces,
> > >>>>>>>>>>>> we should not introduce new features for old interfaces.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> > smiralexan@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Jark!
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Sorry for the late response. I would like to make some
> > comments and
> > >>>>>>>>>>>>> clarify my points.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1) I agree with your first statement. I think we can
> achieve
> > both
> > >>>>>>>>>>>>> advantages this way: put the Cache interface in
> > flink-table-common,
> > >>>>>>>>>>>>> but have implementations of it in flink-table-runtime.
> > Therefore if a
> > >>>>>>>>>>>>> connector developer wants to use existing cache strategies
> > and their
> > >>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
> > planner, but if
> > >>>>>>>>>>>>> he wants to have its own cache implementation in his
> > TableFunction, it
> > >>>>>>>>>>>>> will be possible for him to use the existing interface for
> > this
> > >>>>>>>>>>>>> purpose (we can explicitly point this out in the
> > documentation). In
> > >>>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
> > have 90% of
> > >>>>>>>>>>>>> lookup requests that can never be cached
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case of
> > LRU cache.
> > >>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we
> > always
> > >>>>>>>>>>>>> store the response of the dimension table in cache, even
> > after
> > >>>>>>>>>>>>> applying calc function. I.e. if there are no rows after
> > applying
> > >>>>>>>>>>>>> filters to the result of the 'eval' method of
> TableFunction,
> > we store
> > >>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line
> will
> > be
> > >>>>>>>>>>>>> filled, but will require much less memory (in bytes). I.e.
> > we don't
> > >>>>>>>>>>>>> completely filter keys, by which result was pruned, but
> > significantly
> > >>>>>>>>>>>>> reduce required memory to store this result. If the user
> > knows about
> > >>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option before
> > the start
> > >>>>>>>>>>>>> of the job. But actually I came up with the idea that we
> can
> > do this
> > >>>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher'
> > methods of
> > >>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection of
> > rows
> > >>>>>>>>>>>>> (value of cache). Therefore cache can automatically fit
> much
> > more
> > >>>>>>>>>>>>> records than before.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and
> > projects
> > >>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > SupportsProjectionPushDown.
> > >>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't
> > mean it's
> > >>>>>>>>>>> hard
> > >>>>>>>>>>>>> to implement.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It's debatable how difficult it will be to implement filter
> > pushdown.
> > >>>>>>>>>>>>> But I think the fact that currently there is no database
> > connector
> > >>>>>>>>>>>>> with filter pushdown at least means that this feature won't
> > be
> > >>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about
> > other
> > >>>>>>>>>>>>> connectors (not in Flink repo), their databases might not
> > support all
> > >>>>>>>>>>>>> Flink filters (or not support filters at all). I think
> users
> > are
> > >>>>>>>>>>>>> interested in supporting cache filters optimization
> > independently of
> > >>>>>>>>>>>>> supporting other features and solving more complex problems
> > (or
> > >>>>>>>>>>>>> unsolvable at all).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
> > internal version
> > >>>>>>>>>>>>> I also tried to unify the logic of scanning and reloading
> > data from
> > >>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to unify
> > the logic
> > >>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
> > Source,...)
> > >>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I settled
> > on using
> > >>>>>>>>>>>>> InputFormat, because it was used for scanning in all lookup
> > >>>>>>>>>>>>> connectors. (I didn't know that there are plans to
> deprecate
> > >>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
> > FLIP-27 source
> > >>>>>>>>>>>>> in ALL caching is not good idea, because this source was
> > designed to
> > >>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
> > JobManager and
> > >>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator (lookup
> > join
> > >>>>>>>>>>>>> operator in our case). There is even no direct way to pass
> > splits from
> > >>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works through
> > >>>>>>>>>>>>> SplitEnumeratorContext, which requires
> > >>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents).
> > Usage of
> > >>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
> > easier. But if
> > >>>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I
> > have the
> > >>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join ALL
> > cache in
> > >>>>>>>>>>>>> favor of simple join with multiple scanning of batch
> source?
> > The point
> > >>>>>>>>>>>>> is that the only difference between lookup join ALL cache
> > and simple
> > >>>>>>>>>>>>> join with batch source is that in the first case scanning
> is
> > performed
> > >>>>>>>>>>>>> multiple times, in between which state (cache) is cleared
> > (correct me
> > >>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of
> > simple join
> > >>>>>>>>>>>>> to support state reloading + extend the functionality of
> > scanning
> > >>>>>>>>>>>>> batch source multiple times (this one should be easy with
> > new FLIP-27
> > >>>>>>>>>>>>> source, that unifies streaming/batch reading - we will need
> > to change
> > >>>>>>>>>>>>> only SplitEnumerator, which will pass splits again after
> > some TTL).
> > >>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal and
> > will make
> > >>>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe we
> > can limit
> > >>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> So to sum up, my points is like this:
> > >>>>>>>>>>>>> 1) There is a way to make both concise and flexible
> > interfaces for
> > >>>>>>>>>>>>> caching in lookup join.
> > >>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU and
> > ALL caches.
> > >>>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported in
> > Flink
> > >>>>>>>>>>>>> connectors, some of the connectors might not have the
> > opportunity to
> > >>>>>>>>>>>>> support filter pushdown + as I know, currently filter
> > pushdown works
> > >>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
> > projections
> > >>>>>>>>>>>>> optimization should be independent from other features.
> > >>>>>>>>>>>>> 4) ALL cache realization is a complex topic that involves
> > multiple
> > >>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
> > InputFormat in favor
> > >>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really
> > complex and
> > >>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
> > functionality of
> > >>>>>>>>>>>>> simple join or not refuse from InputFormat in case of
> lookup
> > join ALL
> > >>>>>>>>>>>>> cache?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> >
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> It's great to see the active discussion! I want to share
> my
> > ideas:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
> > >>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should
> > work (e.g.,
> > >>>>>>>>>>> cache
> > >>>>>>>>>>>>>> pruning, compatibility).
> > >>>>>>>>>>>>>> The framework way can provide more concise interfaces.
> > >>>>>>>>>>>>>> The connector base way can define more flexible cache
> > >>>>>>>>>>>>>> strategies/implementations.
> > >>>>>>>>>>>>>> We are still investigating a way to see if we can have
> both
> > >>>>>>>>>>> advantages.
> > >>>>>>>>>>>>>> We should reach a consensus that the way should be a final
> > state,
> > >>>>>>>>>>> and we
> > >>>>>>>>>>>>>> are on the path to it.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 2) filters and projections pushdown:
> > >>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache can
> > benefit a
> > >>>>>>>>>>> lot
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>> ALL cache.
> > >>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use
> > cache to
> > >>>>>>>>>>> reduce
> > >>>>>>>>>>>>> IO
> > >>>>>>>>>>>>>> requests to databases for better throughput.
> > >>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
> > have 90% of
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>> requests that can never be cached
> > >>>>>>>>>>>>>> and hit directly to the databases. That means the cache is
> > >>>>>>>>>>> meaningless in
> > >>>>>>>>>>>>>> this case.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters
> > and projects
> > >>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > >>>>>>>>>>> SupportsProjectionPushDown.
> > >>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't
> > mean it's
> > >>>>>>>>>>> hard
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>> implement.
> > >>>>>>>>>>>>>> They should implement the pushdown interfaces to reduce IO
> > and the
> > >>>>>>>>>>> cache
> > >>>>>>>>>>>>>> size.
> > >>>>>>>>>>>>>> That should be a final state that the scan source and
> > lookup source
> > >>>>>>>>>>> share
> > >>>>>>>>>>>>>> the exact pushdown implementation.
> > >>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic in
> > caches,
> > >>>>>>>>>>> which
> > >>>>>>>>>>>>>> will complex the lookup join design.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 3) ALL cache abstraction
> > >>>>>>>>>>>>>> All cache might be the most challenging part of this FLIP.
> > We have
> > >>>>>>>>>>> never
> > >>>>>>>>>>>>>> provided a reload-lookup public interface.
> > >>>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method of
> > >>>>>>>>>>> TableFunction.
> > >>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> > >>>>>>>>>>>>>> Ideally, connector implementation should share the logic
> of
> > reload
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> > InputFormat/SourceFunction/FLIP-27
> > >>>>>>>>>>>>> Source.
> > >>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and
> the
> > FLIP-27
> > >>>>>>>>>>>>> source
> > >>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> > >>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin,
> this
> > may make
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>> scope of this FLIP much larger.
> > >>>>>>>>>>>>>> We are still investigating how to abstract the ALL cache
> > logic and
> > >>>>>>>>>>> reuse
> > >>>>>>>>>>>>>> the existing source interfaces.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> > ro.v.boyko@gmail.com>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> It's a much more complicated activity and lies out of the
> > scope of
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>>>> improvement. Because such pushdowns should be done for
> all
> > >>>>>>>>>>>>> ScanTableSource
> > >>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> > >>>>>>>>>>> martijnvisser@apache.org>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> One question regarding "And Alexander correctly
> mentioned
> > that
> > >>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase."
> > -> Would
> > >>>>>>>>>>> an
> > >>>>>>>>>>>>>>>> alternative solution be to actually implement these
> filter
> > >>>>>>>>>>> pushdowns?
> > >>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>> imagine that there are many more benefits to doing that,
> > outside
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>> caching and metrics.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Martijn Visser
> > >>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > >>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> > ro.v.boyko@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi everyone!
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I do think that single cache implementation would be a
> > nice
> > >>>>>>>>>>>>> opportunity
> > >>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
> > proc_time"
> > >>>>>>>>>>>>> semantics
> > >>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> > >>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off
> the
> > cache
> > >>>>>>>>>>> size
> > >>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most handy
> > way to do
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to
> > pass it
> > >>>>>>>>>>>>> through the
> > >>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander
> correctly
> > >>>>>>>>>>> mentioned
> > >>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
> > jdbc/hive/hbase.
> > >>>>>>>>>>>>>>>>> 2) The ability to set the different caching parameters
> > for
> > >>>>>>>>>>> different
> > >>>>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it through
> > DDL
> > >>>>>>>>>>> rather
> > >>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for all
> > lookup
> > >>>>>>>>>>> tables.
> > >>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
> > deprives us of
> > >>>>>>>>>>>>>>>>> extensibility (users won't be able to implement their
> own
> > >>>>>>>>>>> cache).
> > >>>>>>>>>>>>> But
> > >>>>>>>>>>>>>>>> most
> > >>>>>>>>>>>>>>>>> probably it might be solved by creating more different
> > cache
> > >>>>>>>>>>>>> strategies
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>> a wider set of configurations.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> All these points are much closer to the schema proposed
> > by
> > >>>>>>>>>>>>> Alexander.
> > >>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and
> all
> > these
> > >>>>>>>>>>>>>>>> facilities
> > >>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> > >>>>>>>>>>>>> martijnvisser@apache.org>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
> > express that
> > >>>>>>>>>>> I
> > >>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic and I
> > hope
> > >>>>>>>>>>> that
> > >>>>>>>>>>>>>>>> others
> > >>>>>>>>>>>>>>>>>> will join the conversation.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> > >>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
> > questions
> > >>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
> > something?).
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
> > SYSTEM_TIME
> > >>>>>>>>>>> AS OF
> > >>>>>>>>>>>>>>>>>> proc_time”
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
> > >>>>>>>>>>> proc_time"
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said,
> users
> > go
> > >>>>>>>>>>> on it
> > >>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one
> > proposed
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> enable
> > >>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean
> > other
> > >>>>>>>>>>>>> developers
> > >>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly
> specify
> > >>>>>>>>>>> whether
> > >>>>>>>>>>>>> their
> > >>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of
> > supported
> > >>>>>>>>>>>>>>>> options),
> > >>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to. So
> > what
> > >>>>>>>>>>>>> exactly is
> > >>>>>>>>>>>>>>>>>>> the difference between implementing caching in
> modules
> > >>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from
> the
> > >>>>>>>>>>>>> considered
> > >>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> > breaking/non-breaking
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> confront a situation that allows table options in
> DDL
> > to
> > >>>>>>>>>>>>> control
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> behavior of the framework, which has never happened
> > >>>>>>>>>>> previously
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>> be cautious
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of DDL
> > >>>>>>>>>>> options
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about
> > limiting
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> scope
> > >>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> the options + importance for the user business logic
> > rather
> > >>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
> > framework? I
> > >>>>>>>>>>>>> mean
> > >>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>> in my design, for example, putting an option with
> > lookup
> > >>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
> > decision,
> > >>>>>>>>>>>>> because it
> > >>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not just
> > >>>>>>>>>>> performance
> > >>>>>>>>>>>>>>>>>>> optimization) + touches just several functions of ONE
> > table
> > >>>>>>>>>>>>> (there
> > >>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it
> > really
> > >>>>>>>>>>>>> matter for
> > >>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is
> located,
> > >>>>>>>>>>> which is
> > >>>>>>>>>>>>>>>>>>> affected by the applied option?
> > >>>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism',
> > which in
> > >>>>>>>>>>>>> some way
> > >>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I don't
> > see any
> > >>>>>>>>>>>>> problem
> > >>>>>>>>>>>>>>>>>>> here.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
> > scenario
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> design
> > >>>>>>>>>>>>>>>>>>> would become more complex
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but
> > actually
> > >>>>>>>>>>> in our
> > >>>>>>>>>>>>>>>>>>> internal version we solved this problem quite easily
> -
> > we
> > >>>>>>>>>>> reused
> > >>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new
> API).
> > The
> > >>>>>>>>>>>>> point is
> > >>>>>>>>>>>>>>>>>>> that currently all lookup connectors use InputFormat
> > for
> > >>>>>>>>>>>>> scanning
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it
> uses
> > >>>>>>>>>>> class
> > >>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper
> around
> > >>>>>>>>>>>>> InputFormat.
> > >>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to
> reload
> > >>>>>>>>>>> cache
> > >>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
> > >>>>>>>>>>> InputSplits,
> > >>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>> has
> > >>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
> > significantly
> > >>>>>>>>>>>>> reduces
> > >>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know
> that
> > >>>>>>>>>>> usually
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>> try
> > >>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but
> maybe
> > this
> > >>>>>>>>>>> one
> > >>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal
> > solution,
> > >>>>>>>>>>> maybe
> > >>>>>>>>>>>>>>>> there
> > >>>>>>>>>>>>>>>>>>> are better ones.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might introduce
> > >>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>> issues
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of the
> > >>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>> won't
> > >>>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache
> > options
> > >>>>>>>>>>>>>>>> incorrectly
> > >>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2
> > different
> > >>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to do
> > is to
> > >>>>>>>>>>>>> redirect
> > >>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+
> > maybe
> > >>>>>>>>>>> add an
> > >>>>>>>>>>>>>>>> alias
> > >>>>>>>>>>>>>>>>>>> for options, if there was different naming),
> everything
> > >>>>>>>>>>> will be
> > >>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
> > >>>>>>>>>>> refactoring at
> > >>>>>>>>>>>>> all,
> > >>>>>>>>>>>>>>>>>>> nothing will be changed for the connector because of
> > >>>>>>>>>>> backward
> > >>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his
> own
> > >>>>>>>>>>> cache
> > >>>>>>>>>>>>> logic,
> > >>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs into
> the
> > >>>>>>>>>>>>> framework,
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> instead make his own implementation with already
> > existing
> > >>>>>>>>>>>>> configs
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the way
> > down
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the
> ONLY
> > >>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> > >>>>>>>>>>>>>>>>>>> (no database connector supports it currently). Also
> > for some
> > >>>>>>>>>>>>>>>> databases
> > >>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex
> filters
> > >>>>>>>>>>> that we
> > >>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>> in Flink.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache seems
> > not
> > >>>>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>> useful
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of
> data
> > >>>>>>>>>>> from the
> > >>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in
> > dimension
> > >>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>> 'users'
> > >>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and
> > input
> > >>>>>>>>>>> stream
> > >>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of
> > users. If
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>> filter 'age > 30',
> > >>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means
> the
> > user
> > >>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times.
> It
> > will
> > >>>>>>>>>>>>> gain a
> > >>>>>>>>>>>>>>>>>>> huge
> > >>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization starts
> > to
> > >>>>>>>>>>> really
> > >>>>>>>>>>>>>>>> shine
> > >>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
> > projections
> > >>>>>>>>>>>>> can't
> > >>>>>>>>>>>>>>>> fit
> > >>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
> > additional
> > >>>>>>>>>>>>>>>> possibilities
> > >>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
> > useful'.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding this
> > topic!
> > >>>>>>>>>>>>> Because
> > >>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I
> > think
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> help
> > >>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
> > consensus.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> > >>>>>>>>>>> renqschn@gmail.com
> > >>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late
> > response!
> > >>>>>>>>>>> We
> > >>>>>>>>>>>>> had
> > >>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard
> and
> > I’d
> > >>>>>>>>>>> like
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the
> cache
> > >>>>>>>>>>> logic in
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided
> > table
> > >>>>>>>>>>>>> function,
> > >>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
> > TableFunction
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>> concerns:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> > >>>>>>>>>>> SYSTEM_TIME
> > >>>>>>>>>>>>> AS OF
> > >>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the
> > content
> > >>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose to
> > enable
> > >>>>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this
> > breakage is
> > >>>>>>>>>>>>>>>> acceptable
> > >>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to
> > provide
> > >>>>>>>>>>>>> caching on
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> table runtime level.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
> > framework
> > >>>>>>>>>>>>> (whether
> > >>>>>>>>>>>>>>>> in a
> > >>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
> > >>>>>>>>>>> confront a
> > >>>>>>>>>>>>>>>>>> situation
> > >>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
> > behavior of
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> framework,
> > >>>>>>>>>>>>>>>>>>> which has never happened previously and should be
> > cautious.
> > >>>>>>>>>>>>> Under
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> current design the behavior of the framework should
> > only be
> > >>>>>>>>>>>>>>>> specified
> > >>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to
> > apply
> > >>>>>>>>>>> these
> > >>>>>>>>>>>>>>>> general
> > >>>>>>>>>>>>>>>>>>> configs to a specific table.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and
> > refresh
> > >>>>>>>>>>> all
> > >>>>>>>>>>>>>>>> records
> > >>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
> > >>>>>>>>>>> performance
> > >>>>>>>>>>>>>>>> (like
> > >>>>>>>>>>>>>>>>>> Hive
> > >>>>>>>>>>>>>>>>>>> connector in the community, and also widely used by
> our
> > >>>>>>>>>>> internal
> > >>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> > >>>>>>>>>>> TableFunction
> > >>>>>>>>>>>>>>>> works
> > >>>>>>>>>>>>>>>>>> fine
> > >>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a
> new
> > >>>>>>>>>>>>> interface for
> > >>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become more
> > >>>>>>>>>>> complex.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
> > introduce
> > >>>>>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might
> > exist two
> > >>>>>>>>>>>>> caches
> > >>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>> totally different strategies if the user incorrectly
> > >>>>>>>>>>> configures
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by the
> > lookup
> > >>>>>>>>>>>>> source).
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I
> > think
> > >>>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to the
> > table
> > >>>>>>>>>>>>> function,
> > >>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the runner
> > with
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
> > >>>>>>>>>>> pressure
> > >>>>>>>>>>>>> on the
> > >>>>>>>>>>>>>>>>>>> external system, and only applying these
> optimizations
> > to
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>> seems
> > >>>>>>>>>>>>>>>>>>> not quite useful.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our
> > ideas.
> > >>>>>>>>>>> We
> > >>>>>>>>>>>>>>>> prefer to
> > >>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
> > TableFunction,
> > >>>>>>>>>>> and we
> > >>>>>>>>>>>>>>>> could
> > >>>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> > >>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> > >>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
> > >>>>>>>>>>> metrics
> > >>>>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>> [2]
> https://github.com/PatrickRen/flink/tree/FLIP-221
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as the
> > >>>>>>>>>>> first
> > >>>>>>>>>>>>> step:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> > >>>>>>>>>>> (originally
> > >>>>>>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they
> > follow
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>>> goal, but implementation details are different. If
> we
> > >>>>>>>>>>> will
> > >>>>>>>>>>>>> go one
> > >>>>>>>>>>>>>>>>> way,
> > >>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
> > deleting
> > >>>>>>>>>>>>> existing
> > >>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors. So
> I
> > >>>>>>>>>>> think we
> > >>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that and
> > then
> > >>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>> together
> > >>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for
> > different
> > >>>>>>>>>>>>> parts
> > >>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
> > introducing
> > >>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>> set
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests after
> > >>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the
> > lookup
> > >>>>>>>>>>>>> table, we
> > >>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we
> can
> > >>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>> responses,
> > >>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
> > pushdown. So
> > >>>>>>>>>>> if
> > >>>>>>>>>>>>>>>>> filtering
> > >>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less
> rows
> > in
> > >>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> > >>>>>>>>>>> shared.
> > >>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> > >>>>>>>>>>> conversations
> > >>>>>>>>>>>>> :)
> > >>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I
> made a
> > >>>>>>>>>>> Jira
> > >>>>>>>>>>>>> issue,
> > >>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more
> details
> > -
> > >>>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> > >>>>>>>>>>> arvid@apache.org>:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
> > >>>>>>>>>>>>> satisfying
> > >>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also
> live
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>> easier
> > >>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making
> > caching
> > >>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>> implementation
> > >>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
> > >>>>>>>>>>> layer
> > >>>>>>>>>>>>>>>> around X.
> > >>>>>>>>>>>>>>>>>> So
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> > >>>>>>>>>>> delegates to
> > >>>>>>>>>>>>> X in
> > >>>>>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into
> > the
> > >>>>>>>>>>>>> operator
> > >>>>>>>>>>>>>>>>>> model
> > >>>>>>>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
> > >>>>>>>>>>> unnecessary
> > >>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> first step
> > >>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only
> receive
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> requests
> > >>>>>>>>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more
> interesting
> > to
> > >>>>>>>>>>> save
> > >>>>>>>>>>>>>>>>> memory).
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of this
> > FLIP
> > >>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>>>> limited to
> > >>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
> > Everything
> > >>>>>>>>>>> else
> > >>>>>>>>>>>>>>>>> remains
> > >>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
> > >>>>>>>>>>> easily
> > >>>>>>>>>>>>>>>>>> incorporate
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
> > >>>>>>>>>>> later.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> > >>>>>>>>>>> shared.
> > >>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов
> <
> > >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> > >>>>>>>>>>> committer
> > >>>>>>>>>>>>> yet,
> > >>>>>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>> I'd
> > >>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> > >>>>>>>>>>>>> interested
> > >>>>>>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
> > >>>>>>>>>>>>> company’s
> > >>>>>>>>>>>>>>>>> Flink
> > >>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
> > >>>>>>>>>>> this and
> > >>>>>>>>>>>>>>>> make
> > >>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>> open source.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
> > >>>>>>>>>>> introducing an
> > >>>>>>>>>>>>>>>>> abstract
> > >>>>>>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction).
> As
> > >>>>>>>>>>> you
> > >>>>>>>>>>>>> know,
> > >>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> > >>>>>>>>>>> module,
> > >>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>> provides
> > >>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> > >>>>>>>>>>>>> convenient
> > >>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> importing
> > >>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction
> > contains
> > >>>>>>>>>>>>> logic
> > >>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
> > >>>>>>>>>>>>> connected
> > >>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably in
> > >>>>>>>>>>>>>>>>>>> flink-table-runtime.
> > >>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on
> > another
> > >>>>>>>>>>>>> module,
> > >>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t
> > sound
> > >>>>>>>>>>>>> good.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’
> to
> > >>>>>>>>>>>>>>>>>> LookupTableSource
> > >>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to
> > only
> > >>>>>>>>>>> pass
> > >>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they
> won’t
> > >>>>>>>>>>>>> depend on
> > >>>>>>>>>>>>>>>>>>> runtime
> > >>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner will
> > >>>>>>>>>>>>> construct a
> > >>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> > >>>>>>>>>>>>>>>> (ProcessFunctions
> > >>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks
> > like
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> pinned
> > >>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
> > >>>>>>>>>>>>>>>> CacheConfig).
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> > >>>>>>>>>>> responsible
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>> –
> > >>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> > >>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> > >>>>>>>>>>> flink-table-runtime
> > >>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> > >>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> > >>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
> > >>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
> > >>>>>>>>>>> such a
> > >>>>>>>>>>>>>>>>> solution.
> > >>>>>>>>>>>>>>>>>>> If
> > >>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can
> > apply
> > >>>>>>>>>>> some
> > >>>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
> > >>>>>>>>>>> named
> > >>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which
> actually
> > >>>>>>>>>>>>> mostly
> > >>>>>>>>>>>>>>>>>> consists
> > >>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
> > >>>>>>>>>>>>> condition
> > >>>>>>>>>>>>>>>>> ‘JOIN …
> > >>>>>>>>>>>>>>>>>>> ON
> > >>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE
> B.salary >
> > >>>>>>>>>>> 1000’
> > >>>>>>>>>>>>>>>>> ‘calc’
> > >>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10
> > and
> > >>>>>>>>>>>>>>>> B.salary >
> > >>>>>>>>>>>>>>>>>>> 1000.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing records
> in
> > >>>>>>>>>>>>> cache,
> > >>>>>>>>>>>>>>>> size
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters =
> > avoid
> > >>>>>>>>>>>>> storing
> > >>>>>>>>>>>>>>>>>> useless
> > >>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> > >>>>>>>>>>> size. So
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> initial
> > >>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be increased
> by
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> user.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion
> about
> > >>>>>>>>>>>>>>>> FLIP-221[1],
> > >>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache
> and
> > >>>>>>>>>>> its
> > >>>>>>>>>>>>>>>> standard
> > >>>>>>>>>>>>>>>>>>> metrics.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
> > implement
> > >>>>>>>>>>>>> their
> > >>>>>>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>> cache to
> > >>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard
> of
> > >>>>>>>>>>>>> metrics
> > >>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> users and
> > >>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup
> joins,
> > >>>>>>>>>>> which
> > >>>>>>>>>>>>> is a
> > >>>>>>>>>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>>>> common
> > >>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including
> > cache,
> > >>>>>>>>>>>>>>>> metrics,
> > >>>>>>>>>>>>>>>>>>> wrapper
> > >>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> > >>>>>>>>>>> Please
> > >>>>>>>>>>>>> take a
> > >>>>>>>>>>>>>>>>> look
> > >>>>>>>>>>>>>>>>>>> at the
> > >>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any
> suggestions
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>> comments
> > >>>>>>>>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>>>>>>>> appreciated!
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > >>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Best Regards,
> > >>>>>>>>
> > >>>>>>>> Qingsheng Ren
> > >>>>>>>>
> > >>>>>>>> Real-time Computing Team
> > >>>>>>>> Alibaba Cloud
> > >>>>>>>>
> > >>>>>>>> Email: renqschn@gmail.com
> > >>>>>>
> >
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Lincoln Lee <li...@gmail.com>.

Hi Qingsheng,

Sorry for jumping into the discussion so late. It's a good idea that we can
have a common table option. I have a minor comments on  'lookup.async' that
not make it a common option:

The table layer abstracts both sync and async lookup capabilities,
connectors implementers can choose one or both, in the case of implementing
only one capability(status of the most of existing builtin connectors)
'lookup.async' will not be used.  And when a connector has both
capabilities, I think this choice is more suitable for making decisions at
the query level, for example, table planner can choose the physical
implementation of async lookup or sync lookup based on its cost model, or
users can give query hint based on their own better understanding.  If
there is another common table option 'lookup.async', it may confuse the
users in the long run.

So, I prefer to leave the 'lookup.async' option in private place (for the
current hbase connector) and not turn it into a common option.

WDYT?

Best,
Lincoln Lee


Qingsheng Ren <re...@gmail.com> 于2022年5月23日周一 14:54写道：

> Hi Alexander,
>
> Thanks for the review! We recently updated the FLIP and you can find those
> changes from my latest email. Since some terminologies has changed so I’ll
> use the new concept for replying your comments.
>
> 1. Builder vs ‘of’
> I’m OK to use builder pattern if we have additional optional parameters
> for full caching mode (“rescan” previously). The schedule-with-delay idea
> looks reasonable to me, but I think we need to redesign the builder API of
> full caching to make it more descriptive for developers. Would you mind
> sharing your ideas about the API? For accessing the FLIP workspace you can
> just provide your account ID and ping any PMC member including Jark.
>
> 2. Common table options
> We have some discussions these days and propose to introduce 8 common
> table options about caching. It has been updated on the FLIP.
>
> 3. Retries
> I think we are on the same page :-)
>
> For your additional concerns:
> 1) The table option has been updated.
> 2) We got “lookup.cache” back for configuring whether to use partial or
> full caching mode.
>
> Best regards,
>
> Qingsheng
>
>
>
> > On May 19, 2022, at 17:25, Александр Смирнов <sm...@gmail.com>
> wrote:
> >
> > Also I have a few additions:
> > 1) maybe rename 'lookup.cache.maximum-size' to
> > 'lookup.cache.max-rows'? I think it will be more clear that we talk
> > not about bytes, but about the number of rows. Plus it fits more,
> > considering my optimization with filters.
> > 2) How will users enable rescanning? Are we going to separate caching
> > and rescanning from the options point of view? Like initially we had
> > one option 'lookup.cache' with values LRU / ALL. I think now we can
> > make a boolean option 'lookup.rescan'. RescanInterval can be
> > 'lookup.rescan.interval', etc.
> >
> > Best regards,
> > Alexander
> >
> > чт, 19 мая 2022 г. в 14:50, Александр Смирнов <sm...@gmail.com>:
> >>
> >> Hi Qingsheng and Jark,
> >>
> >> 1. Builders vs 'of'
> >> I understand that builders are used when we have multiple parameters.
> >> I suggested them because we could add parameters later. To prevent
> >> Builder for ScanRuntimeProvider from looking redundant I can suggest
> >> one more config now - "rescanStartTime".
> >> It's a time in UTC (LocalTime class) when the first reload of cache
> >> starts. This parameter can be thought of as 'initialDelay' (diff
> >> between current time and rescanStartTime) in method
> >> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
> >> useful when the dimension table is updated by some other scheduled job
> >> at a certain time. Or when the user simply wants a second scan (first
> >> cache reload) be delayed. This option can be used even without
> >> 'rescanInterval' - in this case 'rescanInterval' will be one day.
> >> If you are fine with this option, I would be very glad if you would
> >> give me access to edit FLIP page, so I could add it myself
> >>
> >> 2. Common table options
> >> I also think that FactoryUtil would be overloaded by all cache
> >> options. But maybe unify all suggested options, not only for default
> >> cache? I.e. class 'LookupOptions', that unifies default cache options,
> >> rescan options, 'async', 'maxRetries'. WDYT?
> >>
> >> 3. Retries
> >> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
> >>
> >> [1]
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> >>
> >> Best regards,
> >> Alexander
> >>
> >> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
> >>>
> >>> Hi Jark and Alexander,
> >>>
> >>> Thanks for your comments! I’m also OK to introduce common table
> options. I prefer to introduce a new DefaultLookupCacheOptions class for
> holding these option definitions because putting all options into
> FactoryUtil would make it a bit ”crowded” and not well categorized.
> >>>
> >>> FLIP has been updated according to suggestions above:
> >>> 1. Use static “of” method for constructing RescanRuntimeProvider
> considering both arguments are required.
> >>> 2. Introduce new table options matching DefaultLookupCacheFactory
> >>>
> >>> Best,
> >>> Qingsheng
> >>>
> >>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
> >>>>
> >>>> Hi Alex,
> >>>>
> >>>> 1) retry logic
> >>>> I think we can extract some common retry logic into utilities, e.g.
> RetryUtils#tryTimes(times, call).
> >>>> This seems independent of this FLIP and can be reused by DataStream
> users.
> >>>> Maybe we can open an issue to discuss this and where to put it.
> >>>>
> >>>> 2) cache ConfigOptions
> >>>> I'm fine with defining cache config options in the framework.
> >>>> A candidate place to put is FactoryUtil which also includes
> "sink.parallelism", "format" options.
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>>
> >>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <sm...@gmail.com>
> wrote:
> >>>>>
> >>>>> Hi Qingsheng,
> >>>>>
> >>>>> Thank you for considering my comments.
> >>>>>
> >>>>>> there might be custom logic before making retry, such as
> re-establish the connection
> >>>>>
> >>>>> Yes, I understand that. I meant that such logic can be placed in a
> >>>>> separate function, that can be implemented by connectors. Just moving
> >>>>> the retry logic would make connector's LookupFunction more concise +
> >>>>> avoid duplicate code. However, it's a minor change. The decision is
> up
> >>>>> to you.
> >>>>>
> >>>>>> We decide not to provide common DDL options and let developers to
> define their own options as we do now per connector.
> >>>>>
> >>>>> What is the reason for that? One of the main goals of this FLIP was
> to
> >>>>> unify the configs, wasn't it? I understand that current cache design
> >>>>> doesn't depend on ConfigOptions, like was before. But still we can
> put
> >>>>> these options into the framework, so connectors can reuse them and
> >>>>> avoid code duplication, and, what is more significant, avoid possible
> >>>>> different options naming. This moment can be pointed out in
> >>>>> documentation for connector developers.
> >>>>>
> >>>>> Best regards,
> >>>>> Alexander
> >>>>>
> >>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
> >>>>>>
> >>>>>> Hi Alexander,
> >>>>>>
> >>>>>> Thanks for the review and glad to see we are on the same page! I
> think you forgot to cc the dev mailing list so I’m also quoting your reply
> under this email.
> >>>>>>
> >>>>>>> We can add 'maxRetryTimes' option into this class
> >>>>>>
> >>>>>> In my opinion the retry logic should be implemented in lookup()
> instead of in LookupFunction#eval(). Retrying is only meaningful under some
> specific retriable failures, and there might be custom logic before making
> retry, such as re-establish the connection (JdbcRowDataLookupFunction is an
> example), so it's more handy to leave it to the connector.
> >>>>>>
> >>>>>>> I don't see DDL options, that were in previous version of FLIP. Do
> you have any special plans for them?
> >>>>>>
> >>>>>> We decide not to provide common DDL options and let developers to
> define their own options as we do now per connector.
> >>>>>>
> >>>>>> The rest of comments sound great and I’ll update the FLIP. Hope we
> can finalize our proposal soon!
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Qingsheng
> >>>>>>
> >>>>>>
> >>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>> Hi Qingsheng and devs!
> >>>>>>>
> >>>>>>> I like the overall design of updated FLIP, however I have several
> >>>>>>> suggestions and questions.
> >>>>>>>
> >>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction is a
> good
> >>>>>>> idea. We can add 'maxRetryTimes' option into this class. 'eval'
> method
> >>>>>>> of new LookupFunction is great for this purpose. The same is for
> >>>>>>> 'async' case.
> >>>>>>>
> >>>>>>> 2) There might be other configs in future, such as
> 'cacheMissingKey'
> >>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> ScanRuntimeProvider.
> >>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
> >>>>>>> RescanRuntimeProvider for more flexibility (use one 'build' method
> >>>>>>> instead of many 'of' methods in future)?
> >>>>>>>
> >>>>>>> 3) What are the plans for existing TableFunctionProvider and
> >>>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
> >>>>>>>
> >>>>>>> 4) Am I right that the current design does not assume usage of
> >>>>>>> user-provided LookupCache in re-scanning? In this case, it is not
> very
> >>>>>>> clear why do we need methods such as 'invalidate' or 'putAll' in
> >>>>>>> LookupCache.
> >>>>>>>
> >>>>>>> 5) I don't see DDL options, that were in previous version of FLIP.
> Do
> >>>>>>> you have any special plans for them?
> >>>>>>>
> >>>>>>> If you don't mind, I would be glad to be able to make small
> >>>>>>> adjustments to the FLIP document too. I think it's worth mentioning
> >>>>>>> about what exactly optimizations are planning in the future.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Smirnov Alexander
> >>>>>>>
> >>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
> >>>>>>>>
> >>>>>>>> Hi Alexander and devs,
> >>>>>>>>
> >>>>>>>> Thank you very much for the in-depth discussion! As Jark
> mentioned we were inspired by Alexander's idea and made a refactor on our
> design. FLIP-221 [1] has been updated to reflect our design now and we are
> happy to hear more suggestions from you!
> >>>>>>>>
> >>>>>>>> Compared to the previous design:
> >>>>>>>> 1. The lookup cache serves at table runtime level and is
> integrated as a component of LookupJoinRunner as discussed previously.
> >>>>>>>> 2. Interfaces are renamed and re-designed to reflect the new
> design.
> >>>>>>>> 3. We separate the all-caching case individually and introduce a
> new RescanRuntimeProvider to reuse the ability of scanning. We are planning
> to support SourceFunction / InputFormat for now considering the complexity
> of FLIP-27 Source API.
> >>>>>>>> 4. A new interface LookupFunction is introduced to make the
> semantic of lookup more straightforward for developers.
> >>>>>>>>
> >>>>>>>> For replying to Alexander:
> >>>>>>>>> However I'm a little confused whether InputFormat is deprecated
> or not. Am I right that it will be so in the future, but currently it's not?
> >>>>>>>> Yes you are right. InputFormat is not deprecated for now. I think
> it will be deprecated in the future but we don't have a clear plan for that.
> >>>>>>>>
> >>>>>>>> Thanks again for the discussion on this FLIP and looking forward
> to cooperating with you after we finalize the design and interfaces!
> >>>>>>>>
> >>>>>>>> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>>
> >>>>>>>> Qingsheng
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> smiralexan@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Jark, Qingsheng and Leonard!
> >>>>>>>>>
> >>>>>>>>> Glad to see that we came to a consensus on almost all points!
> >>>>>>>>>
> >>>>>>>>> However I'm a little confused whether InputFormat is deprecated
> or
> >>>>>>>>> not. Am I right that it will be so in the future, but currently
> it's
> >>>>>>>>> not? Actually I also think that for the first version it's OK to
> use
> >>>>>>>>> InputFormat in ALL cache realization, because supporting rescan
> >>>>>>>>> ability seems like a very distant prospect. But for this
> decision we
> >>>>>>>>> need a consensus among all discussion participants.
> >>>>>>>>>
> >>>>>>>>> In general, I don't have something to argue with your
> statements. All
> >>>>>>>>> of them correspond my ideas. Looking ahead, it would be nice to
> work
> >>>>>>>>> on this FLIP cooperatively. I've already done a lot of work on
> lookup
> >>>>>>>>> join caching with realization very close to the one we are
> discussing,
> >>>>>>>>> and want to share the results of this work. Anyway looking
> forward for
> >>>>>>>>> the FLIP update!
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>> Smirnov Alexander
> >>>>>>>>>
> >>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> >>>>>>>>>>
> >>>>>>>>>> Hi Alex,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for summarizing your points.
> >>>>>>>>>>
> >>>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed it
> several times
> >>>>>>>>>> and we have totally refactored the design.
> >>>>>>>>>> I'm glad to say we have reached a consensus on many of your
> points!
> >>>>>>>>>> Qingsheng is still working on updating the design docs and
> maybe can be
> >>>>>>>>>> available in the next few days.
> >>>>>>>>>> I will share some conclusions from our discussions:
> >>>>>>>>>>
> >>>>>>>>>> 1) we have refactored the design towards to "cache in
> framework" way.
> >>>>>>>>>>
> >>>>>>>>>> 2) a "LookupCache" interface for users to customize and a
> default
> >>>>>>>>>> implementation with builder for users to easy-use.
> >>>>>>>>>> This can both make it possible to both have flexibility and
> conciseness.
> >>>>>>>>>>
> >>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup cache,
> esp reducing
> >>>>>>>>>> IO.
> >>>>>>>>>> Filter pushdown should be the final state and the unified way
> to both
> >>>>>>>>>> support pruning ALL cache and LRU cache,
> >>>>>>>>>> so I think we should make effort in this direction. If we need
> to support
> >>>>>>>>>> filter pushdown for ALL cache anyway, why not use
> >>>>>>>>>> it for LRU cache as well? Either way, as we decide to implement
> the cache
> >>>>>>>>>> in the framework, we have the chance to support
> >>>>>>>>>> filter on cache anytime. This is an optimization and it doesn't
> affect the
> >>>>>>>>>> public API. I think we can create a JIRA issue to
> >>>>>>>>>> discuss it when the FLIP is accepted.
> >>>>>>>>>>
> >>>>>>>>>> 4) The idea to support ALL cache is similar to your proposal.
> >>>>>>>>>> In the first version, we will only support InputFormat,
> SourceFunction for
> >>>>>>>>>> cache all (invoke InputFormat in join operator).
> >>>>>>>>>> For FLIP-27 source, we need to join a true source operator
> instead of
> >>>>>>>>>> calling it embedded in the join operator.
> >>>>>>>>>> However, this needs another FLIP to support the re-scan ability
> for FLIP-27
> >>>>>>>>>> Source, and this can be a large work.
> >>>>>>>>>> In order to not block this issue, we can put the effort of
> FLIP-27 source
> >>>>>>>>>> integration into future work and integrate
> >>>>>>>>>> InputFormat&SourceFunction for now.
> >>>>>>>>>>
> >>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as they
> are not
> >>>>>>>>>> deprecated, otherwise, we have to introduce another function
> >>>>>>>>>> similar to them which is meaningless. We need to plan FLIP-27
> source
> >>>>>>>>>> integration ASAP before InputFormat & SourceFunction are
> deprecated.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jark
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> smiralexan@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Martijn!
> >>>>>>>>>>>
> >>>>>>>>>>> Got it. Therefore, the realization with InputFormat is not
> considered.
> >>>>>>>>>>> Thanks for clearing that up!
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards,
> >>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>
> >>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> martijn@ververica.com>:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> With regards to:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> But if there are plans to refactor all connectors to FLIP-27
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old
> interfaces will be
> >>>>>>>>>>>> deprecated and connectors will either be refactored to use
> the new ones
> >>>>>>>>>>> or
> >>>>>>>>>>>> dropped.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The caching should work for connectors that are using FLIP-27
> interfaces,
> >>>>>>>>>>>> we should not introduce new features for old interfaces.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Martijn
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> smiralexan@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Jark!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry for the late response. I would like to make some
> comments and
> >>>>>>>>>>>>> clarify my points.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1) I agree with your first statement. I think we can achieve
> both
> >>>>>>>>>>>>> advantages this way: put the Cache interface in
> flink-table-common,
> >>>>>>>>>>>>> but have implementations of it in flink-table-runtime.
> Therefore if a
> >>>>>>>>>>>>> connector developer wants to use existing cache strategies
> and their
> >>>>>>>>>>>>> implementations, he can just pass lookupConfig to the
> planner, but if
> >>>>>>>>>>>>> he wants to have its own cache implementation in his
> TableFunction, it
> >>>>>>>>>>>>> will be possible for him to use the existing interface for
> this
> >>>>>>>>>>>>> purpose (we can explicitly point this out in the
> documentation). In
> >>>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
> have 90% of
> >>>>>>>>>>>>> lookup requests that can never be cached
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case of
> LRU cache.
> >>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we
> always
> >>>>>>>>>>>>> store the response of the dimension table in cache, even
> after
> >>>>>>>>>>>>> applying calc function. I.e. if there are no rows after
> applying
> >>>>>>>>>>>>> filters to the result of the 'eval' method of TableFunction,
> we store
> >>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line will
> be
> >>>>>>>>>>>>> filled, but will require much less memory (in bytes). I.e.
> we don't
> >>>>>>>>>>>>> completely filter keys, by which result was pruned, but
> significantly
> >>>>>>>>>>>>> reduce required memory to store this result. If the user
> knows about
> >>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option before
> the start
> >>>>>>>>>>>>> of the job. But actually I came up with the idea that we can
> do this
> >>>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher'
> methods of
> >>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection of
> rows
> >>>>>>>>>>>>> (value of cache). Therefore cache can automatically fit much
> more
> >>>>>>>>>>>>> records than before.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and
> projects
> >>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> SupportsProjectionPushDown.
> >>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't
> mean it's
> >>>>>>>>>>> hard
> >>>>>>>>>>>>> to implement.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It's debatable how difficult it will be to implement filter
> pushdown.
> >>>>>>>>>>>>> But I think the fact that currently there is no database
> connector
> >>>>>>>>>>>>> with filter pushdown at least means that this feature won't
> be
> >>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about
> other
> >>>>>>>>>>>>> connectors (not in Flink repo), their databases might not
> support all
> >>>>>>>>>>>>> Flink filters (or not support filters at all). I think users
> are
> >>>>>>>>>>>>> interested in supporting cache filters optimization
> independently of
> >>>>>>>>>>>>> supporting other features and solving more complex problems
> (or
> >>>>>>>>>>>>> unsolvable at all).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 3) I agree with your third statement. Actually in our
> internal version
> >>>>>>>>>>>>> I also tried to unify the logic of scanning and reloading
> data from
> >>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to unify
> the logic
> >>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
> Source,...)
> >>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I settled
> on using
> >>>>>>>>>>>>> InputFormat, because it was used for scanning in all lookup
> >>>>>>>>>>>>> connectors. (I didn't know that there are plans to deprecate
> >>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of
> FLIP-27 source
> >>>>>>>>>>>>> in ALL caching is not good idea, because this source was
> designed to
> >>>>>>>>>>>>> work in distributed environment (SplitEnumerator on
> JobManager and
> >>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator (lookup
> join
> >>>>>>>>>>>>> operator in our case). There is even no direct way to pass
> splits from
> >>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works through
> >>>>>>>>>>>>> SplitEnumeratorContext, which requires
> >>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents).
> Usage of
> >>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and
> easier. But if
> >>>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I
> have the
> >>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join ALL
> cache in
> >>>>>>>>>>>>> favor of simple join with multiple scanning of batch source?
> The point
> >>>>>>>>>>>>> is that the only difference between lookup join ALL cache
> and simple
> >>>>>>>>>>>>> join with batch source is that in the first case scanning is
> performed
> >>>>>>>>>>>>> multiple times, in between which state (cache) is cleared
> (correct me
> >>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of
> simple join
> >>>>>>>>>>>>> to support state reloading + extend the functionality of
> scanning
> >>>>>>>>>>>>> batch source multiple times (this one should be easy with
> new FLIP-27
> >>>>>>>>>>>>> source, that unifies streaming/batch reading - we will need
> to change
> >>>>>>>>>>>>> only SplitEnumerator, which will pass splits again after
> some TTL).
> >>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal and
> will make
> >>>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe we
> can limit
> >>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So to sum up, my points is like this:
> >>>>>>>>>>>>> 1) There is a way to make both concise and flexible
> interfaces for
> >>>>>>>>>>>>> caching in lookup join.
> >>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU and
> ALL caches.
> >>>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported in
> Flink
> >>>>>>>>>>>>> connectors, some of the connectors might not have the
> opportunity to
> >>>>>>>>>>>>> support filter pushdown + as I know, currently filter
> pushdown works
> >>>>>>>>>>>>> only for scanning (not lookup). So cache filters +
> projections
> >>>>>>>>>>>>> optimization should be independent from other features.
> >>>>>>>>>>>>> 4) ALL cache realization is a complex topic that involves
> multiple
> >>>>>>>>>>>>> aspects of how Flink is developing. Refusing from
> InputFormat in favor
> >>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really
> complex and
> >>>>>>>>>>>>> not clear, so maybe instead of that we can extend the
> functionality of
> >>>>>>>>>>>>> simple join or not refuse from InputFormat in case of lookup
> join ALL
> >>>>>>>>>>>>> cache?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It's great to see the active discussion! I want to share my
> ideas:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
> >>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should
> work (e.g.,
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>> pruning, compatibility).
> >>>>>>>>>>>>>> The framework way can provide more concise interfaces.
> >>>>>>>>>>>>>> The connector base way can define more flexible cache
> >>>>>>>>>>>>>> strategies/implementations.
> >>>>>>>>>>>>>> We are still investigating a way to see if we can have both
> >>>>>>>>>>> advantages.
> >>>>>>>>>>>>>> We should reach a consensus that the way should be a final
> state,
> >>>>>>>>>>> and we
> >>>>>>>>>>>>>> are on the path to it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2) filters and projections pushdown:
> >>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache can
> benefit a
> >>>>>>>>>>> lot
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>> ALL cache.
> >>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use
> cache to
> >>>>>>>>>>> reduce
> >>>>>>>>>>>>> IO
> >>>>>>>>>>>>>> requests to databases for better throughput.
> >>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will
> have 90% of
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>> requests that can never be cached
> >>>>>>>>>>>>>> and hit directly to the databases. That means the cache is
> >>>>>>>>>>> meaningless in
> >>>>>>>>>>>>>> this case.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters
> and projects
> >>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>>>>>>>> SupportsProjectionPushDown.
> >>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't
> mean it's
> >>>>>>>>>>> hard
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> implement.
> >>>>>>>>>>>>>> They should implement the pushdown interfaces to reduce IO
> and the
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>> That should be a final state that the scan source and
> lookup source
> >>>>>>>>>>> share
> >>>>>>>>>>>>>> the exact pushdown implementation.
> >>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic in
> caches,
> >>>>>>>>>>> which
> >>>>>>>>>>>>>> will complex the lookup join design.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 3) ALL cache abstraction
> >>>>>>>>>>>>>> All cache might be the most challenging part of this FLIP.
> We have
> >>>>>>>>>>> never
> >>>>>>>>>>>>>> provided a reload-lookup public interface.
> >>>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method of
> >>>>>>>>>>> TableFunction.
> >>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> >>>>>>>>>>>>>> Ideally, connector implementation should share the logic of
> reload
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> InputFormat/SourceFunction/FLIP-27
> >>>>>>>>>>>>> Source.
> >>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and the
> FLIP-27
> >>>>>>>>>>>>> source
> >>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> >>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this
> may make
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> scope of this FLIP much larger.
> >>>>>>>>>>>>>> We are still investigating how to abstract the ALL cache
> logic and
> >>>>>>>>>>> reuse
> >>>>>>>>>>>>>> the existing source interfaces.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> ro.v.boyko@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It's a much more complicated activity and lies out of the
> scope of
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>> improvement. Because such pushdowns should be done for all
> >>>>>>>>>>>>> ScanTableSource
> >>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> >>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> One question regarding "And Alexander correctly mentioned
> that
> >>>>>>>>>>> filter
> >>>>>>>>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase."
> -> Would
> >>>>>>>>>>> an
> >>>>>>>>>>>>>>>> alternative solution be to actually implement these filter
> >>>>>>>>>>> pushdowns?
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>> imagine that there are many more benefits to doing that,
> outside
> >>>>>>>>>>> of
> >>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>> caching and metrics.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Martijn Visser
> >>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> ro.v.boyko@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I do think that single cache implementation would be a
> nice
> >>>>>>>>>>>>> opportunity
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
> proc_time"
> >>>>>>>>>>>>> semantics
> >>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> >>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the
> cache
> >>>>>>>>>>> size
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most handy
> way to do
> >>>>>>>>>>> it
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> apply
> >>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to
> pass it
> >>>>>>>>>>>>> through the
> >>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
> >>>>>>>>>>> mentioned
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> filter pushdown still is not implemented for
> jdbc/hive/hbase.
> >>>>>>>>>>>>>>>>> 2) The ability to set the different caching parameters
> for
> >>>>>>>>>>> different
> >>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it through
> DDL
> >>>>>>>>>>> rather
> >>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for all
> lookup
> >>>>>>>>>>> tables.
> >>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really
> deprives us of
> >>>>>>>>>>>>>>>>> extensibility (users won't be able to implement their own
> >>>>>>>>>>> cache).
> >>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>> probably it might be solved by creating more different
> cache
> >>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> a wider set of configurations.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> All these points are much closer to the schema proposed
> by
> >>>>>>>>>>>>> Alexander.
> >>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all
> these
> >>>>>>>>>>>>>>>> facilities
> >>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> >>>>>>>>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to
> express that
> >>>>>>>>>>> I
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic and I
> hope
> >>>>>>>>>>> that
> >>>>>>>>>>>>>>>> others
> >>>>>>>>>>>>>>>>>> will join the conversation.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> >>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
> questions
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get
> something?).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR
> SYSTEM_TIME
> >>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>> proc_time”
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
> >>>>>>>>>>> proc_time"
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said, users
> go
> >>>>>>>>>>> on it
> >>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one
> proposed
> >>>>>>>>>>> to
> >>>>>>>>>>>>> enable
> >>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean
> other
> >>>>>>>>>>>>> developers
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly specify
> >>>>>>>>>>> whether
> >>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of
> supported
> >>>>>>>>>>>>>>>> options),
> >>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to. So
> what
> >>>>>>>>>>>>> exactly is
> >>>>>>>>>>>>>>>>>>> the difference between implementing caching in modules
> >>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
> >>>>>>>>>>>>> considered
> >>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> breaking/non-breaking
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> confront a situation that allows table options in DDL
> to
> >>>>>>>>>>>>> control
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> behavior of the framework, which has never happened
> >>>>>>>>>>> previously
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> be cautious
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of DDL
> >>>>>>>>>>> options
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about
> limiting
> >>>>>>>>>>> the
> >>>>>>>>>>>>> scope
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> the options + importance for the user business logic
> rather
> >>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the
> framework? I
> >>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> in my design, for example, putting an option with
> lookup
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong
> decision,
> >>>>>>>>>>>>> because it
> >>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not just
> >>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>> optimization) + touches just several functions of ONE
> table
> >>>>>>>>>>>>> (there
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it
> really
> >>>>>>>>>>>>> matter for
> >>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is located,
> >>>>>>>>>>> which is
> >>>>>>>>>>>>>>>>>>> affected by the applied option?
> >>>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism',
> which in
> >>>>>>>>>>>>> some way
> >>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I don't
> see any
> >>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching
> scenario
> >>>>>>>>>>> and
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>>> would become more complex
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but
> actually
> >>>>>>>>>>> in our
> >>>>>>>>>>>>>>>>>>> internal version we solved this problem quite easily -
> we
> >>>>>>>>>>> reused
> >>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new API).
> The
> >>>>>>>>>>>>> point is
> >>>>>>>>>>>>>>>>>>> that currently all lookup connectors use InputFormat
> for
> >>>>>>>>>>>>> scanning
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
> >>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
> >>>>>>>>>>>>> InputFormat.
> >>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to reload
> >>>>>>>>>>> cache
> >>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
> >>>>>>>>>>> InputSplits,
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time
> significantly
> >>>>>>>>>>>>> reduces
> >>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know that
> >>>>>>>>>>> usually
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe
> this
> >>>>>>>>>>> one
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal
> solution,
> >>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>> are better ones.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might introduce
> >>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of the
> >>>>>>>>>>> connector
> >>>>>>>>>>>>>>>> won't
> >>>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache
> options
> >>>>>>>>>>>>>>>> incorrectly
> >>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2
> different
> >>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to do
> is to
> >>>>>>>>>>>>> redirect
> >>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+
> maybe
> >>>>>>>>>>> add an
> >>>>>>>>>>>>>>>> alias
> >>>>>>>>>>>>>>>>>>> for options, if there was different naming), everything
> >>>>>>>>>>> will be
> >>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
> >>>>>>>>>>> refactoring at
> >>>>>>>>>>>>> all,
> >>>>>>>>>>>>>>>>>>> nothing will be changed for the connector because of
> >>>>>>>>>>> backward
> >>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
> >>>>>>>>>>> cache
> >>>>>>>>>>>>> logic,
> >>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs into the
> >>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> instead make his own implementation with already
> existing
> >>>>>>>>>>>>> configs
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the way
> down
> >>>>>>>>>>> to
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
> >>>>>>>>>>> connector
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> >>>>>>>>>>>>>>>>>>> (no database connector supports it currently). Also
> for some
> >>>>>>>>>>>>>>>> databases
> >>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
> >>>>>>>>>>> that we
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>> in Flink.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache seems
> not
> >>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
> >>>>>>>>>>> from the
> >>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in
> dimension
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> 'users'
> >>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and
> input
> >>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of
> users. If
> >>>>>>>>>>> we
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>> filter 'age > 30',
> >>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means the
> user
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It
> will
> >>>>>>>>>>>>> gain a
> >>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization starts
> to
> >>>>>>>>>>> really
> >>>>>>>>>>>>>>>> shine
> >>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
> projections
> >>>>>>>>>>>>> can't
> >>>>>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up
> additional
> >>>>>>>>>>>>>>>> possibilities
> >>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite
> useful'.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding this
> topic!
> >>>>>>>>>>>>> Because
> >>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I
> think
> >>>>>>>>>>> with
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> help
> >>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a
> consensus.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> >>>>>>>>>>> renqschn@gmail.com
> >>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late
> response!
> >>>>>>>>>>> We
> >>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard and
> I’d
> >>>>>>>>>>> like
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
> >>>>>>>>>>> logic in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided
> table
> >>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending
> TableFunction
> >>>>>>>>>>> with
> >>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>> concerns:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> >>>>>>>>>>> SYSTEM_TIME
> >>>>>>>>>>>>> AS OF
> >>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the
> content
> >>>>>>>>>>> of the
> >>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose to
> enable
> >>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this
> breakage is
> >>>>>>>>>>>>>>>> acceptable
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to
> provide
> >>>>>>>>>>>>> caching on
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> table runtime level.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the
> framework
> >>>>>>>>>>>>> (whether
> >>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
> >>>>>>>>>>> confront a
> >>>>>>>>>>>>>>>>>> situation
> >>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the
> behavior of
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> framework,
> >>>>>>>>>>>>>>>>>>> which has never happened previously and should be
> cautious.
> >>>>>>>>>>>>> Under
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> current design the behavior of the framework should
> only be
> >>>>>>>>>>>>>>>> specified
> >>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to
> apply
> >>>>>>>>>>> these
> >>>>>>>>>>>>>>>> general
> >>>>>>>>>>>>>>>>>>> configs to a specific table.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and
> refresh
> >>>>>>>>>>> all
> >>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
> >>>>>>>>>>> performance
> >>>>>>>>>>>>>>>> (like
> >>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>> connector in the community, and also widely used by our
> >>>>>>>>>>> internal
> >>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> >>>>>>>>>>> TableFunction
> >>>>>>>>>>>>>>>> works
> >>>>>>>>>>>>>>>>>> fine
> >>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
> >>>>>>>>>>>>> interface for
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become more
> >>>>>>>>>>> complex.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might
> introduce
> >>>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might
> exist two
> >>>>>>>>>>>>> caches
> >>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> totally different strategies if the user incorrectly
> >>>>>>>>>>> configures
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by the
> lookup
> >>>>>>>>>>>>> source).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I
> think
> >>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to the
> table
> >>>>>>>>>>>>> function,
> >>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the runner
> with
> >>>>>>>>>>> the
> >>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
> >>>>>>>>>>> pressure
> >>>>>>>>>>>>> on the
> >>>>>>>>>>>>>>>>>>> external system, and only applying these optimizations
> to
> >>>>>>>>>>> the
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>> not quite useful.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our
> ideas.
> >>>>>>>>>>> We
> >>>>>>>>>>>>>>>> prefer to
> >>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of
> TableFunction,
> >>>>>>>>>>> and we
> >>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> >>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> >>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
> >>>>>>>>>>> metrics
> >>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> >>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as the
> >>>>>>>>>>> first
> >>>>>>>>>>>>> step:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> >>>>>>>>>>> (originally
> >>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they
> follow
> >>>>>>>>>>> the
> >>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>> goal, but implementation details are different. If we
> >>>>>>>>>>> will
> >>>>>>>>>>>>> go one
> >>>>>>>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean
> deleting
> >>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors. So I
> >>>>>>>>>>> think we
> >>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that and
> then
> >>>>>>>>>>> work
> >>>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for
> different
> >>>>>>>>>>>>> parts
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification /
> introducing
> >>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests after
> >>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the
> lookup
> >>>>>>>>>>>>> table, we
> >>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we can
> >>>>>>>>>>> filter
> >>>>>>>>>>>>>>>>> responses,
> >>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter
> pushdown. So
> >>>>>>>>>>> if
> >>>>>>>>>>>>>>>>> filtering
> >>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less rows
> in
> >>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>>>>>>>>>> shared.
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> >>>>>>>>>>> conversations
> >>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
> >>>>>>>>>>> Jira
> >>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more details
> -
> >>>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> >>>>>>>>>>> arvid@apache.org>:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
> >>>>>>>>>>>>> satisfying
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
> >>>>>>>>>>> with
> >>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>> easier
> >>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making
> caching
> >>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
> >>>>>>>>>>> layer
> >>>>>>>>>>>>>>>> around X.
> >>>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> >>>>>>>>>>> delegates to
> >>>>>>>>>>>>> X in
> >>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into
> the
> >>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
> >>>>>>>>>>> unnecessary
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> first step
> >>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> requests
> >>>>>>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more interesting
> to
> >>>>>>>>>>> save
> >>>>>>>>>>>>>>>>> memory).
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of this
> FLIP
> >>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>> limited to
> >>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces.
> Everything
> >>>>>>>>>>> else
> >>>>>>>>>>>>>>>>> remains
> >>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
> >>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>> incorporate
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
> >>>>>>>>>>> later.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>>>>>>>>>> shared.
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> >>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> >>>>>>>>>>> committer
> >>>>>>>>>>>>> yet,
> >>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> >>>>>>>>>>>>> interested
> >>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
> >>>>>>>>>>>>> company’s
> >>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
> >>>>>>>>>>> this and
> >>>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>> open source.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
> >>>>>>>>>>> introducing an
> >>>>>>>>>>>>>>>>> abstract
> >>>>>>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
> >>>>>>>>>>> you
> >>>>>>>>>>>>> know,
> >>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> >>>>>>>>>>> module,
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>> provides
> >>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> >>>>>>>>>>>>> convenient
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> importing
> >>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction
> contains
> >>>>>>>>>>>>> logic
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
> >>>>>>>>>>>>> connected
> >>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably in
> >>>>>>>>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on
> another
> >>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t
> sound
> >>>>>>>>>>>>> good.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
> >>>>>>>>>>>>>>>>>> LookupTableSource
> >>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to
> only
> >>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
> >>>>>>>>>>>>> depend on
> >>>>>>>>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner will
> >>>>>>>>>>>>> construct a
> >>>>>>>>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> >>>>>>>>>>>>>>>> (ProcessFunctions
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks
> like
> >>>>>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> pinned
> >>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
> >>>>>>>>>>>>>>>> CacheConfig).
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> >>>>>>>>>>> responsible
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> –
> >>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> >>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>>>>>>>>>> flink-table-runtime
> >>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> >>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
> >>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
> >>>>>>>>>>> such a
> >>>>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can
> apply
> >>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
> >>>>>>>>>>> named
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
> >>>>>>>>>>>>> mostly
> >>>>>>>>>>>>>>>>>> consists
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
> >>>>>>>>>>>>> condition
> >>>>>>>>>>>>>>>>> ‘JOIN …
> >>>>>>>>>>>>>>>>>>> ON
> >>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
> >>>>>>>>>>> 1000’
> >>>>>>>>>>>>>>>>> ‘calc’
> >>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10
> and
> >>>>>>>>>>>>>>>> B.salary >
> >>>>>>>>>>>>>>>>>>> 1000.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing records in
> >>>>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters =
> avoid
> >>>>>>>>>>>>> storing
> >>>>>>>>>>>>>>>>>> useless
> >>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> >>>>>>>>>>> size. So
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be increased by
> >>>>>>>>>>> the
> >>>>>>>>>>>>> user.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
> >>>>>>>>>>>>>>>> FLIP-221[1],
> >>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
> >>>>>>>>>>> its
> >>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should
> implement
> >>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
> >>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> users and
> >>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
> >>>>>>>>>>> which
> >>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including
> cache,
> >>>>>>>>>>>>>>>> metrics,
> >>>>>>>>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> >>>>>>>>>>> Please
> >>>>>>>>>>>>> take a
> >>>>>>>>>>>>>>>>> look
> >>>>>>>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>> comments
> >>>>>>>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best Regards,
> >>>>>>>>
> >>>>>>>> Qingsheng Ren
> >>>>>>>>
> >>>>>>>> Real-time Computing Team
> >>>>>>>> Alibaba Cloud
> >>>>>>>>
> >>>>>>>> Email: renqschn@gmail.com
> >>>>>>
>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi Alexander,

Thanks for the review! We recently updated the FLIP and you can find those changes from my latest email. Since some terminologies has changed so I’ll use the new concept for replying your comments. 

1. Builder vs ‘of’
I’m OK to use builder pattern if we have additional optional parameters for full caching mode (“rescan” previously). The schedule-with-delay idea looks reasonable to me, but I think we need to redesign the builder API of full caching to make it more descriptive for developers. Would you mind sharing your ideas about the API? For accessing the FLIP workspace you can just provide your account ID and ping any PMC member including Jark. 

2. Common table options
We have some discussions these days and propose to introduce 8 common table options about caching. It has been updated on the FLIP. 

3. Retries
I think we are on the same page :-)

For your additional concerns:
1) The table option has been updated.
2) We got “lookup.cache” back for configuring whether to use partial or full caching mode.

Best regards, 

Qingsheng



> On May 19, 2022, at 17:25, Александр Смирнов <sm...@gmail.com> wrote:
> 
> Also I have a few additions:
> 1) maybe rename 'lookup.cache.maximum-size' to
> 'lookup.cache.max-rows'? I think it will be more clear that we talk
> not about bytes, but about the number of rows. Plus it fits more,
> considering my optimization with filters.
> 2) How will users enable rescanning? Are we going to separate caching
> and rescanning from the options point of view? Like initially we had
> one option 'lookup.cache' with values LRU / ALL. I think now we can
> make a boolean option 'lookup.rescan'. RescanInterval can be
> 'lookup.rescan.interval', etc.
> 
> Best regards,
> Alexander
> 
> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <sm...@gmail.com>:
>> 
>> Hi Qingsheng and Jark,
>> 
>> 1. Builders vs 'of'
>> I understand that builders are used when we have multiple parameters.
>> I suggested them because we could add parameters later. To prevent
>> Builder for ScanRuntimeProvider from looking redundant I can suggest
>> one more config now - "rescanStartTime".
>> It's a time in UTC (LocalTime class) when the first reload of cache
>> starts. This parameter can be thought of as 'initialDelay' (diff
>> between current time and rescanStartTime) in method
>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
>> useful when the dimension table is updated by some other scheduled job
>> at a certain time. Or when the user simply wants a second scan (first
>> cache reload) be delayed. This option can be used even without
>> 'rescanInterval' - in this case 'rescanInterval' will be one day.
>> If you are fine with this option, I would be very glad if you would
>> give me access to edit FLIP page, so I could add it myself
>> 
>> 2. Common table options
>> I also think that FactoryUtil would be overloaded by all cache
>> options. But maybe unify all suggested options, not only for default
>> cache? I.e. class 'LookupOptions', that unifies default cache options,
>> rescan options, 'async', 'maxRetries'. WDYT?
>> 
>> 3. Retries
>> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
>> 
>> [1] https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>> 
>> Best regards,
>> Alexander
>> 
>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
>>> 
>>> Hi Jark and Alexander,
>>> 
>>> Thanks for your comments! I’m also OK to introduce common table options. I prefer to introduce a new DefaultLookupCacheOptions class for holding these option definitions because putting all options into FactoryUtil would make it a bit ”crowded” and not well categorized.
>>> 
>>> FLIP has been updated according to suggestions above:
>>> 1. Use static “of” method for constructing RescanRuntimeProvider considering both arguments are required.
>>> 2. Introduce new table options matching DefaultLookupCacheFactory
>>> 
>>> Best,
>>> Qingsheng
>>> 
>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
>>>> 
>>>> Hi Alex,
>>>> 
>>>> 1) retry logic
>>>> I think we can extract some common retry logic into utilities, e.g. RetryUtils#tryTimes(times, call).
>>>> This seems independent of this FLIP and can be reused by DataStream users.
>>>> Maybe we can open an issue to discuss this and where to put it.
>>>> 
>>>> 2) cache ConfigOptions
>>>> I'm fine with defining cache config options in the framework.
>>>> A candidate place to put is FactoryUtil which also includes "sink.parallelism", "format" options.
>>>> 
>>>> Best,
>>>> Jark
>>>> 
>>>> 
>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <sm...@gmail.com> wrote:
>>>>> 
>>>>> Hi Qingsheng,
>>>>> 
>>>>> Thank you for considering my comments.
>>>>> 
>>>>>> there might be custom logic before making retry, such as re-establish the connection
>>>>> 
>>>>> Yes, I understand that. I meant that such logic can be placed in a
>>>>> separate function, that can be implemented by connectors. Just moving
>>>>> the retry logic would make connector's LookupFunction more concise +
>>>>> avoid duplicate code. However, it's a minor change. The decision is up
>>>>> to you.
>>>>> 
>>>>>> We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
>>>>> 
>>>>> What is the reason for that? One of the main goals of this FLIP was to
>>>>> unify the configs, wasn't it? I understand that current cache design
>>>>> doesn't depend on ConfigOptions, like was before. But still we can put
>>>>> these options into the framework, so connectors can reuse them and
>>>>> avoid code duplication, and, what is more significant, avoid possible
>>>>> different options naming. This moment can be pointed out in
>>>>> documentation for connector developers.
>>>>> 
>>>>> Best regards,
>>>>> Alexander
>>>>> 
>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
>>>>>> 
>>>>>> Hi Alexander,
>>>>>> 
>>>>>> Thanks for the review and glad to see we are on the same page! I think you forgot to cc the dev mailing list so I’m also quoting your reply under this email.
>>>>>> 
>>>>>>> We can add 'maxRetryTimes' option into this class
>>>>>> 
>>>>>> In my opinion the retry logic should be implemented in lookup() instead of in LookupFunction#eval(). Retrying is only meaningful under some specific retriable failures, and there might be custom logic before making retry, such as re-establish the connection (JdbcRowDataLookupFunction is an example), so it's more handy to leave it to the connector.
>>>>>> 
>>>>>>> I don't see DDL options, that were in previous version of FLIP. Do you have any special plans for them?
>>>>>> 
>>>>>> We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
>>>>>> 
>>>>>> The rest of comments sound great and I’ll update the FLIP. Hope we can finalize our proposal soon!
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Qingsheng
>>>>>> 
>>>>>> 
>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi Qingsheng and devs!
>>>>>>> 
>>>>>>> I like the overall design of updated FLIP, however I have several
>>>>>>> suggestions and questions.
>>>>>>> 
>>>>>>> 1) Introducing LookupFunction as a subclass of TableFunction is a good
>>>>>>> idea. We can add 'maxRetryTimes' option into this class. 'eval' method
>>>>>>> of new LookupFunction is great for this purpose. The same is for
>>>>>>> 'async' case.
>>>>>>> 
>>>>>>> 2) There might be other configs in future, such as 'cacheMissingKey'
>>>>>>> in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
>>>>>>> Maybe use Builder pattern in LookupFunctionProvider and
>>>>>>> RescanRuntimeProvider for more flexibility (use one 'build' method
>>>>>>> instead of many 'of' methods in future)?
>>>>>>> 
>>>>>>> 3) What are the plans for existing TableFunctionProvider and
>>>>>>> AsyncTableFunctionProvider? I think they should be deprecated.
>>>>>>> 
>>>>>>> 4) Am I right that the current design does not assume usage of
>>>>>>> user-provided LookupCache in re-scanning? In this case, it is not very
>>>>>>> clear why do we need methods such as 'invalidate' or 'putAll' in
>>>>>>> LookupCache.
>>>>>>> 
>>>>>>> 5) I don't see DDL options, that were in previous version of FLIP. Do
>>>>>>> you have any special plans for them?
>>>>>>> 
>>>>>>> If you don't mind, I would be glad to be able to make small
>>>>>>> adjustments to the FLIP document too. I think it's worth mentioning
>>>>>>> about what exactly optimizations are planning in the future.
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Smirnov Alexander
>>>>>>> 
>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
>>>>>>>> 
>>>>>>>> Hi Alexander and devs,
>>>>>>>> 
>>>>>>>> Thank you very much for the in-depth discussion! As Jark mentioned we were inspired by Alexander's idea and made a refactor on our design. FLIP-221 [1] has been updated to reflect our design now and we are happy to hear more suggestions from you!
>>>>>>>> 
>>>>>>>> Compared to the previous design:
>>>>>>>> 1. The lookup cache serves at table runtime level and is integrated as a component of LookupJoinRunner as discussed previously.
>>>>>>>> 2. Interfaces are renamed and re-designed to reflect the new design.
>>>>>>>> 3. We separate the all-caching case individually and introduce a new RescanRuntimeProvider to reuse the ability of scanning. We are planning to support SourceFunction / InputFormat for now considering the complexity of FLIP-27 Source API.
>>>>>>>> 4. A new interface LookupFunction is introduced to make the semantic of lookup more straightforward for developers.
>>>>>>>> 
>>>>>>>> For replying to Alexander:
>>>>>>>>> However I'm a little confused whether InputFormat is deprecated or not. Am I right that it will be so in the future, but currently it's not?
>>>>>>>> Yes you are right. InputFormat is not deprecated for now. I think it will be deprecated in the future but we don't have a clear plan for that.
>>>>>>>> 
>>>>>>>> Thanks again for the discussion on this FLIP and looking forward to cooperating with you after we finalize the design and interfaces!
>>>>>>>> 
>>>>>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> 
>>>>>>>> Qingsheng
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <sm...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Jark, Qingsheng and Leonard!
>>>>>>>>> 
>>>>>>>>> Glad to see that we came to a consensus on almost all points!
>>>>>>>>> 
>>>>>>>>> However I'm a little confused whether InputFormat is deprecated or
>>>>>>>>> not. Am I right that it will be so in the future, but currently it's
>>>>>>>>> not? Actually I also think that for the first version it's OK to use
>>>>>>>>> InputFormat in ALL cache realization, because supporting rescan
>>>>>>>>> ability seems like a very distant prospect. But for this decision we
>>>>>>>>> need a consensus among all discussion participants.
>>>>>>>>> 
>>>>>>>>> In general, I don't have something to argue with your statements. All
>>>>>>>>> of them correspond my ideas. Looking ahead, it would be nice to work
>>>>>>>>> on this FLIP cooperatively. I've already done a lot of work on lookup
>>>>>>>>> join caching with realization very close to the one we are discussing,
>>>>>>>>> and want to share the results of this work. Anyway looking forward for
>>>>>>>>> the FLIP update!
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Smirnov Alexander
>>>>>>>>> 
>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>>>>>>>>>> 
>>>>>>>>>> Hi Alex,
>>>>>>>>>> 
>>>>>>>>>> Thanks for summarizing your points.
>>>>>>>>>> 
>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have discussed it several times
>>>>>>>>>> and we have totally refactored the design.
>>>>>>>>>> I'm glad to say we have reached a consensus on many of your points!
>>>>>>>>>> Qingsheng is still working on updating the design docs and maybe can be
>>>>>>>>>> available in the next few days.
>>>>>>>>>> I will share some conclusions from our discussions:
>>>>>>>>>> 
>>>>>>>>>> 1) we have refactored the design towards to "cache in framework" way.
>>>>>>>>>> 
>>>>>>>>>> 2) a "LookupCache" interface for users to customize and a default
>>>>>>>>>> implementation with builder for users to easy-use.
>>>>>>>>>> This can both make it possible to both have flexibility and conciseness.
>>>>>>>>>> 
>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
>>>>>>>>>> IO.
>>>>>>>>>> Filter pushdown should be the final state and the unified way to both
>>>>>>>>>> support pruning ALL cache and LRU cache,
>>>>>>>>>> so I think we should make effort in this direction. If we need to support
>>>>>>>>>> filter pushdown for ALL cache anyway, why not use
>>>>>>>>>> it for LRU cache as well? Either way, as we decide to implement the cache
>>>>>>>>>> in the framework, we have the chance to support
>>>>>>>>>> filter on cache anytime. This is an optimization and it doesn't affect the
>>>>>>>>>> public API. I think we can create a JIRA issue to
>>>>>>>>>> discuss it when the FLIP is accepted.
>>>>>>>>>> 
>>>>>>>>>> 4) The idea to support ALL cache is similar to your proposal.
>>>>>>>>>> In the first version, we will only support InputFormat, SourceFunction for
>>>>>>>>>> cache all (invoke InputFormat in join operator).
>>>>>>>>>> For FLIP-27 source, we need to join a true source operator instead of
>>>>>>>>>> calling it embedded in the join operator.
>>>>>>>>>> However, this needs another FLIP to support the re-scan ability for FLIP-27
>>>>>>>>>> Source, and this can be a large work.
>>>>>>>>>> In order to not block this issue, we can put the effort of FLIP-27 source
>>>>>>>>>> integration into future work and integrate
>>>>>>>>>> InputFormat&SourceFunction for now.
>>>>>>>>>> 
>>>>>>>>>> I think it's fine to use InputFormat&SourceFunction, as they are not
>>>>>>>>>> deprecated, otherwise, we have to introduce another function
>>>>>>>>>> similar to them which is meaningless. We need to plan FLIP-27 source
>>>>>>>>>> integration ASAP before InputFormat & SourceFunction are deprecated.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Martijn!
>>>>>>>>>>> 
>>>>>>>>>>> Got it. Therefore, the realization with InputFormat is not considered.
>>>>>>>>>>> Thanks for clearing that up!
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>> 
>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> With regards to:
>>>>>>>>>>>> 
>>>>>>>>>>>>> But if there are plans to refactor all connectors to FLIP-27
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces will be
>>>>>>>>>>>> deprecated and connectors will either be refactored to use the new ones
>>>>>>>>>>> or
>>>>>>>>>>>> dropped.
>>>>>>>>>>>> 
>>>>>>>>>>>> The caching should work for connectors that are using FLIP-27 interfaces,
>>>>>>>>>>>> we should not introduce new features for old interfaces.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> Martijn
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Jark!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sorry for the late response. I would like to make some comments and
>>>>>>>>>>>>> clarify my points.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1) I agree with your first statement. I think we can achieve both
>>>>>>>>>>>>> advantages this way: put the Cache interface in flink-table-common,
>>>>>>>>>>>>> but have implementations of it in flink-table-runtime. Therefore if a
>>>>>>>>>>>>> connector developer wants to use existing cache strategies and their
>>>>>>>>>>>>> implementations, he can just pass lookupConfig to the planner, but if
>>>>>>>>>>>>> he wants to have its own cache implementation in his TableFunction, it
>>>>>>>>>>>>> will be possible for him to use the existing interface for this
>>>>>>>>>>>>> purpose (we can explicitly point this out in the documentation). In
>>>>>>>>>>>>> this way all configs and metrics will be unified. WDYT?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
>>>>>>>>>>>>> lookup requests that can never be cached
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2) Let me clarify the logic filters optimization in case of LRU cache.
>>>>>>>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
>>>>>>>>>>>>> store the response of the dimension table in cache, even after
>>>>>>>>>>>>> applying calc function. I.e. if there are no rows after applying
>>>>>>>>>>>>> filters to the result of the 'eval' method of TableFunction, we store
>>>>>>>>>>>>> the empty list by lookup keys. Therefore the cache line will be
>>>>>>>>>>>>> filled, but will require much less memory (in bytes). I.e. we don't
>>>>>>>>>>>>> completely filter keys, by which result was pruned, but significantly
>>>>>>>>>>>>> reduce required memory to store this result. If the user knows about
>>>>>>>>>>>>> this behavior, he can increase the 'max-rows' option before the start
>>>>>>>>>>>>> of the job. But actually I came up with the idea that we can do this
>>>>>>>>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods of
>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
>>>>>>>>>>>>> (value of cache). Therefore cache can automatically fit much more
>>>>>>>>>>>>> records than before.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Flink SQL has provided a standard way to do filters and projects
>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
>>>>>>>>>>> hard
>>>>>>>>>>>>> to implement.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It's debatable how difficult it will be to implement filter pushdown.
>>>>>>>>>>>>> But I think the fact that currently there is no database connector
>>>>>>>>>>>>> with filter pushdown at least means that this feature won't be
>>>>>>>>>>>>> supported soon in connectors. Moreover, if we talk about other
>>>>>>>>>>>>> connectors (not in Flink repo), their databases might not support all
>>>>>>>>>>>>> Flink filters (or not support filters at all). I think users are
>>>>>>>>>>>>> interested in supporting cache filters optimization  independently of
>>>>>>>>>>>>> supporting other features and solving more complex problems (or
>>>>>>>>>>>>> unsolvable at all).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 3) I agree with your third statement. Actually in our internal version
>>>>>>>>>>>>> I also tried to unify the logic of scanning and reloading data from
>>>>>>>>>>>>> connectors. But unfortunately, I didn't find a way to unify the logic
>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a result I settled on using
>>>>>>>>>>>>> InputFormat, because it was used for scanning in all lookup
>>>>>>>>>>>>> connectors. (I didn't know that there are plans to deprecate
>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
>>>>>>>>>>>>> in ALL caching is not good idea, because this source was designed to
>>>>>>>>>>>>> work in distributed environment (SplitEnumerator on JobManager and
>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
>>>>>>>>>>>>> operator in our case). There is even no direct way to pass splits from
>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic works through
>>>>>>>>>>>>> SplitEnumeratorContext, which requires
>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
>>>>>>>>>>>>> InputFormat for ALL cache seems much more clearer and easier. But if
>>>>>>>>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
>>>>>>>>>>>>> following ideas: maybe we can refuse from lookup join ALL cache in
>>>>>>>>>>>>> favor of simple join with multiple scanning of batch source? The point
>>>>>>>>>>>>> is that the only difference between lookup join ALL cache and simple
>>>>>>>>>>>>> join with batch source is that in the first case scanning is performed
>>>>>>>>>>>>> multiple times, in between which state (cache) is cleared (correct me
>>>>>>>>>>>>> if I'm wrong). So what if we extend the functionality of simple join
>>>>>>>>>>>>> to support state reloading + extend the functionality of scanning
>>>>>>>>>>>>> batch source multiple times (this one should be easy with new FLIP-27
>>>>>>>>>>>>> source, that unifies streaming/batch reading - we will need to change
>>>>>>>>>>>>> only SplitEnumerator, which will pass splits again after some TTL).
>>>>>>>>>>>>> WDYT? I must say that this looks like a long-term goal and will make
>>>>>>>>>>>>> the scope of this FLIP even larger than you said. Maybe we can limit
>>>>>>>>>>>>> ourselves to a simpler solution now (InputFormats).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So to sum up, my points is like this:
>>>>>>>>>>>>> 1) There is a way to make both concise and flexible interfaces for
>>>>>>>>>>>>> caching in lookup join.
>>>>>>>>>>>>> 2) Cache filters optimization is important both in LRU and ALL caches.
>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
>>>>>>>>>>>>> connectors, some of the connectors might not have the opportunity to
>>>>>>>>>>>>> support filter pushdown + as I know, currently filter pushdown works
>>>>>>>>>>>>> only for scanning (not lookup). So cache filters + projections
>>>>>>>>>>>>> optimization should be independent from other features.
>>>>>>>>>>>>> 4) ALL cache realization is a complex topic that involves multiple
>>>>>>>>>>>>> aspects of how Flink is developing. Refusing from InputFormat in favor
>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache realization really complex and
>>>>>>>>>>>>> not clear, so maybe instead of that we can extend the functionality of
>>>>>>>>>>>>> simple join or not refuse from InputFormat in case of lookup join ALL
>>>>>>>>>>>>> cache?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It's great to see the active discussion! I want to share my ideas:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1) implement the cache in framework vs. connectors base
>>>>>>>>>>>>>> I don't have a strong opinion on this. Both ways should work (e.g.,
>>>>>>>>>>> cache
>>>>>>>>>>>>>> pruning, compatibility).
>>>>>>>>>>>>>> The framework way can provide more concise interfaces.
>>>>>>>>>>>>>> The connector base way can define more flexible cache
>>>>>>>>>>>>>> strategies/implementations.
>>>>>>>>>>>>>> We are still investigating a way to see if we can have both
>>>>>>>>>>> advantages.
>>>>>>>>>>>>>> We should reach a consensus that the way should be a final state,
>>>>>>>>>>> and we
>>>>>>>>>>>>>> are on the path to it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2) filters and projections pushdown:
>>>>>>>>>>>>>> I agree with Alex that the filter pushdown into cache can benefit a
>>>>>>>>>>> lot
>>>>>>>>>>>>> for
>>>>>>>>>>>>>> ALL cache.
>>>>>>>>>>>>>> However, this is not true for LRU cache. Connectors use cache to
>>>>>>>>>>> reduce
>>>>>>>>>>>>> IO
>>>>>>>>>>>>>> requests to databases for better throughput.
>>>>>>>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>> requests that can never be cached
>>>>>>>>>>>>>> and hit directly to the databases. That means the cache is
>>>>>>>>>>> meaningless in
>>>>>>>>>>>>>> this case.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way to do filters and projects
>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>>>>>>>> SupportsProjectionPushDown.
>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
>>>>>>>>>>> hard
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> implement.
>>>>>>>>>>>>>> They should implement the pushdown interfaces to reduce IO and the
>>>>>>>>>>> cache
>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>> That should be a final state that the scan source and lookup source
>>>>>>>>>>> share
>>>>>>>>>>>>>> the exact pushdown implementation.
>>>>>>>>>>>>>> I don't see why we need to duplicate the pushdown logic in caches,
>>>>>>>>>>> which
>>>>>>>>>>>>>> will complex the lookup join design.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 3) ALL cache abstraction
>>>>>>>>>>>>>> All cache might be the most challenging part of this FLIP. We have
>>>>>>>>>>> never
>>>>>>>>>>>>>> provided a reload-lookup public interface.
>>>>>>>>>>>>>> Currently, we put the reload logic in the "eval" method of
>>>>>>>>>>> TableFunction.
>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>>>>>>>> Ideally, connector implementation should share the logic of reload
>>>>>>>>>>> and
>>>>>>>>>>>>>> scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
>>>>>>>>>>>>> Source.
>>>>>>>>>>>>>> However, InputFormat/SourceFunction are deprecated, and the FLIP-27
>>>>>>>>>>>>> source
>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may make
>>>>>>>>>>> the
>>>>>>>>>>>>>> scope of this FLIP much larger.
>>>>>>>>>>>>>> We are still investigating how to abstract the ALL cache logic and
>>>>>>>>>>> reuse
>>>>>>>>>>>>>> the existing source interfaces.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It's a much more complicated activity and lies out of the scope of
>>>>>>>>>>> this
>>>>>>>>>>>>>>> improvement. Because such pushdowns should be done for all
>>>>>>>>>>>>> ScanTableSource
>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> One question regarding "And Alexander correctly mentioned that
>>>>>>>>>>> filter
>>>>>>>>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." -> Would
>>>>>>>>>>> an
>>>>>>>>>>>>>>>> alternative solution be to actually implement these filter
>>>>>>>>>>> pushdowns?
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> imagine that there are many more benefits to doing that, outside
>>>>>>>>>>> of
>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>> caching and metrics.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for driving such a valuable improvement!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I do think that single cache implementation would be a nice
>>>>>>>>>>>>> opportunity
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
>>>>>>>>>>>>> semantics
>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the cache
>>>>>>>>>>> size
>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And the most handy way to do
>>>>>>>>>>> it
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> apply
>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
>>>>>>>>>>> mentioned
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
>>>>>>>>>>>>>>>>> 2) The ability to set the different caching parameters for
>>>>>>>>>>> different
>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>> is quite important. So I would prefer to set it through DDL
>>>>>>>>>>> rather
>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>> have similar ttla, strategy and other options for all lookup
>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework really deprives us of
>>>>>>>>>>>>>>>>> extensibility (users won't be able to implement their own
>>>>>>>>>>> cache).
>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>> probably it might be solved by creating more different cache
>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> All these points are much closer to the schema proposed by
>>>>>>>>>>>>> Alexander.
>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all these
>>>>>>>>>>>>>>>> facilities
>>>>>>>>>>>>>>>>> might be simply implemented in your architecture?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>>>>>>>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just wanted to express that
>>>>>>>>>>> I
>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
>>>>>>>>>>> that
>>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have questions
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
>>>>>>>>>>> proc_time"
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as you said, users go
>>>>>>>>>>> on it
>>>>>>>>>>>>>>>>>>> consciously to achieve better performance (no one proposed
>>>>>>>>>>> to
>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
>>>>>>>>>>>>> developers
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> connectors? In this case developers explicitly specify
>>>>>>>>>>> whether
>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>> connector supports caching or not (in the list of supported
>>>>>>>>>>>>>>>> options),
>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't want to. So what
>>>>>>>>>>>>> exactly is
>>>>>>>>>>>>>>>>>>> the difference between implementing caching in modules
>>>>>>>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> confront a situation that allows table options in DDL to
>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> behavior of the framework, which has never happened
>>>>>>>>>>> previously
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> If we talk about main differences of semantics of DDL
>>>>>>>>>>> options
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
>>>>>>>>>>> the
>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> the options + importance for the user business logic rather
>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>> specific location of corresponding logic in the framework? I
>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> in my design, for example, putting an option with lookup
>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
>>>>>>>>>>>>> because it
>>>>>>>>>>>>>>>>>>> directly affects the user's business logic (not just
>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>> optimization) + touches just several functions of ONE table
>>>>>>>>>>>>> (there
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> be multiple tables with different caches). Does it really
>>>>>>>>>>>>> matter for
>>>>>>>>>>>>>>>>>>> the user (or someone else) where the logic is located,
>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
>>>>>>>>>>>>> some way
>>>>>>>>>>>>>>>>>>> "controls the behavior of the framework" and I don't see any
>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
>>>>>>>>>>> and
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> This is a subject for a separate discussion, but actually
>>>>>>>>>>> in our
>>>>>>>>>>>>>>>>>>> internal version we solved this problem quite easily - we
>>>>>>>>>>> reused
>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
>>>>>>>>>>>>> point is
>>>>>>>>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
>>>>>>>>>>>>> InputFormat.
>>>>>>>>>>>>>>>>>>> The advantage of this solution is the ability to reload
>>>>>>>>>>> cache
>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on number of
>>>>>>>>>>> InputSplits,
>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache reload time significantly
>>>>>>>>>>>>> reduces
>>>>>>>>>>>>>>>>>>> (as well as time of input stream blocking). I know that
>>>>>>>>>>> usually
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
>>>>>>>>>>> one
>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework might introduce
>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It's possible only in cases when the developer of the
>>>>>>>>>>> connector
>>>>>>>>>>>>>>>> won't
>>>>>>>>>>>>>>>>>>> properly refactor his code and will use new cache options
>>>>>>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>> places). For correct behavior all he will need to do is to
>>>>>>>>>>>>> redirect
>>>>>>>>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
>>>>>>>>>>> add an
>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>>>> for options, if there was different naming), everything
>>>>>>>>>>> will be
>>>>>>>>>>>>>>>>>>> transparent for users. If the developer won't do
>>>>>>>>>>> refactoring at
>>>>>>>>>>>>> all,
>>>>>>>>>>>>>>>>>>> nothing will be changed for the connector because of
>>>>>>>>>>> backward
>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
>>>>>>>>>>> cache
>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the configs into the
>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> instead make his own implementation with already existing
>>>>>>>>>>>>> configs
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> filters and projections should be pushed all the way down
>>>>>>>>>>> to
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> function, like what we do in the scan source
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
>>>>>>>>>>> connector
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
>>>>>>>>>>>>>>>>>>> (no database connector supports it currently). Also for some
>>>>>>>>>>>>>>>> databases
>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
>>>>>>>>>>> that we
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> only applying these optimizations to the cache seems not
>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
>>>>>>>>>>> from the
>>>>>>>>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
>>>>>>>>>>> stream
>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
>>>>>>>>>>> we
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>>>>>>>> there will be twice less data in cache. This means the user
>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
>>>>>>>>>>>>> gain a
>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
>>>>>>>>>>> really
>>>>>>>>>>>>>>>> shine
>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and projections
>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This opens up additional
>>>>>>>>>>>>>>>> possibilities
>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It would be great to hear other voices regarding this topic!
>>>>>>>>>>>>> Because
>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial points, and I think
>>>>>>>>>>> with
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>>>>>>>>>>> renqschn@gmail.com
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
>>>>>>>>>>> We
>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
>>>>>>>>>>> like
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
>>>>>>>>>>> logic in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
>>>>>>>>>>> with
>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
>>>>>>>>>>> SYSTEM_TIME
>>>>>>>>>>>>> AS OF
>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
>>>>>>>>>>> of the
>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>> table at the moment of querying. If users choose to enable
>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage is
>>>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
>>>>>>>>>>>>> caching on
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
>>>>>>>>>>>>> (whether
>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
>>>>>>>>>>> confront a
>>>>>>>>>>>>>>>>>> situation
>>>>>>>>>>>>>>>>>>> that allows table options in DDL to control the behavior of
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> framework,
>>>>>>>>>>>>>>>>>>> which has never happened previously and should be cautious.
>>>>>>>>>>>>> Under
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> current design the behavior of the framework should only be
>>>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
>>>>>>>>>>> these
>>>>>>>>>>>>>>>> general
>>>>>>>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
>>>>>>>>>>> all
>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve high lookup
>>>>>>>>>>> performance
>>>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>> connector in the community, and also widely used by our
>>>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
>>>>>>>>>>> TableFunction
>>>>>>>>>>>>>>>> works
>>>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
>>>>>>>>>>>>> interface for
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> all-caching scenario and the design would become more
>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like there might exist two
>>>>>>>>>>>>> caches
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> totally different strategies if the user incorrectly
>>>>>>>>>>> configures
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> (one in the framework and another implemented by the lookup
>>>>>>>>>>>>> source).
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> projections should be pushed all the way down to the table
>>>>>>>>>>>>> function,
>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead of the runner with
>>>>>>>>>>> the
>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
>>>>>>>>>>> pressure
>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>> external system, and only applying these optimizations to
>>>>>>>>>>> the
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
>>>>>>>>>>> We
>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
>>>>>>>>>>> and we
>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
>>>>>>>>>>> metrics
>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier solution as the
>>>>>>>>>>> first
>>>>>>>>>>>>> step:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
>>>>>>>>>>> (originally
>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are different. If we
>>>>>>>>>>> will
>>>>>>>>>>>>> go one
>>>>>>>>>>>>>>>>> way,
>>>>>>>>>>>>>>>>>>>>> moving to another way in the future will mean deleting
>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>> and once again changing the API for connectors. So I
>>>>>>>>>>> think we
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community about that and then
>>>>>>>>>>> work
>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
>>>>>>>>>>>>> parts
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the requests after
>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
>>>>>>>>>>>>> table, we
>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only after that we can
>>>>>>>>>>> filter
>>>>>>>>>>>>>>>>> responses,
>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
>>>>>>>>>>> if
>>>>>>>>>>>>>>>>> filtering
>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be much less rows in
>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>>>>>>>>>> shared.
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
>>>>>>>>>>> conversations
>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
>>>>>>>>>>> Jira
>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>> where described the proposed changes in more details -
>>>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>>>>>>>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
>>>>>>>>>>>>> satisfying
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
>>>>>>>>>>> with
>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> easier
>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
>>>>>>>>>>> layer
>>>>>>>>>>>>>>>> around X.
>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
>>>>>>>>>>> delegates to
>>>>>>>>>>>>> X in
>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is probably
>>>>>>>>>>> unnecessary
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
>>>>>>>>>>> save
>>>>>>>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
>>>>>>>>>>> else
>>>>>>>>>>>>>>>>> remains
>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
>>>>>>>>>>> easily
>>>>>>>>>>>>>>>>>> incorporate
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>>>>>>>>>> shared.
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
>>>>>>>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
>>>>>>>>>>> committer
>>>>>>>>>>>>> yet,
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
>>>>>>>>>>>>> interested
>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
>>>>>>>>>>>>> company’s
>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
>>>>>>>>>>> this and
>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I think there is a better alternative than
>>>>>>>>>>> introducing an
>>>>>>>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
>>>>>>>>>>> you
>>>>>>>>>>>>> know,
>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
>>>>>>>>>>> module,
>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
>>>>>>>>>>>>> convenient
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
>>>>>>>>>>>>> connected
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>> should be located in another module, probably in
>>>>>>>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to depend on another
>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
>>>>>>>>>>>>> good.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
>>>>>>>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
>>>>>>>>>>>>> depend on
>>>>>>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs planner will
>>>>>>>>>>>>> construct a
>>>>>>>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
>>>>>>>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
>>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
>>>>>>>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
>>>>>>>>>>> responsible
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>>>>>>>> flink-table-runtime
>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
>>>>>>>>>>> such a
>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
>>>>>>>>>>> named
>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
>>>>>>>>>>>>> mostly
>>>>>>>>>>>>>>>>>> consists
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
>>>>>>>>>>>>> condition
>>>>>>>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
>>>>>>>>>>> 1000’
>>>>>>>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
>>>>>>>>>>>>>>>> B.salary >
>>>>>>>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before storing records in
>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
>>>>>>>>>>> size. So
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can be increased by
>>>>>>>>>>> the
>>>>>>>>>>>>> user.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
>>>>>>>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
>>>>>>>>>>> its
>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
>>>>>>>>>>> which
>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
>>>>>>>>>>> Please
>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
>>>>>>>>>>> and
>>>>>>>>>>>>>>>> comments
>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best Regards,
>>>>>>>> 
>>>>>>>> Qingsheng Ren
>>>>>>>> 
>>>>>>>> Real-time Computing Team
>>>>>>>> Alibaba Cloud
>>>>>>>> 
>>>>>>>> Email: renqschn@gmail.com
>>>>>>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Александр Смирнов <sm...@gmail.com>.

Also I have a few additions:
1) maybe rename 'lookup.cache.maximum-size' to
'lookup.cache.max-rows'? I think it will be more clear that we talk
not about bytes, but about the number of rows. Plus it fits more,
considering my optimization with filters.
2) How will users enable rescanning? Are we going to separate caching
and rescanning from the options point of view? Like initially we had
one option 'lookup.cache' with values LRU / ALL. I think now we can
make a boolean option 'lookup.rescan'. RescanInterval can be
'lookup.rescan.interval', etc.

Best regards,
Alexander

чт, 19 мая 2022 г. в 14:50, Александр Смирнов <sm...@gmail.com>:
>
> Hi Qingsheng and Jark,
>
> 1. Builders vs 'of'
> I understand that builders are used when we have multiple parameters.
> I suggested them because we could add parameters later. To prevent
> Builder for ScanRuntimeProvider from looking redundant I can suggest
> one more config now - "rescanStartTime".
> It's a time in UTC (LocalTime class) when the first reload of cache
> starts. This parameter can be thought of as 'initialDelay' (diff
> between current time and rescanStartTime) in method
> ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
> useful when the dimension table is updated by some other scheduled job
> at a certain time. Or when the user simply wants a second scan (first
> cache reload) be delayed. This option can be used even without
> 'rescanInterval' - in this case 'rescanInterval' will be one day.
> If you are fine with this option, I would be very glad if you would
> give me access to edit FLIP page, so I could add it myself
>
> 2. Common table options
> I also think that FactoryUtil would be overloaded by all cache
> options. But maybe unify all suggested options, not only for default
> cache? I.e. class 'LookupOptions', that unifies default cache options,
> rescan options, 'async', 'maxRetries'. WDYT?
>
> 3. Retries
> I'm fine with suggestion close to RetryUtils#tryTimes(times, call)
>
> [1] https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
>
> Best regards,
> Alexander
>
> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
> >
> > Hi Jark and Alexander,
> >
> > Thanks for your comments! I’m also OK to introduce common table options. I prefer to introduce a new DefaultLookupCacheOptions class for holding these option definitions because putting all options into FactoryUtil would make it a bit ”crowded” and not well categorized.
> >
> > FLIP has been updated according to suggestions above:
> > 1. Use static “of” method for constructing RescanRuntimeProvider considering both arguments are required.
> > 2. Introduce new table options matching DefaultLookupCacheFactory
> >
> > Best,
> > Qingsheng
> >
> > On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
> >>
> >> Hi Alex,
> >>
> >> 1) retry logic
> >> I think we can extract some common retry logic into utilities, e.g. RetryUtils#tryTimes(times, call).
> >> This seems independent of this FLIP and can be reused by DataStream users.
> >> Maybe we can open an issue to discuss this and where to put it.
> >>
> >> 2) cache ConfigOptions
> >> I'm fine with defining cache config options in the framework.
> >> A candidate place to put is FactoryUtil which also includes "sink.parallelism", "format" options.
> >>
> >> Best,
> >> Jark
> >>
> >>
> >> On Wed, 18 May 2022 at 13:52, Александр Смирнов <sm...@gmail.com> wrote:
> >>>
> >>> Hi Qingsheng,
> >>>
> >>> Thank you for considering my comments.
> >>>
> >>> >  there might be custom logic before making retry, such as re-establish the connection
> >>>
> >>> Yes, I understand that. I meant that such logic can be placed in a
> >>> separate function, that can be implemented by connectors. Just moving
> >>> the retry logic would make connector's LookupFunction more concise +
> >>> avoid duplicate code. However, it's a minor change. The decision is up
> >>> to you.
> >>>
> >>> > We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
> >>>
> >>> What is the reason for that? One of the main goals of this FLIP was to
> >>> unify the configs, wasn't it? I understand that current cache design
> >>> doesn't depend on ConfigOptions, like was before. But still we can put
> >>> these options into the framework, so connectors can reuse them and
> >>> avoid code duplication, and, what is more significant, avoid possible
> >>> different options naming. This moment can be pointed out in
> >>> documentation for connector developers.
> >>>
> >>> Best regards,
> >>> Alexander
> >>>
> >>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
> >>> >
> >>> > Hi Alexander,
> >>> >
> >>> > Thanks for the review and glad to see we are on the same page! I think you forgot to cc the dev mailing list so I’m also quoting your reply under this email.
> >>> >
> >>> > >  We can add 'maxRetryTimes' option into this class
> >>> >
> >>> > In my opinion the retry logic should be implemented in lookup() instead of in LookupFunction#eval(). Retrying is only meaningful under some specific retriable failures, and there might be custom logic before making retry, such as re-establish the connection (JdbcRowDataLookupFunction is an example), so it's more handy to leave it to the connector.
> >>> >
> >>> > > I don't see DDL options, that were in previous version of FLIP. Do you have any special plans for them?
> >>> >
> >>> > We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
> >>> >
> >>> > The rest of comments sound great and I’ll update the FLIP. Hope we can finalize our proposal soon!
> >>> >
> >>> > Best,
> >>> >
> >>> > Qingsheng
> >>> >
> >>> >
> >>> > > On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com> wrote:
> >>> > >
> >>> > > Hi Qingsheng and devs!
> >>> > >
> >>> > > I like the overall design of updated FLIP, however I have several
> >>> > > suggestions and questions.
> >>> > >
> >>> > > 1) Introducing LookupFunction as a subclass of TableFunction is a good
> >>> > > idea. We can add 'maxRetryTimes' option into this class. 'eval' method
> >>> > > of new LookupFunction is great for this purpose. The same is for
> >>> > > 'async' case.
> >>> > >
> >>> > > 2) There might be other configs in future, such as 'cacheMissingKey'
> >>> > > in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
> >>> > > Maybe use Builder pattern in LookupFunctionProvider and
> >>> > > RescanRuntimeProvider for more flexibility (use one 'build' method
> >>> > > instead of many 'of' methods in future)?
> >>> > >
> >>> > > 3) What are the plans for existing TableFunctionProvider and
> >>> > > AsyncTableFunctionProvider? I think they should be deprecated.
> >>> > >
> >>> > > 4) Am I right that the current design does not assume usage of
> >>> > > user-provided LookupCache in re-scanning? In this case, it is not very
> >>> > > clear why do we need methods such as 'invalidate' or 'putAll' in
> >>> > > LookupCache.
> >>> > >
> >>> > > 5) I don't see DDL options, that were in previous version of FLIP. Do
> >>> > > you have any special plans for them?
> >>> > >
> >>> > > If you don't mind, I would be glad to be able to make small
> >>> > > adjustments to the FLIP document too. I think it's worth mentioning
> >>> > > about what exactly optimizations are planning in the future.
> >>> > >
> >>> > > Best regards,
> >>> > > Smirnov Alexander
> >>> > >
> >>> > > пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
> >>> > >>
> >>> > >> Hi Alexander and devs,
> >>> > >>
> >>> > >> Thank you very much for the in-depth discussion! As Jark mentioned we were inspired by Alexander's idea and made a refactor on our design. FLIP-221 [1] has been updated to reflect our design now and we are happy to hear more suggestions from you!
> >>> > >>
> >>> > >> Compared to the previous design:
> >>> > >> 1. The lookup cache serves at table runtime level and is integrated as a component of LookupJoinRunner as discussed previously.
> >>> > >> 2. Interfaces are renamed and re-designed to reflect the new design.
> >>> > >> 3. We separate the all-caching case individually and introduce a new RescanRuntimeProvider to reuse the ability of scanning. We are planning to support SourceFunction / InputFormat for now considering the complexity of FLIP-27 Source API.
> >>> > >> 4. A new interface LookupFunction is introduced to make the semantic of lookup more straightforward for developers.
> >>> > >>
> >>> > >> For replying to Alexander:
> >>> > >>> However I'm a little confused whether InputFormat is deprecated or not. Am I right that it will be so in the future, but currently it's not?
> >>> > >> Yes you are right. InputFormat is not deprecated for now. I think it will be deprecated in the future but we don't have a clear plan for that.
> >>> > >>
> >>> > >> Thanks again for the discussion on this FLIP and looking forward to cooperating with you after we finalize the design and interfaces!
> >>> > >>
> >>> > >> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>> > >>
> >>> > >> Best regards,
> >>> > >>
> >>> > >> Qingsheng
> >>> > >>
> >>> > >>
> >>> > >> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <sm...@gmail.com> wrote:
> >>> > >>>
> >>> > >>> Hi Jark, Qingsheng and Leonard!
> >>> > >>>
> >>> > >>> Glad to see that we came to a consensus on almost all points!
> >>> > >>>
> >>> > >>> However I'm a little confused whether InputFormat is deprecated or
> >>> > >>> not. Am I right that it will be so in the future, but currently it's
> >>> > >>> not? Actually I also think that for the first version it's OK to use
> >>> > >>> InputFormat in ALL cache realization, because supporting rescan
> >>> > >>> ability seems like a very distant prospect. But for this decision we
> >>> > >>> need a consensus among all discussion participants.
> >>> > >>>
> >>> > >>> In general, I don't have something to argue with your statements. All
> >>> > >>> of them correspond my ideas. Looking ahead, it would be nice to work
> >>> > >>> on this FLIP cooperatively. I've already done a lot of work on lookup
> >>> > >>> join caching with realization very close to the one we are discussing,
> >>> > >>> and want to share the results of this work. Anyway looking forward for
> >>> > >>> the FLIP update!
> >>> > >>>
> >>> > >>> Best regards,
> >>> > >>> Smirnov Alexander
> >>> > >>>
> >>> > >>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> >>> > >>>>
> >>> > >>>> Hi Alex,
> >>> > >>>>
> >>> > >>>> Thanks for summarizing your points.
> >>> > >>>>
> >>> > >>>> In the past week, Qingsheng, Leonard, and I have discussed it several times
> >>> > >>>> and we have totally refactored the design.
> >>> > >>>> I'm glad to say we have reached a consensus on many of your points!
> >>> > >>>> Qingsheng is still working on updating the design docs and maybe can be
> >>> > >>>> available in the next few days.
> >>> > >>>> I will share some conclusions from our discussions:
> >>> > >>>>
> >>> > >>>> 1) we have refactored the design towards to "cache in framework" way.
> >>> > >>>>
> >>> > >>>> 2) a "LookupCache" interface for users to customize and a default
> >>> > >>>> implementation with builder for users to easy-use.
> >>> > >>>> This can both make it possible to both have flexibility and conciseness.
> >>> > >>>>
> >>> > >>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
> >>> > >>>> IO.
> >>> > >>>> Filter pushdown should be the final state and the unified way to both
> >>> > >>>> support pruning ALL cache and LRU cache,
> >>> > >>>> so I think we should make effort in this direction. If we need to support
> >>> > >>>> filter pushdown for ALL cache anyway, why not use
> >>> > >>>> it for LRU cache as well? Either way, as we decide to implement the cache
> >>> > >>>> in the framework, we have the chance to support
> >>> > >>>> filter on cache anytime. This is an optimization and it doesn't affect the
> >>> > >>>> public API. I think we can create a JIRA issue to
> >>> > >>>> discuss it when the FLIP is accepted.
> >>> > >>>>
> >>> > >>>> 4) The idea to support ALL cache is similar to your proposal.
> >>> > >>>> In the first version, we will only support InputFormat, SourceFunction for
> >>> > >>>> cache all (invoke InputFormat in join operator).
> >>> > >>>> For FLIP-27 source, we need to join a true source operator instead of
> >>> > >>>> calling it embedded in the join operator.
> >>> > >>>> However, this needs another FLIP to support the re-scan ability for FLIP-27
> >>> > >>>> Source, and this can be a large work.
> >>> > >>>> In order to not block this issue, we can put the effort of FLIP-27 source
> >>> > >>>> integration into future work and integrate
> >>> > >>>> InputFormat&SourceFunction for now.
> >>> > >>>>
> >>> > >>>> I think it's fine to use InputFormat&SourceFunction, as they are not
> >>> > >>>> deprecated, otherwise, we have to introduce another function
> >>> > >>>> similar to them which is meaningless. We need to plan FLIP-27 source
> >>> > >>>> integration ASAP before InputFormat & SourceFunction are deprecated.
> >>> > >>>>
> >>> > >>>> Best,
> >>> > >>>> Jark
> >>> > >>>>
> >>> > >>>>
> >>> > >>>>
> >>> > >>>>
> >>> > >>>>
> >>> > >>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
> >>> > >>>> wrote:
> >>> > >>>>
> >>> > >>>>> Hi Martijn!
> >>> > >>>>>
> >>> > >>>>> Got it. Therefore, the realization with InputFormat is not considered.
> >>> > >>>>> Thanks for clearing that up!
> >>> > >>>>>
> >>> > >>>>> Best regards,
> >>> > >>>>> Smirnov Alexander
> >>> > >>>>>
> >>> > >>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
> >>> > >>>>>>
> >>> > >>>>>> Hi,
> >>> > >>>>>>
> >>> > >>>>>> With regards to:
> >>> > >>>>>>
> >>> > >>>>>>> But if there are plans to refactor all connectors to FLIP-27
> >>> > >>>>>>
> >>> > >>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces will be
> >>> > >>>>>> deprecated and connectors will either be refactored to use the new ones
> >>> > >>>>> or
> >>> > >>>>>> dropped.
> >>> > >>>>>>
> >>> > >>>>>> The caching should work for connectors that are using FLIP-27 interfaces,
> >>> > >>>>>> we should not introduce new features for old interfaces.
> >>> > >>>>>>
> >>> > >>>>>> Best regards,
> >>> > >>>>>>
> >>> > >>>>>> Martijn
> >>> > >>>>>>
> >>> > >>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
> >>> > >>>>>> wrote:
> >>> > >>>>>>
> >>> > >>>>>>> Hi Jark!
> >>> > >>>>>>>
> >>> > >>>>>>> Sorry for the late response. I would like to make some comments and
> >>> > >>>>>>> clarify my points.
> >>> > >>>>>>>
> >>> > >>>>>>> 1) I agree with your first statement. I think we can achieve both
> >>> > >>>>>>> advantages this way: put the Cache interface in flink-table-common,
> >>> > >>>>>>> but have implementations of it in flink-table-runtime. Therefore if a
> >>> > >>>>>>> connector developer wants to use existing cache strategies and their
> >>> > >>>>>>> implementations, he can just pass lookupConfig to the planner, but if
> >>> > >>>>>>> he wants to have its own cache implementation in his TableFunction, it
> >>> > >>>>>>> will be possible for him to use the existing interface for this
> >>> > >>>>>>> purpose (we can explicitly point this out in the documentation). In
> >>> > >>>>>>> this way all configs and metrics will be unified. WDYT?
> >>> > >>>>>>>
> >>> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
> >>> > >>>>>>> lookup requests that can never be cached
> >>> > >>>>>>>
> >>> > >>>>>>> 2) Let me clarify the logic filters optimization in case of LRU cache.
> >>> > >>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
> >>> > >>>>>>> store the response of the dimension table in cache, even after
> >>> > >>>>>>> applying calc function. I.e. if there are no rows after applying
> >>> > >>>>>>> filters to the result of the 'eval' method of TableFunction, we store
> >>> > >>>>>>> the empty list by lookup keys. Therefore the cache line will be
> >>> > >>>>>>> filled, but will require much less memory (in bytes). I.e. we don't
> >>> > >>>>>>> completely filter keys, by which result was pruned, but significantly
> >>> > >>>>>>> reduce required memory to store this result. If the user knows about
> >>> > >>>>>>> this behavior, he can increase the 'max-rows' option before the start
> >>> > >>>>>>> of the job. But actually I came up with the idea that we can do this
> >>> > >>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods of
> >>> > >>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
> >>> > >>>>>>> (value of cache). Therefore cache can automatically fit much more
> >>> > >>>>>>> records than before.
> >>> > >>>>>>>
> >>> > >>>>>>>> Flink SQL has provided a standard way to do filters and projects
> >>> > >>>>>>> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> >>> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> >>> > >>>>> hard
> >>> > >>>>>>> to implement.
> >>> > >>>>>>>
> >>> > >>>>>>> It's debatable how difficult it will be to implement filter pushdown.
> >>> > >>>>>>> But I think the fact that currently there is no database connector
> >>> > >>>>>>> with filter pushdown at least means that this feature won't be
> >>> > >>>>>>> supported soon in connectors. Moreover, if we talk about other
> >>> > >>>>>>> connectors (not in Flink repo), their databases might not support all
> >>> > >>>>>>> Flink filters (or not support filters at all). I think users are
> >>> > >>>>>>> interested in supporting cache filters optimization  independently of
> >>> > >>>>>>> supporting other features and solving more complex problems (or
> >>> > >>>>>>> unsolvable at all).
> >>> > >>>>>>>
> >>> > >>>>>>> 3) I agree with your third statement. Actually in our internal version
> >>> > >>>>>>> I also tried to unify the logic of scanning and reloading data from
> >>> > >>>>>>> connectors. But unfortunately, I didn't find a way to unify the logic
> >>> > >>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
> >>> > >>>>>>> and reuse it in reloading ALL cache. As a result I settled on using
> >>> > >>>>>>> InputFormat, because it was used for scanning in all lookup
> >>> > >>>>>>> connectors. (I didn't know that there are plans to deprecate
> >>> > >>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
> >>> > >>>>>>> in ALL caching is not good idea, because this source was designed to
> >>> > >>>>>>> work in distributed environment (SplitEnumerator on JobManager and
> >>> > >>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
> >>> > >>>>>>> operator in our case). There is even no direct way to pass splits from
> >>> > >>>>>>> SplitEnumerator to SourceReader (this logic works through
> >>> > >>>>>>> SplitEnumeratorContext, which requires
> >>> > >>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
> >>> > >>>>>>> InputFormat for ALL cache seems much more clearer and easier. But if
> >>> > >>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
> >>> > >>>>>>> following ideas: maybe we can refuse from lookup join ALL cache in
> >>> > >>>>>>> favor of simple join with multiple scanning of batch source? The point
> >>> > >>>>>>> is that the only difference between lookup join ALL cache and simple
> >>> > >>>>>>> join with batch source is that in the first case scanning is performed
> >>> > >>>>>>> multiple times, in between which state (cache) is cleared (correct me
> >>> > >>>>>>> if I'm wrong). So what if we extend the functionality of simple join
> >>> > >>>>>>> to support state reloading + extend the functionality of scanning
> >>> > >>>>>>> batch source multiple times (this one should be easy with new FLIP-27
> >>> > >>>>>>> source, that unifies streaming/batch reading - we will need to change
> >>> > >>>>>>> only SplitEnumerator, which will pass splits again after some TTL).
> >>> > >>>>>>> WDYT? I must say that this looks like a long-term goal and will make
> >>> > >>>>>>> the scope of this FLIP even larger than you said. Maybe we can limit
> >>> > >>>>>>> ourselves to a simpler solution now (InputFormats).
> >>> > >>>>>>>
> >>> > >>>>>>> So to sum up, my points is like this:
> >>> > >>>>>>> 1) There is a way to make both concise and flexible interfaces for
> >>> > >>>>>>> caching in lookup join.
> >>> > >>>>>>> 2) Cache filters optimization is important both in LRU and ALL caches.
> >>> > >>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
> >>> > >>>>>>> connectors, some of the connectors might not have the opportunity to
> >>> > >>>>>>> support filter pushdown + as I know, currently filter pushdown works
> >>> > >>>>>>> only for scanning (not lookup). So cache filters + projections
> >>> > >>>>>>> optimization should be independent from other features.
> >>> > >>>>>>> 4) ALL cache realization is a complex topic that involves multiple
> >>> > >>>>>>> aspects of how Flink is developing. Refusing from InputFormat in favor
> >>> > >>>>>>> of FLIP-27 Source will make ALL cache realization really complex and
> >>> > >>>>>>> not clear, so maybe instead of that we can extend the functionality of
> >>> > >>>>>>> simple join or not refuse from InputFormat in case of lookup join ALL
> >>> > >>>>>>> cache?
> >>> > >>>>>>>
> >>> > >>>>>>> Best regards,
> >>> > >>>>>>> Smirnov Alexander
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>> [1]
> >>> > >>>>>>>
> >>> > >>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>> > >>>>>>>
> >>> > >>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> >>> > >>>>>>>>
> >>> > >>>>>>>> It's great to see the active discussion! I want to share my ideas:
> >>> > >>>>>>>>
> >>> > >>>>>>>> 1) implement the cache in framework vs. connectors base
> >>> > >>>>>>>> I don't have a strong opinion on this. Both ways should work (e.g.,
> >>> > >>>>> cache
> >>> > >>>>>>>> pruning, compatibility).
> >>> > >>>>>>>> The framework way can provide more concise interfaces.
> >>> > >>>>>>>> The connector base way can define more flexible cache
> >>> > >>>>>>>> strategies/implementations.
> >>> > >>>>>>>> We are still investigating a way to see if we can have both
> >>> > >>>>> advantages.
> >>> > >>>>>>>> We should reach a consensus that the way should be a final state,
> >>> > >>>>> and we
> >>> > >>>>>>>> are on the path to it.
> >>> > >>>>>>>>
> >>> > >>>>>>>> 2) filters and projections pushdown:
> >>> > >>>>>>>> I agree with Alex that the filter pushdown into cache can benefit a
> >>> > >>>>> lot
> >>> > >>>>>>> for
> >>> > >>>>>>>> ALL cache.
> >>> > >>>>>>>> However, this is not true for LRU cache. Connectors use cache to
> >>> > >>>>> reduce
> >>> > >>>>>>> IO
> >>> > >>>>>>>> requests to databases for better throughput.
> >>> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
> >>> > >>>>>>> lookup
> >>> > >>>>>>>> requests that can never be cached
> >>> > >>>>>>>> and hit directly to the databases. That means the cache is
> >>> > >>>>> meaningless in
> >>> > >>>>>>>> this case.
> >>> > >>>>>>>>
> >>> > >>>>>>>> IMO, Flink SQL has provided a standard way to do filters and projects
> >>> > >>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>> > >>>>> SupportsProjectionPushDown.
> >>> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> >>> > >>>>> hard
> >>> > >>>>>>> to
> >>> > >>>>>>>> implement.
> >>> > >>>>>>>> They should implement the pushdown interfaces to reduce IO and the
> >>> > >>>>> cache
> >>> > >>>>>>>> size.
> >>> > >>>>>>>> That should be a final state that the scan source and lookup source
> >>> > >>>>> share
> >>> > >>>>>>>> the exact pushdown implementation.
> >>> > >>>>>>>> I don't see why we need to duplicate the pushdown logic in caches,
> >>> > >>>>> which
> >>> > >>>>>>>> will complex the lookup join design.
> >>> > >>>>>>>>
> >>> > >>>>>>>> 3) ALL cache abstraction
> >>> > >>>>>>>> All cache might be the most challenging part of this FLIP. We have
> >>> > >>>>> never
> >>> > >>>>>>>> provided a reload-lookup public interface.
> >>> > >>>>>>>> Currently, we put the reload logic in the "eval" method of
> >>> > >>>>> TableFunction.
> >>> > >>>>>>>> That's hard for some sources (e.g., Hive).
> >>> > >>>>>>>> Ideally, connector implementation should share the logic of reload
> >>> > >>>>> and
> >>> > >>>>>>>> scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
> >>> > >>>>>>> Source.
> >>> > >>>>>>>> However, InputFormat/SourceFunction are deprecated, and the FLIP-27
> >>> > >>>>>>> source
> >>> > >>>>>>>> is deeply coupled with SourceOperator.
> >>> > >>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may make
> >>> > >>>>> the
> >>> > >>>>>>>> scope of this FLIP much larger.
> >>> > >>>>>>>> We are still investigating how to abstract the ALL cache logic and
> >>> > >>>>> reuse
> >>> > >>>>>>>> the existing source interfaces.
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>> Best,
> >>> > >>>>>>>> Jark
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
> >>> > >>>>> wrote:
> >>> > >>>>>>>>
> >>> > >>>>>>>>> It's a much more complicated activity and lies out of the scope of
> >>> > >>>>> this
> >>> > >>>>>>>>> improvement. Because such pushdowns should be done for all
> >>> > >>>>>>> ScanTableSource
> >>> > >>>>>>>>> implementations (not only for Lookup ones).
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> >>> > >>>>> martijnvisser@apache.org>
> >>> > >>>>>>>>> wrote:
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>> Hi everyone,
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> One question regarding "And Alexander correctly mentioned that
> >>> > >>>>> filter
> >>> > >>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." -> Would
> >>> > >>>>> an
> >>> > >>>>>>>>>> alternative solution be to actually implement these filter
> >>> > >>>>> pushdowns?
> >>> > >>>>>>> I
> >>> > >>>>>>>>>> can
> >>> > >>>>>>>>>> imagine that there are many more benefits to doing that, outside
> >>> > >>>>> of
> >>> > >>>>>>> lookup
> >>> > >>>>>>>>>> caching and metrics.
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Best regards,
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Martijn Visser
> >>> > >>>>>>>>>> https://twitter.com/MartijnVisser82
> >>> > >>>>>>>>>> https://github.com/MartijnVisser
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
> >>> > >>>>>>> wrote:
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>> Hi everyone!
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> Thanks for driving such a valuable improvement!
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> I do think that single cache implementation would be a nice
> >>> > >>>>>>> opportunity
> >>> > >>>>>>>>>> for
> >>> > >>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
> >>> > >>>>>>> semantics
> >>> > >>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> >>> > >>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the cache
> >>> > >>>>> size
> >>> > >>>>>>> by
> >>> > >>>>>>>>>>> simply filtering unnecessary data. And the most handy way to do
> >>> > >>>>> it
> >>> > >>>>>>> is
> >>> > >>>>>>>>>> apply
> >>> > >>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
> >>> > >>>>>>> through the
> >>> > >>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
> >>> > >>>>> mentioned
> >>> > >>>>>>> that
> >>> > >>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
> >>> > >>>>>>>>>>> 2) The ability to set the different caching parameters for
> >>> > >>>>> different
> >>> > >>>>>>>>>> tables
> >>> > >>>>>>>>>>> is quite important. So I would prefer to set it through DDL
> >>> > >>>>> rather
> >>> > >>>>>>> than
> >>> > >>>>>>>>>>> have similar ttla, strategy and other options for all lookup
> >>> > >>>>> tables.
> >>> > >>>>>>>>>>> 3) Providing the cache into the framework really deprives us of
> >>> > >>>>>>>>>>> extensibility (users won't be able to implement their own
> >>> > >>>>> cache).
> >>> > >>>>>>> But
> >>> > >>>>>>>>>> most
> >>> > >>>>>>>>>>> probably it might be solved by creating more different cache
> >>> > >>>>>>> strategies
> >>> > >>>>>>>>>> and
> >>> > >>>>>>>>>>> a wider set of configurations.
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> All these points are much closer to the schema proposed by
> >>> > >>>>>>> Alexander.
> >>> > >>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all these
> >>> > >>>>>>>>>> facilities
> >>> > >>>>>>>>>>> might be simply implemented in your architecture?
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> Best regards,
> >>> > >>>>>>>>>>> Roman Boyko
> >>> > >>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> >>> > >>>>>>> martijnvisser@apache.org>
> >>> > >>>>>>>>>>> wrote:
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> Hi everyone,
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>> I don't have much to chip in, but just wanted to express that
> >>> > >>>>> I
> >>> > >>>>>>> really
> >>> > >>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
> >>> > >>>>> that
> >>> > >>>>>>>>>> others
> >>> > >>>>>>>>>>>> will join the conversation.
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>> Best regards,
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>> Martijn
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> >>> > >>>>>>> smiralexan@gmail.com>
> >>> > >>>>>>>>>>>> wrote:
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> Thanks for your detailed feedback! However, I have questions
> >>> > >>>>>>> about
> >>> > >>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
> >>> > >>>>> AS OF
> >>> > >>>>>>>>>>>> proc_time”
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
> >>> > >>>>> proc_time"
> >>> > >>>>>>> is
> >>> > >>>>>>>>>> not
> >>> > >>>>>>>>>>>>> fully implemented with caching, but as you said, users go
> >>> > >>>>> on it
> >>> > >>>>>>>>>>>>> consciously to achieve better performance (no one proposed
> >>> > >>>>> to
> >>> > >>>>>>> enable
> >>> > >>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
> >>> > >>>>>>> developers
> >>> > >>>>>>>>>> of
> >>> > >>>>>>>>>>>>> connectors? In this case developers explicitly specify
> >>> > >>>>> whether
> >>> > >>>>>>> their
> >>> > >>>>>>>>>>>>> connector supports caching or not (in the list of supported
> >>> > >>>>>>>>>> options),
> >>> > >>>>>>>>>>>>> no one makes them do that if they don't want to. So what
> >>> > >>>>>>> exactly is
> >>> > >>>>>>>>>>>>> the difference between implementing caching in modules
> >>> > >>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
> >>> > >>>>>>> considered
> >>> > >>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
> >>> > >>>>> the
> >>> > >>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> confront a situation that allows table options in DDL to
> >>> > >>>>>>> control
> >>> > >>>>>>>>>> the
> >>> > >>>>>>>>>>>>> behavior of the framework, which has never happened
> >>> > >>>>> previously
> >>> > >>>>>>> and
> >>> > >>>>>>>>>>> should
> >>> > >>>>>>>>>>>>> be cautious
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> If we talk about main differences of semantics of DDL
> >>> > >>>>> options
> >>> > >>>>>>> and
> >>> > >>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
> >>> > >>>>> the
> >>> > >>>>>>> scope
> >>> > >>>>>>>>>> of
> >>> > >>>>>>>>>>>>> the options + importance for the user business logic rather
> >>> > >>>>> than
> >>> > >>>>>>>>>>>>> specific location of corresponding logic in the framework? I
> >>> > >>>>>>> mean
> >>> > >>>>>>>>>> that
> >>> > >>>>>>>>>>>>> in my design, for example, putting an option with lookup
> >>> > >>>>> cache
> >>> > >>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
> >>> > >>>>>>> because it
> >>> > >>>>>>>>>>>>> directly affects the user's business logic (not just
> >>> > >>>>> performance
> >>> > >>>>>>>>>>>>> optimization) + touches just several functions of ONE table
> >>> > >>>>>>> (there
> >>> > >>>>>>>>>> can
> >>> > >>>>>>>>>>>>> be multiple tables with different caches). Does it really
> >>> > >>>>>>> matter for
> >>> > >>>>>>>>>>>>> the user (or someone else) where the logic is located,
> >>> > >>>>> which is
> >>> > >>>>>>>>>>>>> affected by the applied option?
> >>> > >>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
> >>> > >>>>>>> some way
> >>> > >>>>>>>>>>>>> "controls the behavior of the framework" and I don't see any
> >>> > >>>>>>> problem
> >>> > >>>>>>>>>>>>> here.
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
> >>> > >>>>> and
> >>> > >>>>>>> the
> >>> > >>>>>>>>>>> design
> >>> > >>>>>>>>>>>>> would become more complex
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> This is a subject for a separate discussion, but actually
> >>> > >>>>> in our
> >>> > >>>>>>>>>>>>> internal version we solved this problem quite easily - we
> >>> > >>>>> reused
> >>> > >>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
> >>> > >>>>>>> point is
> >>> > >>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
> >>> > >>>>>>> scanning
> >>> > >>>>>>>>>> the
> >>> > >>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
> >>> > >>>>> class
> >>> > >>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
> >>> > >>>>>>> InputFormat.
> >>> > >>>>>>>>>>>>> The advantage of this solution is the ability to reload
> >>> > >>>>> cache
> >>> > >>>>>>> data
> >>> > >>>>>>>>>> in
> >>> > >>>>>>>>>>>>> parallel (number of threads depends on number of
> >>> > >>>>> InputSplits,
> >>> > >>>>>>> but
> >>> > >>>>>>>>>> has
> >>> > >>>>>>>>>>>>> an upper limit). As a result cache reload time significantly
> >>> > >>>>>>> reduces
> >>> > >>>>>>>>>>>>> (as well as time of input stream blocking). I know that
> >>> > >>>>> usually
> >>> > >>>>>>> we
> >>> > >>>>>>>>>> try
> >>> > >>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
> >>> > >>>>> one
> >>> > >>>>>>> can
> >>> > >>>>>>>>>> be
> >>> > >>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
> >>> > >>>>> maybe
> >>> > >>>>>>>>>> there
> >>> > >>>>>>>>>>>>> are better ones.
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Providing the cache in the framework might introduce
> >>> > >>>>>>> compatibility
> >>> > >>>>>>>>>>>> issues
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> It's possible only in cases when the developer of the
> >>> > >>>>> connector
> >>> > >>>>>>>>>> won't
> >>> > >>>>>>>>>>>>> properly refactor his code and will use new cache options
> >>> > >>>>>>>>>> incorrectly
> >>> > >>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
> >>> > >>>>> code
> >>> > >>>>>>>>>>>>> places). For correct behavior all he will need to do is to
> >>> > >>>>>>> redirect
> >>> > >>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
> >>> > >>>>> add an
> >>> > >>>>>>>>>> alias
> >>> > >>>>>>>>>>>>> for options, if there was different naming), everything
> >>> > >>>>> will be
> >>> > >>>>>>>>>>>>> transparent for users. If the developer won't do
> >>> > >>>>> refactoring at
> >>> > >>>>>>> all,
> >>> > >>>>>>>>>>>>> nothing will be changed for the connector because of
> >>> > >>>>> backward
> >>> > >>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
> >>> > >>>>> cache
> >>> > >>>>>>> logic,
> >>> > >>>>>>>>>>>>> he just can refuse to pass some of the configs into the
> >>> > >>>>>>> framework,
> >>> > >>>>>>>>>> and
> >>> > >>>>>>>>>>>>> instead make his own implementation with already existing
> >>> > >>>>>>> configs
> >>> > >>>>>>>>>> and
> >>> > >>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> filters and projections should be pushed all the way down
> >>> > >>>>> to
> >>> > >>>>>>> the
> >>> > >>>>>>>>>>> table
> >>> > >>>>>>>>>>>>> function, like what we do in the scan source
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
> >>> > >>>>> connector
> >>> > >>>>>>>>>> that
> >>> > >>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> >>> > >>>>>>>>>>>>> (no database connector supports it currently). Also for some
> >>> > >>>>>>>>>> databases
> >>> > >>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
> >>> > >>>>> that we
> >>> > >>>>>>> have
> >>> > >>>>>>>>>>>>> in Flink.
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> only applying these optimizations to the cache seems not
> >>> > >>>>>>> quite
> >>> > >>>>>>>>>>> useful
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
> >>> > >>>>> from the
> >>> > >>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
> >>> > >>>>>>> table
> >>> > >>>>>>>>>>>>> 'users'
> >>> > >>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
> >>> > >>>>> stream
> >>> > >>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
> >>> > >>>>> we
> >>> > >>>>>>> have
> >>> > >>>>>>>>>>>>> filter 'age > 30',
> >>> > >>>>>>>>>>>>> there will be twice less data in cache. This means the user
> >>> > >>>>> can
> >>> > >>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
> >>> > >>>>>>> gain a
> >>> > >>>>>>>>>>>>> huge
> >>> > >>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
> >>> > >>>>> really
> >>> > >>>>>>>>>> shine
> >>> > >>>>>>>>>>>>> in 'ALL' cache, where tables without filters and projections
> >>> > >>>>>>> can't
> >>> > >>>>>>>>>> fit
> >>> > >>>>>>>>>>>>> in memory, but with them - can. This opens up additional
> >>> > >>>>>>>>>> possibilities
> >>> > >>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> It would be great to hear other voices regarding this topic!
> >>> > >>>>>>> Because
> >>> > >>>>>>>>>>>>> we have quite a lot of controversial points, and I think
> >>> > >>>>> with
> >>> > >>>>>>> the
> >>> > >>>>>>>>>> help
> >>> > >>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> Best regards,
> >>> > >>>>>>>>>>>>> Smirnov Alexander
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> >>> > >>>>> renqschn@gmail.com
> >>> > >>>>>>>> :
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
> >>> > >>>>> We
> >>> > >>>>>>> had
> >>> > >>>>>>>>>> an
> >>> > >>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
> >>> > >>>>> like
> >>> > >>>>>>> to
> >>> > >>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
> >>> > >>>>> logic in
> >>> > >>>>>>> the
> >>> > >>>>>>>>>>> table
> >>> > >>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
> >>> > >>>>>>> function,
> >>> > >>>>>>>>>> we
> >>> > >>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
> >>> > >>>>> with
> >>> > >>>>>>> these
> >>> > >>>>>>>>>>>>> concerns:
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> >>> > >>>>> SYSTEM_TIME
> >>> > >>>>>>> AS OF
> >>> > >>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
> >>> > >>>>> of the
> >>> > >>>>>>>>>> lookup
> >>> > >>>>>>>>>>>>> table at the moment of querying. If users choose to enable
> >>> > >>>>>>> caching
> >>> > >>>>>>>>>> on
> >>> > >>>>>>>>>>> the
> >>> > >>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage is
> >>> > >>>>>>>>>> acceptable
> >>> > >>>>>>>>>>>> in
> >>> > >>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
> >>> > >>>>>>> caching on
> >>> > >>>>>>>>>>> the
> >>> > >>>>>>>>>>>>> table runtime level.
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
> >>> > >>>>>>> (whether
> >>> > >>>>>>>>>> in a
> >>> > >>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
> >>> > >>>>> confront a
> >>> > >>>>>>>>>>>> situation
> >>> > >>>>>>>>>>>>> that allows table options in DDL to control the behavior of
> >>> > >>>>> the
> >>> > >>>>>>>>>>>> framework,
> >>> > >>>>>>>>>>>>> which has never happened previously and should be cautious.
> >>> > >>>>>>> Under
> >>> > >>>>>>>>>> the
> >>> > >>>>>>>>>>>>> current design the behavior of the framework should only be
> >>> > >>>>>>>>>> specified
> >>> > >>>>>>>>>>> by
> >>> > >>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
> >>> > >>>>> these
> >>> > >>>>>>>>>> general
> >>> > >>>>>>>>>>>>> configs to a specific table.
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
> >>> > >>>>> all
> >>> > >>>>>>>>>> records
> >>> > >>>>>>>>>>>>> periodically into the memory to achieve high lookup
> >>> > >>>>> performance
> >>> > >>>>>>>>>> (like
> >>> > >>>>>>>>>>>> Hive
> >>> > >>>>>>>>>>>>> connector in the community, and also widely used by our
> >>> > >>>>> internal
> >>> > >>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> >>> > >>>>> TableFunction
> >>> > >>>>>>>>>> works
> >>> > >>>>>>>>>>>> fine
> >>> > >>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
> >>> > >>>>>>> interface for
> >>> > >>>>>>>>>>> this
> >>> > >>>>>>>>>>>>> all-caching scenario and the design would become more
> >>> > >>>>> complex.
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
> >>> > >>>>>>>>>> compatibility
> >>> > >>>>>>>>>>>>> issues to existing lookup sources like there might exist two
> >>> > >>>>>>> caches
> >>> > >>>>>>>>>>> with
> >>> > >>>>>>>>>>>>> totally different strategies if the user incorrectly
> >>> > >>>>> configures
> >>> > >>>>>>> the
> >>> > >>>>>>>>>>> table
> >>> > >>>>>>>>>>>>> (one in the framework and another implemented by the lookup
> >>> > >>>>>>> source).
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
> >>> > >>>>>>> filters
> >>> > >>>>>>>>>> and
> >>> > >>>>>>>>>>>>> projections should be pushed all the way down to the table
> >>> > >>>>>>> function,
> >>> > >>>>>>>>>>> like
> >>> > >>>>>>>>>>>>> what we do in the scan source, instead of the runner with
> >>> > >>>>> the
> >>> > >>>>>>> cache.
> >>> > >>>>>>>>>>> The
> >>> > >>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
> >>> > >>>>> pressure
> >>> > >>>>>>> on the
> >>> > >>>>>>>>>>>>> external system, and only applying these optimizations to
> >>> > >>>>> the
> >>> > >>>>>>> cache
> >>> > >>>>>>>>>>> seems
> >>> > >>>>>>>>>>>>> not quite useful.
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
> >>> > >>>>> We
> >>> > >>>>>>>>>> prefer to
> >>> > >>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
> >>> > >>>>> and we
> >>> > >>>>>>>>>> could
> >>> > >>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> >>> > >>>>>>>>>>>> AllCachingTableFunction,
> >>> > >>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
> >>> > >>>>> metrics
> >>> > >>>>>>> of the
> >>> > >>>>>>>>>>>> cache.
> >>> > >>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Looking forward to your ideas!
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> [1]
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>
> >>> > >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>> > >>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Best regards,
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Qingsheng
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> >>> > >>>>>>>>>>>> smiralexan@gmail.com>
> >>> > >>>>>>>>>>>>> wrote:
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> I have few comments on your message.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> but could also live with an easier solution as the
> >>> > >>>>> first
> >>> > >>>>>>> step:
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> >>> > >>>>> (originally
> >>> > >>>>>>>>>>> proposed
> >>> > >>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
> >>> > >>>>> the
> >>> > >>>>>>> same
> >>> > >>>>>>>>>>>>>>> goal, but implementation details are different. If we
> >>> > >>>>> will
> >>> > >>>>>>> go one
> >>> > >>>>>>>>>>> way,
> >>> > >>>>>>>>>>>>>>> moving to another way in the future will mean deleting
> >>> > >>>>>>> existing
> >>> > >>>>>>>>>> code
> >>> > >>>>>>>>>>>>>>> and once again changing the API for connectors. So I
> >>> > >>>>> think we
> >>> > >>>>>>>>>> should
> >>> > >>>>>>>>>>>>>>> reach a consensus with the community about that and then
> >>> > >>>>> work
> >>> > >>>>>>>>>>> together
> >>> > >>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
> >>> > >>>>>>> parts
> >>> > >>>>>>>>>> of
> >>> > >>>>>>>>>>> the
> >>> > >>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
> >>> > >>>>>>> proposed
> >>> > >>>>>>>>>> set
> >>> > >>>>>>>>>>> of
> >>> > >>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> as the source will only receive the requests after
> >>> > >>>>> filter
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
> >>> > >>>>>>> table, we
> >>> > >>>>>>>>>>>>>>> firstly must do requests, and only after that we can
> >>> > >>>>> filter
> >>> > >>>>>>>>>>> responses,
> >>> > >>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
> >>> > >>>>> if
> >>> > >>>>>>>>>>> filtering
> >>> > >>>>>>>>>>>>>>> is done before caching, there will be much less rows in
> >>> > >>>>>>> cache.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>> > >>>>> shared.
> >>> > >>>>>>> I
> >>> > >>>>>>>>>> don't
> >>> > >>>>>>>>>>>>> know the
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> solution to share images to be honest.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> >>> > >>>>> conversations
> >>> > >>>>>>> :)
> >>> > >>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
> >>> > >>>>> Jira
> >>> > >>>>>>> issue,
> >>> > >>>>>>>>>>>>>>> where described the proposed changes in more details -
> >>> > >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Will happy to get more feedback!
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Best,
> >>> > >>>>>>>>>>>>>>> Smirnov Alexander
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> >>> > >>>>> arvid@apache.org>:
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Hi Qingsheng,
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
> >>> > >>>>>>> satisfying
> >>> > >>>>>>>>>> for
> >>> > >>>>>>>>>>>> me.
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
> >>> > >>>>> with
> >>> > >>>>>>> an
> >>> > >>>>>>>>>>> easier
> >>> > >>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
> >>> > >>>>> an
> >>> > >>>>>>>>>>>>> implementation
> >>> > >>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
> >>> > >>>>> layer
> >>> > >>>>>>>>>> around X.
> >>> > >>>>>>>>>>>> So
> >>> > >>>>>>>>>>>>> the
> >>> > >>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> >>> > >>>>> delegates to
> >>> > >>>>>>> X in
> >>> > >>>>>>>>>>> case
> >>> > >>>>>>>>>>>>> of
> >>> > >>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
> >>> > >>>>>>> operator
> >>> > >>>>>>>>>>>> model
> >>> > >>>>>>>>>>>>> as
> >>> > >>>>>>>>>>>>>>>> proposed would be even better but is probably
> >>> > >>>>> unnecessary
> >>> > >>>>>>> in
> >>> > >>>>>>>>>> the
> >>> > >>>>>>>>>>>>> first step
> >>> > >>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
> >>> > >>>>> the
> >>> > >>>>>>>>>> requests
> >>> > >>>>>>>>>>>>> after
> >>> > >>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
> >>> > >>>>> save
> >>> > >>>>>>>>>>> memory).
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
> >>> > >>>>>>> would be
> >>> > >>>>>>>>>>>>> limited to
> >>> > >>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
> >>> > >>>>> else
> >>> > >>>>>>>>>>> remains
> >>> > >>>>>>>>>>>> an
> >>> > >>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
> >>> > >>>>> easily
> >>> > >>>>>>>>>>>> incorporate
> >>> > >>>>>>>>>>>>> the
> >>> > >>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
> >>> > >>>>> later.
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>> > >>>>> shared.
> >>> > >>>>>>> I
> >>> > >>>>>>>>>> don't
> >>> > >>>>>>>>>>>>> know the
> >>> > >>>>>>>>>>>>>>>> solution to share images to be honest.
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> >>> > >>>>>>>>>>>>> smiralexan@gmail.com>
> >>> > >>>>>>>>>>>>>>>> wrote:
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> >>> > >>>>> committer
> >>> > >>>>>>> yet,
> >>> > >>>>>>>>>> but
> >>> > >>>>>>>>>>>> I'd
> >>> > >>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> >>> > >>>>>>> interested
> >>> > >>>>>>>>>> me.
> >>> > >>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
> >>> > >>>>>>> company’s
> >>> > >>>>>>>>>>> Flink
> >>> > >>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
> >>> > >>>>> this and
> >>> > >>>>>>>>>> make
> >>> > >>>>>>>>>>>> code
> >>> > >>>>>>>>>>>>>>>>> open source.
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> I think there is a better alternative than
> >>> > >>>>> introducing an
> >>> > >>>>>>>>>>> abstract
> >>> > >>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
> >>> > >>>>> you
> >>> > >>>>>>> know,
> >>> > >>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> >>> > >>>>> module,
> >>> > >>>>>>> which
> >>> > >>>>>>>>>>>>> provides
> >>> > >>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> >>> > >>>>>>> convenient
> >>> > >>>>>>>>>> for
> >>> > >>>>>>>>>>>>> importing
> >>> > >>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
> >>> > >>>>>>> logic
> >>> > >>>>>>>>>> for
> >>> > >>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
> >>> > >>>>>>> connected
> >>> > >>>>>>>>>> with
> >>> > >>>>>>>>>>> it
> >>> > >>>>>>>>>>>>>>>>> should be located in another module, probably in
> >>> > >>>>>>>>>>>>> flink-table-runtime.
> >>> > >>>>>>>>>>>>>>>>> But this will require connectors to depend on another
> >>> > >>>>>>> module,
> >>> > >>>>>>>>>>>> which
> >>> > >>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
> >>> > >>>>>>> good.
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
> >>> > >>>>>>>>>>>> LookupTableSource
> >>> > >>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
> >>> > >>>>> pass
> >>> > >>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
> >>> > >>>>>>> depend on
> >>> > >>>>>>>>>>>>> runtime
> >>> > >>>>>>>>>>>>>>>>> realization. Based on these configs planner will
> >>> > >>>>>>> construct a
> >>> > >>>>>>>>>>>> lookup
> >>> > >>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> >>> > >>>>>>>>>> (ProcessFunctions
> >>> > >>>>>>>>>>>> in
> >>> > >>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
> >>> > >>>>> in
> >>> > >>>>>>> the
> >>> > >>>>>>>>>>> pinned
> >>> > >>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
> >>> > >>>>>>>>>> CacheConfig).
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> >>> > >>>>> responsible
> >>> > >>>>>>> for
> >>> > >>>>>>>>>>> this
> >>> > >>>>>>>>>>>> –
> >>> > >>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> >>> > >>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>> > >>>>> flink-table-runtime
> >>> > >>>>>>> -
> >>> > >>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> >>> > >>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>> > >>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
> >>> > >>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
> >>> > >>>>> such a
> >>> > >>>>>>>>>>> solution.
> >>> > >>>>>>>>>>>>> If
> >>> > >>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
> >>> > >>>>> some
> >>> > >>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
> >>> > >>>>> named
> >>> > >>>>>>> like
> >>> > >>>>>>>>>>> this
> >>> > >>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
> >>> > >>>>>>> mostly
> >>> > >>>>>>>>>>>> consists
> >>> > >>>>>>>>>>>>> of
> >>> > >>>>>>>>>>>>>>>>> filters and projections.
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
> >>> > >>>>>>> condition
> >>> > >>>>>>>>>>> ‘JOIN …
> >>> > >>>>>>>>>>>>> ON
> >>> > >>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
> >>> > >>>>> 1000’
> >>> > >>>>>>>>>>> ‘calc’
> >>> > >>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
> >>> > >>>>>>>>>> B.salary >
> >>> > >>>>>>>>>>>>> 1000.
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> If we apply this function before storing records in
> >>> > >>>>>>> cache,
> >>> > >>>>>>>>>> size
> >>> > >>>>>>>>>>> of
> >>> > >>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
> >>> > >>>>>>> storing
> >>> > >>>>>>>>>>>> useless
> >>> > >>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> >>> > >>>>> size. So
> >>> > >>>>>>> the
> >>> > >>>>>>>>>>>> initial
> >>> > >>>>>>>>>>>>>>>>> max number of records in cache can be increased by
> >>> > >>>>> the
> >>> > >>>>>>> user.
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> What do you think about it?
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >>> > >>>>>>>>>>>>>>>>>> Hi devs,
> >>> > >>>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
> >>> > >>>>>>>>>> FLIP-221[1],
> >>> > >>>>>>>>>>>>> which
> >>> > >>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
> >>> > >>>>> its
> >>> > >>>>>>>>>> standard
> >>> > >>>>>>>>>>>>> metrics.
> >>> > >>>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
> >>> > >>>>>>> their
> >>> > >>>>>>>>>> own
> >>> > >>>>>>>>>>>>> cache to
> >>> > >>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
> >>> > >>>>>>> metrics
> >>> > >>>>>>>>>> for
> >>> > >>>>>>>>>>>>> users and
> >>> > >>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
> >>> > >>>>> which
> >>> > >>>>>>> is a
> >>> > >>>>>>>>>>>> quite
> >>> > >>>>>>>>>>>>> common
> >>> > >>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>> > >>>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
> >>> > >>>>>>>>>> metrics,
> >>> > >>>>>>>>>>>>> wrapper
> >>> > >>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> >>> > >>>>> Please
> >>> > >>>>>>> take a
> >>> > >>>>>>>>>>> look
> >>> > >>>>>>>>>>>>> at the
> >>> > >>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
> >>> > >>>>> and
> >>> > >>>>>>>>>> comments
> >>> > >>>>>>>>>>>>> would be
> >>> > >>>>>>>>>>>>>>>>> appreciated!
> >>> > >>>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>> [1]
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>
> >>> > >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>> > >>>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>> Best regards,
> >>> > >>>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>> Qingsheng
> >>> > >>>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> --
> >>> > >>>>>>>>>>>>>> Best Regards,
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Qingsheng Ren
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Real-time Computing Team
> >>> > >>>>>>>>>>>>>> Alibaba Cloud
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> --
> >>> > >>>>>>>>> Best regards,
> >>> > >>>>>>>>> Roman Boyko
> >>> > >>>>>>>>> e.: ro.v.boyko@gmail.com
> >>> > >>>>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>
> >>> > >>>>>
> >>> > >>>
> >>> > >>
> >>> > >>
> >>> > >> --
> >>> > >> Best Regards,
> >>> > >>
> >>> > >> Qingsheng Ren
> >>> > >>
> >>> > >> Real-time Computing Team
> >>> > >> Alibaba Cloud
> >>> > >>
> >>> > >> Email: renqschn@gmail.com
> >>> >

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Александр Смирнов <sm...@gmail.com>.

Hi Qingsheng and Jark,

1. Builders vs 'of'
I understand that builders are used when we have multiple parameters.
I suggested them because we could add parameters later. To prevent
Builder for ScanRuntimeProvider from looking redundant I can suggest
one more config now - "rescanStartTime".
It's a time in UTC (LocalTime class) when the first reload of cache
starts. This parameter can be thought of as 'initialDelay' (diff
between current time and rescanStartTime) in method
ScheduleExecutorService#scheduleWithFixedDelay [1] . It can be very
useful when the dimension table is updated by some other scheduled job
at a certain time. Or when the user simply wants a second scan (first
cache reload) be delayed. This option can be used even without
'rescanInterval' - in this case 'rescanInterval' will be one day.
If you are fine with this option, I would be very glad if you would
give me access to edit FLIP page, so I could add it myself

2. Common table options
I also think that FactoryUtil would be overloaded by all cache
options. But maybe unify all suggested options, not only for default
cache? I.e. class 'LookupOptions', that unifies default cache options,
rescan options, 'async', 'maxRetries'. WDYT?

3. Retries
I'm fine with suggestion close to RetryUtils#tryTimes(times, call)

[1] https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-

Best regards,
Alexander

ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <re...@gmail.com>:
>
> Hi Jark and Alexander,
>
> Thanks for your comments! I’m also OK to introduce common table options. I prefer to introduce a new DefaultLookupCacheOptions class for holding these option definitions because putting all options into FactoryUtil would make it a bit ”crowded” and not well categorized.
>
> FLIP has been updated according to suggestions above:
> 1. Use static “of” method for constructing RescanRuntimeProvider considering both arguments are required.
> 2. Introduce new table options matching DefaultLookupCacheFactory
>
> Best,
> Qingsheng
>
> On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:
>>
>> Hi Alex,
>>
>> 1) retry logic
>> I think we can extract some common retry logic into utilities, e.g. RetryUtils#tryTimes(times, call).
>> This seems independent of this FLIP and can be reused by DataStream users.
>> Maybe we can open an issue to discuss this and where to put it.
>>
>> 2) cache ConfigOptions
>> I'm fine with defining cache config options in the framework.
>> A candidate place to put is FactoryUtil which also includes "sink.parallelism", "format" options.
>>
>> Best,
>> Jark
>>
>>
>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <sm...@gmail.com> wrote:
>>>
>>> Hi Qingsheng,
>>>
>>> Thank you for considering my comments.
>>>
>>> >  there might be custom logic before making retry, such as re-establish the connection
>>>
>>> Yes, I understand that. I meant that such logic can be placed in a
>>> separate function, that can be implemented by connectors. Just moving
>>> the retry logic would make connector's LookupFunction more concise +
>>> avoid duplicate code. However, it's a minor change. The decision is up
>>> to you.
>>>
>>> > We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
>>>
>>> What is the reason for that? One of the main goals of this FLIP was to
>>> unify the configs, wasn't it? I understand that current cache design
>>> doesn't depend on ConfigOptions, like was before. But still we can put
>>> these options into the framework, so connectors can reuse them and
>>> avoid code duplication, and, what is more significant, avoid possible
>>> different options naming. This moment can be pointed out in
>>> documentation for connector developers.
>>>
>>> Best regards,
>>> Alexander
>>>
>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
>>> >
>>> > Hi Alexander,
>>> >
>>> > Thanks for the review and glad to see we are on the same page! I think you forgot to cc the dev mailing list so I’m also quoting your reply under this email.
>>> >
>>> > >  We can add 'maxRetryTimes' option into this class
>>> >
>>> > In my opinion the retry logic should be implemented in lookup() instead of in LookupFunction#eval(). Retrying is only meaningful under some specific retriable failures, and there might be custom logic before making retry, such as re-establish the connection (JdbcRowDataLookupFunction is an example), so it's more handy to leave it to the connector.
>>> >
>>> > > I don't see DDL options, that were in previous version of FLIP. Do you have any special plans for them?
>>> >
>>> > We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
>>> >
>>> > The rest of comments sound great and I’ll update the FLIP. Hope we can finalize our proposal soon!
>>> >
>>> > Best,
>>> >
>>> > Qingsheng
>>> >
>>> >
>>> > > On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com> wrote:
>>> > >
>>> > > Hi Qingsheng and devs!
>>> > >
>>> > > I like the overall design of updated FLIP, however I have several
>>> > > suggestions and questions.
>>> > >
>>> > > 1) Introducing LookupFunction as a subclass of TableFunction is a good
>>> > > idea. We can add 'maxRetryTimes' option into this class. 'eval' method
>>> > > of new LookupFunction is great for this purpose. The same is for
>>> > > 'async' case.
>>> > >
>>> > > 2) There might be other configs in future, such as 'cacheMissingKey'
>>> > > in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
>>> > > Maybe use Builder pattern in LookupFunctionProvider and
>>> > > RescanRuntimeProvider for more flexibility (use one 'build' method
>>> > > instead of many 'of' methods in future)?
>>> > >
>>> > > 3) What are the plans for existing TableFunctionProvider and
>>> > > AsyncTableFunctionProvider? I think they should be deprecated.
>>> > >
>>> > > 4) Am I right that the current design does not assume usage of
>>> > > user-provided LookupCache in re-scanning? In this case, it is not very
>>> > > clear why do we need methods such as 'invalidate' or 'putAll' in
>>> > > LookupCache.
>>> > >
>>> > > 5) I don't see DDL options, that were in previous version of FLIP. Do
>>> > > you have any special plans for them?
>>> > >
>>> > > If you don't mind, I would be glad to be able to make small
>>> > > adjustments to the FLIP document too. I think it's worth mentioning
>>> > > about what exactly optimizations are planning in the future.
>>> > >
>>> > > Best regards,
>>> > > Smirnov Alexander
>>> > >
>>> > > пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
>>> > >>
>>> > >> Hi Alexander and devs,
>>> > >>
>>> > >> Thank you very much for the in-depth discussion! As Jark mentioned we were inspired by Alexander's idea and made a refactor on our design. FLIP-221 [1] has been updated to reflect our design now and we are happy to hear more suggestions from you!
>>> > >>
>>> > >> Compared to the previous design:
>>> > >> 1. The lookup cache serves at table runtime level and is integrated as a component of LookupJoinRunner as discussed previously.
>>> > >> 2. Interfaces are renamed and re-designed to reflect the new design.
>>> > >> 3. We separate the all-caching case individually and introduce a new RescanRuntimeProvider to reuse the ability of scanning. We are planning to support SourceFunction / InputFormat for now considering the complexity of FLIP-27 Source API.
>>> > >> 4. A new interface LookupFunction is introduced to make the semantic of lookup more straightforward for developers.
>>> > >>
>>> > >> For replying to Alexander:
>>> > >>> However I'm a little confused whether InputFormat is deprecated or not. Am I right that it will be so in the future, but currently it's not?
>>> > >> Yes you are right. InputFormat is not deprecated for now. I think it will be deprecated in the future but we don't have a clear plan for that.
>>> > >>
>>> > >> Thanks again for the discussion on this FLIP and looking forward to cooperating with you after we finalize the design and interfaces!
>>> > >>
>>> > >> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>> > >>
>>> > >> Best regards,
>>> > >>
>>> > >> Qingsheng
>>> > >>
>>> > >>
>>> > >> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <sm...@gmail.com> wrote:
>>> > >>>
>>> > >>> Hi Jark, Qingsheng and Leonard!
>>> > >>>
>>> > >>> Glad to see that we came to a consensus on almost all points!
>>> > >>>
>>> > >>> However I'm a little confused whether InputFormat is deprecated or
>>> > >>> not. Am I right that it will be so in the future, but currently it's
>>> > >>> not? Actually I also think that for the first version it's OK to use
>>> > >>> InputFormat in ALL cache realization, because supporting rescan
>>> > >>> ability seems like a very distant prospect. But for this decision we
>>> > >>> need a consensus among all discussion participants.
>>> > >>>
>>> > >>> In general, I don't have something to argue with your statements. All
>>> > >>> of them correspond my ideas. Looking ahead, it would be nice to work
>>> > >>> on this FLIP cooperatively. I've already done a lot of work on lookup
>>> > >>> join caching with realization very close to the one we are discussing,
>>> > >>> and want to share the results of this work. Anyway looking forward for
>>> > >>> the FLIP update!
>>> > >>>
>>> > >>> Best regards,
>>> > >>> Smirnov Alexander
>>> > >>>
>>> > >>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>>> > >>>>
>>> > >>>> Hi Alex,
>>> > >>>>
>>> > >>>> Thanks for summarizing your points.
>>> > >>>>
>>> > >>>> In the past week, Qingsheng, Leonard, and I have discussed it several times
>>> > >>>> and we have totally refactored the design.
>>> > >>>> I'm glad to say we have reached a consensus on many of your points!
>>> > >>>> Qingsheng is still working on updating the design docs and maybe can be
>>> > >>>> available in the next few days.
>>> > >>>> I will share some conclusions from our discussions:
>>> > >>>>
>>> > >>>> 1) we have refactored the design towards to "cache in framework" way.
>>> > >>>>
>>> > >>>> 2) a "LookupCache" interface for users to customize and a default
>>> > >>>> implementation with builder for users to easy-use.
>>> > >>>> This can both make it possible to both have flexibility and conciseness.
>>> > >>>>
>>> > >>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
>>> > >>>> IO.
>>> > >>>> Filter pushdown should be the final state and the unified way to both
>>> > >>>> support pruning ALL cache and LRU cache,
>>> > >>>> so I think we should make effort in this direction. If we need to support
>>> > >>>> filter pushdown for ALL cache anyway, why not use
>>> > >>>> it for LRU cache as well? Either way, as we decide to implement the cache
>>> > >>>> in the framework, we have the chance to support
>>> > >>>> filter on cache anytime. This is an optimization and it doesn't affect the
>>> > >>>> public API. I think we can create a JIRA issue to
>>> > >>>> discuss it when the FLIP is accepted.
>>> > >>>>
>>> > >>>> 4) The idea to support ALL cache is similar to your proposal.
>>> > >>>> In the first version, we will only support InputFormat, SourceFunction for
>>> > >>>> cache all (invoke InputFormat in join operator).
>>> > >>>> For FLIP-27 source, we need to join a true source operator instead of
>>> > >>>> calling it embedded in the join operator.
>>> > >>>> However, this needs another FLIP to support the re-scan ability for FLIP-27
>>> > >>>> Source, and this can be a large work.
>>> > >>>> In order to not block this issue, we can put the effort of FLIP-27 source
>>> > >>>> integration into future work and integrate
>>> > >>>> InputFormat&SourceFunction for now.
>>> > >>>>
>>> > >>>> I think it's fine to use InputFormat&SourceFunction, as they are not
>>> > >>>> deprecated, otherwise, we have to introduce another function
>>> > >>>> similar to them which is meaningless. We need to plan FLIP-27 source
>>> > >>>> integration ASAP before InputFormat & SourceFunction are deprecated.
>>> > >>>>
>>> > >>>> Best,
>>> > >>>> Jark
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
>>> > >>>> wrote:
>>> > >>>>
>>> > >>>>> Hi Martijn!
>>> > >>>>>
>>> > >>>>> Got it. Therefore, the realization with InputFormat is not considered.
>>> > >>>>> Thanks for clearing that up!
>>> > >>>>>
>>> > >>>>> Best regards,
>>> > >>>>> Smirnov Alexander
>>> > >>>>>
>>> > >>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
>>> > >>>>>>
>>> > >>>>>> Hi,
>>> > >>>>>>
>>> > >>>>>> With regards to:
>>> > >>>>>>
>>> > >>>>>>> But if there are plans to refactor all connectors to FLIP-27
>>> > >>>>>>
>>> > >>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces will be
>>> > >>>>>> deprecated and connectors will either be refactored to use the new ones
>>> > >>>>> or
>>> > >>>>>> dropped.
>>> > >>>>>>
>>> > >>>>>> The caching should work for connectors that are using FLIP-27 interfaces,
>>> > >>>>>> we should not introduce new features for old interfaces.
>>> > >>>>>>
>>> > >>>>>> Best regards,
>>> > >>>>>>
>>> > >>>>>> Martijn
>>> > >>>>>>
>>> > >>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
>>> > >>>>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> Hi Jark!
>>> > >>>>>>>
>>> > >>>>>>> Sorry for the late response. I would like to make some comments and
>>> > >>>>>>> clarify my points.
>>> > >>>>>>>
>>> > >>>>>>> 1) I agree with your first statement. I think we can achieve both
>>> > >>>>>>> advantages this way: put the Cache interface in flink-table-common,
>>> > >>>>>>> but have implementations of it in flink-table-runtime. Therefore if a
>>> > >>>>>>> connector developer wants to use existing cache strategies and their
>>> > >>>>>>> implementations, he can just pass lookupConfig to the planner, but if
>>> > >>>>>>> he wants to have its own cache implementation in his TableFunction, it
>>> > >>>>>>> will be possible for him to use the existing interface for this
>>> > >>>>>>> purpose (we can explicitly point this out in the documentation). In
>>> > >>>>>>> this way all configs and metrics will be unified. WDYT?
>>> > >>>>>>>
>>> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
>>> > >>>>>>> lookup requests that can never be cached
>>> > >>>>>>>
>>> > >>>>>>> 2) Let me clarify the logic filters optimization in case of LRU cache.
>>> > >>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
>>> > >>>>>>> store the response of the dimension table in cache, even after
>>> > >>>>>>> applying calc function. I.e. if there are no rows after applying
>>> > >>>>>>> filters to the result of the 'eval' method of TableFunction, we store
>>> > >>>>>>> the empty list by lookup keys. Therefore the cache line will be
>>> > >>>>>>> filled, but will require much less memory (in bytes). I.e. we don't
>>> > >>>>>>> completely filter keys, by which result was pruned, but significantly
>>> > >>>>>>> reduce required memory to store this result. If the user knows about
>>> > >>>>>>> this behavior, he can increase the 'max-rows' option before the start
>>> > >>>>>>> of the job. But actually I came up with the idea that we can do this
>>> > >>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods of
>>> > >>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
>>> > >>>>>>> (value of cache). Therefore cache can automatically fit much more
>>> > >>>>>>> records than before.
>>> > >>>>>>>
>>> > >>>>>>>> Flink SQL has provided a standard way to do filters and projects
>>> > >>>>>>> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
>>> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
>>> > >>>>> hard
>>> > >>>>>>> to implement.
>>> > >>>>>>>
>>> > >>>>>>> It's debatable how difficult it will be to implement filter pushdown.
>>> > >>>>>>> But I think the fact that currently there is no database connector
>>> > >>>>>>> with filter pushdown at least means that this feature won't be
>>> > >>>>>>> supported soon in connectors. Moreover, if we talk about other
>>> > >>>>>>> connectors (not in Flink repo), their databases might not support all
>>> > >>>>>>> Flink filters (or not support filters at all). I think users are
>>> > >>>>>>> interested in supporting cache filters optimization  independently of
>>> > >>>>>>> supporting other features and solving more complex problems (or
>>> > >>>>>>> unsolvable at all).
>>> > >>>>>>>
>>> > >>>>>>> 3) I agree with your third statement. Actually in our internal version
>>> > >>>>>>> I also tried to unify the logic of scanning and reloading data from
>>> > >>>>>>> connectors. But unfortunately, I didn't find a way to unify the logic
>>> > >>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
>>> > >>>>>>> and reuse it in reloading ALL cache. As a result I settled on using
>>> > >>>>>>> InputFormat, because it was used for scanning in all lookup
>>> > >>>>>>> connectors. (I didn't know that there are plans to deprecate
>>> > >>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
>>> > >>>>>>> in ALL caching is not good idea, because this source was designed to
>>> > >>>>>>> work in distributed environment (SplitEnumerator on JobManager and
>>> > >>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
>>> > >>>>>>> operator in our case). There is even no direct way to pass splits from
>>> > >>>>>>> SplitEnumerator to SourceReader (this logic works through
>>> > >>>>>>> SplitEnumeratorContext, which requires
>>> > >>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
>>> > >>>>>>> InputFormat for ALL cache seems much more clearer and easier. But if
>>> > >>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
>>> > >>>>>>> following ideas: maybe we can refuse from lookup join ALL cache in
>>> > >>>>>>> favor of simple join with multiple scanning of batch source? The point
>>> > >>>>>>> is that the only difference between lookup join ALL cache and simple
>>> > >>>>>>> join with batch source is that in the first case scanning is performed
>>> > >>>>>>> multiple times, in between which state (cache) is cleared (correct me
>>> > >>>>>>> if I'm wrong). So what if we extend the functionality of simple join
>>> > >>>>>>> to support state reloading + extend the functionality of scanning
>>> > >>>>>>> batch source multiple times (this one should be easy with new FLIP-27
>>> > >>>>>>> source, that unifies streaming/batch reading - we will need to change
>>> > >>>>>>> only SplitEnumerator, which will pass splits again after some TTL).
>>> > >>>>>>> WDYT? I must say that this looks like a long-term goal and will make
>>> > >>>>>>> the scope of this FLIP even larger than you said. Maybe we can limit
>>> > >>>>>>> ourselves to a simpler solution now (InputFormats).
>>> > >>>>>>>
>>> > >>>>>>> So to sum up, my points is like this:
>>> > >>>>>>> 1) There is a way to make both concise and flexible interfaces for
>>> > >>>>>>> caching in lookup join.
>>> > >>>>>>> 2) Cache filters optimization is important both in LRU and ALL caches.
>>> > >>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
>>> > >>>>>>> connectors, some of the connectors might not have the opportunity to
>>> > >>>>>>> support filter pushdown + as I know, currently filter pushdown works
>>> > >>>>>>> only for scanning (not lookup). So cache filters + projections
>>> > >>>>>>> optimization should be independent from other features.
>>> > >>>>>>> 4) ALL cache realization is a complex topic that involves multiple
>>> > >>>>>>> aspects of how Flink is developing. Refusing from InputFormat in favor
>>> > >>>>>>> of FLIP-27 Source will make ALL cache realization really complex and
>>> > >>>>>>> not clear, so maybe instead of that we can extend the functionality of
>>> > >>>>>>> simple join or not refuse from InputFormat in case of lookup join ALL
>>> > >>>>>>> cache?
>>> > >>>>>>>
>>> > >>>>>>> Best regards,
>>> > >>>>>>> Smirnov Alexander
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> [1]
>>> > >>>>>>>
>>> > >>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>> > >>>>>>>
>>> > >>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>>> > >>>>>>>>
>>> > >>>>>>>> It's great to see the active discussion! I want to share my ideas:
>>> > >>>>>>>>
>>> > >>>>>>>> 1) implement the cache in framework vs. connectors base
>>> > >>>>>>>> I don't have a strong opinion on this. Both ways should work (e.g.,
>>> > >>>>> cache
>>> > >>>>>>>> pruning, compatibility).
>>> > >>>>>>>> The framework way can provide more concise interfaces.
>>> > >>>>>>>> The connector base way can define more flexible cache
>>> > >>>>>>>> strategies/implementations.
>>> > >>>>>>>> We are still investigating a way to see if we can have both
>>> > >>>>> advantages.
>>> > >>>>>>>> We should reach a consensus that the way should be a final state,
>>> > >>>>> and we
>>> > >>>>>>>> are on the path to it.
>>> > >>>>>>>>
>>> > >>>>>>>> 2) filters and projections pushdown:
>>> > >>>>>>>> I agree with Alex that the filter pushdown into cache can benefit a
>>> > >>>>> lot
>>> > >>>>>>> for
>>> > >>>>>>>> ALL cache.
>>> > >>>>>>>> However, this is not true for LRU cache. Connectors use cache to
>>> > >>>>> reduce
>>> > >>>>>>> IO
>>> > >>>>>>>> requests to databases for better throughput.
>>> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
>>> > >>>>>>> lookup
>>> > >>>>>>>> requests that can never be cached
>>> > >>>>>>>> and hit directly to the databases. That means the cache is
>>> > >>>>> meaningless in
>>> > >>>>>>>> this case.
>>> > >>>>>>>>
>>> > >>>>>>>> IMO, Flink SQL has provided a standard way to do filters and projects
>>> > >>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>> > >>>>> SupportsProjectionPushDown.
>>> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
>>> > >>>>> hard
>>> > >>>>>>> to
>>> > >>>>>>>> implement.
>>> > >>>>>>>> They should implement the pushdown interfaces to reduce IO and the
>>> > >>>>> cache
>>> > >>>>>>>> size.
>>> > >>>>>>>> That should be a final state that the scan source and lookup source
>>> > >>>>> share
>>> > >>>>>>>> the exact pushdown implementation.
>>> > >>>>>>>> I don't see why we need to duplicate the pushdown logic in caches,
>>> > >>>>> which
>>> > >>>>>>>> will complex the lookup join design.
>>> > >>>>>>>>
>>> > >>>>>>>> 3) ALL cache abstraction
>>> > >>>>>>>> All cache might be the most challenging part of this FLIP. We have
>>> > >>>>> never
>>> > >>>>>>>> provided a reload-lookup public interface.
>>> > >>>>>>>> Currently, we put the reload logic in the "eval" method of
>>> > >>>>> TableFunction.
>>> > >>>>>>>> That's hard for some sources (e.g., Hive).
>>> > >>>>>>>> Ideally, connector implementation should share the logic of reload
>>> > >>>>> and
>>> > >>>>>>>> scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
>>> > >>>>>>> Source.
>>> > >>>>>>>> However, InputFormat/SourceFunction are deprecated, and the FLIP-27
>>> > >>>>>>> source
>>> > >>>>>>>> is deeply coupled with SourceOperator.
>>> > >>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may make
>>> > >>>>> the
>>> > >>>>>>>> scope of this FLIP much larger.
>>> > >>>>>>>> We are still investigating how to abstract the ALL cache logic and
>>> > >>>>> reuse
>>> > >>>>>>>> the existing source interfaces.
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Best,
>>> > >>>>>>>> Jark
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
>>> > >>>>> wrote:
>>> > >>>>>>>>
>>> > >>>>>>>>> It's a much more complicated activity and lies out of the scope of
>>> > >>>>> this
>>> > >>>>>>>>> improvement. Because such pushdowns should be done for all
>>> > >>>>>>> ScanTableSource
>>> > >>>>>>>>> implementations (not only for Lookup ones).
>>> > >>>>>>>>>
>>> > >>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>>> > >>>>> martijnvisser@apache.org>
>>> > >>>>>>>>> wrote:
>>> > >>>>>>>>>
>>> > >>>>>>>>>> Hi everyone,
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> One question regarding "And Alexander correctly mentioned that
>>> > >>>>> filter
>>> > >>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." -> Would
>>> > >>>>> an
>>> > >>>>>>>>>> alternative solution be to actually implement these filter
>>> > >>>>> pushdowns?
>>> > >>>>>>> I
>>> > >>>>>>>>>> can
>>> > >>>>>>>>>> imagine that there are many more benefits to doing that, outside
>>> > >>>>> of
>>> > >>>>>>> lookup
>>> > >>>>>>>>>> caching and metrics.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Best regards,
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Martijn Visser
>>> > >>>>>>>>>> https://twitter.com/MartijnVisser82
>>> > >>>>>>>>>> https://github.com/MartijnVisser
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
>>> > >>>>>>> wrote:
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>> Hi everyone!
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> Thanks for driving such a valuable improvement!
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> I do think that single cache implementation would be a nice
>>> > >>>>>>> opportunity
>>> > >>>>>>>>>> for
>>> > >>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
>>> > >>>>>>> semantics
>>> > >>>>>>>>>>> anyway - doesn't matter how it will be implemented.
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> Putting myself in the user's shoes, I can say that:
>>> > >>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the cache
>>> > >>>>> size
>>> > >>>>>>> by
>>> > >>>>>>>>>>> simply filtering unnecessary data. And the most handy way to do
>>> > >>>>> it
>>> > >>>>>>> is
>>> > >>>>>>>>>> apply
>>> > >>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
>>> > >>>>>>> through the
>>> > >>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
>>> > >>>>> mentioned
>>> > >>>>>>> that
>>> > >>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
>>> > >>>>>>>>>>> 2) The ability to set the different caching parameters for
>>> > >>>>> different
>>> > >>>>>>>>>> tables
>>> > >>>>>>>>>>> is quite important. So I would prefer to set it through DDL
>>> > >>>>> rather
>>> > >>>>>>> than
>>> > >>>>>>>>>>> have similar ttla, strategy and other options for all lookup
>>> > >>>>> tables.
>>> > >>>>>>>>>>> 3) Providing the cache into the framework really deprives us of
>>> > >>>>>>>>>>> extensibility (users won't be able to implement their own
>>> > >>>>> cache).
>>> > >>>>>>> But
>>> > >>>>>>>>>> most
>>> > >>>>>>>>>>> probably it might be solved by creating more different cache
>>> > >>>>>>> strategies
>>> > >>>>>>>>>> and
>>> > >>>>>>>>>>> a wider set of configurations.
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> All these points are much closer to the schema proposed by
>>> > >>>>>>> Alexander.
>>> > >>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all these
>>> > >>>>>>>>>> facilities
>>> > >>>>>>>>>>> might be simply implemented in your architecture?
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> Best regards,
>>> > >>>>>>>>>>> Roman Boyko
>>> > >>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>>> > >>>>>>> martijnvisser@apache.org>
>>> > >>>>>>>>>>> wrote:
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> Hi everyone,
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> I don't have much to chip in, but just wanted to express that
>>> > >>>>> I
>>> > >>>>>>> really
>>> > >>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
>>> > >>>>> that
>>> > >>>>>>>>>> others
>>> > >>>>>>>>>>>> will join the conversation.
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> Best regards,
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> Martijn
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>>> > >>>>>>> smiralexan@gmail.com>
>>> > >>>>>>>>>>>> wrote:
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> Thanks for your detailed feedback! However, I have questions
>>> > >>>>>>> about
>>> > >>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
>>> > >>>>> AS OF
>>> > >>>>>>>>>>>> proc_time”
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
>>> > >>>>> proc_time"
>>> > >>>>>>> is
>>> > >>>>>>>>>> not
>>> > >>>>>>>>>>>>> fully implemented with caching, but as you said, users go
>>> > >>>>> on it
>>> > >>>>>>>>>>>>> consciously to achieve better performance (no one proposed
>>> > >>>>> to
>>> > >>>>>>> enable
>>> > >>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
>>> > >>>>>>> developers
>>> > >>>>>>>>>> of
>>> > >>>>>>>>>>>>> connectors? In this case developers explicitly specify
>>> > >>>>> whether
>>> > >>>>>>> their
>>> > >>>>>>>>>>>>> connector supports caching or not (in the list of supported
>>> > >>>>>>>>>> options),
>>> > >>>>>>>>>>>>> no one makes them do that if they don't want to. So what
>>> > >>>>>>> exactly is
>>> > >>>>>>>>>>>>> the difference between implementing caching in modules
>>> > >>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
>>> > >>>>>>> considered
>>> > >>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
>>> > >>>>> the
>>> > >>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> confront a situation that allows table options in DDL to
>>> > >>>>>>> control
>>> > >>>>>>>>>> the
>>> > >>>>>>>>>>>>> behavior of the framework, which has never happened
>>> > >>>>> previously
>>> > >>>>>>> and
>>> > >>>>>>>>>>> should
>>> > >>>>>>>>>>>>> be cautious
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> If we talk about main differences of semantics of DDL
>>> > >>>>> options
>>> > >>>>>>> and
>>> > >>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
>>> > >>>>> the
>>> > >>>>>>> scope
>>> > >>>>>>>>>> of
>>> > >>>>>>>>>>>>> the options + importance for the user business logic rather
>>> > >>>>> than
>>> > >>>>>>>>>>>>> specific location of corresponding logic in the framework? I
>>> > >>>>>>> mean
>>> > >>>>>>>>>> that
>>> > >>>>>>>>>>>>> in my design, for example, putting an option with lookup
>>> > >>>>> cache
>>> > >>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
>>> > >>>>>>> because it
>>> > >>>>>>>>>>>>> directly affects the user's business logic (not just
>>> > >>>>> performance
>>> > >>>>>>>>>>>>> optimization) + touches just several functions of ONE table
>>> > >>>>>>> (there
>>> > >>>>>>>>>> can
>>> > >>>>>>>>>>>>> be multiple tables with different caches). Does it really
>>> > >>>>>>> matter for
>>> > >>>>>>>>>>>>> the user (or someone else) where the logic is located,
>>> > >>>>> which is
>>> > >>>>>>>>>>>>> affected by the applied option?
>>> > >>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
>>> > >>>>>>> some way
>>> > >>>>>>>>>>>>> "controls the behavior of the framework" and I don't see any
>>> > >>>>>>> problem
>>> > >>>>>>>>>>>>> here.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
>>> > >>>>> and
>>> > >>>>>>> the
>>> > >>>>>>>>>>> design
>>> > >>>>>>>>>>>>> would become more complex
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> This is a subject for a separate discussion, but actually
>>> > >>>>> in our
>>> > >>>>>>>>>>>>> internal version we solved this problem quite easily - we
>>> > >>>>> reused
>>> > >>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
>>> > >>>>>>> point is
>>> > >>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
>>> > >>>>>>> scanning
>>> > >>>>>>>>>> the
>>> > >>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
>>> > >>>>> class
>>> > >>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
>>> > >>>>>>> InputFormat.
>>> > >>>>>>>>>>>>> The advantage of this solution is the ability to reload
>>> > >>>>> cache
>>> > >>>>>>> data
>>> > >>>>>>>>>> in
>>> > >>>>>>>>>>>>> parallel (number of threads depends on number of
>>> > >>>>> InputSplits,
>>> > >>>>>>> but
>>> > >>>>>>>>>> has
>>> > >>>>>>>>>>>>> an upper limit). As a result cache reload time significantly
>>> > >>>>>>> reduces
>>> > >>>>>>>>>>>>> (as well as time of input stream blocking). I know that
>>> > >>>>> usually
>>> > >>>>>>> we
>>> > >>>>>>>>>> try
>>> > >>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
>>> > >>>>> one
>>> > >>>>>>> can
>>> > >>>>>>>>>> be
>>> > >>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
>>> > >>>>> maybe
>>> > >>>>>>>>>> there
>>> > >>>>>>>>>>>>> are better ones.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Providing the cache in the framework might introduce
>>> > >>>>>>> compatibility
>>> > >>>>>>>>>>>> issues
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> It's possible only in cases when the developer of the
>>> > >>>>> connector
>>> > >>>>>>>>>> won't
>>> > >>>>>>>>>>>>> properly refactor his code and will use new cache options
>>> > >>>>>>>>>> incorrectly
>>> > >>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
>>> > >>>>> code
>>> > >>>>>>>>>>>>> places). For correct behavior all he will need to do is to
>>> > >>>>>>> redirect
>>> > >>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
>>> > >>>>> add an
>>> > >>>>>>>>>> alias
>>> > >>>>>>>>>>>>> for options, if there was different naming), everything
>>> > >>>>> will be
>>> > >>>>>>>>>>>>> transparent for users. If the developer won't do
>>> > >>>>> refactoring at
>>> > >>>>>>> all,
>>> > >>>>>>>>>>>>> nothing will be changed for the connector because of
>>> > >>>>> backward
>>> > >>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
>>> > >>>>> cache
>>> > >>>>>>> logic,
>>> > >>>>>>>>>>>>> he just can refuse to pass some of the configs into the
>>> > >>>>>>> framework,
>>> > >>>>>>>>>> and
>>> > >>>>>>>>>>>>> instead make his own implementation with already existing
>>> > >>>>>>> configs
>>> > >>>>>>>>>> and
>>> > >>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> filters and projections should be pushed all the way down
>>> > >>>>> to
>>> > >>>>>>> the
>>> > >>>>>>>>>>> table
>>> > >>>>>>>>>>>>> function, like what we do in the scan source
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
>>> > >>>>> connector
>>> > >>>>>>>>>> that
>>> > >>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
>>> > >>>>>>>>>>>>> (no database connector supports it currently). Also for some
>>> > >>>>>>>>>> databases
>>> > >>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
>>> > >>>>> that we
>>> > >>>>>>> have
>>> > >>>>>>>>>>>>> in Flink.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> only applying these optimizations to the cache seems not
>>> > >>>>>>> quite
>>> > >>>>>>>>>>> useful
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
>>> > >>>>> from the
>>> > >>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
>>> > >>>>>>> table
>>> > >>>>>>>>>>>>> 'users'
>>> > >>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
>>> > >>>>> stream
>>> > >>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
>>> > >>>>> we
>>> > >>>>>>> have
>>> > >>>>>>>>>>>>> filter 'age > 30',
>>> > >>>>>>>>>>>>> there will be twice less data in cache. This means the user
>>> > >>>>> can
>>> > >>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
>>> > >>>>>>> gain a
>>> > >>>>>>>>>>>>> huge
>>> > >>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
>>> > >>>>> really
>>> > >>>>>>>>>> shine
>>> > >>>>>>>>>>>>> in 'ALL' cache, where tables without filters and projections
>>> > >>>>>>> can't
>>> > >>>>>>>>>> fit
>>> > >>>>>>>>>>>>> in memory, but with them - can. This opens up additional
>>> > >>>>>>>>>> possibilities
>>> > >>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> It would be great to hear other voices regarding this topic!
>>> > >>>>>>> Because
>>> > >>>>>>>>>>>>> we have quite a lot of controversial points, and I think
>>> > >>>>> with
>>> > >>>>>>> the
>>> > >>>>>>>>>> help
>>> > >>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> Best regards,
>>> > >>>>>>>>>>>>> Smirnov Alexander
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>>> > >>>>> renqschn@gmail.com
>>> > >>>>>>>> :
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Hi Alexander and Arvid,
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
>>> > >>>>> We
>>> > >>>>>>> had
>>> > >>>>>>>>>> an
>>> > >>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
>>> > >>>>> like
>>> > >>>>>>> to
>>> > >>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
>>> > >>>>> logic in
>>> > >>>>>>> the
>>> > >>>>>>>>>>> table
>>> > >>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
>>> > >>>>>>> function,
>>> > >>>>>>>>>> we
>>> > >>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
>>> > >>>>> with
>>> > >>>>>>> these
>>> > >>>>>>>>>>>>> concerns:
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
>>> > >>>>> SYSTEM_TIME
>>> > >>>>>>> AS OF
>>> > >>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
>>> > >>>>> of the
>>> > >>>>>>>>>> lookup
>>> > >>>>>>>>>>>>> table at the moment of querying. If users choose to enable
>>> > >>>>>>> caching
>>> > >>>>>>>>>> on
>>> > >>>>>>>>>>> the
>>> > >>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage is
>>> > >>>>>>>>>> acceptable
>>> > >>>>>>>>>>>> in
>>> > >>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
>>> > >>>>>>> caching on
>>> > >>>>>>>>>>> the
>>> > >>>>>>>>>>>>> table runtime level.
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
>>> > >>>>>>> (whether
>>> > >>>>>>>>>> in a
>>> > >>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
>>> > >>>>> confront a
>>> > >>>>>>>>>>>> situation
>>> > >>>>>>>>>>>>> that allows table options in DDL to control the behavior of
>>> > >>>>> the
>>> > >>>>>>>>>>>> framework,
>>> > >>>>>>>>>>>>> which has never happened previously and should be cautious.
>>> > >>>>>>> Under
>>> > >>>>>>>>>> the
>>> > >>>>>>>>>>>>> current design the behavior of the framework should only be
>>> > >>>>>>>>>> specified
>>> > >>>>>>>>>>> by
>>> > >>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
>>> > >>>>> these
>>> > >>>>>>>>>> general
>>> > >>>>>>>>>>>>> configs to a specific table.
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
>>> > >>>>> all
>>> > >>>>>>>>>> records
>>> > >>>>>>>>>>>>> periodically into the memory to achieve high lookup
>>> > >>>>> performance
>>> > >>>>>>>>>> (like
>>> > >>>>>>>>>>>> Hive
>>> > >>>>>>>>>>>>> connector in the community, and also widely used by our
>>> > >>>>> internal
>>> > >>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
>>> > >>>>> TableFunction
>>> > >>>>>>>>>> works
>>> > >>>>>>>>>>>> fine
>>> > >>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
>>> > >>>>>>> interface for
>>> > >>>>>>>>>>> this
>>> > >>>>>>>>>>>>> all-caching scenario and the design would become more
>>> > >>>>> complex.
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
>>> > >>>>>>>>>> compatibility
>>> > >>>>>>>>>>>>> issues to existing lookup sources like there might exist two
>>> > >>>>>>> caches
>>> > >>>>>>>>>>> with
>>> > >>>>>>>>>>>>> totally different strategies if the user incorrectly
>>> > >>>>> configures
>>> > >>>>>>> the
>>> > >>>>>>>>>>> table
>>> > >>>>>>>>>>>>> (one in the framework and another implemented by the lookup
>>> > >>>>>>> source).
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
>>> > >>>>>>> filters
>>> > >>>>>>>>>> and
>>> > >>>>>>>>>>>>> projections should be pushed all the way down to the table
>>> > >>>>>>> function,
>>> > >>>>>>>>>>> like
>>> > >>>>>>>>>>>>> what we do in the scan source, instead of the runner with
>>> > >>>>> the
>>> > >>>>>>> cache.
>>> > >>>>>>>>>>> The
>>> > >>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
>>> > >>>>> pressure
>>> > >>>>>>> on the
>>> > >>>>>>>>>>>>> external system, and only applying these optimizations to
>>> > >>>>> the
>>> > >>>>>>> cache
>>> > >>>>>>>>>>> seems
>>> > >>>>>>>>>>>>> not quite useful.
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
>>> > >>>>> We
>>> > >>>>>>>>>> prefer to
>>> > >>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
>>> > >>>>> and we
>>> > >>>>>>>>>> could
>>> > >>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
>>> > >>>>>>>>>>>> AllCachingTableFunction,
>>> > >>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
>>> > >>>>> metrics
>>> > >>>>>>> of the
>>> > >>>>>>>>>>>> cache.
>>> > >>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Looking forward to your ideas!
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> [1]
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>
>>> > >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>> > >>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Best regards,
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Qingsheng
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
>>> > >>>>>>>>>>>> smiralexan@gmail.com>
>>> > >>>>>>>>>>>>> wrote:
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> I have few comments on your message.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> but could also live with an easier solution as the
>>> > >>>>> first
>>> > >>>>>>> step:
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
>>> > >>>>> (originally
>>> > >>>>>>>>>>> proposed
>>> > >>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
>>> > >>>>> the
>>> > >>>>>>> same
>>> > >>>>>>>>>>>>>>> goal, but implementation details are different. If we
>>> > >>>>> will
>>> > >>>>>>> go one
>>> > >>>>>>>>>>> way,
>>> > >>>>>>>>>>>>>>> moving to another way in the future will mean deleting
>>> > >>>>>>> existing
>>> > >>>>>>>>>> code
>>> > >>>>>>>>>>>>>>> and once again changing the API for connectors. So I
>>> > >>>>> think we
>>> > >>>>>>>>>> should
>>> > >>>>>>>>>>>>>>> reach a consensus with the community about that and then
>>> > >>>>> work
>>> > >>>>>>>>>>> together
>>> > >>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
>>> > >>>>>>> parts
>>> > >>>>>>>>>> of
>>> > >>>>>>>>>>> the
>>> > >>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
>>> > >>>>>>> proposed
>>> > >>>>>>>>>> set
>>> > >>>>>>>>>>> of
>>> > >>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> as the source will only receive the requests after
>>> > >>>>> filter
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
>>> > >>>>>>> table, we
>>> > >>>>>>>>>>>>>>> firstly must do requests, and only after that we can
>>> > >>>>> filter
>>> > >>>>>>>>>>> responses,
>>> > >>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
>>> > >>>>> if
>>> > >>>>>>>>>>> filtering
>>> > >>>>>>>>>>>>>>> is done before caching, there will be much less rows in
>>> > >>>>>>> cache.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>> > >>>>> shared.
>>> > >>>>>>> I
>>> > >>>>>>>>>> don't
>>> > >>>>>>>>>>>>> know the
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> solution to share images to be honest.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
>>> > >>>>> conversations
>>> > >>>>>>> :)
>>> > >>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
>>> > >>>>> Jira
>>> > >>>>>>> issue,
>>> > >>>>>>>>>>>>>>> where described the proposed changes in more details -
>>> > >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Will happy to get more feedback!
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Best,
>>> > >>>>>>>>>>>>>>> Smirnov Alexander
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>>> > >>>>> arvid@apache.org>:
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Hi Qingsheng,
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
>>> > >>>>>>> satisfying
>>> > >>>>>>>>>> for
>>> > >>>>>>>>>>>> me.
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
>>> > >>>>> with
>>> > >>>>>>> an
>>> > >>>>>>>>>>> easier
>>> > >>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
>>> > >>>>> an
>>> > >>>>>>>>>>>>> implementation
>>> > >>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
>>> > >>>>> layer
>>> > >>>>>>>>>> around X.
>>> > >>>>>>>>>>>> So
>>> > >>>>>>>>>>>>> the
>>> > >>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
>>> > >>>>> delegates to
>>> > >>>>>>> X in
>>> > >>>>>>>>>>> case
>>> > >>>>>>>>>>>>> of
>>> > >>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
>>> > >>>>>>> operator
>>> > >>>>>>>>>>>> model
>>> > >>>>>>>>>>>>> as
>>> > >>>>>>>>>>>>>>>> proposed would be even better but is probably
>>> > >>>>> unnecessary
>>> > >>>>>>> in
>>> > >>>>>>>>>> the
>>> > >>>>>>>>>>>>> first step
>>> > >>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
>>> > >>>>> the
>>> > >>>>>>>>>> requests
>>> > >>>>>>>>>>>>> after
>>> > >>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
>>> > >>>>> save
>>> > >>>>>>>>>>> memory).
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
>>> > >>>>>>> would be
>>> > >>>>>>>>>>>>> limited to
>>> > >>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
>>> > >>>>> else
>>> > >>>>>>>>>>> remains
>>> > >>>>>>>>>>>> an
>>> > >>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
>>> > >>>>> easily
>>> > >>>>>>>>>>>> incorporate
>>> > >>>>>>>>>>>>> the
>>> > >>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
>>> > >>>>> later.
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>> > >>>>> shared.
>>> > >>>>>>> I
>>> > >>>>>>>>>> don't
>>> > >>>>>>>>>>>>> know the
>>> > >>>>>>>>>>>>>>>> solution to share images to be honest.
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
>>> > >>>>>>>>>>>>> smiralexan@gmail.com>
>>> > >>>>>>>>>>>>>>>> wrote:
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
>>> > >>>>> committer
>>> > >>>>>>> yet,
>>> > >>>>>>>>>> but
>>> > >>>>>>>>>>>> I'd
>>> > >>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
>>> > >>>>>>> interested
>>> > >>>>>>>>>> me.
>>> > >>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
>>> > >>>>>>> company’s
>>> > >>>>>>>>>>> Flink
>>> > >>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
>>> > >>>>> this and
>>> > >>>>>>>>>> make
>>> > >>>>>>>>>>>> code
>>> > >>>>>>>>>>>>>>>>> open source.
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> I think there is a better alternative than
>>> > >>>>> introducing an
>>> > >>>>>>>>>>> abstract
>>> > >>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
>>> > >>>>> you
>>> > >>>>>>> know,
>>> > >>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
>>> > >>>>> module,
>>> > >>>>>>> which
>>> > >>>>>>>>>>>>> provides
>>> > >>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
>>> > >>>>>>> convenient
>>> > >>>>>>>>>> for
>>> > >>>>>>>>>>>>> importing
>>> > >>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
>>> > >>>>>>> logic
>>> > >>>>>>>>>> for
>>> > >>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
>>> > >>>>>>> connected
>>> > >>>>>>>>>> with
>>> > >>>>>>>>>>> it
>>> > >>>>>>>>>>>>>>>>> should be located in another module, probably in
>>> > >>>>>>>>>>>>> flink-table-runtime.
>>> > >>>>>>>>>>>>>>>>> But this will require connectors to depend on another
>>> > >>>>>>> module,
>>> > >>>>>>>>>>>> which
>>> > >>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
>>> > >>>>>>> good.
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
>>> > >>>>>>>>>>>> LookupTableSource
>>> > >>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
>>> > >>>>> pass
>>> > >>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
>>> > >>>>>>> depend on
>>> > >>>>>>>>>>>>> runtime
>>> > >>>>>>>>>>>>>>>>> realization. Based on these configs planner will
>>> > >>>>>>> construct a
>>> > >>>>>>>>>>>> lookup
>>> > >>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
>>> > >>>>>>>>>> (ProcessFunctions
>>> > >>>>>>>>>>>> in
>>> > >>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
>>> > >>>>> in
>>> > >>>>>>> the
>>> > >>>>>>>>>>> pinned
>>> > >>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
>>> > >>>>>>>>>> CacheConfig).
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
>>> > >>>>> responsible
>>> > >>>>>>> for
>>> > >>>>>>>>>>> this
>>> > >>>>>>>>>>>> –
>>> > >>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
>>> > >>>>>>>>>>>>>>>>> Current classes for lookup join in
>>> > >>>>> flink-table-runtime
>>> > >>>>>>> -
>>> > >>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>>> > >>>>>>>>>>> LookupJoinRunnerWithCalc,
>>> > >>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
>>> > >>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
>>> > >>>>> such a
>>> > >>>>>>>>>>> solution.
>>> > >>>>>>>>>>>>> If
>>> > >>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
>>> > >>>>> some
>>> > >>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
>>> > >>>>> named
>>> > >>>>>>> like
>>> > >>>>>>>>>>> this
>>> > >>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
>>> > >>>>>>> mostly
>>> > >>>>>>>>>>>> consists
>>> > >>>>>>>>>>>>> of
>>> > >>>>>>>>>>>>>>>>> filters and projections.
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
>>> > >>>>>>> condition
>>> > >>>>>>>>>>> ‘JOIN …
>>> > >>>>>>>>>>>>> ON
>>> > >>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
>>> > >>>>> 1000’
>>> > >>>>>>>>>>> ‘calc’
>>> > >>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
>>> > >>>>>>>>>> B.salary >
>>> > >>>>>>>>>>>>> 1000.
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> If we apply this function before storing records in
>>> > >>>>>>> cache,
>>> > >>>>>>>>>> size
>>> > >>>>>>>>>>> of
>>> > >>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
>>> > >>>>>>> storing
>>> > >>>>>>>>>>>> useless
>>> > >>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
>>> > >>>>> size. So
>>> > >>>>>>> the
>>> > >>>>>>>>>>>> initial
>>> > >>>>>>>>>>>>>>>>> max number of records in cache can be increased by
>>> > >>>>> the
>>> > >>>>>>> user.
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> What do you think about it?
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>>> > >>>>>>>>>>>>>>>>>> Hi devs,
>>> > >>>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
>>> > >>>>>>>>>> FLIP-221[1],
>>> > >>>>>>>>>>>>> which
>>> > >>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
>>> > >>>>> its
>>> > >>>>>>>>>> standard
>>> > >>>>>>>>>>>>> metrics.
>>> > >>>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
>>> > >>>>>>> their
>>> > >>>>>>>>>> own
>>> > >>>>>>>>>>>>> cache to
>>> > >>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
>>> > >>>>>>> metrics
>>> > >>>>>>>>>> for
>>> > >>>>>>>>>>>>> users and
>>> > >>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
>>> > >>>>> which
>>> > >>>>>>> is a
>>> > >>>>>>>>>>>> quite
>>> > >>>>>>>>>>>>> common
>>> > >>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>> > >>>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
>>> > >>>>>>>>>> metrics,
>>> > >>>>>>>>>>>>> wrapper
>>> > >>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
>>> > >>>>> Please
>>> > >>>>>>> take a
>>> > >>>>>>>>>>> look
>>> > >>>>>>>>>>>>> at the
>>> > >>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
>>> > >>>>> and
>>> > >>>>>>>>>> comments
>>> > >>>>>>>>>>>>> would be
>>> > >>>>>>>>>>>>>>>>> appreciated!
>>> > >>>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>> [1]
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>
>>> > >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>> > >>>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>> Best regards,
>>> > >>>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>> Qingsheng
>>> > >>>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> --
>>> > >>>>>>>>>>>>>> Best Regards,
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Qingsheng Ren
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Real-time Computing Team
>>> > >>>>>>>>>>>>>> Alibaba Cloud
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Email: renqschn@gmail.com
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> --
>>> > >>>>>>>>> Best regards,
>>> > >>>>>>>>> Roman Boyko
>>> > >>>>>>>>> e.: ro.v.boyko@gmail.com
>>> > >>>>>>>>>
>>> > >>>>>>>
>>> > >>>>>
>>> > >>>>>
>>> > >>>
>>> > >>
>>> > >>
>>> > >> --
>>> > >> Best Regards,
>>> > >>
>>> > >> Qingsheng Ren
>>> > >>
>>> > >> Real-time Computing Team
>>> > >> Alibaba Cloud
>>> > >>
>>> > >> Email: renqschn@gmail.com
>>> >

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi Jark and Alexander,

Thanks for your comments! I’m also OK to introduce common table options. I
prefer to introduce a new DefaultLookupCacheOptions class for holding these
option definitions because putting all options into FactoryUtil would make
it a bit ”crowded” and not well categorized.

FLIP has been updated according to suggestions above:
1. Use static “of” method for constructing RescanRuntimeProvider
considering both arguments are required.
2. Introduce new table options matching DefaultLookupCacheFactory

Best,
Qingsheng

On Wed, May 18, 2022 at 2:57 PM Jark Wu <im...@gmail.com> wrote:

> Hi Alex,
>
> 1) retry logic
> I think we can extract some common retry logic into utilities, e.g.
> RetryUtils#tryTimes(times, call).
> This seems independent of this FLIP and can be reused by DataStream users.
> Maybe we can open an issue to discuss this and where to put it.
>
> 2) cache ConfigOptions
> I'm fine with defining cache config options in the framework.
> A candidate place to put is FactoryUtil which also includes
> "sink.parallelism", "format" options.
>
> Best,
> Jark
>
>
> On Wed, 18 May 2022 at 13:52, Александр Смирнов <sm...@gmail.com>
> wrote:
>
>> Hi Qingsheng,
>>
>> Thank you for considering my comments.
>>
>> >  there might be custom logic before making retry, such as re-establish
>> the connection
>>
>> Yes, I understand that. I meant that such logic can be placed in a
>> separate function, that can be implemented by connectors. Just moving
>> the retry logic would make connector's LookupFunction more concise +
>> avoid duplicate code. However, it's a minor change. The decision is up
>> to you.
>>
>> > We decide not to provide common DDL options and let developers to
>> define their own options as we do now per connector.
>>
>> What is the reason for that? One of the main goals of this FLIP was to
>> unify the configs, wasn't it? I understand that current cache design
>> doesn't depend on ConfigOptions, like was before. But still we can put
>> these options into the framework, so connectors can reuse them and
>> avoid code duplication, and, what is more significant, avoid possible
>> different options naming. This moment can be pointed out in
>> documentation for connector developers.
>>
>> Best regards,
>> Alexander
>>
>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
>> >
>> > Hi Alexander,
>> >
>> > Thanks for the review and glad to see we are on the same page! I think
>> you forgot to cc the dev mailing list so I’m also quoting your reply under
>> this email.
>> >
>> > >  We can add 'maxRetryTimes' option into this class
>> >
>> > In my opinion the retry logic should be implemented in lookup() instead
>> of in LookupFunction#eval(). Retrying is only meaningful under some
>> specific retriable failures, and there might be custom logic before making
>> retry, such as re-establish the connection (JdbcRowDataLookupFunction is an
>> example), so it's more handy to leave it to the connector.
>> >
>> > > I don't see DDL options, that were in previous version of FLIP. Do
>> you have any special plans for them?
>> >
>> > We decide not to provide common DDL options and let developers to
>> define their own options as we do now per connector.
>> >
>> > The rest of comments sound great and I’ll update the FLIP. Hope we can
>> finalize our proposal soon!
>> >
>> > Best,
>> >
>> > Qingsheng
>> >
>> >
>> > > On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com>
>> wrote:
>> > >
>> > > Hi Qingsheng and devs!
>> > >
>> > > I like the overall design of updated FLIP, however I have several
>> > > suggestions and questions.
>> > >
>> > > 1) Introducing LookupFunction as a subclass of TableFunction is a good
>> > > idea. We can add 'maxRetryTimes' option into this class. 'eval' method
>> > > of new LookupFunction is great for this purpose. The same is for
>> > > 'async' case.
>> > >
>> > > 2) There might be other configs in future, such as 'cacheMissingKey'
>> > > in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
>> > > Maybe use Builder pattern in LookupFunctionProvider and
>> > > RescanRuntimeProvider for more flexibility (use one 'build' method
>> > > instead of many 'of' methods in future)?
>> > >
>> > > 3) What are the plans for existing TableFunctionProvider and
>> > > AsyncTableFunctionProvider? I think they should be deprecated.
>> > >
>> > > 4) Am I right that the current design does not assume usage of
>> > > user-provided LookupCache in re-scanning? In this case, it is not very
>> > > clear why do we need methods such as 'invalidate' or 'putAll' in
>> > > LookupCache.
>> > >
>> > > 5) I don't see DDL options, that were in previous version of FLIP. Do
>> > > you have any special plans for them?
>> > >
>> > > If you don't mind, I would be glad to be able to make small
>> > > adjustments to the FLIP document too. I think it's worth mentioning
>> > > about what exactly optimizations are planning in the future.
>> > >
>> > > Best regards,
>> > > Smirnov Alexander
>> > >
>> > > пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
>> > >>
>> > >> Hi Alexander and devs,
>> > >>
>> > >> Thank you very much for the in-depth discussion! As Jark mentioned
>> we were inspired by Alexander's idea and made a refactor on our design.
>> FLIP-221 [1] has been updated to reflect our design now and we are happy to
>> hear more suggestions from you!
>> > >>
>> > >> Compared to the previous design:
>> > >> 1. The lookup cache serves at table runtime level and is integrated
>> as a component of LookupJoinRunner as discussed previously.
>> > >> 2. Interfaces are renamed and re-designed to reflect the new design.
>> > >> 3. We separate the all-caching case individually and introduce a new
>> RescanRuntimeProvider to reuse the ability of scanning. We are planning to
>> support SourceFunction / InputFormat for now considering the complexity of
>> FLIP-27 Source API.
>> > >> 4. A new interface LookupFunction is introduced to make the semantic
>> of lookup more straightforward for developers.
>> > >>
>> > >> For replying to Alexander:
>> > >>> However I'm a little confused whether InputFormat is deprecated or
>> not. Am I right that it will be so in the future, but currently it's not?
>> > >> Yes you are right. InputFormat is not deprecated for now. I think it
>> will be deprecated in the future but we don't have a clear plan for that.
>> > >>
>> > >> Thanks again for the discussion on this FLIP and looking forward to
>> cooperating with you after we finalize the design and interfaces!
>> > >>
>> > >> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> > >>
>> > >> Best regards,
>> > >>
>> > >> Qingsheng
>> > >>
>> > >>
>> > >> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
>> smiralexan@gmail.com> wrote:
>> > >>>
>> > >>> Hi Jark, Qingsheng and Leonard!
>> > >>>
>> > >>> Glad to see that we came to a consensus on almost all points!
>> > >>>
>> > >>> However I'm a little confused whether InputFormat is deprecated or
>> > >>> not. Am I right that it will be so in the future, but currently it's
>> > >>> not? Actually I also think that for the first version it's OK to use
>> > >>> InputFormat in ALL cache realization, because supporting rescan
>> > >>> ability seems like a very distant prospect. But for this decision we
>> > >>> need a consensus among all discussion participants.
>> > >>>
>> > >>> In general, I don't have something to argue with your statements.
>> All
>> > >>> of them correspond my ideas. Looking ahead, it would be nice to work
>> > >>> on this FLIP cooperatively. I've already done a lot of work on
>> lookup
>> > >>> join caching with realization very close to the one we are
>> discussing,
>> > >>> and want to share the results of this work. Anyway looking forward
>> for
>> > >>> the FLIP update!
>> > >>>
>> > >>> Best regards,
>> > >>> Smirnov Alexander
>> > >>>
>> > >>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>> > >>>>
>> > >>>> Hi Alex,
>> > >>>>
>> > >>>> Thanks for summarizing your points.
>> > >>>>
>> > >>>> In the past week, Qingsheng, Leonard, and I have discussed it
>> several times
>> > >>>> and we have totally refactored the design.
>> > >>>> I'm glad to say we have reached a consensus on many of your points!
>> > >>>> Qingsheng is still working on updating the design docs and maybe
>> can be
>> > >>>> available in the next few days.
>> > >>>> I will share some conclusions from our discussions:
>> > >>>>
>> > >>>> 1) we have refactored the design towards to "cache in framework"
>> way.
>> > >>>>
>> > >>>> 2) a "LookupCache" interface for users to customize and a default
>> > >>>> implementation with builder for users to easy-use.
>> > >>>> This can both make it possible to both have flexibility and
>> conciseness.
>> > >>>>
>> > >>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp
>> reducing
>> > >>>> IO.
>> > >>>> Filter pushdown should be the final state and the unified way to
>> both
>> > >>>> support pruning ALL cache and LRU cache,
>> > >>>> so I think we should make effort in this direction. If we need to
>> support
>> > >>>> filter pushdown for ALL cache anyway, why not use
>> > >>>> it for LRU cache as well? Either way, as we decide to implement
>> the cache
>> > >>>> in the framework, we have the chance to support
>> > >>>> filter on cache anytime. This is an optimization and it doesn't
>> affect the
>> > >>>> public API. I think we can create a JIRA issue to
>> > >>>> discuss it when the FLIP is accepted.
>> > >>>>
>> > >>>> 4) The idea to support ALL cache is similar to your proposal.
>> > >>>> In the first version, we will only support InputFormat,
>> SourceFunction for
>> > >>>> cache all (invoke InputFormat in join operator).
>> > >>>> For FLIP-27 source, we need to join a true source operator instead
>> of
>> > >>>> calling it embedded in the join operator.
>> > >>>> However, this needs another FLIP to support the re-scan ability
>> for FLIP-27
>> > >>>> Source, and this can be a large work.
>> > >>>> In order to not block this issue, we can put the effort of FLIP-27
>> source
>> > >>>> integration into future work and integrate
>> > >>>> InputFormat&SourceFunction for now.
>> > >>>>
>> > >>>> I think it's fine to use InputFormat&SourceFunction, as they are
>> not
>> > >>>> deprecated, otherwise, we have to introduce another function
>> > >>>> similar to them which is meaningless. We need to plan FLIP-27
>> source
>> > >>>> integration ASAP before InputFormat & SourceFunction are
>> deprecated.
>> > >>>>
>> > >>>> Best,
>> > >>>> Jark
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
>> smiralexan@gmail.com>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> Hi Martijn!
>> > >>>>>
>> > >>>>> Got it. Therefore, the realization with InputFormat is not
>> considered.
>> > >>>>> Thanks for clearing that up!
>> > >>>>>
>> > >>>>> Best regards,
>> > >>>>> Smirnov Alexander
>> > >>>>>
>> > >>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <martijn@ververica.com
>> >:
>> > >>>>>>
>> > >>>>>> Hi,
>> > >>>>>>
>> > >>>>>> With regards to:
>> > >>>>>>
>> > >>>>>>> But if there are plans to refactor all connectors to FLIP-27
>> > >>>>>>
>> > >>>>>> Yes, FLIP-27 is the target for all connectors. The old
>> interfaces will be
>> > >>>>>> deprecated and connectors will either be refactored to use the
>> new ones
>> > >>>>> or
>> > >>>>>> dropped.
>> > >>>>>>
>> > >>>>>> The caching should work for connectors that are using FLIP-27
>> interfaces,
>> > >>>>>> we should not introduce new features for old interfaces.
>> > >>>>>>
>> > >>>>>> Best regards,
>> > >>>>>>
>> > >>>>>> Martijn
>> > >>>>>>
>> > >>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
>> smiralexan@gmail.com>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>>> Hi Jark!
>> > >>>>>>>
>> > >>>>>>> Sorry for the late response. I would like to make some comments
>> and
>> > >>>>>>> clarify my points.
>> > >>>>>>>
>> > >>>>>>> 1) I agree with your first statement. I think we can achieve
>> both
>> > >>>>>>> advantages this way: put the Cache interface in
>> flink-table-common,
>> > >>>>>>> but have implementations of it in flink-table-runtime.
>> Therefore if a
>> > >>>>>>> connector developer wants to use existing cache strategies and
>> their
>> > >>>>>>> implementations, he can just pass lookupConfig to the planner,
>> but if
>> > >>>>>>> he wants to have its own cache implementation in his
>> TableFunction, it
>> > >>>>>>> will be possible for him to use the existing interface for this
>> > >>>>>>> purpose (we can explicitly point this out in the
>> documentation). In
>> > >>>>>>> this way all configs and metrics will be unified. WDYT?
>> > >>>>>>>
>> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have
>> 90% of
>> > >>>>>>> lookup requests that can never be cached
>> > >>>>>>>
>> > >>>>>>> 2) Let me clarify the logic filters optimization in case of LRU
>> cache.
>> > >>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we
>> always
>> > >>>>>>> store the response of the dimension table in cache, even after
>> > >>>>>>> applying calc function. I.e. if there are no rows after applying
>> > >>>>>>> filters to the result of the 'eval' method of TableFunction, we
>> store
>> > >>>>>>> the empty list by lookup keys. Therefore the cache line will be
>> > >>>>>>> filled, but will require much less memory (in bytes). I.e. we
>> don't
>> > >>>>>>> completely filter keys, by which result was pruned, but
>> significantly
>> > >>>>>>> reduce required memory to store this result. If the user knows
>> about
>> > >>>>>>> this behavior, he can increase the 'max-rows' option before the
>> start
>> > >>>>>>> of the job. But actually I came up with the idea that we can do
>> this
>> > >>>>>>> automatically by using the 'maximumWeight' and 'weigher'
>> methods of
>> > >>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
>> > >>>>>>> (value of cache). Therefore cache can automatically fit much
>> more
>> > >>>>>>> records than before.
>> > >>>>>>>
>> > >>>>>>>> Flink SQL has provided a standard way to do filters and
>> projects
>> > >>>>>>> pushdown, i.e., SupportsFilterPushDown and
>> SupportsProjectionPushDown.
>> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean
>> it's
>> > >>>>> hard
>> > >>>>>>> to implement.
>> > >>>>>>>
>> > >>>>>>> It's debatable how difficult it will be to implement filter
>> pushdown.
>> > >>>>>>> But I think the fact that currently there is no database
>> connector
>> > >>>>>>> with filter pushdown at least means that this feature won't be
>> > >>>>>>> supported soon in connectors. Moreover, if we talk about other
>> > >>>>>>> connectors (not in Flink repo), their databases might not
>> support all
>> > >>>>>>> Flink filters (or not support filters at all). I think users are
>> > >>>>>>> interested in supporting cache filters optimization
>> independently of
>> > >>>>>>> supporting other features and solving more complex problems (or
>> > >>>>>>> unsolvable at all).
>> > >>>>>>>
>> > >>>>>>> 3) I agree with your third statement. Actually in our internal
>> version
>> > >>>>>>> I also tried to unify the logic of scanning and reloading data
>> from
>> > >>>>>>> connectors. But unfortunately, I didn't find a way to unify the
>> logic
>> > >>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
>> Source,...)
>> > >>>>>>> and reuse it in reloading ALL cache. As a result I settled on
>> using
>> > >>>>>>> InputFormat, because it was used for scanning in all lookup
>> > >>>>>>> connectors. (I didn't know that there are plans to deprecate
>> > >>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27
>> source
>> > >>>>>>> in ALL caching is not good idea, because this source was
>> designed to
>> > >>>>>>> work in distributed environment (SplitEnumerator on JobManager
>> and
>> > >>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
>> > >>>>>>> operator in our case). There is even no direct way to pass
>> splits from
>> > >>>>>>> SplitEnumerator to SourceReader (this logic works through
>> > >>>>>>> SplitEnumeratorContext, which requires
>> > >>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents).
>> Usage of
>> > >>>>>>> InputFormat for ALL cache seems much more clearer and easier.
>> But if
>> > >>>>>>> there are plans to refactor all connectors to FLIP-27, I have
>> the
>> > >>>>>>> following ideas: maybe we can refuse from lookup join ALL cache
>> in
>> > >>>>>>> favor of simple join with multiple scanning of batch source?
>> The point
>> > >>>>>>> is that the only difference between lookup join ALL cache and
>> simple
>> > >>>>>>> join with batch source is that in the first case scanning is
>> performed
>> > >>>>>>> multiple times, in between which state (cache) is cleared
>> (correct me
>> > >>>>>>> if I'm wrong). So what if we extend the functionality of simple
>> join
>> > >>>>>>> to support state reloading + extend the functionality of
>> scanning
>> > >>>>>>> batch source multiple times (this one should be easy with new
>> FLIP-27
>> > >>>>>>> source, that unifies streaming/batch reading - we will need to
>> change
>> > >>>>>>> only SplitEnumerator, which will pass splits again after some
>> TTL).
>> > >>>>>>> WDYT? I must say that this looks like a long-term goal and will
>> make
>> > >>>>>>> the scope of this FLIP even larger than you said. Maybe we can
>> limit
>> > >>>>>>> ourselves to a simpler solution now (InputFormats).
>> > >>>>>>>
>> > >>>>>>> So to sum up, my points is like this:
>> > >>>>>>> 1) There is a way to make both concise and flexible interfaces
>> for
>> > >>>>>>> caching in lookup join.
>> > >>>>>>> 2) Cache filters optimization is important both in LRU and ALL
>> caches.
>> > >>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
>> > >>>>>>> connectors, some of the connectors might not have the
>> opportunity to
>> > >>>>>>> support filter pushdown + as I know, currently filter pushdown
>> works
>> > >>>>>>> only for scanning (not lookup). So cache filters + projections
>> > >>>>>>> optimization should be independent from other features.
>> > >>>>>>> 4) ALL cache realization is a complex topic that involves
>> multiple
>> > >>>>>>> aspects of how Flink is developing. Refusing from InputFormat
>> in favor
>> > >>>>>>> of FLIP-27 Source will make ALL cache realization really
>> complex and
>> > >>>>>>> not clear, so maybe instead of that we can extend the
>> functionality of
>> > >>>>>>> simple join or not refuse from InputFormat in case of lookup
>> join ALL
>> > >>>>>>> cache?
>> > >>>>>>>
>> > >>>>>>> Best regards,
>> > >>>>>>> Smirnov Alexander
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> [1]
>> > >>>>>>>
>> > >>>>>
>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>> > >>>>>>>
>> > >>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>> > >>>>>>>>
>> > >>>>>>>> It's great to see the active discussion! I want to share my
>> ideas:
>> > >>>>>>>>
>> > >>>>>>>> 1) implement the cache in framework vs. connectors base
>> > >>>>>>>> I don't have a strong opinion on this. Both ways should work
>> (e.g.,
>> > >>>>> cache
>> > >>>>>>>> pruning, compatibility).
>> > >>>>>>>> The framework way can provide more concise interfaces.
>> > >>>>>>>> The connector base way can define more flexible cache
>> > >>>>>>>> strategies/implementations.
>> > >>>>>>>> We are still investigating a way to see if we can have both
>> > >>>>> advantages.
>> > >>>>>>>> We should reach a consensus that the way should be a final
>> state,
>> > >>>>> and we
>> > >>>>>>>> are on the path to it.
>> > >>>>>>>>
>> > >>>>>>>> 2) filters and projections pushdown:
>> > >>>>>>>> I agree with Alex that the filter pushdown into cache can
>> benefit a
>> > >>>>> lot
>> > >>>>>>> for
>> > >>>>>>>> ALL cache.
>> > >>>>>>>> However, this is not true for LRU cache. Connectors use cache
>> to
>> > >>>>> reduce
>> > >>>>>>> IO
>> > >>>>>>>> requests to databases for better throughput.
>> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have
>> 90% of
>> > >>>>>>> lookup
>> > >>>>>>>> requests that can never be cached
>> > >>>>>>>> and hit directly to the databases. That means the cache is
>> > >>>>> meaningless in
>> > >>>>>>>> this case.
>> > >>>>>>>>
>> > >>>>>>>> IMO, Flink SQL has provided a standard way to do filters and
>> projects
>> > >>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>> > >>>>> SupportsProjectionPushDown.
>> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean
>> it's
>> > >>>>> hard
>> > >>>>>>> to
>> > >>>>>>>> implement.
>> > >>>>>>>> They should implement the pushdown interfaces to reduce IO and
>> the
>> > >>>>> cache
>> > >>>>>>>> size.
>> > >>>>>>>> That should be a final state that the scan source and lookup
>> source
>> > >>>>> share
>> > >>>>>>>> the exact pushdown implementation.
>> > >>>>>>>> I don't see why we need to duplicate the pushdown logic in
>> caches,
>> > >>>>> which
>> > >>>>>>>> will complex the lookup join design.
>> > >>>>>>>>
>> > >>>>>>>> 3) ALL cache abstraction
>> > >>>>>>>> All cache might be the most challenging part of this FLIP. We
>> have
>> > >>>>> never
>> > >>>>>>>> provided a reload-lookup public interface.
>> > >>>>>>>> Currently, we put the reload logic in the "eval" method of
>> > >>>>> TableFunction.
>> > >>>>>>>> That's hard for some sources (e.g., Hive).
>> > >>>>>>>> Ideally, connector implementation should share the logic of
>> reload
>> > >>>>> and
>> > >>>>>>>> scan, i.e. ScanTableSource with
>> InputFormat/SourceFunction/FLIP-27
>> > >>>>>>> Source.
>> > >>>>>>>> However, InputFormat/SourceFunction are deprecated, and the
>> FLIP-27
>> > >>>>>>> source
>> > >>>>>>>> is deeply coupled with SourceOperator.
>> > >>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this
>> may make
>> > >>>>> the
>> > >>>>>>>> scope of this FLIP much larger.
>> > >>>>>>>> We are still investigating how to abstract the ALL cache logic
>> and
>> > >>>>> reuse
>> > >>>>>>>> the existing source interfaces.
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Best,
>> > >>>>>>>> Jark
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro.v.boyko@gmail.com
>> >
>> > >>>>> wrote:
>> > >>>>>>>>
>> > >>>>>>>>> It's a much more complicated activity and lies out of the
>> scope of
>> > >>>>> this
>> > >>>>>>>>> improvement. Because such pushdowns should be done for all
>> > >>>>>>> ScanTableSource
>> > >>>>>>>>> implementations (not only for Lookup ones).
>> > >>>>>>>>>
>> > >>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>> > >>>>> martijnvisser@apache.org>
>> > >>>>>>>>> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>>> Hi everyone,
>> > >>>>>>>>>>
>> > >>>>>>>>>> One question regarding "And Alexander correctly mentioned
>> that
>> > >>>>> filter
>> > >>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." ->
>> Would
>> > >>>>> an
>> > >>>>>>>>>> alternative solution be to actually implement these filter
>> > >>>>> pushdowns?
>> > >>>>>>> I
>> > >>>>>>>>>> can
>> > >>>>>>>>>> imagine that there are many more benefits to doing that,
>> outside
>> > >>>>> of
>> > >>>>>>> lookup
>> > >>>>>>>>>> caching and metrics.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Best regards,
>> > >>>>>>>>>>
>> > >>>>>>>>>> Martijn Visser
>> > >>>>>>>>>> https://twitter.com/MartijnVisser82
>> > >>>>>>>>>> https://github.com/MartijnVisser
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
>> ro.v.boyko@gmail.com>
>> > >>>>>>> wrote:
>> > >>>>>>>>>>
>> > >>>>>>>>>>> Hi everyone!
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Thanks for driving such a valuable improvement!
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> I do think that single cache implementation would be a nice
>> > >>>>>>> opportunity
>> > >>>>>>>>>> for
>> > >>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
>> proc_time"
>> > >>>>>>> semantics
>> > >>>>>>>>>>> anyway - doesn't matter how it will be implemented.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Putting myself in the user's shoes, I can say that:
>> > >>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the
>> cache
>> > >>>>> size
>> > >>>>>>> by
>> > >>>>>>>>>>> simply filtering unnecessary data. And the most handy way
>> to do
>> > >>>>> it
>> > >>>>>>> is
>> > >>>>>>>>>> apply
>> > >>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
>> > >>>>>>> through the
>> > >>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
>> > >>>>> mentioned
>> > >>>>>>> that
>> > >>>>>>>>>>> filter pushdown still is not implemented for
>> jdbc/hive/hbase.
>> > >>>>>>>>>>> 2) The ability to set the different caching parameters for
>> > >>>>> different
>> > >>>>>>>>>> tables
>> > >>>>>>>>>>> is quite important. So I would prefer to set it through DDL
>> > >>>>> rather
>> > >>>>>>> than
>> > >>>>>>>>>>> have similar ttla, strategy and other options for all lookup
>> > >>>>> tables.
>> > >>>>>>>>>>> 3) Providing the cache into the framework really deprives
>> us of
>> > >>>>>>>>>>> extensibility (users won't be able to implement their own
>> > >>>>> cache).
>> > >>>>>>> But
>> > >>>>>>>>>> most
>> > >>>>>>>>>>> probably it might be solved by creating more different cache
>> > >>>>>>> strategies
>> > >>>>>>>>>> and
>> > >>>>>>>>>>> a wider set of configurations.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> All these points are much closer to the schema proposed by
>> > >>>>>>> Alexander.
>> > >>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all
>> these
>> > >>>>>>>>>> facilities
>> > >>>>>>>>>>> might be simply implemented in your architecture?
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Best regards,
>> > >>>>>>>>>>> Roman Boyko
>> > >>>>>>>>>>> e.: ro.v.boyko@gmail.com
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>> > >>>>>>> martijnvisser@apache.org>
>> > >>>>>>>>>>> wrote:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> Hi everyone,
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> I don't have much to chip in, but just wanted to express
>> that
>> > >>>>> I
>> > >>>>>>> really
>> > >>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
>> > >>>>> that
>> > >>>>>>>>>> others
>> > >>>>>>>>>>>> will join the conversation.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Best regards,
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Martijn
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>> > >>>>>>> smiralexan@gmail.com>
>> > >>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
>> questions
>> > >>>>>>> about
>> > >>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
>> > >>>>> AS OF
>> > >>>>>>>>>>>> proc_time”
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
>> > >>>>> proc_time"
>> > >>>>>>> is
>> > >>>>>>>>>> not
>> > >>>>>>>>>>>>> fully implemented with caching, but as you said, users go
>> > >>>>> on it
>> > >>>>>>>>>>>>> consciously to achieve better performance (no one proposed
>> > >>>>> to
>> > >>>>>>> enable
>> > >>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
>> > >>>>>>> developers
>> > >>>>>>>>>> of
>> > >>>>>>>>>>>>> connectors? In this case developers explicitly specify
>> > >>>>> whether
>> > >>>>>>> their
>> > >>>>>>>>>>>>> connector supports caching or not (in the list of
>> supported
>> > >>>>>>>>>> options),
>> > >>>>>>>>>>>>> no one makes them do that if they don't want to. So what
>> > >>>>>>> exactly is
>> > >>>>>>>>>>>>> the difference between implementing caching in modules
>> > >>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
>> > >>>>>>> considered
>> > >>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
>> > >>>>> the
>> > >>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> confront a situation that allows table options in DDL to
>> > >>>>>>> control
>> > >>>>>>>>>> the
>> > >>>>>>>>>>>>> behavior of the framework, which has never happened
>> > >>>>> previously
>> > >>>>>>> and
>> > >>>>>>>>>>> should
>> > >>>>>>>>>>>>> be cautious
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> If we talk about main differences of semantics of DDL
>> > >>>>> options
>> > >>>>>>> and
>> > >>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
>> > >>>>> the
>> > >>>>>>> scope
>> > >>>>>>>>>> of
>> > >>>>>>>>>>>>> the options + importance for the user business logic
>> rather
>> > >>>>> than
>> > >>>>>>>>>>>>> specific location of corresponding logic in the
>> framework? I
>> > >>>>>>> mean
>> > >>>>>>>>>> that
>> > >>>>>>>>>>>>> in my design, for example, putting an option with lookup
>> > >>>>> cache
>> > >>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
>> > >>>>>>> because it
>> > >>>>>>>>>>>>> directly affects the user's business logic (not just
>> > >>>>> performance
>> > >>>>>>>>>>>>> optimization) + touches just several functions of ONE
>> table
>> > >>>>>>> (there
>> > >>>>>>>>>> can
>> > >>>>>>>>>>>>> be multiple tables with different caches). Does it really
>> > >>>>>>> matter for
>> > >>>>>>>>>>>>> the user (or someone else) where the logic is located,
>> > >>>>> which is
>> > >>>>>>>>>>>>> affected by the applied option?
>> > >>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which
>> in
>> > >>>>>>> some way
>> > >>>>>>>>>>>>> "controls the behavior of the framework" and I don't see
>> any
>> > >>>>>>> problem
>> > >>>>>>>>>>>>> here.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
>> > >>>>> and
>> > >>>>>>> the
>> > >>>>>>>>>>> design
>> > >>>>>>>>>>>>> would become more complex
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> This is a subject for a separate discussion, but actually
>> > >>>>> in our
>> > >>>>>>>>>>>>> internal version we solved this problem quite easily - we
>> > >>>>> reused
>> > >>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
>> > >>>>>>> point is
>> > >>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
>> > >>>>>>> scanning
>> > >>>>>>>>>> the
>> > >>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
>> > >>>>> class
>> > >>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
>> > >>>>>>> InputFormat.
>> > >>>>>>>>>>>>> The advantage of this solution is the ability to reload
>> > >>>>> cache
>> > >>>>>>> data
>> > >>>>>>>>>> in
>> > >>>>>>>>>>>>> parallel (number of threads depends on number of
>> > >>>>> InputSplits,
>> > >>>>>>> but
>> > >>>>>>>>>> has
>> > >>>>>>>>>>>>> an upper limit). As a result cache reload time
>> significantly
>> > >>>>>>> reduces
>> > >>>>>>>>>>>>> (as well as time of input stream blocking). I know that
>> > >>>>> usually
>> > >>>>>>> we
>> > >>>>>>>>>> try
>> > >>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe
>> this
>> > >>>>> one
>> > >>>>>>> can
>> > >>>>>>>>>> be
>> > >>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
>> > >>>>> maybe
>> > >>>>>>>>>> there
>> > >>>>>>>>>>>>> are better ones.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Providing the cache in the framework might introduce
>> > >>>>>>> compatibility
>> > >>>>>>>>>>>> issues
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> It's possible only in cases when the developer of the
>> > >>>>> connector
>> > >>>>>>>>>> won't
>> > >>>>>>>>>>>>> properly refactor his code and will use new cache options
>> > >>>>>>>>>> incorrectly
>> > >>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
>> > >>>>> code
>> > >>>>>>>>>>>>> places). For correct behavior all he will need to do is to
>> > >>>>>>> redirect
>> > >>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
>> > >>>>> add an
>> > >>>>>>>>>> alias
>> > >>>>>>>>>>>>> for options, if there was different naming), everything
>> > >>>>> will be
>> > >>>>>>>>>>>>> transparent for users. If the developer won't do
>> > >>>>> refactoring at
>> > >>>>>>> all,
>> > >>>>>>>>>>>>> nothing will be changed for the connector because of
>> > >>>>> backward
>> > >>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
>> > >>>>> cache
>> > >>>>>>> logic,
>> > >>>>>>>>>>>>> he just can refuse to pass some of the configs into the
>> > >>>>>>> framework,
>> > >>>>>>>>>> and
>> > >>>>>>>>>>>>> instead make his own implementation with already existing
>> > >>>>>>> configs
>> > >>>>>>>>>> and
>> > >>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> filters and projections should be pushed all the way down
>> > >>>>> to
>> > >>>>>>> the
>> > >>>>>>>>>>> table
>> > >>>>>>>>>>>>> function, like what we do in the scan source
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
>> > >>>>> connector
>> > >>>>>>>>>> that
>> > >>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
>> > >>>>>>>>>>>>> (no database connector supports it currently). Also for
>> some
>> > >>>>>>>>>> databases
>> > >>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
>> > >>>>> that we
>> > >>>>>>> have
>> > >>>>>>>>>>>>> in Flink.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> only applying these optimizations to the cache seems not
>> > >>>>>>> quite
>> > >>>>>>>>>>> useful
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
>> > >>>>> from the
>> > >>>>>>>>>>>>> dimension table. For a simple example, suppose in
>> dimension
>> > >>>>>>> table
>> > >>>>>>>>>>>>> 'users'
>> > >>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
>> > >>>>> stream
>> > >>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users.
>> If
>> > >>>>> we
>> > >>>>>>> have
>> > >>>>>>>>>>>>> filter 'age > 30',
>> > >>>>>>>>>>>>> there will be twice less data in cache. This means the
>> user
>> > >>>>> can
>> > >>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It
>> will
>> > >>>>>>> gain a
>> > >>>>>>>>>>>>> huge
>> > >>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
>> > >>>>> really
>> > >>>>>>>>>> shine
>> > >>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
>> projections
>> > >>>>>>> can't
>> > >>>>>>>>>> fit
>> > >>>>>>>>>>>>> in memory, but with them - can. This opens up additional
>> > >>>>>>>>>> possibilities
>> > >>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> It would be great to hear other voices regarding this
>> topic!
>> > >>>>>>> Because
>> > >>>>>>>>>>>>> we have quite a lot of controversial points, and I think
>> > >>>>> with
>> > >>>>>>> the
>> > >>>>>>>>>> help
>> > >>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Best regards,
>> > >>>>>>>>>>>>> Smirnov Alexander
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>> > >>>>> renqschn@gmail.com
>> > >>>>>>>> :
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Hi Alexander and Arvid,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
>> > >>>>> We
>> > >>>>>>> had
>> > >>>>>>>>>> an
>> > >>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
>> > >>>>> like
>> > >>>>>>> to
>> > >>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
>> > >>>>> logic in
>> > >>>>>>> the
>> > >>>>>>>>>>> table
>> > >>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
>> > >>>>>>> function,
>> > >>>>>>>>>> we
>> > >>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
>> > >>>>> with
>> > >>>>>>> these
>> > >>>>>>>>>>>>> concerns:
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
>> > >>>>> SYSTEM_TIME
>> > >>>>>>> AS OF
>> > >>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
>> > >>>>> of the
>> > >>>>>>>>>> lookup
>> > >>>>>>>>>>>>> table at the moment of querying. If users choose to enable
>> > >>>>>>> caching
>> > >>>>>>>>>> on
>> > >>>>>>>>>>> the
>> > >>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage
>> is
>> > >>>>>>>>>> acceptable
>> > >>>>>>>>>>>> in
>> > >>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
>> > >>>>>>> caching on
>> > >>>>>>>>>>> the
>> > >>>>>>>>>>>>> table runtime level.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
>> > >>>>>>> (whether
>> > >>>>>>>>>> in a
>> > >>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
>> > >>>>> confront a
>> > >>>>>>>>>>>> situation
>> > >>>>>>>>>>>>> that allows table options in DDL to control the behavior
>> of
>> > >>>>> the
>> > >>>>>>>>>>>> framework,
>> > >>>>>>>>>>>>> which has never happened previously and should be
>> cautious.
>> > >>>>>>> Under
>> > >>>>>>>>>> the
>> > >>>>>>>>>>>>> current design the behavior of the framework should only
>> be
>> > >>>>>>>>>> specified
>> > >>>>>>>>>>> by
>> > >>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
>> > >>>>> these
>> > >>>>>>>>>> general
>> > >>>>>>>>>>>>> configs to a specific table.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
>> > >>>>> all
>> > >>>>>>>>>> records
>> > >>>>>>>>>>>>> periodically into the memory to achieve high lookup
>> > >>>>> performance
>> > >>>>>>>>>> (like
>> > >>>>>>>>>>>> Hive
>> > >>>>>>>>>>>>> connector in the community, and also widely used by our
>> > >>>>> internal
>> > >>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
>> > >>>>> TableFunction
>> > >>>>>>>>>> works
>> > >>>>>>>>>>>> fine
>> > >>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
>> > >>>>>>> interface for
>> > >>>>>>>>>>> this
>> > >>>>>>>>>>>>> all-caching scenario and the design would become more
>> > >>>>> complex.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
>> > >>>>>>>>>> compatibility
>> > >>>>>>>>>>>>> issues to existing lookup sources like there might exist
>> two
>> > >>>>>>> caches
>> > >>>>>>>>>>> with
>> > >>>>>>>>>>>>> totally different strategies if the user incorrectly
>> > >>>>> configures
>> > >>>>>>> the
>> > >>>>>>>>>>> table
>> > >>>>>>>>>>>>> (one in the framework and another implemented by the
>> lookup
>> > >>>>>>> source).
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
>> > >>>>>>> filters
>> > >>>>>>>>>> and
>> > >>>>>>>>>>>>> projections should be pushed all the way down to the table
>> > >>>>>>> function,
>> > >>>>>>>>>>> like
>> > >>>>>>>>>>>>> what we do in the scan source, instead of the runner with
>> > >>>>> the
>> > >>>>>>> cache.
>> > >>>>>>>>>>> The
>> > >>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
>> > >>>>> pressure
>> > >>>>>>> on the
>> > >>>>>>>>>>>>> external system, and only applying these optimizations to
>> > >>>>> the
>> > >>>>>>> cache
>> > >>>>>>>>>>> seems
>> > >>>>>>>>>>>>> not quite useful.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
>> > >>>>> We
>> > >>>>>>>>>> prefer to
>> > >>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
>> > >>>>> and we
>> > >>>>>>>>>> could
>> > >>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
>> > >>>>>>>>>>>> AllCachingTableFunction,
>> > >>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
>> > >>>>> metrics
>> > >>>>>>> of the
>> > >>>>>>>>>>>> cache.
>> > >>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Looking forward to your ideas!
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> [1]
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>
>> > >>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> > >>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Best regards,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Qingsheng
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
>> > >>>>>>>>>>>> smiralexan@gmail.com>
>> > >>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Thanks for the response, Arvid!
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> I have few comments on your message.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> but could also live with an easier solution as the
>> > >>>>> first
>> > >>>>>>> step:
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
>> > >>>>> (originally
>> > >>>>>>>>>>> proposed
>> > >>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
>> > >>>>> the
>> > >>>>>>> same
>> > >>>>>>>>>>>>>>> goal, but implementation details are different. If we
>> > >>>>> will
>> > >>>>>>> go one
>> > >>>>>>>>>>> way,
>> > >>>>>>>>>>>>>>> moving to another way in the future will mean deleting
>> > >>>>>>> existing
>> > >>>>>>>>>> code
>> > >>>>>>>>>>>>>>> and once again changing the API for connectors. So I
>> > >>>>> think we
>> > >>>>>>>>>> should
>> > >>>>>>>>>>>>>>> reach a consensus with the community about that and then
>> > >>>>> work
>> > >>>>>>>>>>> together
>> > >>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for
>> different
>> > >>>>>>> parts
>> > >>>>>>>>>> of
>> > >>>>>>>>>>> the
>> > >>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
>> > >>>>>>> proposed
>> > >>>>>>>>>> set
>> > >>>>>>>>>>> of
>> > >>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> as the source will only receive the requests after
>> > >>>>> filter
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
>> > >>>>>>> table, we
>> > >>>>>>>>>>>>>>> firstly must do requests, and only after that we can
>> > >>>>> filter
>> > >>>>>>>>>>> responses,
>> > >>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
>> > >>>>> if
>> > >>>>>>>>>>> filtering
>> > >>>>>>>>>>>>>>> is done before caching, there will be much less rows in
>> > >>>>>>> cache.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>> > >>>>> shared.
>> > >>>>>>> I
>> > >>>>>>>>>> don't
>> > >>>>>>>>>>>>> know the
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> solution to share images to be honest.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
>> > >>>>> conversations
>> > >>>>>>> :)
>> > >>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
>> > >>>>> Jira
>> > >>>>>>> issue,
>> > >>>>>>>>>>>>>>> where described the proposed changes in more details -
>> > >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Will happy to get more feedback!
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Best,
>> > >>>>>>>>>>>>>>> Smirnov Alexander
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>> > >>>>> arvid@apache.org>:
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Hi Qingsheng,
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
>> > >>>>>>> satisfying
>> > >>>>>>>>>> for
>> > >>>>>>>>>>>> me.
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
>> > >>>>> with
>> > >>>>>>> an
>> > >>>>>>>>>>> easier
>> > >>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
>> > >>>>> an
>> > >>>>>>>>>>>>> implementation
>> > >>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
>> > >>>>> layer
>> > >>>>>>>>>> around X.
>> > >>>>>>>>>>>> So
>> > >>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
>> > >>>>> delegates to
>> > >>>>>>> X in
>> > >>>>>>>>>>> case
>> > >>>>>>>>>>>>> of
>> > >>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
>> > >>>>>>> operator
>> > >>>>>>>>>>>> model
>> > >>>>>>>>>>>>> as
>> > >>>>>>>>>>>>>>>> proposed would be even better but is probably
>> > >>>>> unnecessary
>> > >>>>>>> in
>> > >>>>>>>>>> the
>> > >>>>>>>>>>>>> first step
>> > >>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
>> > >>>>> the
>> > >>>>>>>>>> requests
>> > >>>>>>>>>>>>> after
>> > >>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
>> > >>>>> save
>> > >>>>>>>>>>> memory).
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
>> > >>>>>>> would be
>> > >>>>>>>>>>>>> limited to
>> > >>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
>> > >>>>> else
>> > >>>>>>>>>>> remains
>> > >>>>>>>>>>>> an
>> > >>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
>> > >>>>> easily
>> > >>>>>>>>>>>> incorporate
>> > >>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
>> > >>>>> later.
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>> > >>>>> shared.
>> > >>>>>>> I
>> > >>>>>>>>>> don't
>> > >>>>>>>>>>>>> know the
>> > >>>>>>>>>>>>>>>> solution to share images to be honest.
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
>> > >>>>>>>>>>>>> smiralexan@gmail.com>
>> > >>>>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
>> > >>>>> committer
>> > >>>>>>> yet,
>> > >>>>>>>>>> but
>> > >>>>>>>>>>>> I'd
>> > >>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
>> > >>>>>>> interested
>> > >>>>>>>>>> me.
>> > >>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
>> > >>>>>>> company’s
>> > >>>>>>>>>>> Flink
>> > >>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
>> > >>>>> this and
>> > >>>>>>>>>> make
>> > >>>>>>>>>>>> code
>> > >>>>>>>>>>>>>>>>> open source.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> I think there is a better alternative than
>> > >>>>> introducing an
>> > >>>>>>>>>>> abstract
>> > >>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
>> > >>>>> you
>> > >>>>>>> know,
>> > >>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
>> > >>>>> module,
>> > >>>>>>> which
>> > >>>>>>>>>>>>> provides
>> > >>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
>> > >>>>>>> convenient
>> > >>>>>>>>>> for
>> > >>>>>>>>>>>>> importing
>> > >>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
>> > >>>>>>> logic
>> > >>>>>>>>>> for
>> > >>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
>> > >>>>>>> connected
>> > >>>>>>>>>> with
>> > >>>>>>>>>>> it
>> > >>>>>>>>>>>>>>>>> should be located in another module, probably in
>> > >>>>>>>>>>>>> flink-table-runtime.
>> > >>>>>>>>>>>>>>>>> But this will require connectors to depend on another
>> > >>>>>>> module,
>> > >>>>>>>>>>>> which
>> > >>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
>> > >>>>>>> good.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
>> > >>>>>>>>>>>> LookupTableSource
>> > >>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
>> > >>>>> pass
>> > >>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
>> > >>>>>>> depend on
>> > >>>>>>>>>>>>> runtime
>> > >>>>>>>>>>>>>>>>> realization. Based on these configs planner will
>> > >>>>>>> construct a
>> > >>>>>>>>>>>> lookup
>> > >>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
>> > >>>>>>>>>> (ProcessFunctions
>> > >>>>>>>>>>>> in
>> > >>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
>> > >>>>> in
>> > >>>>>>> the
>> > >>>>>>>>>>> pinned
>> > >>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
>> > >>>>>>>>>> CacheConfig).
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
>> > >>>>> responsible
>> > >>>>>>> for
>> > >>>>>>>>>>> this
>> > >>>>>>>>>>>> –
>> > >>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
>> > >>>>>>>>>>>>>>>>> Current classes for lookup join in
>> > >>>>> flink-table-runtime
>> > >>>>>>> -
>> > >>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>> > >>>>>>>>>>> LookupJoinRunnerWithCalc,
>> > >>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
>> > >>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
>> > >>>>> such a
>> > >>>>>>>>>>> solution.
>> > >>>>>>>>>>>>> If
>> > >>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
>> > >>>>> some
>> > >>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
>> > >>>>> named
>> > >>>>>>> like
>> > >>>>>>>>>>> this
>> > >>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
>> > >>>>>>> mostly
>> > >>>>>>>>>>>> consists
>> > >>>>>>>>>>>>> of
>> > >>>>>>>>>>>>>>>>> filters and projections.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
>> > >>>>>>> condition
>> > >>>>>>>>>>> ‘JOIN …
>> > >>>>>>>>>>>>> ON
>> > >>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
>> > >>>>> 1000’
>> > >>>>>>>>>>> ‘calc’
>> > >>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
>> > >>>>>>>>>> B.salary >
>> > >>>>>>>>>>>>> 1000.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> If we apply this function before storing records in
>> > >>>>>>> cache,
>> > >>>>>>>>>> size
>> > >>>>>>>>>>> of
>> > >>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
>> > >>>>>>> storing
>> > >>>>>>>>>>>> useless
>> > >>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
>> > >>>>> size. So
>> > >>>>>>> the
>> > >>>>>>>>>>>> initial
>> > >>>>>>>>>>>>>>>>> max number of records in cache can be increased by
>> > >>>>> the
>> > >>>>>>> user.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> What do you think about it?
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>> > >>>>>>>>>>>>>>>>>> Hi devs,
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
>> > >>>>>>>>>> FLIP-221[1],
>> > >>>>>>>>>>>>> which
>> > >>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
>> > >>>>> its
>> > >>>>>>>>>> standard
>> > >>>>>>>>>>>>> metrics.
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
>> > >>>>>>> their
>> > >>>>>>>>>> own
>> > >>>>>>>>>>>>> cache to
>> > >>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
>> > >>>>>>> metrics
>> > >>>>>>>>>> for
>> > >>>>>>>>>>>>> users and
>> > >>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
>> > >>>>> which
>> > >>>>>>> is a
>> > >>>>>>>>>>>> quite
>> > >>>>>>>>>>>>> common
>> > >>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
>> > >>>>>>>>>> metrics,
>> > >>>>>>>>>>>>> wrapper
>> > >>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
>> > >>>>> Please
>> > >>>>>>> take a
>> > >>>>>>>>>>> look
>> > >>>>>>>>>>>>> at the
>> > >>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
>> > >>>>> and
>> > >>>>>>>>>> comments
>> > >>>>>>>>>>>>> would be
>> > >>>>>>>>>>>>>>>>> appreciated!
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> [1]
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>
>> > >>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> Best regards,
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> Qingsheng
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> --
>> > >>>>>>>>>>>>>> Best Regards,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Qingsheng Ren
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Real-time Computing Team
>> > >>>>>>>>>>>>>> Alibaba Cloud
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Email: renqschn@gmail.com
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> --
>> > >>>>>>>>> Best regards,
>> > >>>>>>>>> Roman Boyko
>> > >>>>>>>>> e.: ro.v.boyko@gmail.com
>> > >>>>>>>>>
>> > >>>>>>>
>> > >>>>>
>> > >>>>>
>> > >>>
>> > >>
>> > >>
>> > >> --
>> > >> Best Regards,
>> > >>
>> > >> Qingsheng Ren
>> > >>
>> > >> Real-time Computing Team
>> > >> Alibaba Cloud
>> > >>
>> > >> Email: renqschn@gmail.com
>> >
>>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jark Wu <im...@gmail.com>.

Hi Alex,

1) retry logic
I think we can extract some common retry logic into utilities, e.g.
RetryUtils#tryTimes(times, call).
This seems independent of this FLIP and can be reused by DataStream users.
Maybe we can open an issue to discuss this and where to put it.

2) cache ConfigOptions
I'm fine with defining cache config options in the framework.
A candidate place to put is FactoryUtil which also includes
"sink.parallelism", "format" options.

Best,
Jark


On Wed, 18 May 2022 at 13:52, Александр Смирнов <sm...@gmail.com>
wrote:

> Hi Qingsheng,
>
> Thank you for considering my comments.
>
> >  there might be custom logic before making retry, such as re-establish
> the connection
>
> Yes, I understand that. I meant that such logic can be placed in a
> separate function, that can be implemented by connectors. Just moving
> the retry logic would make connector's LookupFunction more concise +
> avoid duplicate code. However, it's a minor change. The decision is up
> to you.
>
> > We decide not to provide common DDL options and let developers to define
> their own options as we do now per connector.
>
> What is the reason for that? One of the main goals of this FLIP was to
> unify the configs, wasn't it? I understand that current cache design
> doesn't depend on ConfigOptions, like was before. But still we can put
> these options into the framework, so connectors can reuse them and
> avoid code duplication, and, what is more significant, avoid possible
> different options naming. This moment can be pointed out in
> documentation for connector developers.
>
> Best regards,
> Alexander
>
> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
> >
> > Hi Alexander,
> >
> > Thanks for the review and glad to see we are on the same page! I think
> you forgot to cc the dev mailing list so I’m also quoting your reply under
> this email.
> >
> > >  We can add 'maxRetryTimes' option into this class
> >
> > In my opinion the retry logic should be implemented in lookup() instead
> of in LookupFunction#eval(). Retrying is only meaningful under some
> specific retriable failures, and there might be custom logic before making
> retry, such as re-establish the connection (JdbcRowDataLookupFunction is an
> example), so it's more handy to leave it to the connector.
> >
> > > I don't see DDL options, that were in previous version of FLIP. Do you
> have any special plans for them?
> >
> > We decide not to provide common DDL options and let developers to define
> their own options as we do now per connector.
> >
> > The rest of comments sound great and I’ll update the FLIP. Hope we can
> finalize our proposal soon!
> >
> > Best,
> >
> > Qingsheng
> >
> >
> > > On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com>
> wrote:
> > >
> > > Hi Qingsheng and devs!
> > >
> > > I like the overall design of updated FLIP, however I have several
> > > suggestions and questions.
> > >
> > > 1) Introducing LookupFunction as a subclass of TableFunction is a good
> > > idea. We can add 'maxRetryTimes' option into this class. 'eval' method
> > > of new LookupFunction is great for this purpose. The same is for
> > > 'async' case.
> > >
> > > 2) There might be other configs in future, such as 'cacheMissingKey'
> > > in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
> > > Maybe use Builder pattern in LookupFunctionProvider and
> > > RescanRuntimeProvider for more flexibility (use one 'build' method
> > > instead of many 'of' methods in future)?
> > >
> > > 3) What are the plans for existing TableFunctionProvider and
> > > AsyncTableFunctionProvider? I think they should be deprecated.
> > >
> > > 4) Am I right that the current design does not assume usage of
> > > user-provided LookupCache in re-scanning? In this case, it is not very
> > > clear why do we need methods such as 'invalidate' or 'putAll' in
> > > LookupCache.
> > >
> > > 5) I don't see DDL options, that were in previous version of FLIP. Do
> > > you have any special plans for them?
> > >
> > > If you don't mind, I would be glad to be able to make small
> > > adjustments to the FLIP document too. I think it's worth mentioning
> > > about what exactly optimizations are planning in the future.
> > >
> > > Best regards,
> > > Smirnov Alexander
> > >
> > > пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
> > >>
> > >> Hi Alexander and devs,
> > >>
> > >> Thank you very much for the in-depth discussion! As Jark mentioned we
> were inspired by Alexander's idea and made a refactor on our design.
> FLIP-221 [1] has been updated to reflect our design now and we are happy to
> hear more suggestions from you!
> > >>
> > >> Compared to the previous design:
> > >> 1. The lookup cache serves at table runtime level and is integrated
> as a component of LookupJoinRunner as discussed previously.
> > >> 2. Interfaces are renamed and re-designed to reflect the new design.
> > >> 3. We separate the all-caching case individually and introduce a new
> RescanRuntimeProvider to reuse the ability of scanning. We are planning to
> support SourceFunction / InputFormat for now considering the complexity of
> FLIP-27 Source API.
> > >> 4. A new interface LookupFunction is introduced to make the semantic
> of lookup more straightforward for developers.
> > >>
> > >> For replying to Alexander:
> > >>> However I'm a little confused whether InputFormat is deprecated or
> not. Am I right that it will be so in the future, but currently it's not?
> > >> Yes you are right. InputFormat is not deprecated for now. I think it
> will be deprecated in the future but we don't have a clear plan for that.
> > >>
> > >> Thanks again for the discussion on this FLIP and looking forward to
> cooperating with you after we finalize the design and interfaces!
> > >>
> > >> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>
> > >> Best regards,
> > >>
> > >> Qingsheng
> > >>
> > >>
> > >> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <
> smiralexan@gmail.com> wrote:
> > >>>
> > >>> Hi Jark, Qingsheng and Leonard!
> > >>>
> > >>> Glad to see that we came to a consensus on almost all points!
> > >>>
> > >>> However I'm a little confused whether InputFormat is deprecated or
> > >>> not. Am I right that it will be so in the future, but currently it's
> > >>> not? Actually I also think that for the first version it's OK to use
> > >>> InputFormat in ALL cache realization, because supporting rescan
> > >>> ability seems like a very distant prospect. But for this decision we
> > >>> need a consensus among all discussion participants.
> > >>>
> > >>> In general, I don't have something to argue with your statements. All
> > >>> of them correspond my ideas. Looking ahead, it would be nice to work
> > >>> on this FLIP cooperatively. I've already done a lot of work on lookup
> > >>> join caching with realization very close to the one we are
> discussing,
> > >>> and want to share the results of this work. Anyway looking forward
> for
> > >>> the FLIP update!
> > >>>
> > >>> Best regards,
> > >>> Smirnov Alexander
> > >>>
> > >>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> > >>>>
> > >>>> Hi Alex,
> > >>>>
> > >>>> Thanks for summarizing your points.
> > >>>>
> > >>>> In the past week, Qingsheng, Leonard, and I have discussed it
> several times
> > >>>> and we have totally refactored the design.
> > >>>> I'm glad to say we have reached a consensus on many of your points!
> > >>>> Qingsheng is still working on updating the design docs and maybe
> can be
> > >>>> available in the next few days.
> > >>>> I will share some conclusions from our discussions:
> > >>>>
> > >>>> 1) we have refactored the design towards to "cache in framework"
> way.
> > >>>>
> > >>>> 2) a "LookupCache" interface for users to customize and a default
> > >>>> implementation with builder for users to easy-use.
> > >>>> This can both make it possible to both have flexibility and
> conciseness.
> > >>>>
> > >>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp
> reducing
> > >>>> IO.
> > >>>> Filter pushdown should be the final state and the unified way to
> both
> > >>>> support pruning ALL cache and LRU cache,
> > >>>> so I think we should make effort in this direction. If we need to
> support
> > >>>> filter pushdown for ALL cache anyway, why not use
> > >>>> it for LRU cache as well? Either way, as we decide to implement the
> cache
> > >>>> in the framework, we have the chance to support
> > >>>> filter on cache anytime. This is an optimization and it doesn't
> affect the
> > >>>> public API. I think we can create a JIRA issue to
> > >>>> discuss it when the FLIP is accepted.
> > >>>>
> > >>>> 4) The idea to support ALL cache is similar to your proposal.
> > >>>> In the first version, we will only support InputFormat,
> SourceFunction for
> > >>>> cache all (invoke InputFormat in join operator).
> > >>>> For FLIP-27 source, we need to join a true source operator instead
> of
> > >>>> calling it embedded in the join operator.
> > >>>> However, this needs another FLIP to support the re-scan ability for
> FLIP-27
> > >>>> Source, and this can be a large work.
> > >>>> In order to not block this issue, we can put the effort of FLIP-27
> source
> > >>>> integration into future work and integrate
> > >>>> InputFormat&SourceFunction for now.
> > >>>>
> > >>>> I think it's fine to use InputFormat&SourceFunction, as they are not
> > >>>> deprecated, otherwise, we have to introduce another function
> > >>>> similar to them which is meaningless. We need to plan FLIP-27 source
> > >>>> integration ASAP before InputFormat & SourceFunction are deprecated.
> > >>>>
> > >>>> Best,
> > >>>> Jark
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <
> smiralexan@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Martijn!
> > >>>>>
> > >>>>> Got it. Therefore, the realization with InputFormat is not
> considered.
> > >>>>> Thanks for clearing that up!
> > >>>>>
> > >>>>> Best regards,
> > >>>>> Smirnov Alexander
> > >>>>>
> > >>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <martijn@ververica.com
> >:
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> With regards to:
> > >>>>>>
> > >>>>>>> But if there are plans to refactor all connectors to FLIP-27
> > >>>>>>
> > >>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces
> will be
> > >>>>>> deprecated and connectors will either be refactored to use the
> new ones
> > >>>>> or
> > >>>>>> dropped.
> > >>>>>>
> > >>>>>> The caching should work for connectors that are using FLIP-27
> interfaces,
> > >>>>>> we should not introduce new features for old interfaces.
> > >>>>>>
> > >>>>>> Best regards,
> > >>>>>>
> > >>>>>> Martijn
> > >>>>>>
> > >>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> smiralexan@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Jark!
> > >>>>>>>
> > >>>>>>> Sorry for the late response. I would like to make some comments
> and
> > >>>>>>> clarify my points.
> > >>>>>>>
> > >>>>>>> 1) I agree with your first statement. I think we can achieve both
> > >>>>>>> advantages this way: put the Cache interface in
> flink-table-common,
> > >>>>>>> but have implementations of it in flink-table-runtime. Therefore
> if a
> > >>>>>>> connector developer wants to use existing cache strategies and
> their
> > >>>>>>> implementations, he can just pass lookupConfig to the planner,
> but if
> > >>>>>>> he wants to have its own cache implementation in his
> TableFunction, it
> > >>>>>>> will be possible for him to use the existing interface for this
> > >>>>>>> purpose (we can explicitly point this out in the documentation).
> In
> > >>>>>>> this way all configs and metrics will be unified. WDYT?
> > >>>>>>>
> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have
> 90% of
> > >>>>>>> lookup requests that can never be cached
> > >>>>>>>
> > >>>>>>> 2) Let me clarify the logic filters optimization in case of LRU
> cache.
> > >>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
> > >>>>>>> store the response of the dimension table in cache, even after
> > >>>>>>> applying calc function. I.e. if there are no rows after applying
> > >>>>>>> filters to the result of the 'eval' method of TableFunction, we
> store
> > >>>>>>> the empty list by lookup keys. Therefore the cache line will be
> > >>>>>>> filled, but will require much less memory (in bytes). I.e. we
> don't
> > >>>>>>> completely filter keys, by which result was pruned, but
> significantly
> > >>>>>>> reduce required memory to store this result. If the user knows
> about
> > >>>>>>> this behavior, he can increase the 'max-rows' option before the
> start
> > >>>>>>> of the job. But actually I came up with the idea that we can do
> this
> > >>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods
> of
> > >>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
> > >>>>>>> (value of cache). Therefore cache can automatically fit much more
> > >>>>>>> records than before.
> > >>>>>>>
> > >>>>>>>> Flink SQL has provided a standard way to do filters and projects
> > >>>>>>> pushdown, i.e., SupportsFilterPushDown and
> SupportsProjectionPushDown.
> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean
> it's
> > >>>>> hard
> > >>>>>>> to implement.
> > >>>>>>>
> > >>>>>>> It's debatable how difficult it will be to implement filter
> pushdown.
> > >>>>>>> But I think the fact that currently there is no database
> connector
> > >>>>>>> with filter pushdown at least means that this feature won't be
> > >>>>>>> supported soon in connectors. Moreover, if we talk about other
> > >>>>>>> connectors (not in Flink repo), their databases might not
> support all
> > >>>>>>> Flink filters (or not support filters at all). I think users are
> > >>>>>>> interested in supporting cache filters optimization
> independently of
> > >>>>>>> supporting other features and solving more complex problems (or
> > >>>>>>> unsolvable at all).
> > >>>>>>>
> > >>>>>>> 3) I agree with your third statement. Actually in our internal
> version
> > >>>>>>> I also tried to unify the logic of scanning and reloading data
> from
> > >>>>>>> connectors. But unfortunately, I didn't find a way to unify the
> logic
> > >>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction,
> Source,...)
> > >>>>>>> and reuse it in reloading ALL cache. As a result I settled on
> using
> > >>>>>>> InputFormat, because it was used for scanning in all lookup
> > >>>>>>> connectors. (I didn't know that there are plans to deprecate
> > >>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27
> source
> > >>>>>>> in ALL caching is not good idea, because this source was
> designed to
> > >>>>>>> work in distributed environment (SplitEnumerator on JobManager
> and
> > >>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
> > >>>>>>> operator in our case). There is even no direct way to pass
> splits from
> > >>>>>>> SplitEnumerator to SourceReader (this logic works through
> > >>>>>>> SplitEnumeratorContext, which requires
> > >>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents).
> Usage of
> > >>>>>>> InputFormat for ALL cache seems much more clearer and easier.
> But if
> > >>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
> > >>>>>>> following ideas: maybe we can refuse from lookup join ALL cache
> in
> > >>>>>>> favor of simple join with multiple scanning of batch source? The
> point
> > >>>>>>> is that the only difference between lookup join ALL cache and
> simple
> > >>>>>>> join with batch source is that in the first case scanning is
> performed
> > >>>>>>> multiple times, in between which state (cache) is cleared
> (correct me
> > >>>>>>> if I'm wrong). So what if we extend the functionality of simple
> join
> > >>>>>>> to support state reloading + extend the functionality of scanning
> > >>>>>>> batch source multiple times (this one should be easy with new
> FLIP-27
> > >>>>>>> source, that unifies streaming/batch reading - we will need to
> change
> > >>>>>>> only SplitEnumerator, which will pass splits again after some
> TTL).
> > >>>>>>> WDYT? I must say that this looks like a long-term goal and will
> make
> > >>>>>>> the scope of this FLIP even larger than you said. Maybe we can
> limit
> > >>>>>>> ourselves to a simpler solution now (InputFormats).
> > >>>>>>>
> > >>>>>>> So to sum up, my points is like this:
> > >>>>>>> 1) There is a way to make both concise and flexible interfaces
> for
> > >>>>>>> caching in lookup join.
> > >>>>>>> 2) Cache filters optimization is important both in LRU and ALL
> caches.
> > >>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
> > >>>>>>> connectors, some of the connectors might not have the
> opportunity to
> > >>>>>>> support filter pushdown + as I know, currently filter pushdown
> works
> > >>>>>>> only for scanning (not lookup). So cache filters + projections
> > >>>>>>> optimization should be independent from other features.
> > >>>>>>> 4) ALL cache realization is a complex topic that involves
> multiple
> > >>>>>>> aspects of how Flink is developing. Refusing from InputFormat in
> favor
> > >>>>>>> of FLIP-27 Source will make ALL cache realization really complex
> and
> > >>>>>>> not clear, so maybe instead of that we can extend the
> functionality of
> > >>>>>>> simple join or not refuse from InputFormat in case of lookup
> join ALL
> > >>>>>>> cache?
> > >>>>>>>
> > >>>>>>> Best regards,
> > >>>>>>> Smirnov Alexander
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> [1]
> > >>>>>>>
> > >>>>>
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > >>>>>>>
> > >>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> > >>>>>>>>
> > >>>>>>>> It's great to see the active discussion! I want to share my
> ideas:
> > >>>>>>>>
> > >>>>>>>> 1) implement the cache in framework vs. connectors base
> > >>>>>>>> I don't have a strong opinion on this. Both ways should work
> (e.g.,
> > >>>>> cache
> > >>>>>>>> pruning, compatibility).
> > >>>>>>>> The framework way can provide more concise interfaces.
> > >>>>>>>> The connector base way can define more flexible cache
> > >>>>>>>> strategies/implementations.
> > >>>>>>>> We are still investigating a way to see if we can have both
> > >>>>> advantages.
> > >>>>>>>> We should reach a consensus that the way should be a final
> state,
> > >>>>> and we
> > >>>>>>>> are on the path to it.
> > >>>>>>>>
> > >>>>>>>> 2) filters and projections pushdown:
> > >>>>>>>> I agree with Alex that the filter pushdown into cache can
> benefit a
> > >>>>> lot
> > >>>>>>> for
> > >>>>>>>> ALL cache.
> > >>>>>>>> However, this is not true for LRU cache. Connectors use cache to
> > >>>>> reduce
> > >>>>>>> IO
> > >>>>>>>> requests to databases for better throughput.
> > >>>>>>>> If a filter can prune 90% of data in the cache, we will have
> 90% of
> > >>>>>>> lookup
> > >>>>>>>> requests that can never be cached
> > >>>>>>>> and hit directly to the databases. That means the cache is
> > >>>>> meaningless in
> > >>>>>>>> this case.
> > >>>>>>>>
> > >>>>>>>> IMO, Flink SQL has provided a standard way to do filters and
> projects
> > >>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > >>>>> SupportsProjectionPushDown.
> > >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean
> it's
> > >>>>> hard
> > >>>>>>> to
> > >>>>>>>> implement.
> > >>>>>>>> They should implement the pushdown interfaces to reduce IO and
> the
> > >>>>> cache
> > >>>>>>>> size.
> > >>>>>>>> That should be a final state that the scan source and lookup
> source
> > >>>>> share
> > >>>>>>>> the exact pushdown implementation.
> > >>>>>>>> I don't see why we need to duplicate the pushdown logic in
> caches,
> > >>>>> which
> > >>>>>>>> will complex the lookup join design.
> > >>>>>>>>
> > >>>>>>>> 3) ALL cache abstraction
> > >>>>>>>> All cache might be the most challenging part of this FLIP. We
> have
> > >>>>> never
> > >>>>>>>> provided a reload-lookup public interface.
> > >>>>>>>> Currently, we put the reload logic in the "eval" method of
> > >>>>> TableFunction.
> > >>>>>>>> That's hard for some sources (e.g., Hive).
> > >>>>>>>> Ideally, connector implementation should share the logic of
> reload
> > >>>>> and
> > >>>>>>>> scan, i.e. ScanTableSource with
> InputFormat/SourceFunction/FLIP-27
> > >>>>>>> Source.
> > >>>>>>>> However, InputFormat/SourceFunction are deprecated, and the
> FLIP-27
> > >>>>>>> source
> > >>>>>>>> is deeply coupled with SourceOperator.
> > >>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may
> make
> > >>>>> the
> > >>>>>>>> scope of this FLIP much larger.
> > >>>>>>>> We are still investigating how to abstract the ALL cache logic
> and
> > >>>>> reuse
> > >>>>>>>> the existing source interfaces.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Jark
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> It's a much more complicated activity and lies out of the
> scope of
> > >>>>> this
> > >>>>>>>>> improvement. Because such pushdowns should be done for all
> > >>>>>>> ScanTableSource
> > >>>>>>>>> implementations (not only for Lookup ones).
> > >>>>>>>>>
> > >>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> > >>>>> martijnvisser@apache.org>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi everyone,
> > >>>>>>>>>>
> > >>>>>>>>>> One question regarding "And Alexander correctly mentioned that
> > >>>>> filter
> > >>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." ->
> Would
> > >>>>> an
> > >>>>>>>>>> alternative solution be to actually implement these filter
> > >>>>> pushdowns?
> > >>>>>>> I
> > >>>>>>>>>> can
> > >>>>>>>>>> imagine that there are many more benefits to doing that,
> outside
> > >>>>> of
> > >>>>>>> lookup
> > >>>>>>>>>> caching and metrics.
> > >>>>>>>>>>
> > >>>>>>>>>> Best regards,
> > >>>>>>>>>>
> > >>>>>>>>>> Martijn Visser
> > >>>>>>>>>> https://twitter.com/MartijnVisser82
> > >>>>>>>>>> https://github.com/MartijnVisser
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> ro.v.boyko@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi everyone!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for driving such a valuable improvement!
> > >>>>>>>>>>>
> > >>>>>>>>>>> I do think that single cache implementation would be a nice
> > >>>>>>> opportunity
> > >>>>>>>>>> for
> > >>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF
> proc_time"
> > >>>>>>> semantics
> > >>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> > >>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the
> cache
> > >>>>> size
> > >>>>>>> by
> > >>>>>>>>>>> simply filtering unnecessary data. And the most handy way to
> do
> > >>>>> it
> > >>>>>>> is
> > >>>>>>>>>> apply
> > >>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
> > >>>>>>> through the
> > >>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
> > >>>>> mentioned
> > >>>>>>> that
> > >>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
> > >>>>>>>>>>> 2) The ability to set the different caching parameters for
> > >>>>> different
> > >>>>>>>>>> tables
> > >>>>>>>>>>> is quite important. So I would prefer to set it through DDL
> > >>>>> rather
> > >>>>>>> than
> > >>>>>>>>>>> have similar ttla, strategy and other options for all lookup
> > >>>>> tables.
> > >>>>>>>>>>> 3) Providing the cache into the framework really deprives us
> of
> > >>>>>>>>>>> extensibility (users won't be able to implement their own
> > >>>>> cache).
> > >>>>>>> But
> > >>>>>>>>>> most
> > >>>>>>>>>>> probably it might be solved by creating more different cache
> > >>>>>>> strategies
> > >>>>>>>>>> and
> > >>>>>>>>>>> a wider set of configurations.
> > >>>>>>>>>>>
> > >>>>>>>>>>> All these points are much closer to the schema proposed by
> > >>>>>>> Alexander.
> > >>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all
> these
> > >>>>>>>>>> facilities
> > >>>>>>>>>>> might be simply implemented in your architecture?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best regards,
> > >>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> > >>>>>>> martijnvisser@apache.org>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I don't have much to chip in, but just wanted to express
> that
> > >>>>> I
> > >>>>>>> really
> > >>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
> > >>>>> that
> > >>>>>>>>>> others
> > >>>>>>>>>>>> will join the conversation.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> > >>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for your detailed feedback! However, I have
> questions
> > >>>>>>> about
> > >>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
> > >>>>> AS OF
> > >>>>>>>>>>>> proc_time”
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
> > >>>>> proc_time"
> > >>>>>>> is
> > >>>>>>>>>> not
> > >>>>>>>>>>>>> fully implemented with caching, but as you said, users go
> > >>>>> on it
> > >>>>>>>>>>>>> consciously to achieve better performance (no one proposed
> > >>>>> to
> > >>>>>>> enable
> > >>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
> > >>>>>>> developers
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> connectors? In this case developers explicitly specify
> > >>>>> whether
> > >>>>>>> their
> > >>>>>>>>>>>>> connector supports caching or not (in the list of supported
> > >>>>>>>>>> options),
> > >>>>>>>>>>>>> no one makes them do that if they don't want to. So what
> > >>>>>>> exactly is
> > >>>>>>>>>>>>> the difference between implementing caching in modules
> > >>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
> > >>>>>>> considered
> > >>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
> > >>>>> the
> > >>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> confront a situation that allows table options in DDL to
> > >>>>>>> control
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> behavior of the framework, which has never happened
> > >>>>> previously
> > >>>>>>> and
> > >>>>>>>>>>> should
> > >>>>>>>>>>>>> be cautious
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> If we talk about main differences of semantics of DDL
> > >>>>> options
> > >>>>>>> and
> > >>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
> > >>>>> the
> > >>>>>>> scope
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> the options + importance for the user business logic rather
> > >>>>> than
> > >>>>>>>>>>>>> specific location of corresponding logic in the framework?
> I
> > >>>>>>> mean
> > >>>>>>>>>> that
> > >>>>>>>>>>>>> in my design, for example, putting an option with lookup
> > >>>>> cache
> > >>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
> > >>>>>>> because it
> > >>>>>>>>>>>>> directly affects the user's business logic (not just
> > >>>>> performance
> > >>>>>>>>>>>>> optimization) + touches just several functions of ONE table
> > >>>>>>> (there
> > >>>>>>>>>> can
> > >>>>>>>>>>>>> be multiple tables with different caches). Does it really
> > >>>>>>> matter for
> > >>>>>>>>>>>>> the user (or someone else) where the logic is located,
> > >>>>> which is
> > >>>>>>>>>>>>> affected by the applied option?
> > >>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
> > >>>>>>> some way
> > >>>>>>>>>>>>> "controls the behavior of the framework" and I don't see
> any
> > >>>>>>> problem
> > >>>>>>>>>>>>> here.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
> > >>>>> and
> > >>>>>>> the
> > >>>>>>>>>>> design
> > >>>>>>>>>>>>> would become more complex
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> This is a subject for a separate discussion, but actually
> > >>>>> in our
> > >>>>>>>>>>>>> internal version we solved this problem quite easily - we
> > >>>>> reused
> > >>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
> > >>>>>>> point is
> > >>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
> > >>>>>>> scanning
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
> > >>>>> class
> > >>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
> > >>>>>>> InputFormat.
> > >>>>>>>>>>>>> The advantage of this solution is the ability to reload
> > >>>>> cache
> > >>>>>>> data
> > >>>>>>>>>> in
> > >>>>>>>>>>>>> parallel (number of threads depends on number of
> > >>>>> InputSplits,
> > >>>>>>> but
> > >>>>>>>>>> has
> > >>>>>>>>>>>>> an upper limit). As a result cache reload time
> significantly
> > >>>>>>> reduces
> > >>>>>>>>>>>>> (as well as time of input stream blocking). I know that
> > >>>>> usually
> > >>>>>>> we
> > >>>>>>>>>> try
> > >>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
> > >>>>> one
> > >>>>>>> can
> > >>>>>>>>>> be
> > >>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
> > >>>>> maybe
> > >>>>>>>>>> there
> > >>>>>>>>>>>>> are better ones.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Providing the cache in the framework might introduce
> > >>>>>>> compatibility
> > >>>>>>>>>>>> issues
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It's possible only in cases when the developer of the
> > >>>>> connector
> > >>>>>>>>>> won't
> > >>>>>>>>>>>>> properly refactor his code and will use new cache options
> > >>>>>>>>>> incorrectly
> > >>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
> > >>>>> code
> > >>>>>>>>>>>>> places). For correct behavior all he will need to do is to
> > >>>>>>> redirect
> > >>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
> > >>>>> add an
> > >>>>>>>>>> alias
> > >>>>>>>>>>>>> for options, if there was different naming), everything
> > >>>>> will be
> > >>>>>>>>>>>>> transparent for users. If the developer won't do
> > >>>>> refactoring at
> > >>>>>>> all,
> > >>>>>>>>>>>>> nothing will be changed for the connector because of
> > >>>>> backward
> > >>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
> > >>>>> cache
> > >>>>>>> logic,
> > >>>>>>>>>>>>> he just can refuse to pass some of the configs into the
> > >>>>>>> framework,
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> instead make his own implementation with already existing
> > >>>>>>> configs
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> filters and projections should be pushed all the way down
> > >>>>> to
> > >>>>>>> the
> > >>>>>>>>>>> table
> > >>>>>>>>>>>>> function, like what we do in the scan source
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
> > >>>>> connector
> > >>>>>>>>>> that
> > >>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> > >>>>>>>>>>>>> (no database connector supports it currently). Also for
> some
> > >>>>>>>>>> databases
> > >>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
> > >>>>> that we
> > >>>>>>> have
> > >>>>>>>>>>>>> in Flink.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> only applying these optimizations to the cache seems not
> > >>>>>>> quite
> > >>>>>>>>>>> useful
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
> > >>>>> from the
> > >>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
> > >>>>>>> table
> > >>>>>>>>>>>>> 'users'
> > >>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
> > >>>>> stream
> > >>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
> > >>>>> we
> > >>>>>>> have
> > >>>>>>>>>>>>> filter 'age > 30',
> > >>>>>>>>>>>>> there will be twice less data in cache. This means the user
> > >>>>> can
> > >>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
> > >>>>>>> gain a
> > >>>>>>>>>>>>> huge
> > >>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
> > >>>>> really
> > >>>>>>>>>> shine
> > >>>>>>>>>>>>> in 'ALL' cache, where tables without filters and
> projections
> > >>>>>>> can't
> > >>>>>>>>>> fit
> > >>>>>>>>>>>>> in memory, but with them - can. This opens up additional
> > >>>>>>>>>> possibilities
> > >>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It would be great to hear other voices regarding this
> topic!
> > >>>>>>> Because
> > >>>>>>>>>>>>> we have quite a lot of controversial points, and I think
> > >>>>> with
> > >>>>>>> the
> > >>>>>>>>>> help
> > >>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> > >>>>> renqschn@gmail.com
> > >>>>>>>> :
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi Alexander and Arvid,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
> > >>>>> We
> > >>>>>>> had
> > >>>>>>>>>> an
> > >>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
> > >>>>> like
> > >>>>>>> to
> > >>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
> > >>>>> logic in
> > >>>>>>> the
> > >>>>>>>>>>> table
> > >>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
> > >>>>>>> function,
> > >>>>>>>>>> we
> > >>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
> > >>>>> with
> > >>>>>>> these
> > >>>>>>>>>>>>> concerns:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> > >>>>> SYSTEM_TIME
> > >>>>>>> AS OF
> > >>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
> > >>>>> of the
> > >>>>>>>>>> lookup
> > >>>>>>>>>>>>> table at the moment of querying. If users choose to enable
> > >>>>>>> caching
> > >>>>>>>>>> on
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage
> is
> > >>>>>>>>>> acceptable
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
> > >>>>>>> caching on
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> table runtime level.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
> > >>>>>>> (whether
> > >>>>>>>>>> in a
> > >>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
> > >>>>> confront a
> > >>>>>>>>>>>> situation
> > >>>>>>>>>>>>> that allows table options in DDL to control the behavior of
> > >>>>> the
> > >>>>>>>>>>>> framework,
> > >>>>>>>>>>>>> which has never happened previously and should be cautious.
> > >>>>>>> Under
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> current design the behavior of the framework should only be
> > >>>>>>>>>> specified
> > >>>>>>>>>>> by
> > >>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
> > >>>>> these
> > >>>>>>>>>> general
> > >>>>>>>>>>>>> configs to a specific table.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
> > >>>>> all
> > >>>>>>>>>> records
> > >>>>>>>>>>>>> periodically into the memory to achieve high lookup
> > >>>>> performance
> > >>>>>>>>>> (like
> > >>>>>>>>>>>> Hive
> > >>>>>>>>>>>>> connector in the community, and also widely used by our
> > >>>>> internal
> > >>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> > >>>>> TableFunction
> > >>>>>>>>>> works
> > >>>>>>>>>>>> fine
> > >>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
> > >>>>>>> interface for
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>> all-caching scenario and the design would become more
> > >>>>> complex.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
> > >>>>>>>>>> compatibility
> > >>>>>>>>>>>>> issues to existing lookup sources like there might exist
> two
> > >>>>>>> caches
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>> totally different strategies if the user incorrectly
> > >>>>> configures
> > >>>>>>> the
> > >>>>>>>>>>> table
> > >>>>>>>>>>>>> (one in the framework and another implemented by the lookup
> > >>>>>>> source).
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
> > >>>>>>> filters
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> projections should be pushed all the way down to the table
> > >>>>>>> function,
> > >>>>>>>>>>> like
> > >>>>>>>>>>>>> what we do in the scan source, instead of the runner with
> > >>>>> the
> > >>>>>>> cache.
> > >>>>>>>>>>> The
> > >>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
> > >>>>> pressure
> > >>>>>>> on the
> > >>>>>>>>>>>>> external system, and only applying these optimizations to
> > >>>>> the
> > >>>>>>> cache
> > >>>>>>>>>>> seems
> > >>>>>>>>>>>>> not quite useful.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
> > >>>>> We
> > >>>>>>>>>> prefer to
> > >>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
> > >>>>> and we
> > >>>>>>>>>> could
> > >>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> > >>>>>>>>>>>> AllCachingTableFunction,
> > >>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
> > >>>>> metrics
> > >>>>>>> of the
> > >>>>>>>>>>>> cache.
> > >>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Looking forward to your ideas!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > >>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I have few comments on your message.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> but could also live with an easier solution as the
> > >>>>> first
> > >>>>>>> step:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> > >>>>> (originally
> > >>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
> > >>>>> the
> > >>>>>>> same
> > >>>>>>>>>>>>>>> goal, but implementation details are different. If we
> > >>>>> will
> > >>>>>>> go one
> > >>>>>>>>>>> way,
> > >>>>>>>>>>>>>>> moving to another way in the future will mean deleting
> > >>>>>>> existing
> > >>>>>>>>>> code
> > >>>>>>>>>>>>>>> and once again changing the API for connectors. So I
> > >>>>> think we
> > >>>>>>>>>> should
> > >>>>>>>>>>>>>>> reach a consensus with the community about that and then
> > >>>>> work
> > >>>>>>>>>>> together
> > >>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
> > >>>>>>> parts
> > >>>>>>>>>> of
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
> > >>>>>>> proposed
> > >>>>>>>>>> set
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> as the source will only receive the requests after
> > >>>>> filter
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
> > >>>>>>> table, we
> > >>>>>>>>>>>>>>> firstly must do requests, and only after that we can
> > >>>>> filter
> > >>>>>>>>>>> responses,
> > >>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
> > >>>>> if
> > >>>>>>>>>>> filtering
> > >>>>>>>>>>>>>>> is done before caching, there will be much less rows in
> > >>>>>>> cache.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> > >>>>> shared.
> > >>>>>>> I
> > >>>>>>>>>> don't
> > >>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> solution to share images to be honest.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> > >>>>> conversations
> > >>>>>>> :)
> > >>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
> > >>>>> Jira
> > >>>>>>> issue,
> > >>>>>>>>>>>>>>> where described the proposed changes in more details -
> > >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Will happy to get more feedback!
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> > >>>>> arvid@apache.org>:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
> > >>>>>>> satisfying
> > >>>>>>>>>> for
> > >>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
> > >>>>> with
> > >>>>>>> an
> > >>>>>>>>>>> easier
> > >>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
> > >>>>> an
> > >>>>>>>>>>>>> implementation
> > >>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
> > >>>>> layer
> > >>>>>>>>>> around X.
> > >>>>>>>>>>>> So
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> > >>>>> delegates to
> > >>>>>>> X in
> > >>>>>>>>>>> case
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
> > >>>>>>> operator
> > >>>>>>>>>>>> model
> > >>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>> proposed would be even better but is probably
> > >>>>> unnecessary
> > >>>>>>> in
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> first step
> > >>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
> > >>>>> the
> > >>>>>>>>>> requests
> > >>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
> > >>>>> save
> > >>>>>>>>>>> memory).
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
> > >>>>>>> would be
> > >>>>>>>>>>>>> limited to
> > >>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
> > >>>>> else
> > >>>>>>>>>>> remains
> > >>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
> > >>>>> easily
> > >>>>>>>>>>>> incorporate
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
> > >>>>> later.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> > >>>>> shared.
> > >>>>>>> I
> > >>>>>>>>>> don't
> > >>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>> solution to share images to be honest.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > >>>>>>>>>>>>> smiralexan@gmail.com>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> > >>>>> committer
> > >>>>>>> yet,
> > >>>>>>>>>> but
> > >>>>>>>>>>>> I'd
> > >>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> > >>>>>>> interested
> > >>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
> > >>>>>>> company’s
> > >>>>>>>>>>> Flink
> > >>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
> > >>>>> this and
> > >>>>>>>>>> make
> > >>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>> open source.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I think there is a better alternative than
> > >>>>> introducing an
> > >>>>>>>>>>> abstract
> > >>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
> > >>>>> you
> > >>>>>>> know,
> > >>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> > >>>>> module,
> > >>>>>>> which
> > >>>>>>>>>>>>> provides
> > >>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> > >>>>>>> convenient
> > >>>>>>>>>> for
> > >>>>>>>>>>>>> importing
> > >>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
> > >>>>>>> logic
> > >>>>>>>>>> for
> > >>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
> > >>>>>>> connected
> > >>>>>>>>>> with
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>> should be located in another module, probably in
> > >>>>>>>>>>>>> flink-table-runtime.
> > >>>>>>>>>>>>>>>>> But this will require connectors to depend on another
> > >>>>>>> module,
> > >>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
> > >>>>>>> good.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
> > >>>>>>>>>>>> LookupTableSource
> > >>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
> > >>>>> pass
> > >>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
> > >>>>>>> depend on
> > >>>>>>>>>>>>> runtime
> > >>>>>>>>>>>>>>>>> realization. Based on these configs planner will
> > >>>>>>> construct a
> > >>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> > >>>>>>>>>> (ProcessFunctions
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
> > >>>>> in
> > >>>>>>> the
> > >>>>>>>>>>> pinned
> > >>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
> > >>>>>>>>>> CacheConfig).
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> > >>>>> responsible
> > >>>>>>> for
> > >>>>>>>>>>> this
> > >>>>>>>>>>>> –
> > >>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> > >>>>>>>>>>>>>>>>> Current classes for lookup join in
> > >>>>> flink-table-runtime
> > >>>>>>> -
> > >>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> > >>>>>>>>>>> LookupJoinRunnerWithCalc,
> > >>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
> > >>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
> > >>>>> such a
> > >>>>>>>>>>> solution.
> > >>>>>>>>>>>>> If
> > >>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
> > >>>>> some
> > >>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
> > >>>>> named
> > >>>>>>> like
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
> > >>>>>>> mostly
> > >>>>>>>>>>>> consists
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> filters and projections.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
> > >>>>>>> condition
> > >>>>>>>>>>> ‘JOIN …
> > >>>>>>>>>>>>> ON
> > >>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
> > >>>>> 1000’
> > >>>>>>>>>>> ‘calc’
> > >>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
> > >>>>>>>>>> B.salary >
> > >>>>>>>>>>>>> 1000.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> If we apply this function before storing records in
> > >>>>>>> cache,
> > >>>>>>>>>> size
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
> > >>>>>>> storing
> > >>>>>>>>>>>> useless
> > >>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> > >>>>> size. So
> > >>>>>>> the
> > >>>>>>>>>>>> initial
> > >>>>>>>>>>>>>>>>> max number of records in cache can be increased by
> > >>>>> the
> > >>>>>>> user.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> What do you think about it?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > >>>>>>>>>>>>>>>>>> Hi devs,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
> > >>>>>>>>>> FLIP-221[1],
> > >>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
> > >>>>> its
> > >>>>>>>>>> standard
> > >>>>>>>>>>>>> metrics.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
> > >>>>>>> their
> > >>>>>>>>>> own
> > >>>>>>>>>>>>> cache to
> > >>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
> > >>>>>>> metrics
> > >>>>>>>>>> for
> > >>>>>>>>>>>>> users and
> > >>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
> > >>>>> which
> > >>>>>>> is a
> > >>>>>>>>>>>> quite
> > >>>>>>>>>>>>> common
> > >>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
> > >>>>>>>>>> metrics,
> > >>>>>>>>>>>>> wrapper
> > >>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> > >>>>> Please
> > >>>>>>> take a
> > >>>>>>>>>>> look
> > >>>>>>>>>>>>> at the
> > >>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
> > >>>>> and
> > >>>>>>>>>> comments
> > >>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>> appreciated!
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Qingsheng Ren
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Real-time Computing Team
> > >>>>>>>>>>>>>> Alibaba Cloud
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Email: renqschn@gmail.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Best regards,
> > >>>>>>>>> Roman Boyko
> > >>>>>>>>> e.: ro.v.boyko@gmail.com
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Best Regards,
> > >>
> > >> Qingsheng Ren
> > >>
> > >> Real-time Computing Team
> > >> Alibaba Cloud
> > >>
> > >> Email: renqschn@gmail.com
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Александр Смирнов <sm...@gmail.com>.

Hi Qingsheng,

Thank you for considering my comments.

>  there might be custom logic before making retry, such as re-establish the connection

Yes, I understand that. I meant that such logic can be placed in a
separate function, that can be implemented by connectors. Just moving
the retry logic would make connector's LookupFunction more concise +
avoid duplicate code. However, it's a minor change. The decision is up
to you.

> We decide not to provide common DDL options and let developers to define their own options as we do now per connector.

What is the reason for that? One of the main goals of this FLIP was to
unify the configs, wasn't it? I understand that current cache design
doesn't depend on ConfigOptions, like was before. But still we can put
these options into the framework, so connectors can reuse them and
avoid code duplication, and, what is more significant, avoid possible
different options naming. This moment can be pointed out in
documentation for connector developers.

Best regards,
Alexander

вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <re...@gmail.com>:
>
> Hi Alexander,
>
> Thanks for the review and glad to see we are on the same page! I think you forgot to cc the dev mailing list so I’m also quoting your reply under this email.
>
> >  We can add 'maxRetryTimes' option into this class
>
> In my opinion the retry logic should be implemented in lookup() instead of in LookupFunction#eval(). Retrying is only meaningful under some specific retriable failures, and there might be custom logic before making retry, such as re-establish the connection (JdbcRowDataLookupFunction is an example), so it's more handy to leave it to the connector.
>
> > I don't see DDL options, that were in previous version of FLIP. Do you have any special plans for them?
>
> We decide not to provide common DDL options and let developers to define their own options as we do now per connector.
>
> The rest of comments sound great and I’ll update the FLIP. Hope we can finalize our proposal soon!
>
> Best,
>
> Qingsheng
>
>
> > On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com> wrote:
> >
> > Hi Qingsheng and devs!
> >
> > I like the overall design of updated FLIP, however I have several
> > suggestions and questions.
> >
> > 1) Introducing LookupFunction as a subclass of TableFunction is a good
> > idea. We can add 'maxRetryTimes' option into this class. 'eval' method
> > of new LookupFunction is great for this purpose. The same is for
> > 'async' case.
> >
> > 2) There might be other configs in future, such as 'cacheMissingKey'
> > in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
> > Maybe use Builder pattern in LookupFunctionProvider and
> > RescanRuntimeProvider for more flexibility (use one 'build' method
> > instead of many 'of' methods in future)?
> >
> > 3) What are the plans for existing TableFunctionProvider and
> > AsyncTableFunctionProvider? I think they should be deprecated.
> >
> > 4) Am I right that the current design does not assume usage of
> > user-provided LookupCache in re-scanning? In this case, it is not very
> > clear why do we need methods such as 'invalidate' or 'putAll' in
> > LookupCache.
> >
> > 5) I don't see DDL options, that were in previous version of FLIP. Do
> > you have any special plans for them?
> >
> > If you don't mind, I would be glad to be able to make small
> > adjustments to the FLIP document too. I think it's worth mentioning
> > about what exactly optimizations are planning in the future.
> >
> > Best regards,
> > Smirnov Alexander
> >
> > пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
> >>
> >> Hi Alexander and devs,
> >>
> >> Thank you very much for the in-depth discussion! As Jark mentioned we were inspired by Alexander's idea and made a refactor on our design. FLIP-221 [1] has been updated to reflect our design now and we are happy to hear more suggestions from you!
> >>
> >> Compared to the previous design:
> >> 1. The lookup cache serves at table runtime level and is integrated as a component of LookupJoinRunner as discussed previously.
> >> 2. Interfaces are renamed and re-designed to reflect the new design.
> >> 3. We separate the all-caching case individually and introduce a new RescanRuntimeProvider to reuse the ability of scanning. We are planning to support SourceFunction / InputFormat for now considering the complexity of FLIP-27 Source API.
> >> 4. A new interface LookupFunction is introduced to make the semantic of lookup more straightforward for developers.
> >>
> >> For replying to Alexander:
> >>> However I'm a little confused whether InputFormat is deprecated or not. Am I right that it will be so in the future, but currently it's not?
> >> Yes you are right. InputFormat is not deprecated for now. I think it will be deprecated in the future but we don't have a clear plan for that.
> >>
> >> Thanks again for the discussion on this FLIP and looking forward to cooperating with you after we finalize the design and interfaces!
> >>
> >> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>
> >> Best regards,
> >>
> >> Qingsheng
> >>
> >>
> >> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <sm...@gmail.com> wrote:
> >>>
> >>> Hi Jark, Qingsheng and Leonard!
> >>>
> >>> Glad to see that we came to a consensus on almost all points!
> >>>
> >>> However I'm a little confused whether InputFormat is deprecated or
> >>> not. Am I right that it will be so in the future, but currently it's
> >>> not? Actually I also think that for the first version it's OK to use
> >>> InputFormat in ALL cache realization, because supporting rescan
> >>> ability seems like a very distant prospect. But for this decision we
> >>> need a consensus among all discussion participants.
> >>>
> >>> In general, I don't have something to argue with your statements. All
> >>> of them correspond my ideas. Looking ahead, it would be nice to work
> >>> on this FLIP cooperatively. I've already done a lot of work on lookup
> >>> join caching with realization very close to the one we are discussing,
> >>> and want to share the results of this work. Anyway looking forward for
> >>> the FLIP update!
> >>>
> >>> Best regards,
> >>> Smirnov Alexander
> >>>
> >>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> >>>>
> >>>> Hi Alex,
> >>>>
> >>>> Thanks for summarizing your points.
> >>>>
> >>>> In the past week, Qingsheng, Leonard, and I have discussed it several times
> >>>> and we have totally refactored the design.
> >>>> I'm glad to say we have reached a consensus on many of your points!
> >>>> Qingsheng is still working on updating the design docs and maybe can be
> >>>> available in the next few days.
> >>>> I will share some conclusions from our discussions:
> >>>>
> >>>> 1) we have refactored the design towards to "cache in framework" way.
> >>>>
> >>>> 2) a "LookupCache" interface for users to customize and a default
> >>>> implementation with builder for users to easy-use.
> >>>> This can both make it possible to both have flexibility and conciseness.
> >>>>
> >>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
> >>>> IO.
> >>>> Filter pushdown should be the final state and the unified way to both
> >>>> support pruning ALL cache and LRU cache,
> >>>> so I think we should make effort in this direction. If we need to support
> >>>> filter pushdown for ALL cache anyway, why not use
> >>>> it for LRU cache as well? Either way, as we decide to implement the cache
> >>>> in the framework, we have the chance to support
> >>>> filter on cache anytime. This is an optimization and it doesn't affect the
> >>>> public API. I think we can create a JIRA issue to
> >>>> discuss it when the FLIP is accepted.
> >>>>
> >>>> 4) The idea to support ALL cache is similar to your proposal.
> >>>> In the first version, we will only support InputFormat, SourceFunction for
> >>>> cache all (invoke InputFormat in join operator).
> >>>> For FLIP-27 source, we need to join a true source operator instead of
> >>>> calling it embedded in the join operator.
> >>>> However, this needs another FLIP to support the re-scan ability for FLIP-27
> >>>> Source, and this can be a large work.
> >>>> In order to not block this issue, we can put the effort of FLIP-27 source
> >>>> integration into future work and integrate
> >>>> InputFormat&SourceFunction for now.
> >>>>
> >>>> I think it's fine to use InputFormat&SourceFunction, as they are not
> >>>> deprecated, otherwise, we have to introduce another function
> >>>> similar to them which is meaningless. We need to plan FLIP-27 source
> >>>> integration ASAP before InputFormat & SourceFunction are deprecated.
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Martijn!
> >>>>>
> >>>>> Got it. Therefore, the realization with InputFormat is not considered.
> >>>>> Thanks for clearing that up!
> >>>>>
> >>>>> Best regards,
> >>>>> Smirnov Alexander
> >>>>>
> >>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> With regards to:
> >>>>>>
> >>>>>>> But if there are plans to refactor all connectors to FLIP-27
> >>>>>>
> >>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces will be
> >>>>>> deprecated and connectors will either be refactored to use the new ones
> >>>>> or
> >>>>>> dropped.
> >>>>>>
> >>>>>> The caching should work for connectors that are using FLIP-27 interfaces,
> >>>>>> we should not introduce new features for old interfaces.
> >>>>>>
> >>>>>> Best regards,
> >>>>>>
> >>>>>> Martijn
> >>>>>>
> >>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Jark!
> >>>>>>>
> >>>>>>> Sorry for the late response. I would like to make some comments and
> >>>>>>> clarify my points.
> >>>>>>>
> >>>>>>> 1) I agree with your first statement. I think we can achieve both
> >>>>>>> advantages this way: put the Cache interface in flink-table-common,
> >>>>>>> but have implementations of it in flink-table-runtime. Therefore if a
> >>>>>>> connector developer wants to use existing cache strategies and their
> >>>>>>> implementations, he can just pass lookupConfig to the planner, but if
> >>>>>>> he wants to have its own cache implementation in his TableFunction, it
> >>>>>>> will be possible for him to use the existing interface for this
> >>>>>>> purpose (we can explicitly point this out in the documentation). In
> >>>>>>> this way all configs and metrics will be unified. WDYT?
> >>>>>>>
> >>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
> >>>>>>> lookup requests that can never be cached
> >>>>>>>
> >>>>>>> 2) Let me clarify the logic filters optimization in case of LRU cache.
> >>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
> >>>>>>> store the response of the dimension table in cache, even after
> >>>>>>> applying calc function. I.e. if there are no rows after applying
> >>>>>>> filters to the result of the 'eval' method of TableFunction, we store
> >>>>>>> the empty list by lookup keys. Therefore the cache line will be
> >>>>>>> filled, but will require much less memory (in bytes). I.e. we don't
> >>>>>>> completely filter keys, by which result was pruned, but significantly
> >>>>>>> reduce required memory to store this result. If the user knows about
> >>>>>>> this behavior, he can increase the 'max-rows' option before the start
> >>>>>>> of the job. But actually I came up with the idea that we can do this
> >>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods of
> >>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
> >>>>>>> (value of cache). Therefore cache can automatically fit much more
> >>>>>>> records than before.
> >>>>>>>
> >>>>>>>> Flink SQL has provided a standard way to do filters and projects
> >>>>>>> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> >>>>> hard
> >>>>>>> to implement.
> >>>>>>>
> >>>>>>> It's debatable how difficult it will be to implement filter pushdown.
> >>>>>>> But I think the fact that currently there is no database connector
> >>>>>>> with filter pushdown at least means that this feature won't be
> >>>>>>> supported soon in connectors. Moreover, if we talk about other
> >>>>>>> connectors (not in Flink repo), their databases might not support all
> >>>>>>> Flink filters (or not support filters at all). I think users are
> >>>>>>> interested in supporting cache filters optimization  independently of
> >>>>>>> supporting other features and solving more complex problems (or
> >>>>>>> unsolvable at all).
> >>>>>>>
> >>>>>>> 3) I agree with your third statement. Actually in our internal version
> >>>>>>> I also tried to unify the logic of scanning and reloading data from
> >>>>>>> connectors. But unfortunately, I didn't find a way to unify the logic
> >>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
> >>>>>>> and reuse it in reloading ALL cache. As a result I settled on using
> >>>>>>> InputFormat, because it was used for scanning in all lookup
> >>>>>>> connectors. (I didn't know that there are plans to deprecate
> >>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
> >>>>>>> in ALL caching is not good idea, because this source was designed to
> >>>>>>> work in distributed environment (SplitEnumerator on JobManager and
> >>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
> >>>>>>> operator in our case). There is even no direct way to pass splits from
> >>>>>>> SplitEnumerator to SourceReader (this logic works through
> >>>>>>> SplitEnumeratorContext, which requires
> >>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
> >>>>>>> InputFormat for ALL cache seems much more clearer and easier. But if
> >>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
> >>>>>>> following ideas: maybe we can refuse from lookup join ALL cache in
> >>>>>>> favor of simple join with multiple scanning of batch source? The point
> >>>>>>> is that the only difference between lookup join ALL cache and simple
> >>>>>>> join with batch source is that in the first case scanning is performed
> >>>>>>> multiple times, in between which state (cache) is cleared (correct me
> >>>>>>> if I'm wrong). So what if we extend the functionality of simple join
> >>>>>>> to support state reloading + extend the functionality of scanning
> >>>>>>> batch source multiple times (this one should be easy with new FLIP-27
> >>>>>>> source, that unifies streaming/batch reading - we will need to change
> >>>>>>> only SplitEnumerator, which will pass splits again after some TTL).
> >>>>>>> WDYT? I must say that this looks like a long-term goal and will make
> >>>>>>> the scope of this FLIP even larger than you said. Maybe we can limit
> >>>>>>> ourselves to a simpler solution now (InputFormats).
> >>>>>>>
> >>>>>>> So to sum up, my points is like this:
> >>>>>>> 1) There is a way to make both concise and flexible interfaces for
> >>>>>>> caching in lookup join.
> >>>>>>> 2) Cache filters optimization is important both in LRU and ALL caches.
> >>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
> >>>>>>> connectors, some of the connectors might not have the opportunity to
> >>>>>>> support filter pushdown + as I know, currently filter pushdown works
> >>>>>>> only for scanning (not lookup). So cache filters + projections
> >>>>>>> optimization should be independent from other features.
> >>>>>>> 4) ALL cache realization is a complex topic that involves multiple
> >>>>>>> aspects of how Flink is developing. Refusing from InputFormat in favor
> >>>>>>> of FLIP-27 Source will make ALL cache realization really complex and
> >>>>>>> not clear, so maybe instead of that we can extend the functionality of
> >>>>>>> simple join or not refuse from InputFormat in case of lookup join ALL
> >>>>>>> cache?
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Smirnov Alexander
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> [1]
> >>>>>>>
> >>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >>>>>>>
> >>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> >>>>>>>>
> >>>>>>>> It's great to see the active discussion! I want to share my ideas:
> >>>>>>>>
> >>>>>>>> 1) implement the cache in framework vs. connectors base
> >>>>>>>> I don't have a strong opinion on this. Both ways should work (e.g.,
> >>>>> cache
> >>>>>>>> pruning, compatibility).
> >>>>>>>> The framework way can provide more concise interfaces.
> >>>>>>>> The connector base way can define more flexible cache
> >>>>>>>> strategies/implementations.
> >>>>>>>> We are still investigating a way to see if we can have both
> >>>>> advantages.
> >>>>>>>> We should reach a consensus that the way should be a final state,
> >>>>> and we
> >>>>>>>> are on the path to it.
> >>>>>>>>
> >>>>>>>> 2) filters and projections pushdown:
> >>>>>>>> I agree with Alex that the filter pushdown into cache can benefit a
> >>>>> lot
> >>>>>>> for
> >>>>>>>> ALL cache.
> >>>>>>>> However, this is not true for LRU cache. Connectors use cache to
> >>>>> reduce
> >>>>>>> IO
> >>>>>>>> requests to databases for better throughput.
> >>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
> >>>>>>> lookup
> >>>>>>>> requests that can never be cached
> >>>>>>>> and hit directly to the databases. That means the cache is
> >>>>> meaningless in
> >>>>>>>> this case.
> >>>>>>>>
> >>>>>>>> IMO, Flink SQL has provided a standard way to do filters and projects
> >>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> >>>>> SupportsProjectionPushDown.
> >>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> >>>>> hard
> >>>>>>> to
> >>>>>>>> implement.
> >>>>>>>> They should implement the pushdown interfaces to reduce IO and the
> >>>>> cache
> >>>>>>>> size.
> >>>>>>>> That should be a final state that the scan source and lookup source
> >>>>> share
> >>>>>>>> the exact pushdown implementation.
> >>>>>>>> I don't see why we need to duplicate the pushdown logic in caches,
> >>>>> which
> >>>>>>>> will complex the lookup join design.
> >>>>>>>>
> >>>>>>>> 3) ALL cache abstraction
> >>>>>>>> All cache might be the most challenging part of this FLIP. We have
> >>>>> never
> >>>>>>>> provided a reload-lookup public interface.
> >>>>>>>> Currently, we put the reload logic in the "eval" method of
> >>>>> TableFunction.
> >>>>>>>> That's hard for some sources (e.g., Hive).
> >>>>>>>> Ideally, connector implementation should share the logic of reload
> >>>>> and
> >>>>>>>> scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
> >>>>>>> Source.
> >>>>>>>> However, InputFormat/SourceFunction are deprecated, and the FLIP-27
> >>>>>>> source
> >>>>>>>> is deeply coupled with SourceOperator.
> >>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may make
> >>>>> the
> >>>>>>>> scope of this FLIP much larger.
> >>>>>>>> We are still investigating how to abstract the ALL cache logic and
> >>>>> reuse
> >>>>>>>> the existing source interfaces.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> It's a much more complicated activity and lies out of the scope of
> >>>>> this
> >>>>>>>>> improvement. Because such pushdowns should be done for all
> >>>>>>> ScanTableSource
> >>>>>>>>> implementations (not only for Lookup ones).
> >>>>>>>>>
> >>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
> >>>>> martijnvisser@apache.org>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi everyone,
> >>>>>>>>>>
> >>>>>>>>>> One question regarding "And Alexander correctly mentioned that
> >>>>> filter
> >>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." -> Would
> >>>>> an
> >>>>>>>>>> alternative solution be to actually implement these filter
> >>>>> pushdowns?
> >>>>>>> I
> >>>>>>>>>> can
> >>>>>>>>>> imagine that there are many more benefits to doing that, outside
> >>>>> of
> >>>>>>> lookup
> >>>>>>>>>> caching and metrics.
> >>>>>>>>>>
> >>>>>>>>>> Best regards,
> >>>>>>>>>>
> >>>>>>>>>> Martijn Visser
> >>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for driving such a valuable improvement!
> >>>>>>>>>>>
> >>>>>>>>>>> I do think that single cache implementation would be a nice
> >>>>>>> opportunity
> >>>>>>>>>> for
> >>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
> >>>>>>> semantics
> >>>>>>>>>>> anyway - doesn't matter how it will be implemented.
> >>>>>>>>>>>
> >>>>>>>>>>> Putting myself in the user's shoes, I can say that:
> >>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the cache
> >>>>> size
> >>>>>>> by
> >>>>>>>>>>> simply filtering unnecessary data. And the most handy way to do
> >>>>> it
> >>>>>>> is
> >>>>>>>>>> apply
> >>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
> >>>>>>> through the
> >>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
> >>>>> mentioned
> >>>>>>> that
> >>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
> >>>>>>>>>>> 2) The ability to set the different caching parameters for
> >>>>> different
> >>>>>>>>>> tables
> >>>>>>>>>>> is quite important. So I would prefer to set it through DDL
> >>>>> rather
> >>>>>>> than
> >>>>>>>>>>> have similar ttla, strategy and other options for all lookup
> >>>>> tables.
> >>>>>>>>>>> 3) Providing the cache into the framework really deprives us of
> >>>>>>>>>>> extensibility (users won't be able to implement their own
> >>>>> cache).
> >>>>>>> But
> >>>>>>>>>> most
> >>>>>>>>>>> probably it might be solved by creating more different cache
> >>>>>>> strategies
> >>>>>>>>>> and
> >>>>>>>>>>> a wider set of configurations.
> >>>>>>>>>>>
> >>>>>>>>>>> All these points are much closer to the schema proposed by
> >>>>>>> Alexander.
> >>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all these
> >>>>>>>>>> facilities
> >>>>>>>>>>> might be simply implemented in your architecture?
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards,
> >>>>>>>>>>> Roman Boyko
> >>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
> >>>>>>> martijnvisser@apache.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't have much to chip in, but just wanted to express that
> >>>>> I
> >>>>>>> really
> >>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
> >>>>> that
> >>>>>>>>>> others
> >>>>>>>>>>>> will join the conversation.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Martijn
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> >>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for your detailed feedback! However, I have questions
> >>>>>>> about
> >>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
> >>>>> AS OF
> >>>>>>>>>>>> proc_time”
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
> >>>>> proc_time"
> >>>>>>> is
> >>>>>>>>>> not
> >>>>>>>>>>>>> fully implemented with caching, but as you said, users go
> >>>>> on it
> >>>>>>>>>>>>> consciously to achieve better performance (no one proposed
> >>>>> to
> >>>>>>> enable
> >>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
> >>>>>>> developers
> >>>>>>>>>> of
> >>>>>>>>>>>>> connectors? In this case developers explicitly specify
> >>>>> whether
> >>>>>>> their
> >>>>>>>>>>>>> connector supports caching or not (in the list of supported
> >>>>>>>>>> options),
> >>>>>>>>>>>>> no one makes them do that if they don't want to. So what
> >>>>>>> exactly is
> >>>>>>>>>>>>> the difference between implementing caching in modules
> >>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
> >>>>>>> considered
> >>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
> >>>>> the
> >>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> confront a situation that allows table options in DDL to
> >>>>>>> control
> >>>>>>>>>> the
> >>>>>>>>>>>>> behavior of the framework, which has never happened
> >>>>> previously
> >>>>>>> and
> >>>>>>>>>>> should
> >>>>>>>>>>>>> be cautious
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If we talk about main differences of semantics of DDL
> >>>>> options
> >>>>>>> and
> >>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
> >>>>> the
> >>>>>>> scope
> >>>>>>>>>> of
> >>>>>>>>>>>>> the options + importance for the user business logic rather
> >>>>> than
> >>>>>>>>>>>>> specific location of corresponding logic in the framework? I
> >>>>>>> mean
> >>>>>>>>>> that
> >>>>>>>>>>>>> in my design, for example, putting an option with lookup
> >>>>> cache
> >>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
> >>>>>>> because it
> >>>>>>>>>>>>> directly affects the user's business logic (not just
> >>>>> performance
> >>>>>>>>>>>>> optimization) + touches just several functions of ONE table
> >>>>>>> (there
> >>>>>>>>>> can
> >>>>>>>>>>>>> be multiple tables with different caches). Does it really
> >>>>>>> matter for
> >>>>>>>>>>>>> the user (or someone else) where the logic is located,
> >>>>> which is
> >>>>>>>>>>>>> affected by the applied option?
> >>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
> >>>>>>> some way
> >>>>>>>>>>>>> "controls the behavior of the framework" and I don't see any
> >>>>>>> problem
> >>>>>>>>>>>>> here.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
> >>>>> and
> >>>>>>> the
> >>>>>>>>>>> design
> >>>>>>>>>>>>> would become more complex
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is a subject for a separate discussion, but actually
> >>>>> in our
> >>>>>>>>>>>>> internal version we solved this problem quite easily - we
> >>>>> reused
> >>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
> >>>>>>> point is
> >>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
> >>>>>>> scanning
> >>>>>>>>>> the
> >>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
> >>>>> class
> >>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
> >>>>>>> InputFormat.
> >>>>>>>>>>>>> The advantage of this solution is the ability to reload
> >>>>> cache
> >>>>>>> data
> >>>>>>>>>> in
> >>>>>>>>>>>>> parallel (number of threads depends on number of
> >>>>> InputSplits,
> >>>>>>> but
> >>>>>>>>>> has
> >>>>>>>>>>>>> an upper limit). As a result cache reload time significantly
> >>>>>>> reduces
> >>>>>>>>>>>>> (as well as time of input stream blocking). I know that
> >>>>> usually
> >>>>>>> we
> >>>>>>>>>> try
> >>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
> >>>>> one
> >>>>>>> can
> >>>>>>>>>> be
> >>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
> >>>>> maybe
> >>>>>>>>>> there
> >>>>>>>>>>>>> are better ones.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Providing the cache in the framework might introduce
> >>>>>>> compatibility
> >>>>>>>>>>>> issues
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It's possible only in cases when the developer of the
> >>>>> connector
> >>>>>>>>>> won't
> >>>>>>>>>>>>> properly refactor his code and will use new cache options
> >>>>>>>>>> incorrectly
> >>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
> >>>>> code
> >>>>>>>>>>>>> places). For correct behavior all he will need to do is to
> >>>>>>> redirect
> >>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
> >>>>> add an
> >>>>>>>>>> alias
> >>>>>>>>>>>>> for options, if there was different naming), everything
> >>>>> will be
> >>>>>>>>>>>>> transparent for users. If the developer won't do
> >>>>> refactoring at
> >>>>>>> all,
> >>>>>>>>>>>>> nothing will be changed for the connector because of
> >>>>> backward
> >>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
> >>>>> cache
> >>>>>>> logic,
> >>>>>>>>>>>>> he just can refuse to pass some of the configs into the
> >>>>>>> framework,
> >>>>>>>>>> and
> >>>>>>>>>>>>> instead make his own implementation with already existing
> >>>>>>> configs
> >>>>>>>>>> and
> >>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> filters and projections should be pushed all the way down
> >>>>> to
> >>>>>>> the
> >>>>>>>>>>> table
> >>>>>>>>>>>>> function, like what we do in the scan source
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
> >>>>> connector
> >>>>>>>>>> that
> >>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
> >>>>>>>>>>>>> (no database connector supports it currently). Also for some
> >>>>>>>>>> databases
> >>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
> >>>>> that we
> >>>>>>> have
> >>>>>>>>>>>>> in Flink.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> only applying these optimizations to the cache seems not
> >>>>>>> quite
> >>>>>>>>>>> useful
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
> >>>>> from the
> >>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
> >>>>>>> table
> >>>>>>>>>>>>> 'users'
> >>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
> >>>>> stream
> >>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
> >>>>> we
> >>>>>>> have
> >>>>>>>>>>>>> filter 'age > 30',
> >>>>>>>>>>>>> there will be twice less data in cache. This means the user
> >>>>> can
> >>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
> >>>>>>> gain a
> >>>>>>>>>>>>> huge
> >>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
> >>>>> really
> >>>>>>>>>> shine
> >>>>>>>>>>>>> in 'ALL' cache, where tables without filters and projections
> >>>>>>> can't
> >>>>>>>>>> fit
> >>>>>>>>>>>>> in memory, but with them - can. This opens up additional
> >>>>>>>>>> possibilities
> >>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It would be great to hear other voices regarding this topic!
> >>>>>>> Because
> >>>>>>>>>>>>> we have quite a lot of controversial points, and I think
> >>>>> with
> >>>>>>> the
> >>>>>>>>>> help
> >>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> >>>>> renqschn@gmail.com
> >>>>>>>> :
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Alexander and Arvid,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
> >>>>> We
> >>>>>>> had
> >>>>>>>>>> an
> >>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
> >>>>> like
> >>>>>>> to
> >>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
> >>>>> logic in
> >>>>>>> the
> >>>>>>>>>>> table
> >>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
> >>>>>>> function,
> >>>>>>>>>> we
> >>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
> >>>>> with
> >>>>>>> these
> >>>>>>>>>>>>> concerns:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
> >>>>> SYSTEM_TIME
> >>>>>>> AS OF
> >>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
> >>>>> of the
> >>>>>>>>>> lookup
> >>>>>>>>>>>>> table at the moment of querying. If users choose to enable
> >>>>>>> caching
> >>>>>>>>>> on
> >>>>>>>>>>> the
> >>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage is
> >>>>>>>>>> acceptable
> >>>>>>>>>>>> in
> >>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
> >>>>>>> caching on
> >>>>>>>>>>> the
> >>>>>>>>>>>>> table runtime level.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
> >>>>>>> (whether
> >>>>>>>>>> in a
> >>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
> >>>>> confront a
> >>>>>>>>>>>> situation
> >>>>>>>>>>>>> that allows table options in DDL to control the behavior of
> >>>>> the
> >>>>>>>>>>>> framework,
> >>>>>>>>>>>>> which has never happened previously and should be cautious.
> >>>>>>> Under
> >>>>>>>>>> the
> >>>>>>>>>>>>> current design the behavior of the framework should only be
> >>>>>>>>>> specified
> >>>>>>>>>>> by
> >>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
> >>>>> these
> >>>>>>>>>> general
> >>>>>>>>>>>>> configs to a specific table.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
> >>>>> all
> >>>>>>>>>> records
> >>>>>>>>>>>>> periodically into the memory to achieve high lookup
> >>>>> performance
> >>>>>>>>>> (like
> >>>>>>>>>>>> Hive
> >>>>>>>>>>>>> connector in the community, and also widely used by our
> >>>>> internal
> >>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
> >>>>> TableFunction
> >>>>>>>>>> works
> >>>>>>>>>>>> fine
> >>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
> >>>>>>> interface for
> >>>>>>>>>>> this
> >>>>>>>>>>>>> all-caching scenario and the design would become more
> >>>>> complex.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
> >>>>>>>>>> compatibility
> >>>>>>>>>>>>> issues to existing lookup sources like there might exist two
> >>>>>>> caches
> >>>>>>>>>>> with
> >>>>>>>>>>>>> totally different strategies if the user incorrectly
> >>>>> configures
> >>>>>>> the
> >>>>>>>>>>> table
> >>>>>>>>>>>>> (one in the framework and another implemented by the lookup
> >>>>>>> source).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
> >>>>>>> filters
> >>>>>>>>>> and
> >>>>>>>>>>>>> projections should be pushed all the way down to the table
> >>>>>>> function,
> >>>>>>>>>>> like
> >>>>>>>>>>>>> what we do in the scan source, instead of the runner with
> >>>>> the
> >>>>>>> cache.
> >>>>>>>>>>> The
> >>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
> >>>>> pressure
> >>>>>>> on the
> >>>>>>>>>>>>> external system, and only applying these optimizations to
> >>>>> the
> >>>>>>> cache
> >>>>>>>>>>> seems
> >>>>>>>>>>>>> not quite useful.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
> >>>>> We
> >>>>>>>>>> prefer to
> >>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
> >>>>> and we
> >>>>>>>>>> could
> >>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
> >>>>>>>>>>>> AllCachingTableFunction,
> >>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
> >>>>> metrics
> >>>>>>> of the
> >>>>>>>>>>>> cache.
> >>>>>>>>>>>>> Also, I made a POC[2] for your reference.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Looking forward to your ideas!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> >>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the response, Arvid!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I have few comments on your message.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> but could also live with an easier solution as the
> >>>>> first
> >>>>>>> step:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
> >>>>> (originally
> >>>>>>>>>>> proposed
> >>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
> >>>>> the
> >>>>>>> same
> >>>>>>>>>>>>>>> goal, but implementation details are different. If we
> >>>>> will
> >>>>>>> go one
> >>>>>>>>>>> way,
> >>>>>>>>>>>>>>> moving to another way in the future will mean deleting
> >>>>>>> existing
> >>>>>>>>>> code
> >>>>>>>>>>>>>>> and once again changing the API for connectors. So I
> >>>>> think we
> >>>>>>>>>> should
> >>>>>>>>>>>>>>> reach a consensus with the community about that and then
> >>>>> work
> >>>>>>>>>>> together
> >>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
> >>>>>>> parts
> >>>>>>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
> >>>>>>> proposed
> >>>>>>>>>> set
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> as the source will only receive the requests after
> >>>>> filter
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
> >>>>>>> table, we
> >>>>>>>>>>>>>>> firstly must do requests, and only after that we can
> >>>>> filter
> >>>>>>>>>>> responses,
> >>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
> >>>>> if
> >>>>>>>>>>> filtering
> >>>>>>>>>>>>>>> is done before caching, there will be much less rows in
> >>>>>>> cache.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>>>> shared.
> >>>>>>> I
> >>>>>>>>>> don't
> >>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
> >>>>> conversations
> >>>>>>> :)
> >>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
> >>>>> Jira
> >>>>>>> issue,
> >>>>>>>>>>>>>>> where described the proposed changes in more details -
> >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Will happy to get more feedback!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Smirnov Alexander
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> >>>>> arvid@apache.org>:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Qingsheng,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
> >>>>>>> satisfying
> >>>>>>>>>> for
> >>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
> >>>>> with
> >>>>>>> an
> >>>>>>>>>>> easier
> >>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
> >>>>> an
> >>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
> >>>>> layer
> >>>>>>>>>> around X.
> >>>>>>>>>>>> So
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
> >>>>> delegates to
> >>>>>>> X in
> >>>>>>>>>>> case
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
> >>>>>>> operator
> >>>>>>>>>>>> model
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>> proposed would be even better but is probably
> >>>>> unnecessary
> >>>>>>> in
> >>>>>>>>>> the
> >>>>>>>>>>>>> first step
> >>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
> >>>>> the
> >>>>>>>>>> requests
> >>>>>>>>>>>>> after
> >>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
> >>>>> save
> >>>>>>>>>>> memory).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
> >>>>>>> would be
> >>>>>>>>>>>>> limited to
> >>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
> >>>>> else
> >>>>>>>>>>> remains
> >>>>>>>>>>>> an
> >>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
> >>>>> easily
> >>>>>>>>>>>> incorporate
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
> >>>>> later.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
> >>>>> shared.
> >>>>>>> I
> >>>>>>>>>> don't
> >>>>>>>>>>>>> know the
> >>>>>>>>>>>>>>>> solution to share images to be honest.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> >>>>>>>>>>>>> smiralexan@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
> >>>>> committer
> >>>>>>> yet,
> >>>>>>>>>> but
> >>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
> >>>>>>> interested
> >>>>>>>>>> me.
> >>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
> >>>>>>> company’s
> >>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
> >>>>> this and
> >>>>>>>>>> make
> >>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>> open source.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I think there is a better alternative than
> >>>>> introducing an
> >>>>>>>>>>> abstract
> >>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
> >>>>> you
> >>>>>>> know,
> >>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
> >>>>> module,
> >>>>>>> which
> >>>>>>>>>>>>> provides
> >>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
> >>>>>>> convenient
> >>>>>>>>>> for
> >>>>>>>>>>>>> importing
> >>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
> >>>>>>> logic
> >>>>>>>>>> for
> >>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
> >>>>>>> connected
> >>>>>>>>>> with
> >>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> should be located in another module, probably in
> >>>>>>>>>>>>> flink-table-runtime.
> >>>>>>>>>>>>>>>>> But this will require connectors to depend on another
> >>>>>>> module,
> >>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
> >>>>>>> good.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
> >>>>>>>>>>>> LookupTableSource
> >>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
> >>>>> pass
> >>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
> >>>>>>> depend on
> >>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>> realization. Based on these configs planner will
> >>>>>>> construct a
> >>>>>>>>>>>> lookup
> >>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
> >>>>>>>>>> (ProcessFunctions
> >>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
> >>>>> in
> >>>>>>> the
> >>>>>>>>>>> pinned
> >>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
> >>>>>>>>>> CacheConfig).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
> >>>>> responsible
> >>>>>>> for
> >>>>>>>>>>> this
> >>>>>>>>>>>> –
> >>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
> >>>>>>>>>>>>>>>>> Current classes for lookup join in
> >>>>> flink-table-runtime
> >>>>>>> -
> >>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
> >>>>>>>>>>> LookupJoinRunnerWithCalc,
> >>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
> >>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
> >>>>> such a
> >>>>>>>>>>> solution.
> >>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
> >>>>> some
> >>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
> >>>>> named
> >>>>>>> like
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
> >>>>>>> mostly
> >>>>>>>>>>>> consists
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> filters and projections.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
> >>>>>>> condition
> >>>>>>>>>>> ‘JOIN …
> >>>>>>>>>>>>> ON
> >>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
> >>>>> 1000’
> >>>>>>>>>>> ‘calc’
> >>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
> >>>>>>>>>> B.salary >
> >>>>>>>>>>>>> 1000.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If we apply this function before storing records in
> >>>>>>> cache,
> >>>>>>>>>> size
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
> >>>>>>> storing
> >>>>>>>>>>>> useless
> >>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
> >>>>> size. So
> >>>>>>> the
> >>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>> max number of records in cache can be increased by
> >>>>> the
> >>>>>>> user.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> What do you think about it?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
> >>>>>>>>>> FLIP-221[1],
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
> >>>>> its
> >>>>>>>>>> standard
> >>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
> >>>>>>> their
> >>>>>>>>>> own
> >>>>>>>>>>>>> cache to
> >>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
> >>>>>>> metrics
> >>>>>>>>>> for
> >>>>>>>>>>>>> users and
> >>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
> >>>>> which
> >>>>>>> is a
> >>>>>>>>>>>> quite
> >>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
> >>>>>>>>>> metrics,
> >>>>>>>>>>>>> wrapper
> >>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
> >>>>> Please
> >>>>>>> take a
> >>>>>>>>>>> look
> >>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
> >>>>> and
> >>>>>>>>>> comments
> >>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Qingsheng
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Qingsheng Ren
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Real-time Computing Team
> >>>>>>>>>>>>>> Alibaba Cloud
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Email: renqschn@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best regards,
> >>>>>>>>> Roman Boyko
> >>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>
> >>
> >> --
> >> Best Regards,
> >>
> >> Qingsheng Ren
> >>
> >> Real-time Computing Team
> >> Alibaba Cloud
> >>
> >> Email: renqschn@gmail.com
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi Alexander, 

Thanks for the review and glad to see we are on the same page! I think you forgot to cc the dev mailing list so I’m also quoting your reply under this email. 

>  We can add 'maxRetryTimes' option into this class

In my opinion the retry logic should be implemented in lookup() instead of in LookupFunction#eval(). Retrying is only meaningful under some specific retriable failures, and there might be custom logic before making retry, such as re-establish the connection (JdbcRowDataLookupFunction is an example), so it's more handy to leave it to the connector.

> I don't see DDL options, that were in previous version of FLIP. Do you have any special plans for them?

We decide not to provide common DDL options and let developers to define their own options as we do now per connector. 

The rest of comments sound great and I’ll update the FLIP. Hope we can finalize our proposal soon!

Best, 

Qingsheng


> On May 17, 2022, at 13:46, Александр Смирнов <sm...@gmail.com> wrote:
> 
> Hi Qingsheng and devs!
> 
> I like the overall design of updated FLIP, however I have several
> suggestions and questions.
> 
> 1) Introducing LookupFunction as a subclass of TableFunction is a good
> idea. We can add 'maxRetryTimes' option into this class. 'eval' method
> of new LookupFunction is great for this purpose. The same is for
> 'async' case.
> 
> 2) There might be other configs in future, such as 'cacheMissingKey'
> in LookupFunctionProvider or 'rescanInterval' in ScanRuntimeProvider.
> Maybe use Builder pattern in LookupFunctionProvider and
> RescanRuntimeProvider for more flexibility (use one 'build' method
> instead of many 'of' methods in future)?
> 
> 3) What are the plans for existing TableFunctionProvider and
> AsyncTableFunctionProvider? I think they should be deprecated.
> 
> 4) Am I right that the current design does not assume usage of
> user-provided LookupCache in re-scanning? In this case, it is not very
> clear why do we need methods such as 'invalidate' or 'putAll' in
> LookupCache.
> 
> 5) I don't see DDL options, that were in previous version of FLIP. Do
> you have any special plans for them?
> 
> If you don't mind, I would be glad to be able to make small
> adjustments to the FLIP document too. I think it's worth mentioning
> about what exactly optimizations are planning in the future.
> 
> Best regards,
> Smirnov Alexander
> 
> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <re...@gmail.com>:
>> 
>> Hi Alexander and devs,
>> 
>> Thank you very much for the in-depth discussion! As Jark mentioned we were inspired by Alexander's idea and made a refactor on our design. FLIP-221 [1] has been updated to reflect our design now and we are happy to hear more suggestions from you!
>> 
>> Compared to the previous design:
>> 1. The lookup cache serves at table runtime level and is integrated as a component of LookupJoinRunner as discussed previously.
>> 2. Interfaces are renamed and re-designed to reflect the new design.
>> 3. We separate the all-caching case individually and introduce a new RescanRuntimeProvider to reuse the ability of scanning. We are planning to support SourceFunction / InputFormat for now considering the complexity of FLIP-27 Source API.
>> 4. A new interface LookupFunction is introduced to make the semantic of lookup more straightforward for developers.
>> 
>> For replying to Alexander:
>>> However I'm a little confused whether InputFormat is deprecated or not. Am I right that it will be so in the future, but currently it's not?
>> Yes you are right. InputFormat is not deprecated for now. I think it will be deprecated in the future but we don't have a clear plan for that.
>> 
>> Thanks again for the discussion on this FLIP and looking forward to cooperating with you after we finalize the design and interfaces!
>> 
>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> 
>> Best regards,
>> 
>> Qingsheng
>> 
>> 
>> On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <sm...@gmail.com> wrote:
>>> 
>>> Hi Jark, Qingsheng and Leonard!
>>> 
>>> Glad to see that we came to a consensus on almost all points!
>>> 
>>> However I'm a little confused whether InputFormat is deprecated or
>>> not. Am I right that it will be so in the future, but currently it's
>>> not? Actually I also think that for the first version it's OK to use
>>> InputFormat in ALL cache realization, because supporting rescan
>>> ability seems like a very distant prospect. But for this decision we
>>> need a consensus among all discussion participants.
>>> 
>>> In general, I don't have something to argue with your statements. All
>>> of them correspond my ideas. Looking ahead, it would be nice to work
>>> on this FLIP cooperatively. I've already done a lot of work on lookup
>>> join caching with realization very close to the one we are discussing,
>>> and want to share the results of this work. Anyway looking forward for
>>> the FLIP update!
>>> 
>>> Best regards,
>>> Smirnov Alexander
>>> 
>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>>>> 
>>>> Hi Alex,
>>>> 
>>>> Thanks for summarizing your points.
>>>> 
>>>> In the past week, Qingsheng, Leonard, and I have discussed it several times
>>>> and we have totally refactored the design.
>>>> I'm glad to say we have reached a consensus on many of your points!
>>>> Qingsheng is still working on updating the design docs and maybe can be
>>>> available in the next few days.
>>>> I will share some conclusions from our discussions:
>>>> 
>>>> 1) we have refactored the design towards to "cache in framework" way.
>>>> 
>>>> 2) a "LookupCache" interface for users to customize and a default
>>>> implementation with builder for users to easy-use.
>>>> This can both make it possible to both have flexibility and conciseness.
>>>> 
>>>> 3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
>>>> IO.
>>>> Filter pushdown should be the final state and the unified way to both
>>>> support pruning ALL cache and LRU cache,
>>>> so I think we should make effort in this direction. If we need to support
>>>> filter pushdown for ALL cache anyway, why not use
>>>> it for LRU cache as well? Either way, as we decide to implement the cache
>>>> in the framework, we have the chance to support
>>>> filter on cache anytime. This is an optimization and it doesn't affect the
>>>> public API. I think we can create a JIRA issue to
>>>> discuss it when the FLIP is accepted.
>>>> 
>>>> 4) The idea to support ALL cache is similar to your proposal.
>>>> In the first version, we will only support InputFormat, SourceFunction for
>>>> cache all (invoke InputFormat in join operator).
>>>> For FLIP-27 source, we need to join a true source operator instead of
>>>> calling it embedded in the join operator.
>>>> However, this needs another FLIP to support the re-scan ability for FLIP-27
>>>> Source, and this can be a large work.
>>>> In order to not block this issue, we can put the effort of FLIP-27 source
>>>> integration into future work and integrate
>>>> InputFormat&SourceFunction for now.
>>>> 
>>>> I think it's fine to use InputFormat&SourceFunction, as they are not
>>>> deprecated, otherwise, we have to introduce another function
>>>> similar to them which is meaningless. We need to plan FLIP-27 source
>>>> integration ASAP before InputFormat & SourceFunction are deprecated.
>>>> 
>>>> Best,
>>>> Jark
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi Martijn!
>>>>> 
>>>>> Got it. Therefore, the realization with InputFormat is not considered.
>>>>> Thanks for clearing that up!
>>>>> 
>>>>> Best regards,
>>>>> Smirnov Alexander
>>>>> 
>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> With regards to:
>>>>>> 
>>>>>>> But if there are plans to refactor all connectors to FLIP-27
>>>>>> 
>>>>>> Yes, FLIP-27 is the target for all connectors. The old interfaces will be
>>>>>> deprecated and connectors will either be refactored to use the new ones
>>>>> or
>>>>>> dropped.
>>>>>> 
>>>>>> The caching should work for connectors that are using FLIP-27 interfaces,
>>>>>> we should not introduce new features for old interfaces.
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Martijn
>>>>>> 
>>>>>> On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Jark!
>>>>>>> 
>>>>>>> Sorry for the late response. I would like to make some comments and
>>>>>>> clarify my points.
>>>>>>> 
>>>>>>> 1) I agree with your first statement. I think we can achieve both
>>>>>>> advantages this way: put the Cache interface in flink-table-common,
>>>>>>> but have implementations of it in flink-table-runtime. Therefore if a
>>>>>>> connector developer wants to use existing cache strategies and their
>>>>>>> implementations, he can just pass lookupConfig to the planner, but if
>>>>>>> he wants to have its own cache implementation in his TableFunction, it
>>>>>>> will be possible for him to use the existing interface for this
>>>>>>> purpose (we can explicitly point this out in the documentation). In
>>>>>>> this way all configs and metrics will be unified. WDYT?
>>>>>>> 
>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
>>>>>>> lookup requests that can never be cached
>>>>>>> 
>>>>>>> 2) Let me clarify the logic filters optimization in case of LRU cache.
>>>>>>> It looks like Cache<RowData, Collection<RowData>>. Here we always
>>>>>>> store the response of the dimension table in cache, even after
>>>>>>> applying calc function. I.e. if there are no rows after applying
>>>>>>> filters to the result of the 'eval' method of TableFunction, we store
>>>>>>> the empty list by lookup keys. Therefore the cache line will be
>>>>>>> filled, but will require much less memory (in bytes). I.e. we don't
>>>>>>> completely filter keys, by which result was pruned, but significantly
>>>>>>> reduce required memory to store this result. If the user knows about
>>>>>>> this behavior, he can increase the 'max-rows' option before the start
>>>>>>> of the job. But actually I came up with the idea that we can do this
>>>>>>> automatically by using the 'maximumWeight' and 'weigher' methods of
>>>>>>> GuavaCache [1]. Weight can be the size of the collection of rows
>>>>>>> (value of cache). Therefore cache can automatically fit much more
>>>>>>> records than before.
>>>>>>> 
>>>>>>>> Flink SQL has provided a standard way to do filters and projects
>>>>>>> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
>>>>> hard
>>>>>>> to implement.
>>>>>>> 
>>>>>>> It's debatable how difficult it will be to implement filter pushdown.
>>>>>>> But I think the fact that currently there is no database connector
>>>>>>> with filter pushdown at least means that this feature won't be
>>>>>>> supported soon in connectors. Moreover, if we talk about other
>>>>>>> connectors (not in Flink repo), their databases might not support all
>>>>>>> Flink filters (or not support filters at all). I think users are
>>>>>>> interested in supporting cache filters optimization  independently of
>>>>>>> supporting other features and solving more complex problems (or
>>>>>>> unsolvable at all).
>>>>>>> 
>>>>>>> 3) I agree with your third statement. Actually in our internal version
>>>>>>> I also tried to unify the logic of scanning and reloading data from
>>>>>>> connectors. But unfortunately, I didn't find a way to unify the logic
>>>>>>> of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
>>>>>>> and reuse it in reloading ALL cache. As a result I settled on using
>>>>>>> InputFormat, because it was used for scanning in all lookup
>>>>>>> connectors. (I didn't know that there are plans to deprecate
>>>>>>> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
>>>>>>> in ALL caching is not good idea, because this source was designed to
>>>>>>> work in distributed environment (SplitEnumerator on JobManager and
>>>>>>> SourceReaders on TaskManagers), not in one operator (lookup join
>>>>>>> operator in our case). There is even no direct way to pass splits from
>>>>>>> SplitEnumerator to SourceReader (this logic works through
>>>>>>> SplitEnumeratorContext, which requires
>>>>>>> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
>>>>>>> InputFormat for ALL cache seems much more clearer and easier. But if
>>>>>>> there are plans to refactor all connectors to FLIP-27, I have the
>>>>>>> following ideas: maybe we can refuse from lookup join ALL cache in
>>>>>>> favor of simple join with multiple scanning of batch source? The point
>>>>>>> is that the only difference between lookup join ALL cache and simple
>>>>>>> join with batch source is that in the first case scanning is performed
>>>>>>> multiple times, in between which state (cache) is cleared (correct me
>>>>>>> if I'm wrong). So what if we extend the functionality of simple join
>>>>>>> to support state reloading + extend the functionality of scanning
>>>>>>> batch source multiple times (this one should be easy with new FLIP-27
>>>>>>> source, that unifies streaming/batch reading - we will need to change
>>>>>>> only SplitEnumerator, which will pass splits again after some TTL).
>>>>>>> WDYT? I must say that this looks like a long-term goal and will make
>>>>>>> the scope of this FLIP even larger than you said. Maybe we can limit
>>>>>>> ourselves to a simpler solution now (InputFormats).
>>>>>>> 
>>>>>>> So to sum up, my points is like this:
>>>>>>> 1) There is a way to make both concise and flexible interfaces for
>>>>>>> caching in lookup join.
>>>>>>> 2) Cache filters optimization is important both in LRU and ALL caches.
>>>>>>> 3) It is unclear when filter pushdown will be supported in Flink
>>>>>>> connectors, some of the connectors might not have the opportunity to
>>>>>>> support filter pushdown + as I know, currently filter pushdown works
>>>>>>> only for scanning (not lookup). So cache filters + projections
>>>>>>> optimization should be independent from other features.
>>>>>>> 4) ALL cache realization is a complex topic that involves multiple
>>>>>>> aspects of how Flink is developing. Refusing from InputFormat in favor
>>>>>>> of FLIP-27 Source will make ALL cache realization really complex and
>>>>>>> not clear, so maybe instead of that we can extend the functionality of
>>>>>>> simple join or not refuse from InputFormat in case of lookup join ALL
>>>>>>> cache?
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Smirnov Alexander
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>>>>>>> 
>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>>>>>>>> 
>>>>>>>> It's great to see the active discussion! I want to share my ideas:
>>>>>>>> 
>>>>>>>> 1) implement the cache in framework vs. connectors base
>>>>>>>> I don't have a strong opinion on this. Both ways should work (e.g.,
>>>>> cache
>>>>>>>> pruning, compatibility).
>>>>>>>> The framework way can provide more concise interfaces.
>>>>>>>> The connector base way can define more flexible cache
>>>>>>>> strategies/implementations.
>>>>>>>> We are still investigating a way to see if we can have both
>>>>> advantages.
>>>>>>>> We should reach a consensus that the way should be a final state,
>>>>> and we
>>>>>>>> are on the path to it.
>>>>>>>> 
>>>>>>>> 2) filters and projections pushdown:
>>>>>>>> I agree with Alex that the filter pushdown into cache can benefit a
>>>>> lot
>>>>>>> for
>>>>>>>> ALL cache.
>>>>>>>> However, this is not true for LRU cache. Connectors use cache to
>>>>> reduce
>>>>>>> IO
>>>>>>>> requests to databases for better throughput.
>>>>>>>> If a filter can prune 90% of data in the cache, we will have 90% of
>>>>>>> lookup
>>>>>>>> requests that can never be cached
>>>>>>>> and hit directly to the databases. That means the cache is
>>>>> meaningless in
>>>>>>>> this case.
>>>>>>>> 
>>>>>>>> IMO, Flink SQL has provided a standard way to do filters and projects
>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
>>>>> SupportsProjectionPushDown.
>>>>>>>> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
>>>>> hard
>>>>>>> to
>>>>>>>> implement.
>>>>>>>> They should implement the pushdown interfaces to reduce IO and the
>>>>> cache
>>>>>>>> size.
>>>>>>>> That should be a final state that the scan source and lookup source
>>>>> share
>>>>>>>> the exact pushdown implementation.
>>>>>>>> I don't see why we need to duplicate the pushdown logic in caches,
>>>>> which
>>>>>>>> will complex the lookup join design.
>>>>>>>> 
>>>>>>>> 3) ALL cache abstraction
>>>>>>>> All cache might be the most challenging part of this FLIP. We have
>>>>> never
>>>>>>>> provided a reload-lookup public interface.
>>>>>>>> Currently, we put the reload logic in the "eval" method of
>>>>> TableFunction.
>>>>>>>> That's hard for some sources (e.g., Hive).
>>>>>>>> Ideally, connector implementation should share the logic of reload
>>>>> and
>>>>>>>> scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
>>>>>>> Source.
>>>>>>>> However, InputFormat/SourceFunction are deprecated, and the FLIP-27
>>>>>>> source
>>>>>>>> is deeply coupled with SourceOperator.
>>>>>>>> If we want to invoke the FLIP-27 source in LookupJoin, this may make
>>>>> the
>>>>>>>> scope of this FLIP much larger.
>>>>>>>> We are still investigating how to abstract the ALL cache logic and
>>>>> reuse
>>>>>>>> the existing source interfaces.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> It's a much more complicated activity and lies out of the scope of
>>>>> this
>>>>>>>>> improvement. Because such pushdowns should be done for all
>>>>>>> ScanTableSource
>>>>>>>>> implementations (not only for Lookup ones).
>>>>>>>>> 
>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn Visser <
>>>>> martijnvisser@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi everyone,
>>>>>>>>>> 
>>>>>>>>>> One question regarding "And Alexander correctly mentioned that
>>>>> filter
>>>>>>>>>> pushdown still is not implemented for jdbc/hive/hbase." -> Would
>>>>> an
>>>>>>>>>> alternative solution be to actually implement these filter
>>>>> pushdowns?
>>>>>>> I
>>>>>>>>>> can
>>>>>>>>>> imagine that there are many more benefits to doing that, outside
>>>>> of
>>>>>>> lookup
>>>>>>>>>> caching and metrics.
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> 
>>>>>>>>>> Martijn Visser
>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi everyone!
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for driving such a valuable improvement!
>>>>>>>>>>> 
>>>>>>>>>>> I do think that single cache implementation would be a nice
>>>>>>> opportunity
>>>>>>>>>> for
>>>>>>>>>>> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
>>>>>>> semantics
>>>>>>>>>>> anyway - doesn't matter how it will be implemented.
>>>>>>>>>>> 
>>>>>>>>>>> Putting myself in the user's shoes, I can say that:
>>>>>>>>>>> 1) I would prefer to have the opportunity to cut off the cache
>>>>> size
>>>>>>> by
>>>>>>>>>>> simply filtering unnecessary data. And the most handy way to do
>>>>> it
>>>>>>> is
>>>>>>>>>> apply
>>>>>>>>>>> it inside LookupRunners. It would be a bit harder to pass it
>>>>>>> through the
>>>>>>>>>>> LookupJoin node to TableFunction. And Alexander correctly
>>>>> mentioned
>>>>>>> that
>>>>>>>>>>> filter pushdown still is not implemented for jdbc/hive/hbase.
>>>>>>>>>>> 2) The ability to set the different caching parameters for
>>>>> different
>>>>>>>>>> tables
>>>>>>>>>>> is quite important. So I would prefer to set it through DDL
>>>>> rather
>>>>>>> than
>>>>>>>>>>> have similar ttla, strategy and other options for all lookup
>>>>> tables.
>>>>>>>>>>> 3) Providing the cache into the framework really deprives us of
>>>>>>>>>>> extensibility (users won't be able to implement their own
>>>>> cache).
>>>>>>> But
>>>>>>>>>> most
>>>>>>>>>>> probably it might be solved by creating more different cache
>>>>>>> strategies
>>>>>>>>>> and
>>>>>>>>>>> a wider set of configurations.
>>>>>>>>>>> 
>>>>>>>>>>> All these points are much closer to the schema proposed by
>>>>>>> Alexander.
>>>>>>>>>>> Qingshen Ren, please correct me if I'm not right and all these
>>>>>>>>>> facilities
>>>>>>>>>>> might be simply implemented in your architecture?
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Roman Boyko
>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn Visser <
>>>>>>> martijnvisser@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>> 
>>>>>>>>>>>> I don't have much to chip in, but just wanted to express that
>>>>> I
>>>>>>> really
>>>>>>>>>>>> appreciate the in-depth discussion on this topic and I hope
>>>>> that
>>>>>>>>>> others
>>>>>>>>>>>> will join the conversation.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> Martijn
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр Смирнов <
>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for your detailed feedback! However, I have questions
>>>>>>> about
>>>>>>>>>>>>> some of your statements (maybe I didn't get something?).
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Caching actually breaks the semantic of "FOR SYSTEM_TIME
>>>>> AS OF
>>>>>>>>>>>> proc_time”
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I agree that the semantics of "FOR SYSTEM_TIME AS OF
>>>>> proc_time"
>>>>>>> is
>>>>>>>>>> not
>>>>>>>>>>>>> fully implemented with caching, but as you said, users go
>>>>> on it
>>>>>>>>>>>>> consciously to achieve better performance (no one proposed
>>>>> to
>>>>>>> enable
>>>>>>>>>>>>> caching by default, etc.). Or by users do you mean other
>>>>>>> developers
>>>>>>>>>> of
>>>>>>>>>>>>> connectors? In this case developers explicitly specify
>>>>> whether
>>>>>>> their
>>>>>>>>>>>>> connector supports caching or not (in the list of supported
>>>>>>>>>> options),
>>>>>>>>>>>>> no one makes them do that if they don't want to. So what
>>>>>>> exactly is
>>>>>>>>>>>>> the difference between implementing caching in modules
>>>>>>>>>>>>> flink-table-runtime and in flink-table-common from the
>>>>>>> considered
>>>>>>>>>>>>> point of view? How does it affect on breaking/non-breaking
>>>>> the
>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> confront a situation that allows table options in DDL to
>>>>>>> control
>>>>>>>>>> the
>>>>>>>>>>>>> behavior of the framework, which has never happened
>>>>> previously
>>>>>>> and
>>>>>>>>>>> should
>>>>>>>>>>>>> be cautious
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If we talk about main differences of semantics of DDL
>>>>> options
>>>>>>> and
>>>>>>>>>>>>> config options("table.exec.xxx"), isn't it about limiting
>>>>> the
>>>>>>> scope
>>>>>>>>>> of
>>>>>>>>>>>>> the options + importance for the user business logic rather
>>>>> than
>>>>>>>>>>>>> specific location of corresponding logic in the framework? I
>>>>>>> mean
>>>>>>>>>> that
>>>>>>>>>>>>> in my design, for example, putting an option with lookup
>>>>> cache
>>>>>>>>>>>>> strategy in configurations would  be the wrong decision,
>>>>>>> because it
>>>>>>>>>>>>> directly affects the user's business logic (not just
>>>>> performance
>>>>>>>>>>>>> optimization) + touches just several functions of ONE table
>>>>>>> (there
>>>>>>>>>> can
>>>>>>>>>>>>> be multiple tables with different caches). Does it really
>>>>>>> matter for
>>>>>>>>>>>>> the user (or someone else) where the logic is located,
>>>>> which is
>>>>>>>>>>>>> affected by the applied option?
>>>>>>>>>>>>> Also I can remember DDL option 'sink.parallelism', which in
>>>>>>> some way
>>>>>>>>>>>>> "controls the behavior of the framework" and I don't see any
>>>>>>> problem
>>>>>>>>>>>>> here.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> introduce a new interface for this all-caching scenario
>>>>> and
>>>>>>> the
>>>>>>>>>>> design
>>>>>>>>>>>>> would become more complex
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This is a subject for a separate discussion, but actually
>>>>> in our
>>>>>>>>>>>>> internal version we solved this problem quite easily - we
>>>>> reused
>>>>>>>>>>>>> InputFormat class (so there is no need for a new API). The
>>>>>>> point is
>>>>>>>>>>>>> that currently all lookup connectors use InputFormat for
>>>>>>> scanning
>>>>>>>>>> the
>>>>>>>>>>>>> data in batch mode: HBase, JDBC and even Hive - it uses
>>>>> class
>>>>>>>>>>>>> PartitionReader, that is actually just a wrapper around
>>>>>>> InputFormat.
>>>>>>>>>>>>> The advantage of this solution is the ability to reload
>>>>> cache
>>>>>>> data
>>>>>>>>>> in
>>>>>>>>>>>>> parallel (number of threads depends on number of
>>>>> InputSplits,
>>>>>>> but
>>>>>>>>>> has
>>>>>>>>>>>>> an upper limit). As a result cache reload time significantly
>>>>>>> reduces
>>>>>>>>>>>>> (as well as time of input stream blocking). I know that
>>>>> usually
>>>>>>> we
>>>>>>>>>> try
>>>>>>>>>>>>> to avoid usage of concurrency in Flink code, but maybe this
>>>>> one
>>>>>>> can
>>>>>>>>>> be
>>>>>>>>>>>>> an exception. BTW I don't say that it's an ideal solution,
>>>>> maybe
>>>>>>>>>> there
>>>>>>>>>>>>> are better ones.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Providing the cache in the framework might introduce
>>>>>>> compatibility
>>>>>>>>>>>> issues
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It's possible only in cases when the developer of the
>>>>> connector
>>>>>>>>>> won't
>>>>>>>>>>>>> properly refactor his code and will use new cache options
>>>>>>>>>> incorrectly
>>>>>>>>>>>>> (i.e. explicitly provide the same options into 2 different
>>>>> code
>>>>>>>>>>>>> places). For correct behavior all he will need to do is to
>>>>>>> redirect
>>>>>>>>>>>>> existing options to the framework's LookupConfig (+ maybe
>>>>> add an
>>>>>>>>>> alias
>>>>>>>>>>>>> for options, if there was different naming), everything
>>>>> will be
>>>>>>>>>>>>> transparent for users. If the developer won't do
>>>>> refactoring at
>>>>>>> all,
>>>>>>>>>>>>> nothing will be changed for the connector because of
>>>>> backward
>>>>>>>>>>>>> compatibility. Also if a developer wants to use his own
>>>>> cache
>>>>>>> logic,
>>>>>>>>>>>>> he just can refuse to pass some of the configs into the
>>>>>>> framework,
>>>>>>>>>> and
>>>>>>>>>>>>> instead make his own implementation with already existing
>>>>>>> configs
>>>>>>>>>> and
>>>>>>>>>>>>> metrics (but actually I think that it's a rare case).
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> filters and projections should be pushed all the way down
>>>>> to
>>>>>>> the
>>>>>>>>>>> table
>>>>>>>>>>>>> function, like what we do in the scan source
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It's the great purpose. But the truth is that the ONLY
>>>>> connector
>>>>>>>>>> that
>>>>>>>>>>>>> supports filter pushdown is FileSystemTableSource
>>>>>>>>>>>>> (no database connector supports it currently). Also for some
>>>>>>>>>> databases
>>>>>>>>>>>>> it's simply impossible to pushdown such complex filters
>>>>> that we
>>>>>>> have
>>>>>>>>>>>>> in Flink.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> only applying these optimizations to the cache seems not
>>>>>>> quite
>>>>>>>>>>> useful
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Filters can cut off an arbitrarily large amount of data
>>>>> from the
>>>>>>>>>>>>> dimension table. For a simple example, suppose in dimension
>>>>>>> table
>>>>>>>>>>>>> 'users'
>>>>>>>>>>>>> we have column 'age' with values from 20 to 40, and input
>>>>> stream
>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed by age of users. If
>>>>> we
>>>>>>> have
>>>>>>>>>>>>> filter 'age > 30',
>>>>>>>>>>>>> there will be twice less data in cache. This means the user
>>>>> can
>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by almost 2 times. It will
>>>>>>> gain a
>>>>>>>>>>>>> huge
>>>>>>>>>>>>> performance boost. Moreover, this optimization starts to
>>>>> really
>>>>>>>>>> shine
>>>>>>>>>>>>> in 'ALL' cache, where tables without filters and projections
>>>>>>> can't
>>>>>>>>>> fit
>>>>>>>>>>>>> in memory, but with them - can. This opens up additional
>>>>>>>>>> possibilities
>>>>>>>>>>>>> for users. And this doesn't sound as 'not quite useful'.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It would be great to hear other voices regarding this topic!
>>>>>>> Because
>>>>>>>>>>>>> we have quite a lot of controversial points, and I think
>>>>> with
>>>>>>> the
>>>>>>>>>> help
>>>>>>>>>>>>> of others it will be easier for us to come to a consensus.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
>>>>> renqschn@gmail.com
>>>>>>>> :
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Alexander and Arvid,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the discussion and sorry for my late response!
>>>>> We
>>>>>>> had
>>>>>>>>>> an
>>>>>>>>>>>>> internal discussion together with Jark and Leonard and I’d
>>>>> like
>>>>>>> to
>>>>>>>>>>>>> summarize our ideas. Instead of implementing the cache
>>>>> logic in
>>>>>>> the
>>>>>>>>>>> table
>>>>>>>>>>>>> runtime layer or wrapping around the user-provided table
>>>>>>> function,
>>>>>>>>>> we
>>>>>>>>>>>>> prefer to introduce some new APIs extending TableFunction
>>>>> with
>>>>>>> these
>>>>>>>>>>>>> concerns:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Caching actually breaks the semantic of "FOR
>>>>> SYSTEM_TIME
>>>>>>> AS OF
>>>>>>>>>>>>> proc_time”, because it couldn’t truly reflect the content
>>>>> of the
>>>>>>>>>> lookup
>>>>>>>>>>>>> table at the moment of querying. If users choose to enable
>>>>>>> caching
>>>>>>>>>> on
>>>>>>>>>>> the
>>>>>>>>>>>>> lookup table, they implicitly indicate that this breakage is
>>>>>>>>>> acceptable
>>>>>>>>>>>> in
>>>>>>>>>>>>> exchange for the performance. So we prefer not to provide
>>>>>>> caching on
>>>>>>>>>>> the
>>>>>>>>>>>>> table runtime level.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. If we make the cache implementation in the framework
>>>>>>> (whether
>>>>>>>>>> in a
>>>>>>>>>>>>> runner or a wrapper around TableFunction), we have to
>>>>> confront a
>>>>>>>>>>>> situation
>>>>>>>>>>>>> that allows table options in DDL to control the behavior of
>>>>> the
>>>>>>>>>>>> framework,
>>>>>>>>>>>>> which has never happened previously and should be cautious.
>>>>>>> Under
>>>>>>>>>> the
>>>>>>>>>>>>> current design the behavior of the framework should only be
>>>>>>>>>> specified
>>>>>>>>>>> by
>>>>>>>>>>>>> configurations (“table.exec.xxx”), and it’s hard to apply
>>>>> these
>>>>>>>>>> general
>>>>>>>>>>>>> configs to a specific table.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 3. We have use cases that lookup source loads and refresh
>>>>> all
>>>>>>>>>> records
>>>>>>>>>>>>> periodically into the memory to achieve high lookup
>>>>> performance
>>>>>>>>>> (like
>>>>>>>>>>>> Hive
>>>>>>>>>>>>> connector in the community, and also widely used by our
>>>>> internal
>>>>>>>>>>>>> connectors). Wrapping the cache around the user’s
>>>>> TableFunction
>>>>>>>>>> works
>>>>>>>>>>>> fine
>>>>>>>>>>>>> for LRU caches, but I think we have to introduce a new
>>>>>>> interface for
>>>>>>>>>>> this
>>>>>>>>>>>>> all-caching scenario and the design would become more
>>>>> complex.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 4. Providing the cache in the framework might introduce
>>>>>>>>>> compatibility
>>>>>>>>>>>>> issues to existing lookup sources like there might exist two
>>>>>>> caches
>>>>>>>>>>> with
>>>>>>>>>>>>> totally different strategies if the user incorrectly
>>>>> configures
>>>>>>> the
>>>>>>>>>>> table
>>>>>>>>>>>>> (one in the framework and another implemented by the lookup
>>>>>>> source).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As for the optimization mentioned by Alexander, I think
>>>>>>> filters
>>>>>>>>>> and
>>>>>>>>>>>>> projections should be pushed all the way down to the table
>>>>>>> function,
>>>>>>>>>>> like
>>>>>>>>>>>>> what we do in the scan source, instead of the runner with
>>>>> the
>>>>>>> cache.
>>>>>>>>>>> The
>>>>>>>>>>>>> goal of using cache is to reduce the network I/O and
>>>>> pressure
>>>>>>> on the
>>>>>>>>>>>>> external system, and only applying these optimizations to
>>>>> the
>>>>>>> cache
>>>>>>>>>>> seems
>>>>>>>>>>>>> not quite useful.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I made some updates to the FLIP[1] to reflect our ideas.
>>>>> We
>>>>>>>>>> prefer to
>>>>>>>>>>>>> keep the cache implementation as a part of TableFunction,
>>>>> and we
>>>>>>>>>> could
>>>>>>>>>>>>> provide some helper classes (CachingTableFunction,
>>>>>>>>>>>> AllCachingTableFunction,
>>>>>>>>>>>>> CachingAsyncTableFunction) to developers and regulate
>>>>> metrics
>>>>>>> of the
>>>>>>>>>>>> cache.
>>>>>>>>>>>>> Also, I made a POC[2] for your reference.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Looking forward to your ideas!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>> [2] https://github.com/PatrickRen/flink/tree/FLIP-221
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the response, Arvid!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I have few comments on your message.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> but could also live with an easier solution as the
>>>>> first
>>>>>>> step:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I think that these 2 ways are mutually exclusive
>>>>> (originally
>>>>>>>>>>> proposed
>>>>>>>>>>>>>>> by Qingsheng and mine), because conceptually they follow
>>>>> the
>>>>>>> same
>>>>>>>>>>>>>>> goal, but implementation details are different. If we
>>>>> will
>>>>>>> go one
>>>>>>>>>>> way,
>>>>>>>>>>>>>>> moving to another way in the future will mean deleting
>>>>>>> existing
>>>>>>>>>> code
>>>>>>>>>>>>>>> and once again changing the API for connectors. So I
>>>>> think we
>>>>>>>>>> should
>>>>>>>>>>>>>>> reach a consensus with the community about that and then
>>>>> work
>>>>>>>>>>> together
>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on tasks for different
>>>>>>> parts
>>>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>>>> flip (for example, LRU cache unification / introducing
>>>>>>> proposed
>>>>>>>>>> set
>>>>>>>>>>> of
>>>>>>>>>>>>>>> metrics / further work…). WDYT, Qingsheng?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> as the source will only receive the requests after
>>>>> filter
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Actually if filters are applied to fields of the lookup
>>>>>>> table, we
>>>>>>>>>>>>>>> firstly must do requests, and only after that we can
>>>>> filter
>>>>>>>>>>> responses,
>>>>>>>>>>>>>>> because lookup connectors don't have filter pushdown. So
>>>>> if
>>>>>>>>>>> filtering
>>>>>>>>>>>>>>> is done before caching, there will be much less rows in
>>>>>>> cache.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>>>> shared.
>>>>>>> I
>>>>>>>>>> don't
>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such kinds of
>>>>> conversations
>>>>>>> :)
>>>>>>>>>>>>>>> I have no write access to the confluence, so I made a
>>>>> Jira
>>>>>>> issue,
>>>>>>>>>>>>>>> where described the proposed changes in more details -
>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Will happy to get more feedback!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Smirnov Alexander
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
>>>>> arvid@apache.org>:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Qingsheng,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for driving this; the inconsistency was not
>>>>>>> satisfying
>>>>>>>>>> for
>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I second Alexander's idea though but could also live
>>>>> with
>>>>>>> an
>>>>>>>>>>> easier
>>>>>>>>>>>>>>>> solution as the first step: Instead of making caching
>>>>> an
>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>> detail of TableFunction X, rather devise a caching
>>>>> layer
>>>>>>>>>> around X.
>>>>>>>>>>>> So
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> proposal would be a CachingTableFunction that
>>>>> delegates to
>>>>>>> X in
>>>>>>>>>>> case
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> misses and else manages the cache. Lifting it into the
>>>>>>> operator
>>>>>>>>>>>> model
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> proposed would be even better but is probably
>>>>> unnecessary
>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>>>> first step
>>>>>>>>>>>>>>>> for a lookup source (as the source will only receive
>>>>> the
>>>>>>>>>> requests
>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>> filter; applying projection may be more interesting to
>>>>> save
>>>>>>>>>>> memory).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Another advantage is that all the changes of this FLIP
>>>>>>> would be
>>>>>>>>>>>>> limited to
>>>>>>>>>>>>>>>> options, no need for new public interfaces. Everything
>>>>> else
>>>>>>>>>>> remains
>>>>>>>>>>>> an
>>>>>>>>>>>>>>>> implementation of Table runtime. That means we can
>>>>> easily
>>>>>>>>>>>> incorporate
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> optimization potential that Alexander pointed out
>>>>> later.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> @Alexander unfortunately, your architecture is not
>>>>> shared.
>>>>>>> I
>>>>>>>>>> don't
>>>>>>>>>>>>> know the
>>>>>>>>>>>>>>>> solution to share images to be honest.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
>>>>>>>>>>>>> smiralexan@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander, I'm not a
>>>>> committer
>>>>>>> yet,
>>>>>>>>>> but
>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>> really like to become one. And this FLIP really
>>>>>>> interested
>>>>>>>>>> me.
>>>>>>>>>>>>>>>>> Actually I have worked on a similar feature in my
>>>>>>> company’s
>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>> fork, and we would like to share our thoughts on
>>>>> this and
>>>>>>>>>> make
>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>> open source.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I think there is a better alternative than
>>>>> introducing an
>>>>>>>>>>> abstract
>>>>>>>>>>>>>>>>> class for TableFunction (CachingTableFunction). As
>>>>> you
>>>>>>> know,
>>>>>>>>>>>>>>>>> TableFunction exists in the flink-table-common
>>>>> module,
>>>>>>> which
>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>> only an API for working with tables – it’s very
>>>>>>> convenient
>>>>>>>>>> for
>>>>>>>>>>>>> importing
>>>>>>>>>>>>>>>>> in connectors. In turn, CachingTableFunction contains
>>>>>>> logic
>>>>>>>>>> for
>>>>>>>>>>>>>>>>> runtime execution,  so this class and everything
>>>>>>> connected
>>>>>>>>>> with
>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> should be located in another module, probably in
>>>>>>>>>>>>> flink-table-runtime.
>>>>>>>>>>>>>>>>> But this will require connectors to depend on another
>>>>>>> module,
>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> contains a lot of runtime logic, which doesn’t sound
>>>>>>> good.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I suggest adding a new method ‘getLookupConfig’ to
>>>>>>>>>>>> LookupTableSource
>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow connectors to only
>>>>> pass
>>>>>>>>>>>>>>>>> configurations to the planner, therefore they won’t
>>>>>>> depend on
>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>> realization. Based on these configs planner will
>>>>>>> construct a
>>>>>>>>>>>> lookup
>>>>>>>>>>>>>>>>> join operator with corresponding runtime logic
>>>>>>>>>> (ProcessFunctions
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> module flink-table-runtime). Architecture looks like
>>>>> in
>>>>>>> the
>>>>>>>>>>> pinned
>>>>>>>>>>>>>>>>> image (LookupConfig class there is actually yours
>>>>>>>>>> CacheConfig).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Classes in flink-table-planner, that will be
>>>>> responsible
>>>>>>> for
>>>>>>>>>>> this
>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his inheritors.
>>>>>>>>>>>>>>>>> Current classes for lookup join in
>>>>> flink-table-runtime
>>>>>>> -
>>>>>>>>>>>>>>>>> LookupJoinRunner, AsyncLookupJoinRunner,
>>>>>>>>>>> LookupJoinRunnerWithCalc,
>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I suggest adding classes LookupJoinCachingRunner,
>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc, etc.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> And here comes another more powerful advantage of
>>>>> such a
>>>>>>>>>>> solution.
>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>> we have caching logic on a lower level, we can apply
>>>>> some
>>>>>>>>>>>>>>>>> optimizations to it. LookupJoinRunnerWithCalc was
>>>>> named
>>>>>>> like
>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> because it uses the ‘calc’ function, which actually
>>>>>>> mostly
>>>>>>>>>>>> consists
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> filters and projections.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> For example, in join table A with lookup table B
>>>>>>> condition
>>>>>>>>>>> ‘JOIN …
>>>>>>>>>>>>> ON
>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
>>>>> 1000’
>>>>>>>>>>> ‘calc’
>>>>>>>>>>>>>>>>> function will contain filters A.age = B.age + 10 and
>>>>>>>>>> B.salary >
>>>>>>>>>>>>> 1000.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> If we apply this function before storing records in
>>>>>>> cache,
>>>>>>>>>> size
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> cache will be significantly reduced: filters = avoid
>>>>>>> storing
>>>>>>>>>>>> useless
>>>>>>>>>>>>>>>>> records in cache, projections = reduce records’
>>>>> size. So
>>>>>>> the
>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>> max number of records in cache can be increased by
>>>>> the
>>>>>>> user.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What do you think about it?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>>>>>>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Yuan and I would like to start a discussion about
>>>>>>>>>> FLIP-221[1],
>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> introduces an abstraction of lookup table cache and
>>>>> its
>>>>>>>>>> standard
>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Currently each lookup table source should implement
>>>>>>> their
>>>>>>>>>> own
>>>>>>>>>>>>> cache to
>>>>>>>>>>>>>>>>> store lookup results, and there isn’t a standard of
>>>>>>> metrics
>>>>>>>>>> for
>>>>>>>>>>>>> users and
>>>>>>>>>>>>>>>>> developers to tuning their jobs with lookup joins,
>>>>> which
>>>>>>> is a
>>>>>>>>>>>> quite
>>>>>>>>>>>>> common
>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs including cache,
>>>>>>>>>> metrics,
>>>>>>>>>>>>> wrapper
>>>>>>>>>>>>>>>>> classes of TableFunction and new table options.
>>>>> Please
>>>>>>> take a
>>>>>>>>>>> look
>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>> FLIP page [1] to get more details. Any suggestions
>>>>> and
>>>>>>>>>> comments
>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Qingsheng
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Qingsheng Ren
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Real-time Computing Team
>>>>>>>>>>>>>> Alibaba Cloud
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Email: renqschn@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Roman Boyko
>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>> 
>> --
>> Best Regards,
>> 
>> Qingsheng Ren
>> 
>> Real-time Computing Team
>> Alibaba Cloud
>> 
>> Email: renqschn@gmail.com

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Qingsheng Ren <re...@gmail.com>.

Hi Alexander and devs,

Thank you very much for the in-depth discussion! As Jark mentioned we were
inspired by Alexander's idea and made a refactor on our design. FLIP-221
[1] has been updated to reflect our design now and we are happy to hear
more suggestions from you!

Compared to the previous design:
1. The lookup cache serves at table runtime level and is integrated as a
component of LookupJoinRunner as discussed previously.
2. Interfaces are renamed and re-designed to reflect the new design.
3. We separate the all-caching case individually and introduce a new
RescanRuntimeProvider to reuse the ability of scanning. We are planning to
support SourceFunction / InputFormat for now considering the complexity of
FLIP-27 Source API.
4. A new interface LookupFunction is introduced to make the semantic of
lookup more straightforward for developers.

For replying to Alexander:
> However I'm a little confused whether InputFormat is deprecated or not.
Am I right that it will be so in the future, but currently it's not?
Yes you are right. InputFormat is not deprecated for now. I think it will
be deprecated in the future but we don't have a clear plan for that.

Thanks again for the discussion on this FLIP and looking forward to
cooperating with you after we finalize the design and interfaces!

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric

Best regards,

Qingsheng


On Fri, May 13, 2022 at 12:12 AM Александр Смирнов <sm...@gmail.com>
wrote:

> Hi Jark, Qingsheng and Leonard!
>
> Glad to see that we came to a consensus on almost all points!
>
> However I'm a little confused whether InputFormat is deprecated or
> not. Am I right that it will be so in the future, but currently it's
> not? Actually I also think that for the first version it's OK to use
> InputFormat in ALL cache realization, because supporting rescan
> ability seems like a very distant prospect. But for this decision we
> need a consensus among all discussion participants.
>
> In general, I don't have something to argue with your statements. All
> of them correspond my ideas. Looking ahead, it would be nice to work
> on this FLIP cooperatively. I've already done a lot of work on lookup
> join caching with realization very close to the one we are discussing,
> and want to share the results of this work. Anyway looking forward for
> the FLIP update!
>
> Best regards,
> Smirnov Alexander
>
> чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
> >
> > Hi Alex,
> >
> > Thanks for summarizing your points.
> >
> > In the past week, Qingsheng, Leonard, and I have discussed it several
> times
> > and we have totally refactored the design.
> > I'm glad to say we have reached a consensus on many of your points!
> > Qingsheng is still working on updating the design docs and maybe can be
> > available in the next few days.
> > I will share some conclusions from our discussions:
> >
> > 1) we have refactored the design towards to "cache in framework" way.
> >
> > 2) a "LookupCache" interface for users to customize and a default
> > implementation with builder for users to easy-use.
> > This can both make it possible to both have flexibility and conciseness.
> >
> > 3) Filter pushdown is important for ALL and LRU lookup cache, esp
> reducing
> > IO.
> > Filter pushdown should be the final state and the unified way to both
> > support pruning ALL cache and LRU cache,
> > so I think we should make effort in this direction. If we need to support
> > filter pushdown for ALL cache anyway, why not use
> > it for LRU cache as well? Either way, as we decide to implement the cache
> > in the framework, we have the chance to support
> > filter on cache anytime. This is an optimization and it doesn't affect
> the
> > public API. I think we can create a JIRA issue to
> > discuss it when the FLIP is accepted.
> >
> > 4) The idea to support ALL cache is similar to your proposal.
> > In the first version, we will only support InputFormat, SourceFunction
> for
> > cache all (invoke InputFormat in join operator).
> > For FLIP-27 source, we need to join a true source operator instead of
> > calling it embedded in the join operator.
> > However, this needs another FLIP to support the re-scan ability for
> FLIP-27
> > Source, and this can be a large work.
> > In order to not block this issue, we can put the effort of FLIP-27 source
> > integration into future work and integrate
> > InputFormat&SourceFunction for now.
> >
> > I think it's fine to use InputFormat&SourceFunction, as they are not
> > deprecated, otherwise, we have to introduce another function
> > similar to them which is meaningless. We need to plan FLIP-27 source
> > integration ASAP before InputFormat & SourceFunction are deprecated.
> >
> > Best,
> > Jark
> >
> >
> >
> >
> >
> > On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
> > wrote:
> >
> > > Hi Martijn!
> > >
> > > Got it. Therefore, the realization with InputFormat is not considered.
> > > Thanks for clearing that up!
> > >
> > > Best regards,
> > > Smirnov Alexander
> > >
> > > чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
> > > >
> > > > Hi,
> > > >
> > > > With regards to:
> > > >
> > > > > But if there are plans to refactor all connectors to FLIP-27
> > > >
> > > > Yes, FLIP-27 is the target for all connectors. The old interfaces
> will be
> > > > deprecated and connectors will either be refactored to use the new
> ones
> > > or
> > > > dropped.
> > > >
> > > > The caching should work for connectors that are using FLIP-27
> interfaces,
> > > > we should not introduce new features for old interfaces.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > > On Thu, 12 May 2022 at 06:19, Александр Смирнов <
> smiralexan@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Jark!
> > > > >
> > > > > Sorry for the late response. I would like to make some comments and
> > > > > clarify my points.
> > > > >
> > > > > 1) I agree with your first statement. I think we can achieve both
> > > > > advantages this way: put the Cache interface in flink-table-common,
> > > > > but have implementations of it in flink-table-runtime. Therefore
> if a
> > > > > connector developer wants to use existing cache strategies and
> their
> > > > > implementations, he can just pass lookupConfig to the planner, but
> if
> > > > > he wants to have its own cache implementation in his
> TableFunction, it
> > > > > will be possible for him to use the existing interface for this
> > > > > purpose (we can explicitly point this out in the documentation). In
> > > > > this way all configs and metrics will be unified. WDYT?
> > > > >
> > > > > > If a filter can prune 90% of data in the cache, we will have 90%
> of
> > > > > lookup requests that can never be cached
> > > > >
> > > > > 2) Let me clarify the logic filters optimization in case of LRU
> cache.
> > > > > It looks like Cache<RowData, Collection<RowData>>. Here we always
> > > > > store the response of the dimension table in cache, even after
> > > > > applying calc function. I.e. if there are no rows after applying
> > > > > filters to the result of the 'eval' method of TableFunction, we
> store
> > > > > the empty list by lookup keys. Therefore the cache line will be
> > > > > filled, but will require much less memory (in bytes). I.e. we don't
> > > > > completely filter keys, by which result was pruned, but
> significantly
> > > > > reduce required memory to store this result. If the user knows
> about
> > > > > this behavior, he can increase the 'max-rows' option before the
> start
> > > > > of the job. But actually I came up with the idea that we can do
> this
> > > > > automatically by using the 'maximumWeight' and 'weigher' methods of
> > > > > GuavaCache [1]. Weight can be the size of the collection of rows
> > > > > (value of cache). Therefore cache can automatically fit much more
> > > > > records than before.
> > > > >
> > > > > > Flink SQL has provided a standard way to do filters and projects
> > > > > pushdown, i.e., SupportsFilterPushDown and
> SupportsProjectionPushDown.
> > > > > > Jdbc/hive/HBase haven't implemented the interfaces, don't mean
> it's
> > > hard
> > > > > to implement.
> > > > >
> > > > > It's debatable how difficult it will be to implement filter
> pushdown.
> > > > > But I think the fact that currently there is no database connector
> > > > > with filter pushdown at least means that this feature won't be
> > > > > supported soon in connectors. Moreover, if we talk about other
> > > > > connectors (not in Flink repo), their databases might not support
> all
> > > > > Flink filters (or not support filters at all). I think users are
> > > > > interested in supporting cache filters optimization  independently
> of
> > > > > supporting other features and solving more complex problems (or
> > > > > unsolvable at all).
> > > > >
> > > > > 3) I agree with your third statement. Actually in our internal
> version
> > > > > I also tried to unify the logic of scanning and reloading data from
> > > > > connectors. But unfortunately, I didn't find a way to unify the
> logic
> > > > > of all ScanRuntimeProviders (InputFormat, SourceFunction,
> Source,...)
> > > > > and reuse it in reloading ALL cache. As a result I settled on using
> > > > > InputFormat, because it was used for scanning in all lookup
> > > > > connectors. (I didn't know that there are plans to deprecate
> > > > > InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27
> source
> > > > > in ALL caching is not good idea, because this source was designed
> to
> > > > > work in distributed environment (SplitEnumerator on JobManager and
> > > > > SourceReaders on TaskManagers), not in one operator (lookup join
> > > > > operator in our case). There is even no direct way to pass splits
> from
> > > > > SplitEnumerator to SourceReader (this logic works through
> > > > > SplitEnumeratorContext, which requires
> > > > > OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage
> of
> > > > > InputFormat for ALL cache seems much more clearer and easier. But
> if
> > > > > there are plans to refactor all connectors to FLIP-27, I have the
> > > > > following ideas: maybe we can refuse from lookup join ALL cache in
> > > > > favor of simple join with multiple scanning of batch source? The
> point
> > > > > is that the only difference between lookup join ALL cache and
> simple
> > > > > join with batch source is that in the first case scanning is
> performed
> > > > > multiple times, in between which state (cache) is cleared (correct
> me
> > > > > if I'm wrong). So what if we extend the functionality of simple
> join
> > > > > to support state reloading + extend the functionality of scanning
> > > > > batch source multiple times (this one should be easy with new
> FLIP-27
> > > > > source, that unifies streaming/batch reading - we will need to
> change
> > > > > only SplitEnumerator, which will pass splits again after some TTL).
> > > > > WDYT? I must say that this looks like a long-term goal and will
> make
> > > > > the scope of this FLIP even larger than you said. Maybe we can
> limit
> > > > > ourselves to a simpler solution now (InputFormats).
> > > > >
> > > > > So to sum up, my points is like this:
> > > > > 1) There is a way to make both concise and flexible interfaces for
> > > > > caching in lookup join.
> > > > > 2) Cache filters optimization is important both in LRU and ALL
> caches.
> > > > > 3) It is unclear when filter pushdown will be supported in Flink
> > > > > connectors, some of the connectors might not have the opportunity
> to
> > > > > support filter pushdown + as I know, currently filter pushdown
> works
> > > > > only for scanning (not lookup). So cache filters + projections
> > > > > optimization should be independent from other features.
> > > > > 4) ALL cache realization is a complex topic that involves multiple
> > > > > aspects of how Flink is developing. Refusing from InputFormat in
> favor
> > > > > of FLIP-27 Source will make ALL cache realization really complex
> and
> > > > > not clear, so maybe instead of that we can extend the
> functionality of
> > > > > simple join or not refuse from InputFormat in case of lookup join
> ALL
> > > > > cache?
> > > > >
> > > > > Best regards,
> > > > > Smirnov Alexander
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > >
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > > > >
> > > > > чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> > > > > >
> > > > > > It's great to see the active discussion! I want to share my
> ideas:
> > > > > >
> > > > > > 1) implement the cache in framework vs. connectors base
> > > > > > I don't have a strong opinion on this. Both ways should work
> (e.g.,
> > > cache
> > > > > > pruning, compatibility).
> > > > > > The framework way can provide more concise interfaces.
> > > > > > The connector base way can define more flexible cache
> > > > > > strategies/implementations.
> > > > > > We are still investigating a way to see if we can have both
> > > advantages.
> > > > > > We should reach a consensus that the way should be a final state,
> > > and we
> > > > > > are on the path to it.
> > > > > >
> > > > > > 2) filters and projections pushdown:
> > > > > > I agree with Alex that the filter pushdown into cache can
> benefit a
> > > lot
> > > > > for
> > > > > > ALL cache.
> > > > > > However, this is not true for LRU cache. Connectors use cache to
> > > reduce
> > > > > IO
> > > > > > requests to databases for better throughput.
> > > > > > If a filter can prune 90% of data in the cache, we will have 90%
> of
> > > > > lookup
> > > > > > requests that can never be cached
> > > > > > and hit directly to the databases. That means the cache is
> > > meaningless in
> > > > > > this case.
> > > > > >
> > > > > > IMO, Flink SQL has provided a standard way to do filters and
> projects
> > > > > > pushdown, i.e., SupportsFilterPushDown and
> > > SupportsProjectionPushDown.
> > > > > > Jdbc/hive/HBase haven't implemented the interfaces, don't mean
> it's
> > > hard
> > > > > to
> > > > > > implement.
> > > > > > They should implement the pushdown interfaces to reduce IO and
> the
> > > cache
> > > > > > size.
> > > > > > That should be a final state that the scan source and lookup
> source
> > > share
> > > > > > the exact pushdown implementation.
> > > > > > I don't see why we need to duplicate the pushdown logic in
> caches,
> > > which
> > > > > > will complex the lookup join design.
> > > > > >
> > > > > > 3) ALL cache abstraction
> > > > > > All cache might be the most challenging part of this FLIP. We
> have
> > > never
> > > > > > provided a reload-lookup public interface.
> > > > > > Currently, we put the reload logic in the "eval" method of
> > > TableFunction.
> > > > > > That's hard for some sources (e.g., Hive).
> > > > > > Ideally, connector implementation should share the logic of
> reload
> > > and
> > > > > > scan, i.e. ScanTableSource with
> InputFormat/SourceFunction/FLIP-27
> > > > > Source.
> > > > > > However, InputFormat/SourceFunction are deprecated, and the
> FLIP-27
> > > > > source
> > > > > > is deeply coupled with SourceOperator.
> > > > > > If we want to invoke the FLIP-27 source in LookupJoin, this may
> make
> > > the
> > > > > > scope of this FLIP much larger.
> > > > > > We are still investigating how to abstract the ALL cache logic
> and
> > > reuse
> > > > > > the existing source interfaces.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > It's a much more complicated activity and lies out of the
> scope of
> > > this
> > > > > > > improvement. Because such pushdowns should be done for all
> > > > > ScanTableSource
> > > > > > > implementations (not only for Lookup ones).
> > > > > > >
> > > > > > > On Thu, 5 May 2022 at 19:02, Martijn Visser <
> > > martijnvisser@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi everyone,
> > > > > > >>
> > > > > > >> One question regarding "And Alexander correctly mentioned that
> > > filter
> > > > > > >> pushdown still is not implemented for jdbc/hive/hbase." ->
> Would
> > > an
> > > > > > >> alternative solution be to actually implement these filter
> > > pushdowns?
> > > > > I
> > > > > > >> can
> > > > > > >> imagine that there are many more benefits to doing that,
> outside
> > > of
> > > > > lookup
> > > > > > >> caching and metrics.
> > > > > > >>
> > > > > > >> Best regards,
> > > > > > >>
> > > > > > >> Martijn Visser
> > > > > > >> https://twitter.com/MartijnVisser82
> > > > > > >> https://github.com/MartijnVisser
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> ro.v.boyko@gmail.com>
> > > > > wrote:
> > > > > > >>
> > > > > > >> > Hi everyone!
> > > > > > >> >
> > > > > > >> > Thanks for driving such a valuable improvement!
> > > > > > >> >
> > > > > > >> > I do think that single cache implementation would be a nice
> > > > > opportunity
> > > > > > >> for
> > > > > > >> > users. And it will break the "FOR SYSTEM_TIME AS OF
> proc_time"
> > > > > semantics
> > > > > > >> > anyway - doesn't matter how it will be implemented.
> > > > > > >> >
> > > > > > >> > Putting myself in the user's shoes, I can say that:
> > > > > > >> > 1) I would prefer to have the opportunity to cut off the
> cache
> > > size
> > > > > by
> > > > > > >> > simply filtering unnecessary data. And the most handy way
> to do
> > > it
> > > > > is
> > > > > > >> apply
> > > > > > >> > it inside LookupRunners. It would be a bit harder to pass it
> > > > > through the
> > > > > > >> > LookupJoin node to TableFunction. And Alexander correctly
> > > mentioned
> > > > > that
> > > > > > >> > filter pushdown still is not implemented for
> jdbc/hive/hbase.
> > > > > > >> > 2) The ability to set the different caching parameters for
> > > different
> > > > > > >> tables
> > > > > > >> > is quite important. So I would prefer to set it through DDL
> > > rather
> > > > > than
> > > > > > >> > have similar ttla, strategy and other options for all lookup
> > > tables.
> > > > > > >> > 3) Providing the cache into the framework really deprives
> us of
> > > > > > >> > extensibility (users won't be able to implement their own
> > > cache).
> > > > > But
> > > > > > >> most
> > > > > > >> > probably it might be solved by creating more different cache
> > > > > strategies
> > > > > > >> and
> > > > > > >> > a wider set of configurations.
> > > > > > >> >
> > > > > > >> > All these points are much closer to the schema proposed by
> > > > > Alexander.
> > > > > > >> > Qingshen Ren, please correct me if I'm not right and all
> these
> > > > > > >> facilities
> > > > > > >> > might be simply implemented in your architecture?
> > > > > > >> >
> > > > > > >> > Best regards,
> > > > > > >> > Roman Boyko
> > > > > > >> > e.: ro.v.boyko@gmail.com
> > > > > > >> >
> > > > > > >> > On Wed, 4 May 2022 at 21:01, Martijn Visser <
> > > > > martijnvisser@apache.org>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > Hi everyone,
> > > > > > >> > >
> > > > > > >> > > I don't have much to chip in, but just wanted to express
> that
> > > I
> > > > > really
> > > > > > >> > > appreciate the in-depth discussion on this topic and I
> hope
> > > that
> > > > > > >> others
> > > > > > >> > > will join the conversation.
> > > > > > >> > >
> > > > > > >> > > Best regards,
> > > > > > >> > >
> > > > > > >> > > Martijn
> > > > > > >> > >
> > > > > > >> > > On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> > > > > smiralexan@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Hi Qingsheng, Leonard and Jark,
> > > > > > >> > > >
> > > > > > >> > > > Thanks for your detailed feedback! However, I have
> questions
> > > > > about
> > > > > > >> > > > some of your statements (maybe I didn't get something?).
> > > > > > >> > > >
> > > > > > >> > > > > Caching actually breaks the semantic of "FOR
> SYSTEM_TIME
> > > AS OF
> > > > > > >> > > proc_time”
> > > > > > >> > > >
> > > > > > >> > > > I agree that the semantics of "FOR SYSTEM_TIME AS OF
> > > proc_time"
> > > > > is
> > > > > > >> not
> > > > > > >> > > > fully implemented with caching, but as you said, users
> go
> > > on it
> > > > > > >> > > > consciously to achieve better performance (no one
> proposed
> > > to
> > > > > enable
> > > > > > >> > > > caching by default, etc.). Or by users do you mean other
> > > > > developers
> > > > > > >> of
> > > > > > >> > > > connectors? In this case developers explicitly specify
> > > whether
> > > > > their
> > > > > > >> > > > connector supports caching or not (in the list of
> supported
> > > > > > >> options),
> > > > > > >> > > > no one makes them do that if they don't want to. So what
> > > > > exactly is
> > > > > > >> > > > the difference between implementing caching in modules
> > > > > > >> > > > flink-table-runtime and in flink-table-common from the
> > > > > considered
> > > > > > >> > > > point of view? How does it affect on
> breaking/non-breaking
> > > the
> > > > > > >> > > > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > > > > > >> > > >
> > > > > > >> > > > > confront a situation that allows table options in DDL
> to
> > > > > control
> > > > > > >> the
> > > > > > >> > > > behavior of the framework, which has never happened
> > > previously
> > > > > and
> > > > > > >> > should
> > > > > > >> > > > be cautious
> > > > > > >> > > >
> > > > > > >> > > > If we talk about main differences of semantics of DDL
> > > options
> > > > > and
> > > > > > >> > > > config options("table.exec.xxx"), isn't it about
> limiting
> > > the
> > > > > scope
> > > > > > >> of
> > > > > > >> > > > the options + importance for the user business logic
> rather
> > > than
> > > > > > >> > > > specific location of corresponding logic in the
> framework? I
> > > > > mean
> > > > > > >> that
> > > > > > >> > > > in my design, for example, putting an option with lookup
> > > cache
> > > > > > >> > > > strategy in configurations would  be the wrong decision,
> > > > > because it
> > > > > > >> > > > directly affects the user's business logic (not just
> > > performance
> > > > > > >> > > > optimization) + touches just several functions of ONE
> table
> > > > > (there
> > > > > > >> can
> > > > > > >> > > > be multiple tables with different caches). Does it
> really
> > > > > matter for
> > > > > > >> > > > the user (or someone else) where the logic is located,
> > > which is
> > > > > > >> > > > affected by the applied option?
> > > > > > >> > > > Also I can remember DDL option 'sink.parallelism',
> which in
> > > > > some way
> > > > > > >> > > > "controls the behavior of the framework" and I don't
> see any
> > > > > problem
> > > > > > >> > > > here.
> > > > > > >> > > >
> > > > > > >> > > > > introduce a new interface for this all-caching
> scenario
> > > and
> > > > > the
> > > > > > >> > design
> > > > > > >> > > > would become more complex
> > > > > > >> > > >
> > > > > > >> > > > This is a subject for a separate discussion, but
> actually
> > > in our
> > > > > > >> > > > internal version we solved this problem quite easily -
> we
> > > reused
> > > > > > >> > > > InputFormat class (so there is no need for a new API).
> The
> > > > > point is
> > > > > > >> > > > that currently all lookup connectors use InputFormat for
> > > > > scanning
> > > > > > >> the
> > > > > > >> > > > data in batch mode: HBase, JDBC and even Hive - it uses
> > > class
> > > > > > >> > > > PartitionReader, that is actually just a wrapper around
> > > > > InputFormat.
> > > > > > >> > > > The advantage of this solution is the ability to reload
> > > cache
> > > > > data
> > > > > > >> in
> > > > > > >> > > > parallel (number of threads depends on number of
> > > InputSplits,
> > > > > but
> > > > > > >> has
> > > > > > >> > > > an upper limit). As a result cache reload time
> significantly
> > > > > reduces
> > > > > > >> > > > (as well as time of input stream blocking). I know that
> > > usually
> > > > > we
> > > > > > >> try
> > > > > > >> > > > to avoid usage of concurrency in Flink code, but maybe
> this
> > > one
> > > > > can
> > > > > > >> be
> > > > > > >> > > > an exception. BTW I don't say that it's an ideal
> solution,
> > > maybe
> > > > > > >> there
> > > > > > >> > > > are better ones.
> > > > > > >> > > >
> > > > > > >> > > > > Providing the cache in the framework might introduce
> > > > > compatibility
> > > > > > >> > > issues
> > > > > > >> > > >
> > > > > > >> > > > It's possible only in cases when the developer of the
> > > connector
> > > > > > >> won't
> > > > > > >> > > > properly refactor his code and will use new cache
> options
> > > > > > >> incorrectly
> > > > > > >> > > > (i.e. explicitly provide the same options into 2
> different
> > > code
> > > > > > >> > > > places). For correct behavior all he will need to do is
> to
> > > > > redirect
> > > > > > >> > > > existing options to the framework's LookupConfig (+
> maybe
> > > add an
> > > > > > >> alias
> > > > > > >> > > > for options, if there was different naming), everything
> > > will be
> > > > > > >> > > > transparent for users. If the developer won't do
> > > refactoring at
> > > > > all,
> > > > > > >> > > > nothing will be changed for the connector because of
> > > backward
> > > > > > >> > > > compatibility. Also if a developer wants to use his own
> > > cache
> > > > > logic,
> > > > > > >> > > > he just can refuse to pass some of the configs into the
> > > > > framework,
> > > > > > >> and
> > > > > > >> > > > instead make his own implementation with already
> existing
> > > > > configs
> > > > > > >> and
> > > > > > >> > > > metrics (but actually I think that it's a rare case).
> > > > > > >> > > >
> > > > > > >> > > > > filters and projections should be pushed all the way
> down
> > > to
> > > > > the
> > > > > > >> > table
> > > > > > >> > > > function, like what we do in the scan source
> > > > > > >> > > >
> > > > > > >> > > > It's the great purpose. But the truth is that the ONLY
> > > connector
> > > > > > >> that
> > > > > > >> > > > supports filter pushdown is FileSystemTableSource
> > > > > > >> > > > (no database connector supports it currently). Also for
> some
> > > > > > >> databases
> > > > > > >> > > > it's simply impossible to pushdown such complex filters
> > > that we
> > > > > have
> > > > > > >> > > > in Flink.
> > > > > > >> > > >
> > > > > > >> > > > >  only applying these optimizations to the cache seems
> not
> > > > > quite
> > > > > > >> > useful
> > > > > > >> > > >
> > > > > > >> > > > Filters can cut off an arbitrarily large amount of data
> > > from the
> > > > > > >> > > > dimension table. For a simple example, suppose in
> dimension
> > > > > table
> > > > > > >> > > > 'users'
> > > > > > >> > > > we have column 'age' with values from 20 to 40, and
> input
> > > stream
> > > > > > >> > > > 'clicks' that is ~uniformly distributed by age of
> users. If
> > > we
> > > > > have
> > > > > > >> > > > filter 'age > 30',
> > > > > > >> > > > there will be twice less data in cache. This means the
> user
> > > can
> > > > > > >> > > > increase 'lookup.cache.max-rows' by almost 2 times. It
> will
> > > > > gain a
> > > > > > >> > > > huge
> > > > > > >> > > > performance boost. Moreover, this optimization starts to
> > > really
> > > > > > >> shine
> > > > > > >> > > > in 'ALL' cache, where tables without filters and
> projections
> > > > > can't
> > > > > > >> fit
> > > > > > >> > > > in memory, but with them - can. This opens up additional
> > > > > > >> possibilities
> > > > > > >> > > > for users. And this doesn't sound as 'not quite useful'.
> > > > > > >> > > >
> > > > > > >> > > > It would be great to hear other voices regarding this
> topic!
> > > > > Because
> > > > > > >> > > > we have quite a lot of controversial points, and I think
> > > with
> > > > > the
> > > > > > >> help
> > > > > > >> > > > of others it will be easier for us to come to a
> consensus.
> > > > > > >> > > >
> > > > > > >> > > > Best regards,
> > > > > > >> > > > Smirnov Alexander
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> > > renqschn@gmail.com
> > > > > >:
> > > > > > >> > > > >
> > > > > > >> > > > > Hi Alexander and Arvid,
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks for the discussion and sorry for my late
> response!
> > > We
> > > > > had
> > > > > > >> an
> > > > > > >> > > > internal discussion together with Jark and Leonard and
> I’d
> > > like
> > > > > to
> > > > > > >> > > > summarize our ideas. Instead of implementing the cache
> > > logic in
> > > > > the
> > > > > > >> > table
> > > > > > >> > > > runtime layer or wrapping around the user-provided table
> > > > > function,
> > > > > > >> we
> > > > > > >> > > > prefer to introduce some new APIs extending
> TableFunction
> > > with
> > > > > these
> > > > > > >> > > > concerns:
> > > > > > >> > > > >
> > > > > > >> > > > > 1. Caching actually breaks the semantic of "FOR
> > > SYSTEM_TIME
> > > > > AS OF
> > > > > > >> > > > proc_time”, because it couldn’t truly reflect the
> content
> > > of the
> > > > > > >> lookup
> > > > > > >> > > > table at the moment of querying. If users choose to
> enable
> > > > > caching
> > > > > > >> on
> > > > > > >> > the
> > > > > > >> > > > lookup table, they implicitly indicate that this
> breakage is
> > > > > > >> acceptable
> > > > > > >> > > in
> > > > > > >> > > > exchange for the performance. So we prefer not to
> provide
> > > > > caching on
> > > > > > >> > the
> > > > > > >> > > > table runtime level.
> > > > > > >> > > > >
> > > > > > >> > > > > 2. If we make the cache implementation in the
> framework
> > > > > (whether
> > > > > > >> in a
> > > > > > >> > > > runner or a wrapper around TableFunction), we have to
> > > confront a
> > > > > > >> > > situation
> > > > > > >> > > > that allows table options in DDL to control the
> behavior of
> > > the
> > > > > > >> > > framework,
> > > > > > >> > > > which has never happened previously and should be
> cautious.
> > > > > Under
> > > > > > >> the
> > > > > > >> > > > current design the behavior of the framework should
> only be
> > > > > > >> specified
> > > > > > >> > by
> > > > > > >> > > > configurations (“table.exec.xxx”), and it’s hard to
> apply
> > > these
> > > > > > >> general
> > > > > > >> > > > configs to a specific table.
> > > > > > >> > > > >
> > > > > > >> > > > > 3. We have use cases that lookup source loads and
> refresh
> > > all
> > > > > > >> records
> > > > > > >> > > > periodically into the memory to achieve high lookup
> > > performance
> > > > > > >> (like
> > > > > > >> > > Hive
> > > > > > >> > > > connector in the community, and also widely used by our
> > > internal
> > > > > > >> > > > connectors). Wrapping the cache around the user’s
> > > TableFunction
> > > > > > >> works
> > > > > > >> > > fine
> > > > > > >> > > > for LRU caches, but I think we have to introduce a new
> > > > > interface for
> > > > > > >> > this
> > > > > > >> > > > all-caching scenario and the design would become more
> > > complex.
> > > > > > >> > > > >
> > > > > > >> > > > > 4. Providing the cache in the framework might
> introduce
> > > > > > >> compatibility
> > > > > > >> > > > issues to existing lookup sources like there might
> exist two
> > > > > caches
> > > > > > >> > with
> > > > > > >> > > > totally different strategies if the user incorrectly
> > > configures
> > > > > the
> > > > > > >> > table
> > > > > > >> > > > (one in the framework and another implemented by the
> lookup
> > > > > source).
> > > > > > >> > > > >
> > > > > > >> > > > > As for the optimization mentioned by Alexander, I
> think
> > > > > filters
> > > > > > >> and
> > > > > > >> > > > projections should be pushed all the way down to the
> table
> > > > > function,
> > > > > > >> > like
> > > > > > >> > > > what we do in the scan source, instead of the runner
> with
> > > the
> > > > > cache.
> > > > > > >> > The
> > > > > > >> > > > goal of using cache is to reduce the network I/O and
> > > pressure
> > > > > on the
> > > > > > >> > > > external system, and only applying these optimizations
> to
> > > the
> > > > > cache
> > > > > > >> > seems
> > > > > > >> > > > not quite useful.
> > > > > > >> > > > >
> > > > > > >> > > > > I made some updates to the FLIP[1] to reflect our
> ideas.
> > > We
> > > > > > >> prefer to
> > > > > > >> > > > keep the cache implementation as a part of
> TableFunction,
> > > and we
> > > > > > >> could
> > > > > > >> > > > provide some helper classes (CachingTableFunction,
> > > > > > >> > > AllCachingTableFunction,
> > > > > > >> > > > CachingAsyncTableFunction) to developers and regulate
> > > metrics
> > > > > of the
> > > > > > >> > > cache.
> > > > > > >> > > > Also, I made a POC[2] for your reference.
> > > > > > >> > > > >
> > > > > > >> > > > > Looking forward to your ideas!
> > > > > > >> > > > >
> > > > > > >> > > > > [1]
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > > > >> > > > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > > > > > >> > > > >
> > > > > > >> > > > > Best regards,
> > > > > > >> > > > >
> > > > > > >> > > > > Qingsheng
> > > > > > >> > > > >
> > > > > > >> > > > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > > > > > >> > > smiralexan@gmail.com>
> > > > > > >> > > > wrote:
> > > > > > >> > > > >>
> > > > > > >> > > > >> Thanks for the response, Arvid!
> > > > > > >> > > > >>
> > > > > > >> > > > >> I have few comments on your message.
> > > > > > >> > > > >>
> > > > > > >> > > > >> > but could also live with an easier solution as the
> > > first
> > > > > step:
> > > > > > >> > > > >>
> > > > > > >> > > > >> I think that these 2 ways are mutually exclusive
> > > (originally
> > > > > > >> > proposed
> > > > > > >> > > > >> by Qingsheng and mine), because conceptually they
> follow
> > > the
> > > > > same
> > > > > > >> > > > >> goal, but implementation details are different. If we
> > > will
> > > > > go one
> > > > > > >> > way,
> > > > > > >> > > > >> moving to another way in the future will mean
> deleting
> > > > > existing
> > > > > > >> code
> > > > > > >> > > > >> and once again changing the API for connectors. So I
> > > think we
> > > > > > >> should
> > > > > > >> > > > >> reach a consensus with the community about that and
> then
> > > work
> > > > > > >> > together
> > > > > > >> > > > >> on this FLIP, i.e. divide the work on tasks for
> different
> > > > > parts
> > > > > > >> of
> > > > > > >> > the
> > > > > > >> > > > >> flip (for example, LRU cache unification /
> introducing
> > > > > proposed
> > > > > > >> set
> > > > > > >> > of
> > > > > > >> > > > >> metrics / further work…). WDYT, Qingsheng?
> > > > > > >> > > > >>
> > > > > > >> > > > >> > as the source will only receive the requests after
> > > filter
> > > > > > >> > > > >>
> > > > > > >> > > > >> Actually if filters are applied to fields of the
> lookup
> > > > > table, we
> > > > > > >> > > > >> firstly must do requests, and only after that we can
> > > filter
> > > > > > >> > responses,
> > > > > > >> > > > >> because lookup connectors don't have filter
> pushdown. So
> > > if
> > > > > > >> > filtering
> > > > > > >> > > > >> is done before caching, there will be much less rows
> in
> > > > > cache.
> > > > > > >> > > > >>
> > > > > > >> > > > >> > @Alexander unfortunately, your architecture is not
> > > shared.
> > > > > I
> > > > > > >> don't
> > > > > > >> > > > know the
> > > > > > >> > > > >>
> > > > > > >> > > > >> > solution to share images to be honest.
> > > > > > >> > > > >>
> > > > > > >> > > > >> Sorry for that, I’m a bit new to such kinds of
> > > conversations
> > > > > :)
> > > > > > >> > > > >> I have no write access to the confluence, so I made a
> > > Jira
> > > > > issue,
> > > > > > >> > > > >> where described the proposed changes in more details
> -
> > > > > > >> > > > >> https://issues.apache.org/jira/browse/FLINK-27411.
> > > > > > >> > > > >>
> > > > > > >> > > > >> Will happy to get more feedback!
> > > > > > >> > > > >>
> > > > > > >> > > > >> Best,
> > > > > > >> > > > >> Smirnov Alexander
> > > > > > >> > > > >>
> > > > > > >> > > > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> > > arvid@apache.org>:
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > Hi Qingsheng,
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > Thanks for driving this; the inconsistency was not
> > > > > satisfying
> > > > > > >> for
> > > > > > >> > > me.
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > I second Alexander's idea though but could also
> live
> > > with
> > > > > an
> > > > > > >> > easier
> > > > > > >> > > > >> > solution as the first step: Instead of making
> caching
> > > an
> > > > > > >> > > > implementation
> > > > > > >> > > > >> > detail of TableFunction X, rather devise a caching
> > > layer
> > > > > > >> around X.
> > > > > > >> > > So
> > > > > > >> > > > the
> > > > > > >> > > > >> > proposal would be a CachingTableFunction that
> > > delegates to
> > > > > X in
> > > > > > >> > case
> > > > > > >> > > > of
> > > > > > >> > > > >> > misses and else manages the cache. Lifting it into
> the
> > > > > operator
> > > > > > >> > > model
> > > > > > >> > > > as
> > > > > > >> > > > >> > proposed would be even better but is probably
> > > unnecessary
> > > > > in
> > > > > > >> the
> > > > > > >> > > > first step
> > > > > > >> > > > >> > for a lookup source (as the source will only
> receive
> > > the
> > > > > > >> requests
> > > > > > >> > > > after
> > > > > > >> > > > >> > filter; applying projection may be more
> interesting to
> > > save
> > > > > > >> > memory).
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > Another advantage is that all the changes of this
> FLIP
> > > > > would be
> > > > > > >> > > > limited to
> > > > > > >> > > > >> > options, no need for new public interfaces.
> Everything
> > > else
> > > > > > >> > remains
> > > > > > >> > > an
> > > > > > >> > > > >> > implementation of Table runtime. That means we can
> > > easily
> > > > > > >> > > incorporate
> > > > > > >> > > > the
> > > > > > >> > > > >> > optimization potential that Alexander pointed out
> > > later.
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > @Alexander unfortunately, your architecture is not
> > > shared.
> > > > > I
> > > > > > >> don't
> > > > > > >> > > > know the
> > > > > > >> > > > >> > solution to share images to be honest.
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > > > > > >> > > > smiralexan@gmail.com>
> > > > > > >> > > > >> > wrote:
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > > Hi Qingsheng! My name is Alexander, I'm not a
> > > committer
> > > > > yet,
> > > > > > >> but
> > > > > > >> > > I'd
> > > > > > >> > > > >> > > really like to become one. And this FLIP really
> > > > > interested
> > > > > > >> me.
> > > > > > >> > > > >> > > Actually I have worked on a similar feature in my
> > > > > company’s
> > > > > > >> > Flink
> > > > > > >> > > > >> > > fork, and we would like to share our thoughts on
> > > this and
> > > > > > >> make
> > > > > > >> > > code
> > > > > > >> > > > >> > > open source.
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > > I think there is a better alternative than
> > > introducing an
> > > > > > >> > abstract
> > > > > > >> > > > >> > > class for TableFunction (CachingTableFunction).
> As
> > > you
> > > > > know,
> > > > > > >> > > > >> > > TableFunction exists in the flink-table-common
> > > module,
> > > > > which
> > > > > > >> > > > provides
> > > > > > >> > > > >> > > only an API for working with tables – it’s very
> > > > > convenient
> > > > > > >> for
> > > > > > >> > > > importing
> > > > > > >> > > > >> > > in connectors. In turn, CachingTableFunction
> contains
> > > > > logic
> > > > > > >> for
> > > > > > >> > > > >> > > runtime execution,  so this class and everything
> > > > > connected
> > > > > > >> with
> > > > > > >> > it
> > > > > > >> > > > >> > > should be located in another module, probably in
> > > > > > >> > > > flink-table-runtime.
> > > > > > >> > > > >> > > But this will require connectors to depend on
> another
> > > > > module,
> > > > > > >> > > which
> > > > > > >> > > > >> > > contains a lot of runtime logic, which doesn’t
> sound
> > > > > good.
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > > I suggest adding a new method ‘getLookupConfig’
> to
> > > > > > >> > > LookupTableSource
> > > > > > >> > > > >> > > or LookupRuntimeProvider to allow connectors to
> only
> > > pass
> > > > > > >> > > > >> > > configurations to the planner, therefore they
> won’t
> > > > > depend on
> > > > > > >> > > > runtime
> > > > > > >> > > > >> > > realization. Based on these configs planner will
> > > > > construct a
> > > > > > >> > > lookup
> > > > > > >> > > > >> > > join operator with corresponding runtime logic
> > > > > > >> (ProcessFunctions
> > > > > > >> > > in
> > > > > > >> > > > >> > > module flink-table-runtime). Architecture looks
> like
> > > in
> > > > > the
> > > > > > >> > pinned
> > > > > > >> > > > >> > > image (LookupConfig class there is actually yours
> > > > > > >> CacheConfig).
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > > Classes in flink-table-planner, that will be
> > > responsible
> > > > > for
> > > > > > >> > this
> > > > > > >> > > –
> > > > > > >> > > > >> > > CommonPhysicalLookupJoin and his inheritors.
> > > > > > >> > > > >> > > Current classes for lookup join in
> > > flink-table-runtime
> > > > > -
> > > > > > >> > > > >> > > LookupJoinRunner, AsyncLookupJoinRunner,
> > > > > > >> > LookupJoinRunnerWithCalc,
> > > > > > >> > > > >> > > AsyncLookupJoinRunnerWithCalc.
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > > I suggest adding classes LookupJoinCachingRunner,
> > > > > > >> > > > >> > > LookupJoinCachingRunnerWithCalc, etc.
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > > And here comes another more powerful advantage of
> > > such a
> > > > > > >> > solution.
> > > > > > >> > > > If
> > > > > > >> > > > >> > > we have caching logic on a lower level, we can
> apply
> > > some
> > > > > > >> > > > >> > > optimizations to it. LookupJoinRunnerWithCalc was
> > > named
> > > > > like
> > > > > > >> > this
> > > > > > >> > > > >> > > because it uses the ‘calc’ function, which
> actually
> > > > > mostly
> > > > > > >> > > consists
> > > > > > >> > > > of
> > > > > > >> > > > >> > > filters and projections.
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > > For example, in join table A with lookup table B
> > > > > condition
> > > > > > >> > ‘JOIN …
> > > > > > >> > > > ON
> > > > > > >> > > > >> > > A.id = B.id AND A.age = B.age + 10 WHERE
> B.salary >
> > > 1000’
> > > > > > >> > ‘calc’
> > > > > > >> > > > >> > > function will contain filters A.age = B.age + 10
> and
> > > > > > >> B.salary >
> > > > > > >> > > > 1000.
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > > If we apply this function before storing records
> in
> > > > > cache,
> > > > > > >> size
> > > > > > >> > of
> > > > > > >> > > > >> > > cache will be significantly reduced: filters =
> avoid
> > > > > storing
> > > > > > >> > > useless
> > > > > > >> > > > >> > > records in cache, projections = reduce records’
> > > size. So
> > > > > the
> > > > > > >> > > initial
> > > > > > >> > > > >> > > max number of records in cache can be increased
> by
> > > the
> > > > > user.
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > > What do you think about it?
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > >
> > > > > > >> > > > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > > > > > >> > > > >> > > > Hi devs,
> > > > > > >> > > > >> > > >
> > > > > > >> > > > >> > > > Yuan and I would like to start a discussion
> about
> > > > > > >> FLIP-221[1],
> > > > > > >> > > > which
> > > > > > >> > > > >> > > introduces an abstraction of lookup table cache
> and
> > > its
> > > > > > >> standard
> > > > > > >> > > > metrics.
> > > > > > >> > > > >> > > >
> > > > > > >> > > > >> > > > Currently each lookup table source should
> implement
> > > > > their
> > > > > > >> own
> > > > > > >> > > > cache to
> > > > > > >> > > > >> > > store lookup results, and there isn’t a standard
> of
> > > > > metrics
> > > > > > >> for
> > > > > > >> > > > users and
> > > > > > >> > > > >> > > developers to tuning their jobs with lookup
> joins,
> > > which
> > > > > is a
> > > > > > >> > > quite
> > > > > > >> > > > common
> > > > > > >> > > > >> > > use case in Flink table / SQL.
> > > > > > >> > > > >> > > >
> > > > > > >> > > > >> > > > Therefore we propose some new APIs including
> cache,
> > > > > > >> metrics,
> > > > > > >> > > > wrapper
> > > > > > >> > > > >> > > classes of TableFunction and new table options.
> > > Please
> > > > > take a
> > > > > > >> > look
> > > > > > >> > > > at the
> > > > > > >> > > > >> > > FLIP page [1] to get more details. Any
> suggestions
> > > and
> > > > > > >> comments
> > > > > > >> > > > would be
> > > > > > >> > > > >> > > appreciated!
> > > > > > >> > > > >> > > >
> > > > > > >> > > > >> > > > [1]
> > > > > > >> > > > >> > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > > > >> > > > >> > > >
> > > > > > >> > > > >> > > > Best regards,
> > > > > > >> > > > >> > > >
> > > > > > >> > > > >> > > > Qingsheng
> > > > > > >> > > > >> > > >
> > > > > > >> > > > >> > > >
> > > > > > >> > > > >> > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > --
> > > > > > >> > > > > Best Regards,
> > > > > > >> > > > >
> > > > > > >> > > > > Qingsheng Ren
> > > > > > >> > > > >
> > > > > > >> > > > > Real-time Computing Team
> > > > > > >> > > > > Alibaba Cloud
> > > > > > >> > > > >
> > > > > > >> > > > > Email: renqschn@gmail.com
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Roman Boyko
> > > > > > > e.: ro.v.boyko@gmail.com
> > > > > > >
> > > > >
> > >
> > >
>
>

-- 
Best Regards,

*Qingsheng Ren*

Real-time Computing Team
Alibaba Cloud

Email: renqschn@gmail.com

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Александр Смирнов <sm...@gmail.com>.

Hi Jark, Qingsheng and Leonard!

Glad to see that we came to a consensus on almost all points!

However I'm a little confused whether InputFormat is deprecated or
not. Am I right that it will be so in the future, but currently it's
not? Actually I also think that for the first version it's OK to use
InputFormat in ALL cache realization, because supporting rescan
ability seems like a very distant prospect. But for this decision we
need a consensus among all discussion participants.

In general, I don't have something to argue with your statements. All
of them correspond my ideas. Looking ahead, it would be nice to work
on this FLIP cooperatively. I've already done a lot of work on lookup
join caching with realization very close to the one we are discussing,
and want to share the results of this work. Anyway looking forward for
the FLIP update!

Best regards,
Smirnov Alexander

чт, 12 мая 2022 г. в 17:38, Jark Wu <im...@gmail.com>:
>
> Hi Alex,
>
> Thanks for summarizing your points.
>
> In the past week, Qingsheng, Leonard, and I have discussed it several times
> and we have totally refactored the design.
> I'm glad to say we have reached a consensus on many of your points!
> Qingsheng is still working on updating the design docs and maybe can be
> available in the next few days.
> I will share some conclusions from our discussions:
>
> 1) we have refactored the design towards to "cache in framework" way.
>
> 2) a "LookupCache" interface for users to customize and a default
> implementation with builder for users to easy-use.
> This can both make it possible to both have flexibility and conciseness.
>
> 3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
> IO.
> Filter pushdown should be the final state and the unified way to both
> support pruning ALL cache and LRU cache,
> so I think we should make effort in this direction. If we need to support
> filter pushdown for ALL cache anyway, why not use
> it for LRU cache as well? Either way, as we decide to implement the cache
> in the framework, we have the chance to support
> filter on cache anytime. This is an optimization and it doesn't affect the
> public API. I think we can create a JIRA issue to
> discuss it when the FLIP is accepted.
>
> 4) The idea to support ALL cache is similar to your proposal.
> In the first version, we will only support InputFormat, SourceFunction for
> cache all (invoke InputFormat in join operator).
> For FLIP-27 source, we need to join a true source operator instead of
> calling it embedded in the join operator.
> However, this needs another FLIP to support the re-scan ability for FLIP-27
> Source, and this can be a large work.
> In order to not block this issue, we can put the effort of FLIP-27 source
> integration into future work and integrate
> InputFormat&SourceFunction for now.
>
> I think it's fine to use InputFormat&SourceFunction, as they are not
> deprecated, otherwise, we have to introduce another function
> similar to them which is meaningless. We need to plan FLIP-27 source
> integration ASAP before InputFormat & SourceFunction are deprecated.
>
> Best,
> Jark
>
>
>
>
>
> On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
> wrote:
>
> > Hi Martijn!
> >
> > Got it. Therefore, the realization with InputFormat is not considered.
> > Thanks for clearing that up!
> >
> > Best regards,
> > Smirnov Alexander
> >
> > чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
> > >
> > > Hi,
> > >
> > > With regards to:
> > >
> > > > But if there are plans to refactor all connectors to FLIP-27
> > >
> > > Yes, FLIP-27 is the target for all connectors. The old interfaces will be
> > > deprecated and connectors will either be refactored to use the new ones
> > or
> > > dropped.
> > >
> > > The caching should work for connectors that are using FLIP-27 interfaces,
> > > we should not introduce new features for old interfaces.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
> > > wrote:
> > >
> > > > Hi Jark!
> > > >
> > > > Sorry for the late response. I would like to make some comments and
> > > > clarify my points.
> > > >
> > > > 1) I agree with your first statement. I think we can achieve both
> > > > advantages this way: put the Cache interface in flink-table-common,
> > > > but have implementations of it in flink-table-runtime. Therefore if a
> > > > connector developer wants to use existing cache strategies and their
> > > > implementations, he can just pass lookupConfig to the planner, but if
> > > > he wants to have its own cache implementation in his TableFunction, it
> > > > will be possible for him to use the existing interface for this
> > > > purpose (we can explicitly point this out in the documentation). In
> > > > this way all configs and metrics will be unified. WDYT?
> > > >
> > > > > If a filter can prune 90% of data in the cache, we will have 90% of
> > > > lookup requests that can never be cached
> > > >
> > > > 2) Let me clarify the logic filters optimization in case of LRU cache.
> > > > It looks like Cache<RowData, Collection<RowData>>. Here we always
> > > > store the response of the dimension table in cache, even after
> > > > applying calc function. I.e. if there are no rows after applying
> > > > filters to the result of the 'eval' method of TableFunction, we store
> > > > the empty list by lookup keys. Therefore the cache line will be
> > > > filled, but will require much less memory (in bytes). I.e. we don't
> > > > completely filter keys, by which result was pruned, but significantly
> > > > reduce required memory to store this result. If the user knows about
> > > > this behavior, he can increase the 'max-rows' option before the start
> > > > of the job. But actually I came up with the idea that we can do this
> > > > automatically by using the 'maximumWeight' and 'weigher' methods of
> > > > GuavaCache [1]. Weight can be the size of the collection of rows
> > > > (value of cache). Therefore cache can automatically fit much more
> > > > records than before.
> > > >
> > > > > Flink SQL has provided a standard way to do filters and projects
> > > > pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> > > > > Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> > hard
> > > > to implement.
> > > >
> > > > It's debatable how difficult it will be to implement filter pushdown.
> > > > But I think the fact that currently there is no database connector
> > > > with filter pushdown at least means that this feature won't be
> > > > supported soon in connectors. Moreover, if we talk about other
> > > > connectors (not in Flink repo), their databases might not support all
> > > > Flink filters (or not support filters at all). I think users are
> > > > interested in supporting cache filters optimization  independently of
> > > > supporting other features and solving more complex problems (or
> > > > unsolvable at all).
> > > >
> > > > 3) I agree with your third statement. Actually in our internal version
> > > > I also tried to unify the logic of scanning and reloading data from
> > > > connectors. But unfortunately, I didn't find a way to unify the logic
> > > > of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
> > > > and reuse it in reloading ALL cache. As a result I settled on using
> > > > InputFormat, because it was used for scanning in all lookup
> > > > connectors. (I didn't know that there are plans to deprecate
> > > > InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
> > > > in ALL caching is not good idea, because this source was designed to
> > > > work in distributed environment (SplitEnumerator on JobManager and
> > > > SourceReaders on TaskManagers), not in one operator (lookup join
> > > > operator in our case). There is even no direct way to pass splits from
> > > > SplitEnumerator to SourceReader (this logic works through
> > > > SplitEnumeratorContext, which requires
> > > > OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
> > > > InputFormat for ALL cache seems much more clearer and easier. But if
> > > > there are plans to refactor all connectors to FLIP-27, I have the
> > > > following ideas: maybe we can refuse from lookup join ALL cache in
> > > > favor of simple join with multiple scanning of batch source? The point
> > > > is that the only difference between lookup join ALL cache and simple
> > > > join with batch source is that in the first case scanning is performed
> > > > multiple times, in between which state (cache) is cleared (correct me
> > > > if I'm wrong). So what if we extend the functionality of simple join
> > > > to support state reloading + extend the functionality of scanning
> > > > batch source multiple times (this one should be easy with new FLIP-27
> > > > source, that unifies streaming/batch reading - we will need to change
> > > > only SplitEnumerator, which will pass splits again after some TTL).
> > > > WDYT? I must say that this looks like a long-term goal and will make
> > > > the scope of this FLIP even larger than you said. Maybe we can limit
> > > > ourselves to a simpler solution now (InputFormats).
> > > >
> > > > So to sum up, my points is like this:
> > > > 1) There is a way to make both concise and flexible interfaces for
> > > > caching in lookup join.
> > > > 2) Cache filters optimization is important both in LRU and ALL caches.
> > > > 3) It is unclear when filter pushdown will be supported in Flink
> > > > connectors, some of the connectors might not have the opportunity to
> > > > support filter pushdown + as I know, currently filter pushdown works
> > > > only for scanning (not lookup). So cache filters + projections
> > > > optimization should be independent from other features.
> > > > 4) ALL cache realization is a complex topic that involves multiple
> > > > aspects of how Flink is developing. Refusing from InputFormat in favor
> > > > of FLIP-27 Source will make ALL cache realization really complex and
> > > > not clear, so maybe instead of that we can extend the functionality of
> > > > simple join or not refuse from InputFormat in case of lookup join ALL
> > > > cache?
> > > >
> > > > Best regards,
> > > > Smirnov Alexander
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > > >
> > > > чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> > > > >
> > > > > It's great to see the active discussion! I want to share my ideas:
> > > > >
> > > > > 1) implement the cache in framework vs. connectors base
> > > > > I don't have a strong opinion on this. Both ways should work (e.g.,
> > cache
> > > > > pruning, compatibility).
> > > > > The framework way can provide more concise interfaces.
> > > > > The connector base way can define more flexible cache
> > > > > strategies/implementations.
> > > > > We are still investigating a way to see if we can have both
> > advantages.
> > > > > We should reach a consensus that the way should be a final state,
> > and we
> > > > > are on the path to it.
> > > > >
> > > > > 2) filters and projections pushdown:
> > > > > I agree with Alex that the filter pushdown into cache can benefit a
> > lot
> > > > for
> > > > > ALL cache.
> > > > > However, this is not true for LRU cache. Connectors use cache to
> > reduce
> > > > IO
> > > > > requests to databases for better throughput.
> > > > > If a filter can prune 90% of data in the cache, we will have 90% of
> > > > lookup
> > > > > requests that can never be cached
> > > > > and hit directly to the databases. That means the cache is
> > meaningless in
> > > > > this case.
> > > > >
> > > > > IMO, Flink SQL has provided a standard way to do filters and projects
> > > > > pushdown, i.e., SupportsFilterPushDown and
> > SupportsProjectionPushDown.
> > > > > Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> > hard
> > > > to
> > > > > implement.
> > > > > They should implement the pushdown interfaces to reduce IO and the
> > cache
> > > > > size.
> > > > > That should be a final state that the scan source and lookup source
> > share
> > > > > the exact pushdown implementation.
> > > > > I don't see why we need to duplicate the pushdown logic in caches,
> > which
> > > > > will complex the lookup join design.
> > > > >
> > > > > 3) ALL cache abstraction
> > > > > All cache might be the most challenging part of this FLIP. We have
> > never
> > > > > provided a reload-lookup public interface.
> > > > > Currently, we put the reload logic in the "eval" method of
> > TableFunction.
> > > > > That's hard for some sources (e.g., Hive).
> > > > > Ideally, connector implementation should share the logic of reload
> > and
> > > > > scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
> > > > Source.
> > > > > However, InputFormat/SourceFunction are deprecated, and the FLIP-27
> > > > source
> > > > > is deeply coupled with SourceOperator.
> > > > > If we want to invoke the FLIP-27 source in LookupJoin, this may make
> > the
> > > > > scope of this FLIP much larger.
> > > > > We are still investigating how to abstract the ALL cache logic and
> > reuse
> > > > > the existing source interfaces.
> > > > >
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > >
> > > > >
> > > > > On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
> > wrote:
> > > > >
> > > > > > It's a much more complicated activity and lies out of the scope of
> > this
> > > > > > improvement. Because such pushdowns should be done for all
> > > > ScanTableSource
> > > > > > implementations (not only for Lookup ones).
> > > > > >
> > > > > > On Thu, 5 May 2022 at 19:02, Martijn Visser <
> > martijnvisser@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > >> Hi everyone,
> > > > > >>
> > > > > >> One question regarding "And Alexander correctly mentioned that
> > filter
> > > > > >> pushdown still is not implemented for jdbc/hive/hbase." -> Would
> > an
> > > > > >> alternative solution be to actually implement these filter
> > pushdowns?
> > > > I
> > > > > >> can
> > > > > >> imagine that there are many more benefits to doing that, outside
> > of
> > > > lookup
> > > > > >> caching and metrics.
> > > > > >>
> > > > > >> Best regards,
> > > > > >>
> > > > > >> Martijn Visser
> > > > > >> https://twitter.com/MartijnVisser82
> > > > > >> https://github.com/MartijnVisser
> > > > > >>
> > > > > >>
> > > > > >> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
> > > > wrote:
> > > > > >>
> > > > > >> > Hi everyone!
> > > > > >> >
> > > > > >> > Thanks for driving such a valuable improvement!
> > > > > >> >
> > > > > >> > I do think that single cache implementation would be a nice
> > > > opportunity
> > > > > >> for
> > > > > >> > users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
> > > > semantics
> > > > > >> > anyway - doesn't matter how it will be implemented.
> > > > > >> >
> > > > > >> > Putting myself in the user's shoes, I can say that:
> > > > > >> > 1) I would prefer to have the opportunity to cut off the cache
> > size
> > > > by
> > > > > >> > simply filtering unnecessary data. And the most handy way to do
> > it
> > > > is
> > > > > >> apply
> > > > > >> > it inside LookupRunners. It would be a bit harder to pass it
> > > > through the
> > > > > >> > LookupJoin node to TableFunction. And Alexander correctly
> > mentioned
> > > > that
> > > > > >> > filter pushdown still is not implemented for jdbc/hive/hbase.
> > > > > >> > 2) The ability to set the different caching parameters for
> > different
> > > > > >> tables
> > > > > >> > is quite important. So I would prefer to set it through DDL
> > rather
> > > > than
> > > > > >> > have similar ttla, strategy and other options for all lookup
> > tables.
> > > > > >> > 3) Providing the cache into the framework really deprives us of
> > > > > >> > extensibility (users won't be able to implement their own
> > cache).
> > > > But
> > > > > >> most
> > > > > >> > probably it might be solved by creating more different cache
> > > > strategies
> > > > > >> and
> > > > > >> > a wider set of configurations.
> > > > > >> >
> > > > > >> > All these points are much closer to the schema proposed by
> > > > Alexander.
> > > > > >> > Qingshen Ren, please correct me if I'm not right and all these
> > > > > >> facilities
> > > > > >> > might be simply implemented in your architecture?
> > > > > >> >
> > > > > >> > Best regards,
> > > > > >> > Roman Boyko
> > > > > >> > e.: ro.v.boyko@gmail.com
> > > > > >> >
> > > > > >> > On Wed, 4 May 2022 at 21:01, Martijn Visser <
> > > > martijnvisser@apache.org>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Hi everyone,
> > > > > >> > >
> > > > > >> > > I don't have much to chip in, but just wanted to express that
> > I
> > > > really
> > > > > >> > > appreciate the in-depth discussion on this topic and I hope
> > that
> > > > > >> others
> > > > > >> > > will join the conversation.
> > > > > >> > >
> > > > > >> > > Best regards,
> > > > > >> > >
> > > > > >> > > Martijn
> > > > > >> > >
> > > > > >> > > On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> > > > smiralexan@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Hi Qingsheng, Leonard and Jark,
> > > > > >> > > >
> > > > > >> > > > Thanks for your detailed feedback! However, I have questions
> > > > about
> > > > > >> > > > some of your statements (maybe I didn't get something?).
> > > > > >> > > >
> > > > > >> > > > > Caching actually breaks the semantic of "FOR SYSTEM_TIME
> > AS OF
> > > > > >> > > proc_time”
> > > > > >> > > >
> > > > > >> > > > I agree that the semantics of "FOR SYSTEM_TIME AS OF
> > proc_time"
> > > > is
> > > > > >> not
> > > > > >> > > > fully implemented with caching, but as you said, users go
> > on it
> > > > > >> > > > consciously to achieve better performance (no one proposed
> > to
> > > > enable
> > > > > >> > > > caching by default, etc.). Or by users do you mean other
> > > > developers
> > > > > >> of
> > > > > >> > > > connectors? In this case developers explicitly specify
> > whether
> > > > their
> > > > > >> > > > connector supports caching or not (in the list of supported
> > > > > >> options),
> > > > > >> > > > no one makes them do that if they don't want to. So what
> > > > exactly is
> > > > > >> > > > the difference between implementing caching in modules
> > > > > >> > > > flink-table-runtime and in flink-table-common from the
> > > > considered
> > > > > >> > > > point of view? How does it affect on breaking/non-breaking
> > the
> > > > > >> > > > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > > > > >> > > >
> > > > > >> > > > > confront a situation that allows table options in DDL to
> > > > control
> > > > > >> the
> > > > > >> > > > behavior of the framework, which has never happened
> > previously
> > > > and
> > > > > >> > should
> > > > > >> > > > be cautious
> > > > > >> > > >
> > > > > >> > > > If we talk about main differences of semantics of DDL
> > options
> > > > and
> > > > > >> > > > config options("table.exec.xxx"), isn't it about limiting
> > the
> > > > scope
> > > > > >> of
> > > > > >> > > > the options + importance for the user business logic rather
> > than
> > > > > >> > > > specific location of corresponding logic in the framework? I
> > > > mean
> > > > > >> that
> > > > > >> > > > in my design, for example, putting an option with lookup
> > cache
> > > > > >> > > > strategy in configurations would  be the wrong decision,
> > > > because it
> > > > > >> > > > directly affects the user's business logic (not just
> > performance
> > > > > >> > > > optimization) + touches just several functions of ONE table
> > > > (there
> > > > > >> can
> > > > > >> > > > be multiple tables with different caches). Does it really
> > > > matter for
> > > > > >> > > > the user (or someone else) where the logic is located,
> > which is
> > > > > >> > > > affected by the applied option?
> > > > > >> > > > Also I can remember DDL option 'sink.parallelism', which in
> > > > some way
> > > > > >> > > > "controls the behavior of the framework" and I don't see any
> > > > problem
> > > > > >> > > > here.
> > > > > >> > > >
> > > > > >> > > > > introduce a new interface for this all-caching scenario
> > and
> > > > the
> > > > > >> > design
> > > > > >> > > > would become more complex
> > > > > >> > > >
> > > > > >> > > > This is a subject for a separate discussion, but actually
> > in our
> > > > > >> > > > internal version we solved this problem quite easily - we
> > reused
> > > > > >> > > > InputFormat class (so there is no need for a new API). The
> > > > point is
> > > > > >> > > > that currently all lookup connectors use InputFormat for
> > > > scanning
> > > > > >> the
> > > > > >> > > > data in batch mode: HBase, JDBC and even Hive - it uses
> > class
> > > > > >> > > > PartitionReader, that is actually just a wrapper around
> > > > InputFormat.
> > > > > >> > > > The advantage of this solution is the ability to reload
> > cache
> > > > data
> > > > > >> in
> > > > > >> > > > parallel (number of threads depends on number of
> > InputSplits,
> > > > but
> > > > > >> has
> > > > > >> > > > an upper limit). As a result cache reload time significantly
> > > > reduces
> > > > > >> > > > (as well as time of input stream blocking). I know that
> > usually
> > > > we
> > > > > >> try
> > > > > >> > > > to avoid usage of concurrency in Flink code, but maybe this
> > one
> > > > can
> > > > > >> be
> > > > > >> > > > an exception. BTW I don't say that it's an ideal solution,
> > maybe
> > > > > >> there
> > > > > >> > > > are better ones.
> > > > > >> > > >
> > > > > >> > > > > Providing the cache in the framework might introduce
> > > > compatibility
> > > > > >> > > issues
> > > > > >> > > >
> > > > > >> > > > It's possible only in cases when the developer of the
> > connector
> > > > > >> won't
> > > > > >> > > > properly refactor his code and will use new cache options
> > > > > >> incorrectly
> > > > > >> > > > (i.e. explicitly provide the same options into 2 different
> > code
> > > > > >> > > > places). For correct behavior all he will need to do is to
> > > > redirect
> > > > > >> > > > existing options to the framework's LookupConfig (+ maybe
> > add an
> > > > > >> alias
> > > > > >> > > > for options, if there was different naming), everything
> > will be
> > > > > >> > > > transparent for users. If the developer won't do
> > refactoring at
> > > > all,
> > > > > >> > > > nothing will be changed for the connector because of
> > backward
> > > > > >> > > > compatibility. Also if a developer wants to use his own
> > cache
> > > > logic,
> > > > > >> > > > he just can refuse to pass some of the configs into the
> > > > framework,
> > > > > >> and
> > > > > >> > > > instead make his own implementation with already existing
> > > > configs
> > > > > >> and
> > > > > >> > > > metrics (but actually I think that it's a rare case).
> > > > > >> > > >
> > > > > >> > > > > filters and projections should be pushed all the way down
> > to
> > > > the
> > > > > >> > table
> > > > > >> > > > function, like what we do in the scan source
> > > > > >> > > >
> > > > > >> > > > It's the great purpose. But the truth is that the ONLY
> > connector
> > > > > >> that
> > > > > >> > > > supports filter pushdown is FileSystemTableSource
> > > > > >> > > > (no database connector supports it currently). Also for some
> > > > > >> databases
> > > > > >> > > > it's simply impossible to pushdown such complex filters
> > that we
> > > > have
> > > > > >> > > > in Flink.
> > > > > >> > > >
> > > > > >> > > > >  only applying these optimizations to the cache seems not
> > > > quite
> > > > > >> > useful
> > > > > >> > > >
> > > > > >> > > > Filters can cut off an arbitrarily large amount of data
> > from the
> > > > > >> > > > dimension table. For a simple example, suppose in dimension
> > > > table
> > > > > >> > > > 'users'
> > > > > >> > > > we have column 'age' with values from 20 to 40, and input
> > stream
> > > > > >> > > > 'clicks' that is ~uniformly distributed by age of users. If
> > we
> > > > have
> > > > > >> > > > filter 'age > 30',
> > > > > >> > > > there will be twice less data in cache. This means the user
> > can
> > > > > >> > > > increase 'lookup.cache.max-rows' by almost 2 times. It will
> > > > gain a
> > > > > >> > > > huge
> > > > > >> > > > performance boost. Moreover, this optimization starts to
> > really
> > > > > >> shine
> > > > > >> > > > in 'ALL' cache, where tables without filters and projections
> > > > can't
> > > > > >> fit
> > > > > >> > > > in memory, but with them - can. This opens up additional
> > > > > >> possibilities
> > > > > >> > > > for users. And this doesn't sound as 'not quite useful'.
> > > > > >> > > >
> > > > > >> > > > It would be great to hear other voices regarding this topic!
> > > > Because
> > > > > >> > > > we have quite a lot of controversial points, and I think
> > with
> > > > the
> > > > > >> help
> > > > > >> > > > of others it will be easier for us to come to a consensus.
> > > > > >> > > >
> > > > > >> > > > Best regards,
> > > > > >> > > > Smirnov Alexander
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> > renqschn@gmail.com
> > > > >:
> > > > > >> > > > >
> > > > > >> > > > > Hi Alexander and Arvid,
> > > > > >> > > > >
> > > > > >> > > > > Thanks for the discussion and sorry for my late response!
> > We
> > > > had
> > > > > >> an
> > > > > >> > > > internal discussion together with Jark and Leonard and I’d
> > like
> > > > to
> > > > > >> > > > summarize our ideas. Instead of implementing the cache
> > logic in
> > > > the
> > > > > >> > table
> > > > > >> > > > runtime layer or wrapping around the user-provided table
> > > > function,
> > > > > >> we
> > > > > >> > > > prefer to introduce some new APIs extending TableFunction
> > with
> > > > these
> > > > > >> > > > concerns:
> > > > > >> > > > >
> > > > > >> > > > > 1. Caching actually breaks the semantic of "FOR
> > SYSTEM_TIME
> > > > AS OF
> > > > > >> > > > proc_time”, because it couldn’t truly reflect the content
> > of the
> > > > > >> lookup
> > > > > >> > > > table at the moment of querying. If users choose to enable
> > > > caching
> > > > > >> on
> > > > > >> > the
> > > > > >> > > > lookup table, they implicitly indicate that this breakage is
> > > > > >> acceptable
> > > > > >> > > in
> > > > > >> > > > exchange for the performance. So we prefer not to provide
> > > > caching on
> > > > > >> > the
> > > > > >> > > > table runtime level.
> > > > > >> > > > >
> > > > > >> > > > > 2. If we make the cache implementation in the framework
> > > > (whether
> > > > > >> in a
> > > > > >> > > > runner or a wrapper around TableFunction), we have to
> > confront a
> > > > > >> > > situation
> > > > > >> > > > that allows table options in DDL to control the behavior of
> > the
> > > > > >> > > framework,
> > > > > >> > > > which has never happened previously and should be cautious.
> > > > Under
> > > > > >> the
> > > > > >> > > > current design the behavior of the framework should only be
> > > > > >> specified
> > > > > >> > by
> > > > > >> > > > configurations (“table.exec.xxx”), and it’s hard to apply
> > these
> > > > > >> general
> > > > > >> > > > configs to a specific table.
> > > > > >> > > > >
> > > > > >> > > > > 3. We have use cases that lookup source loads and refresh
> > all
> > > > > >> records
> > > > > >> > > > periodically into the memory to achieve high lookup
> > performance
> > > > > >> (like
> > > > > >> > > Hive
> > > > > >> > > > connector in the community, and also widely used by our
> > internal
> > > > > >> > > > connectors). Wrapping the cache around the user’s
> > TableFunction
> > > > > >> works
> > > > > >> > > fine
> > > > > >> > > > for LRU caches, but I think we have to introduce a new
> > > > interface for
> > > > > >> > this
> > > > > >> > > > all-caching scenario and the design would become more
> > complex.
> > > > > >> > > > >
> > > > > >> > > > > 4. Providing the cache in the framework might introduce
> > > > > >> compatibility
> > > > > >> > > > issues to existing lookup sources like there might exist two
> > > > caches
> > > > > >> > with
> > > > > >> > > > totally different strategies if the user incorrectly
> > configures
> > > > the
> > > > > >> > table
> > > > > >> > > > (one in the framework and another implemented by the lookup
> > > > source).
> > > > > >> > > > >
> > > > > >> > > > > As for the optimization mentioned by Alexander, I think
> > > > filters
> > > > > >> and
> > > > > >> > > > projections should be pushed all the way down to the table
> > > > function,
> > > > > >> > like
> > > > > >> > > > what we do in the scan source, instead of the runner with
> > the
> > > > cache.
> > > > > >> > The
> > > > > >> > > > goal of using cache is to reduce the network I/O and
> > pressure
> > > > on the
> > > > > >> > > > external system, and only applying these optimizations to
> > the
> > > > cache
> > > > > >> > seems
> > > > > >> > > > not quite useful.
> > > > > >> > > > >
> > > > > >> > > > > I made some updates to the FLIP[1] to reflect our ideas.
> > We
> > > > > >> prefer to
> > > > > >> > > > keep the cache implementation as a part of TableFunction,
> > and we
> > > > > >> could
> > > > > >> > > > provide some helper classes (CachingTableFunction,
> > > > > >> > > AllCachingTableFunction,
> > > > > >> > > > CachingAsyncTableFunction) to developers and regulate
> > metrics
> > > > of the
> > > > > >> > > cache.
> > > > > >> > > > Also, I made a POC[2] for your reference.
> > > > > >> > > > >
> > > > > >> > > > > Looking forward to your ideas!
> > > > > >> > > > >
> > > > > >> > > > > [1]
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > > >> > > > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > > > > >> > > > >
> > > > > >> > > > > Best regards,
> > > > > >> > > > >
> > > > > >> > > > > Qingsheng
> > > > > >> > > > >
> > > > > >> > > > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > > > > >> > > smiralexan@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > > >>
> > > > > >> > > > >> Thanks for the response, Arvid!
> > > > > >> > > > >>
> > > > > >> > > > >> I have few comments on your message.
> > > > > >> > > > >>
> > > > > >> > > > >> > but could also live with an easier solution as the
> > first
> > > > step:
> > > > > >> > > > >>
> > > > > >> > > > >> I think that these 2 ways are mutually exclusive
> > (originally
> > > > > >> > proposed
> > > > > >> > > > >> by Qingsheng and mine), because conceptually they follow
> > the
> > > > same
> > > > > >> > > > >> goal, but implementation details are different. If we
> > will
> > > > go one
> > > > > >> > way,
> > > > > >> > > > >> moving to another way in the future will mean deleting
> > > > existing
> > > > > >> code
> > > > > >> > > > >> and once again changing the API for connectors. So I
> > think we
> > > > > >> should
> > > > > >> > > > >> reach a consensus with the community about that and then
> > work
> > > > > >> > together
> > > > > >> > > > >> on this FLIP, i.e. divide the work on tasks for different
> > > > parts
> > > > > >> of
> > > > > >> > the
> > > > > >> > > > >> flip (for example, LRU cache unification / introducing
> > > > proposed
> > > > > >> set
> > > > > >> > of
> > > > > >> > > > >> metrics / further work…). WDYT, Qingsheng?
> > > > > >> > > > >>
> > > > > >> > > > >> > as the source will only receive the requests after
> > filter
> > > > > >> > > > >>
> > > > > >> > > > >> Actually if filters are applied to fields of the lookup
> > > > table, we
> > > > > >> > > > >> firstly must do requests, and only after that we can
> > filter
> > > > > >> > responses,
> > > > > >> > > > >> because lookup connectors don't have filter pushdown. So
> > if
> > > > > >> > filtering
> > > > > >> > > > >> is done before caching, there will be much less rows in
> > > > cache.
> > > > > >> > > > >>
> > > > > >> > > > >> > @Alexander unfortunately, your architecture is not
> > shared.
> > > > I
> > > > > >> don't
> > > > > >> > > > know the
> > > > > >> > > > >>
> > > > > >> > > > >> > solution to share images to be honest.
> > > > > >> > > > >>
> > > > > >> > > > >> Sorry for that, I’m a bit new to such kinds of
> > conversations
> > > > :)
> > > > > >> > > > >> I have no write access to the confluence, so I made a
> > Jira
> > > > issue,
> > > > > >> > > > >> where described the proposed changes in more details -
> > > > > >> > > > >> https://issues.apache.org/jira/browse/FLINK-27411.
> > > > > >> > > > >>
> > > > > >> > > > >> Will happy to get more feedback!
> > > > > >> > > > >>
> > > > > >> > > > >> Best,
> > > > > >> > > > >> Smirnov Alexander
> > > > > >> > > > >>
> > > > > >> > > > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> > arvid@apache.org>:
> > > > > >> > > > >> >
> > > > > >> > > > >> > Hi Qingsheng,
> > > > > >> > > > >> >
> > > > > >> > > > >> > Thanks for driving this; the inconsistency was not
> > > > satisfying
> > > > > >> for
> > > > > >> > > me.
> > > > > >> > > > >> >
> > > > > >> > > > >> > I second Alexander's idea though but could also live
> > with
> > > > an
> > > > > >> > easier
> > > > > >> > > > >> > solution as the first step: Instead of making caching
> > an
> > > > > >> > > > implementation
> > > > > >> > > > >> > detail of TableFunction X, rather devise a caching
> > layer
> > > > > >> around X.
> > > > > >> > > So
> > > > > >> > > > the
> > > > > >> > > > >> > proposal would be a CachingTableFunction that
> > delegates to
> > > > X in
> > > > > >> > case
> > > > > >> > > > of
> > > > > >> > > > >> > misses and else manages the cache. Lifting it into the
> > > > operator
> > > > > >> > > model
> > > > > >> > > > as
> > > > > >> > > > >> > proposed would be even better but is probably
> > unnecessary
> > > > in
> > > > > >> the
> > > > > >> > > > first step
> > > > > >> > > > >> > for a lookup source (as the source will only receive
> > the
> > > > > >> requests
> > > > > >> > > > after
> > > > > >> > > > >> > filter; applying projection may be more interesting to
> > save
> > > > > >> > memory).
> > > > > >> > > > >> >
> > > > > >> > > > >> > Another advantage is that all the changes of this FLIP
> > > > would be
> > > > > >> > > > limited to
> > > > > >> > > > >> > options, no need for new public interfaces. Everything
> > else
> > > > > >> > remains
> > > > > >> > > an
> > > > > >> > > > >> > implementation of Table runtime. That means we can
> > easily
> > > > > >> > > incorporate
> > > > > >> > > > the
> > > > > >> > > > >> > optimization potential that Alexander pointed out
> > later.
> > > > > >> > > > >> >
> > > > > >> > > > >> > @Alexander unfortunately, your architecture is not
> > shared.
> > > > I
> > > > > >> don't
> > > > > >> > > > know the
> > > > > >> > > > >> > solution to share images to be honest.
> > > > > >> > > > >> >
> > > > > >> > > > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > > > > >> > > > smiralexan@gmail.com>
> > > > > >> > > > >> > wrote:
> > > > > >> > > > >> >
> > > > > >> > > > >> > > Hi Qingsheng! My name is Alexander, I'm not a
> > committer
> > > > yet,
> > > > > >> but
> > > > > >> > > I'd
> > > > > >> > > > >> > > really like to become one. And this FLIP really
> > > > interested
> > > > > >> me.
> > > > > >> > > > >> > > Actually I have worked on a similar feature in my
> > > > company’s
> > > > > >> > Flink
> > > > > >> > > > >> > > fork, and we would like to share our thoughts on
> > this and
> > > > > >> make
> > > > > >> > > code
> > > > > >> > > > >> > > open source.
> > > > > >> > > > >> > >
> > > > > >> > > > >> > > I think there is a better alternative than
> > introducing an
> > > > > >> > abstract
> > > > > >> > > > >> > > class for TableFunction (CachingTableFunction). As
> > you
> > > > know,
> > > > > >> > > > >> > > TableFunction exists in the flink-table-common
> > module,
> > > > which
> > > > > >> > > > provides
> > > > > >> > > > >> > > only an API for working with tables – it’s very
> > > > convenient
> > > > > >> for
> > > > > >> > > > importing
> > > > > >> > > > >> > > in connectors. In turn, CachingTableFunction contains
> > > > logic
> > > > > >> for
> > > > > >> > > > >> > > runtime execution,  so this class and everything
> > > > connected
> > > > > >> with
> > > > > >> > it
> > > > > >> > > > >> > > should be located in another module, probably in
> > > > > >> > > > flink-table-runtime.
> > > > > >> > > > >> > > But this will require connectors to depend on another
> > > > module,
> > > > > >> > > which
> > > > > >> > > > >> > > contains a lot of runtime logic, which doesn’t sound
> > > > good.
> > > > > >> > > > >> > >
> > > > > >> > > > >> > > I suggest adding a new method ‘getLookupConfig’ to
> > > > > >> > > LookupTableSource
> > > > > >> > > > >> > > or LookupRuntimeProvider to allow connectors to only
> > pass
> > > > > >> > > > >> > > configurations to the planner, therefore they won’t
> > > > depend on
> > > > > >> > > > runtime
> > > > > >> > > > >> > > realization. Based on these configs planner will
> > > > construct a
> > > > > >> > > lookup
> > > > > >> > > > >> > > join operator with corresponding runtime logic
> > > > > >> (ProcessFunctions
> > > > > >> > > in
> > > > > >> > > > >> > > module flink-table-runtime). Architecture looks like
> > in
> > > > the
> > > > > >> > pinned
> > > > > >> > > > >> > > image (LookupConfig class there is actually yours
> > > > > >> CacheConfig).
> > > > > >> > > > >> > >
> > > > > >> > > > >> > > Classes in flink-table-planner, that will be
> > responsible
> > > > for
> > > > > >> > this
> > > > > >> > > –
> > > > > >> > > > >> > > CommonPhysicalLookupJoin and his inheritors.
> > > > > >> > > > >> > > Current classes for lookup join in
> > flink-table-runtime
> > > > -
> > > > > >> > > > >> > > LookupJoinRunner, AsyncLookupJoinRunner,
> > > > > >> > LookupJoinRunnerWithCalc,
> > > > > >> > > > >> > > AsyncLookupJoinRunnerWithCalc.
> > > > > >> > > > >> > >
> > > > > >> > > > >> > > I suggest adding classes LookupJoinCachingRunner,
> > > > > >> > > > >> > > LookupJoinCachingRunnerWithCalc, etc.
> > > > > >> > > > >> > >
> > > > > >> > > > >> > > And here comes another more powerful advantage of
> > such a
> > > > > >> > solution.
> > > > > >> > > > If
> > > > > >> > > > >> > > we have caching logic on a lower level, we can apply
> > some
> > > > > >> > > > >> > > optimizations to it. LookupJoinRunnerWithCalc was
> > named
> > > > like
> > > > > >> > this
> > > > > >> > > > >> > > because it uses the ‘calc’ function, which actually
> > > > mostly
> > > > > >> > > consists
> > > > > >> > > > of
> > > > > >> > > > >> > > filters and projections.
> > > > > >> > > > >> > >
> > > > > >> > > > >> > > For example, in join table A with lookup table B
> > > > condition
> > > > > >> > ‘JOIN …
> > > > > >> > > > ON
> > > > > >> > > > >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
> > 1000’
> > > > > >> > ‘calc’
> > > > > >> > > > >> > > function will contain filters A.age = B.age + 10 and
> > > > > >> B.salary >
> > > > > >> > > > 1000.
> > > > > >> > > > >> > >
> > > > > >> > > > >> > > If we apply this function before storing records in
> > > > cache,
> > > > > >> size
> > > > > >> > of
> > > > > >> > > > >> > > cache will be significantly reduced: filters = avoid
> > > > storing
> > > > > >> > > useless
> > > > > >> > > > >> > > records in cache, projections = reduce records’
> > size. So
> > > > the
> > > > > >> > > initial
> > > > > >> > > > >> > > max number of records in cache can be increased by
> > the
> > > > user.
> > > > > >> > > > >> > >
> > > > > >> > > > >> > > What do you think about it?
> > > > > >> > > > >> > >
> > > > > >> > > > >> > >
> > > > > >> > > > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > > > > >> > > > >> > > > Hi devs,
> > > > > >> > > > >> > > >
> > > > > >> > > > >> > > > Yuan and I would like to start a discussion about
> > > > > >> FLIP-221[1],
> > > > > >> > > > which
> > > > > >> > > > >> > > introduces an abstraction of lookup table cache and
> > its
> > > > > >> standard
> > > > > >> > > > metrics.
> > > > > >> > > > >> > > >
> > > > > >> > > > >> > > > Currently each lookup table source should implement
> > > > their
> > > > > >> own
> > > > > >> > > > cache to
> > > > > >> > > > >> > > store lookup results, and there isn’t a standard of
> > > > metrics
> > > > > >> for
> > > > > >> > > > users and
> > > > > >> > > > >> > > developers to tuning their jobs with lookup joins,
> > which
> > > > is a
> > > > > >> > > quite
> > > > > >> > > > common
> > > > > >> > > > >> > > use case in Flink table / SQL.
> > > > > >> > > > >> > > >
> > > > > >> > > > >> > > > Therefore we propose some new APIs including cache,
> > > > > >> metrics,
> > > > > >> > > > wrapper
> > > > > >> > > > >> > > classes of TableFunction and new table options.
> > Please
> > > > take a
> > > > > >> > look
> > > > > >> > > > at the
> > > > > >> > > > >> > > FLIP page [1] to get more details. Any suggestions
> > and
> > > > > >> comments
> > > > > >> > > > would be
> > > > > >> > > > >> > > appreciated!
> > > > > >> > > > >> > > >
> > > > > >> > > > >> > > > [1]
> > > > > >> > > > >> > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > > >> > > > >> > > >
> > > > > >> > > > >> > > > Best regards,
> > > > > >> > > > >> > > >
> > > > > >> > > > >> > > > Qingsheng
> > > > > >> > > > >> > > >
> > > > > >> > > > >> > > >
> > > > > >> > > > >> > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > --
> > > > > >> > > > > Best Regards,
> > > > > >> > > > >
> > > > > >> > > > > Qingsheng Ren
> > > > > >> > > > >
> > > > > >> > > > > Real-time Computing Team
> > > > > >> > > > > Alibaba Cloud
> > > > > >> > > > >
> > > > > >> > > > > Email: renqschn@gmail.com
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Roman Boyko
> > > > > > e.: ro.v.boyko@gmail.com
> > > > > >
> > > >
> >
> >

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jark Wu <im...@gmail.com>.

Hi Alex,

Thanks for summarizing your points.

In the past week, Qingsheng, Leonard, and I have discussed it several times
and we have totally refactored the design.
I'm glad to say we have reached a consensus on many of your points!
Qingsheng is still working on updating the design docs and maybe can be
available in the next few days.
I will share some conclusions from our discussions:

1) we have refactored the design towards to "cache in framework" way.

2) a "LookupCache" interface for users to customize and a default
implementation with builder for users to easy-use.
This can both make it possible to both have flexibility and conciseness.

3) Filter pushdown is important for ALL and LRU lookup cache, esp reducing
IO.
Filter pushdown should be the final state and the unified way to both
support pruning ALL cache and LRU cache,
so I think we should make effort in this direction. If we need to support
filter pushdown for ALL cache anyway, why not use
it for LRU cache as well? Either way, as we decide to implement the cache
in the framework, we have the chance to support
filter on cache anytime. This is an optimization and it doesn't affect the
public API. I think we can create a JIRA issue to
discuss it when the FLIP is accepted.

4) The idea to support ALL cache is similar to your proposal.
In the first version, we will only support InputFormat, SourceFunction for
cache all (invoke InputFormat in join operator).
For FLIP-27 source, we need to join a true source operator instead of
calling it embedded in the join operator.
However, this needs another FLIP to support the re-scan ability for FLIP-27
Source, and this can be a large work.
In order to not block this issue, we can put the effort of FLIP-27 source
integration into future work and integrate
InputFormat&SourceFunction for now.

I think it's fine to use InputFormat&SourceFunction, as they are not
deprecated, otherwise, we have to introduce another function
similar to them which is meaningless. We need to plan FLIP-27 source
integration ASAP before InputFormat & SourceFunction are deprecated.

Best,
Jark





On Thu, 12 May 2022 at 15:46, Александр Смирнов <sm...@gmail.com>
wrote:

> Hi Martijn!
>
> Got it. Therefore, the realization with InputFormat is not considered.
> Thanks for clearing that up!
>
> Best regards,
> Smirnov Alexander
>
> чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
> >
> > Hi,
> >
> > With regards to:
> >
> > > But if there are plans to refactor all connectors to FLIP-27
> >
> > Yes, FLIP-27 is the target for all connectors. The old interfaces will be
> > deprecated and connectors will either be refactored to use the new ones
> or
> > dropped.
> >
> > The caching should work for connectors that are using FLIP-27 interfaces,
> > we should not introduce new features for old interfaces.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
> > wrote:
> >
> > > Hi Jark!
> > >
> > > Sorry for the late response. I would like to make some comments and
> > > clarify my points.
> > >
> > > 1) I agree with your first statement. I think we can achieve both
> > > advantages this way: put the Cache interface in flink-table-common,
> > > but have implementations of it in flink-table-runtime. Therefore if a
> > > connector developer wants to use existing cache strategies and their
> > > implementations, he can just pass lookupConfig to the planner, but if
> > > he wants to have its own cache implementation in his TableFunction, it
> > > will be possible for him to use the existing interface for this
> > > purpose (we can explicitly point this out in the documentation). In
> > > this way all configs and metrics will be unified. WDYT?
> > >
> > > > If a filter can prune 90% of data in the cache, we will have 90% of
> > > lookup requests that can never be cached
> > >
> > > 2) Let me clarify the logic filters optimization in case of LRU cache.
> > > It looks like Cache<RowData, Collection<RowData>>. Here we always
> > > store the response of the dimension table in cache, even after
> > > applying calc function. I.e. if there are no rows after applying
> > > filters to the result of the 'eval' method of TableFunction, we store
> > > the empty list by lookup keys. Therefore the cache line will be
> > > filled, but will require much less memory (in bytes). I.e. we don't
> > > completely filter keys, by which result was pruned, but significantly
> > > reduce required memory to store this result. If the user knows about
> > > this behavior, he can increase the 'max-rows' option before the start
> > > of the job. But actually I came up with the idea that we can do this
> > > automatically by using the 'maximumWeight' and 'weigher' methods of
> > > GuavaCache [1]. Weight can be the size of the collection of rows
> > > (value of cache). Therefore cache can automatically fit much more
> > > records than before.
> > >
> > > > Flink SQL has provided a standard way to do filters and projects
> > > pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> > > > Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> hard
> > > to implement.
> > >
> > > It's debatable how difficult it will be to implement filter pushdown.
> > > But I think the fact that currently there is no database connector
> > > with filter pushdown at least means that this feature won't be
> > > supported soon in connectors. Moreover, if we talk about other
> > > connectors (not in Flink repo), their databases might not support all
> > > Flink filters (or not support filters at all). I think users are
> > > interested in supporting cache filters optimization  independently of
> > > supporting other features and solving more complex problems (or
> > > unsolvable at all).
> > >
> > > 3) I agree with your third statement. Actually in our internal version
> > > I also tried to unify the logic of scanning and reloading data from
> > > connectors. But unfortunately, I didn't find a way to unify the logic
> > > of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
> > > and reuse it in reloading ALL cache. As a result I settled on using
> > > InputFormat, because it was used for scanning in all lookup
> > > connectors. (I didn't know that there are plans to deprecate
> > > InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
> > > in ALL caching is not good idea, because this source was designed to
> > > work in distributed environment (SplitEnumerator on JobManager and
> > > SourceReaders on TaskManagers), not in one operator (lookup join
> > > operator in our case). There is even no direct way to pass splits from
> > > SplitEnumerator to SourceReader (this logic works through
> > > SplitEnumeratorContext, which requires
> > > OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
> > > InputFormat for ALL cache seems much more clearer and easier. But if
> > > there are plans to refactor all connectors to FLIP-27, I have the
> > > following ideas: maybe we can refuse from lookup join ALL cache in
> > > favor of simple join with multiple scanning of batch source? The point
> > > is that the only difference between lookup join ALL cache and simple
> > > join with batch source is that in the first case scanning is performed
> > > multiple times, in between which state (cache) is cleared (correct me
> > > if I'm wrong). So what if we extend the functionality of simple join
> > > to support state reloading + extend the functionality of scanning
> > > batch source multiple times (this one should be easy with new FLIP-27
> > > source, that unifies streaming/batch reading - we will need to change
> > > only SplitEnumerator, which will pass splits again after some TTL).
> > > WDYT? I must say that this looks like a long-term goal and will make
> > > the scope of this FLIP even larger than you said. Maybe we can limit
> > > ourselves to a simpler solution now (InputFormats).
> > >
> > > So to sum up, my points is like this:
> > > 1) There is a way to make both concise and flexible interfaces for
> > > caching in lookup join.
> > > 2) Cache filters optimization is important both in LRU and ALL caches.
> > > 3) It is unclear when filter pushdown will be supported in Flink
> > > connectors, some of the connectors might not have the opportunity to
> > > support filter pushdown + as I know, currently filter pushdown works
> > > only for scanning (not lookup). So cache filters + projections
> > > optimization should be independent from other features.
> > > 4) ALL cache realization is a complex topic that involves multiple
> > > aspects of how Flink is developing. Refusing from InputFormat in favor
> > > of FLIP-27 Source will make ALL cache realization really complex and
> > > not clear, so maybe instead of that we can extend the functionality of
> > > simple join or not refuse from InputFormat in case of lookup join ALL
> > > cache?
> > >
> > > Best regards,
> > > Smirnov Alexander
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > [1]
> > >
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > >
> > > чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> > > >
> > > > It's great to see the active discussion! I want to share my ideas:
> > > >
> > > > 1) implement the cache in framework vs. connectors base
> > > > I don't have a strong opinion on this. Both ways should work (e.g.,
> cache
> > > > pruning, compatibility).
> > > > The framework way can provide more concise interfaces.
> > > > The connector base way can define more flexible cache
> > > > strategies/implementations.
> > > > We are still investigating a way to see if we can have both
> advantages.
> > > > We should reach a consensus that the way should be a final state,
> and we
> > > > are on the path to it.
> > > >
> > > > 2) filters and projections pushdown:
> > > > I agree with Alex that the filter pushdown into cache can benefit a
> lot
> > > for
> > > > ALL cache.
> > > > However, this is not true for LRU cache. Connectors use cache to
> reduce
> > > IO
> > > > requests to databases for better throughput.
> > > > If a filter can prune 90% of data in the cache, we will have 90% of
> > > lookup
> > > > requests that can never be cached
> > > > and hit directly to the databases. That means the cache is
> meaningless in
> > > > this case.
> > > >
> > > > IMO, Flink SQL has provided a standard way to do filters and projects
> > > > pushdown, i.e., SupportsFilterPushDown and
> SupportsProjectionPushDown.
> > > > Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's
> hard
> > > to
> > > > implement.
> > > > They should implement the pushdown interfaces to reduce IO and the
> cache
> > > > size.
> > > > That should be a final state that the scan source and lookup source
> share
> > > > the exact pushdown implementation.
> > > > I don't see why we need to duplicate the pushdown logic in caches,
> which
> > > > will complex the lookup join design.
> > > >
> > > > 3) ALL cache abstraction
> > > > All cache might be the most challenging part of this FLIP. We have
> never
> > > > provided a reload-lookup public interface.
> > > > Currently, we put the reload logic in the "eval" method of
> TableFunction.
> > > > That's hard for some sources (e.g., Hive).
> > > > Ideally, connector implementation should share the logic of reload
> and
> > > > scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
> > > Source.
> > > > However, InputFormat/SourceFunction are deprecated, and the FLIP-27
> > > source
> > > > is deeply coupled with SourceOperator.
> > > > If we want to invoke the FLIP-27 source in LookupJoin, this may make
> the
> > > > scope of this FLIP much larger.
> > > > We are still investigating how to abstract the ALL cache logic and
> reuse
> > > > the existing source interfaces.
> > > >
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > >
> > > >
> > > > On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com>
> wrote:
> > > >
> > > > > It's a much more complicated activity and lies out of the scope of
> this
> > > > > improvement. Because such pushdowns should be done for all
> > > ScanTableSource
> > > > > implementations (not only for Lookup ones).
> > > > >
> > > > > On Thu, 5 May 2022 at 19:02, Martijn Visser <
> martijnvisser@apache.org>
> > > > > wrote:
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> One question regarding "And Alexander correctly mentioned that
> filter
> > > > >> pushdown still is not implemented for jdbc/hive/hbase." -> Would
> an
> > > > >> alternative solution be to actually implement these filter
> pushdowns?
> > > I
> > > > >> can
> > > > >> imagine that there are many more benefits to doing that, outside
> of
> > > lookup
> > > > >> caching and metrics.
> > > > >>
> > > > >> Best regards,
> > > > >>
> > > > >> Martijn Visser
> > > > >> https://twitter.com/MartijnVisser82
> > > > >> https://github.com/MartijnVisser
> > > > >>
> > > > >>
> > > > >> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
> > > wrote:
> > > > >>
> > > > >> > Hi everyone!
> > > > >> >
> > > > >> > Thanks for driving such a valuable improvement!
> > > > >> >
> > > > >> > I do think that single cache implementation would be a nice
> > > opportunity
> > > > >> for
> > > > >> > users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
> > > semantics
> > > > >> > anyway - doesn't matter how it will be implemented.
> > > > >> >
> > > > >> > Putting myself in the user's shoes, I can say that:
> > > > >> > 1) I would prefer to have the opportunity to cut off the cache
> size
> > > by
> > > > >> > simply filtering unnecessary data. And the most handy way to do
> it
> > > is
> > > > >> apply
> > > > >> > it inside LookupRunners. It would be a bit harder to pass it
> > > through the
> > > > >> > LookupJoin node to TableFunction. And Alexander correctly
> mentioned
> > > that
> > > > >> > filter pushdown still is not implemented for jdbc/hive/hbase.
> > > > >> > 2) The ability to set the different caching parameters for
> different
> > > > >> tables
> > > > >> > is quite important. So I would prefer to set it through DDL
> rather
> > > than
> > > > >> > have similar ttla, strategy and other options for all lookup
> tables.
> > > > >> > 3) Providing the cache into the framework really deprives us of
> > > > >> > extensibility (users won't be able to implement their own
> cache).
> > > But
> > > > >> most
> > > > >> > probably it might be solved by creating more different cache
> > > strategies
> > > > >> and
> > > > >> > a wider set of configurations.
> > > > >> >
> > > > >> > All these points are much closer to the schema proposed by
> > > Alexander.
> > > > >> > Qingshen Ren, please correct me if I'm not right and all these
> > > > >> facilities
> > > > >> > might be simply implemented in your architecture?
> > > > >> >
> > > > >> > Best regards,
> > > > >> > Roman Boyko
> > > > >> > e.: ro.v.boyko@gmail.com
> > > > >> >
> > > > >> > On Wed, 4 May 2022 at 21:01, Martijn Visser <
> > > martijnvisser@apache.org>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi everyone,
> > > > >> > >
> > > > >> > > I don't have much to chip in, but just wanted to express that
> I
> > > really
> > > > >> > > appreciate the in-depth discussion on this topic and I hope
> that
> > > > >> others
> > > > >> > > will join the conversation.
> > > > >> > >
> > > > >> > > Best regards,
> > > > >> > >
> > > > >> > > Martijn
> > > > >> > >
> > > > >> > > On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> > > smiralexan@gmail.com>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Hi Qingsheng, Leonard and Jark,
> > > > >> > > >
> > > > >> > > > Thanks for your detailed feedback! However, I have questions
> > > about
> > > > >> > > > some of your statements (maybe I didn't get something?).
> > > > >> > > >
> > > > >> > > > > Caching actually breaks the semantic of "FOR SYSTEM_TIME
> AS OF
> > > > >> > > proc_time”
> > > > >> > > >
> > > > >> > > > I agree that the semantics of "FOR SYSTEM_TIME AS OF
> proc_time"
> > > is
> > > > >> not
> > > > >> > > > fully implemented with caching, but as you said, users go
> on it
> > > > >> > > > consciously to achieve better performance (no one proposed
> to
> > > enable
> > > > >> > > > caching by default, etc.). Or by users do you mean other
> > > developers
> > > > >> of
> > > > >> > > > connectors? In this case developers explicitly specify
> whether
> > > their
> > > > >> > > > connector supports caching or not (in the list of supported
> > > > >> options),
> > > > >> > > > no one makes them do that if they don't want to. So what
> > > exactly is
> > > > >> > > > the difference between implementing caching in modules
> > > > >> > > > flink-table-runtime and in flink-table-common from the
> > > considered
> > > > >> > > > point of view? How does it affect on breaking/non-breaking
> the
> > > > >> > > > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > > > >> > > >
> > > > >> > > > > confront a situation that allows table options in DDL to
> > > control
> > > > >> the
> > > > >> > > > behavior of the framework, which has never happened
> previously
> > > and
> > > > >> > should
> > > > >> > > > be cautious
> > > > >> > > >
> > > > >> > > > If we talk about main differences of semantics of DDL
> options
> > > and
> > > > >> > > > config options("table.exec.xxx"), isn't it about limiting
> the
> > > scope
> > > > >> of
> > > > >> > > > the options + importance for the user business logic rather
> than
> > > > >> > > > specific location of corresponding logic in the framework? I
> > > mean
> > > > >> that
> > > > >> > > > in my design, for example, putting an option with lookup
> cache
> > > > >> > > > strategy in configurations would  be the wrong decision,
> > > because it
> > > > >> > > > directly affects the user's business logic (not just
> performance
> > > > >> > > > optimization) + touches just several functions of ONE table
> > > (there
> > > > >> can
> > > > >> > > > be multiple tables with different caches). Does it really
> > > matter for
> > > > >> > > > the user (or someone else) where the logic is located,
> which is
> > > > >> > > > affected by the applied option?
> > > > >> > > > Also I can remember DDL option 'sink.parallelism', which in
> > > some way
> > > > >> > > > "controls the behavior of the framework" and I don't see any
> > > problem
> > > > >> > > > here.
> > > > >> > > >
> > > > >> > > > > introduce a new interface for this all-caching scenario
> and
> > > the
> > > > >> > design
> > > > >> > > > would become more complex
> > > > >> > > >
> > > > >> > > > This is a subject for a separate discussion, but actually
> in our
> > > > >> > > > internal version we solved this problem quite easily - we
> reused
> > > > >> > > > InputFormat class (so there is no need for a new API). The
> > > point is
> > > > >> > > > that currently all lookup connectors use InputFormat for
> > > scanning
> > > > >> the
> > > > >> > > > data in batch mode: HBase, JDBC and even Hive - it uses
> class
> > > > >> > > > PartitionReader, that is actually just a wrapper around
> > > InputFormat.
> > > > >> > > > The advantage of this solution is the ability to reload
> cache
> > > data
> > > > >> in
> > > > >> > > > parallel (number of threads depends on number of
> InputSplits,
> > > but
> > > > >> has
> > > > >> > > > an upper limit). As a result cache reload time significantly
> > > reduces
> > > > >> > > > (as well as time of input stream blocking). I know that
> usually
> > > we
> > > > >> try
> > > > >> > > > to avoid usage of concurrency in Flink code, but maybe this
> one
> > > can
> > > > >> be
> > > > >> > > > an exception. BTW I don't say that it's an ideal solution,
> maybe
> > > > >> there
> > > > >> > > > are better ones.
> > > > >> > > >
> > > > >> > > > > Providing the cache in the framework might introduce
> > > compatibility
> > > > >> > > issues
> > > > >> > > >
> > > > >> > > > It's possible only in cases when the developer of the
> connector
> > > > >> won't
> > > > >> > > > properly refactor his code and will use new cache options
> > > > >> incorrectly
> > > > >> > > > (i.e. explicitly provide the same options into 2 different
> code
> > > > >> > > > places). For correct behavior all he will need to do is to
> > > redirect
> > > > >> > > > existing options to the framework's LookupConfig (+ maybe
> add an
> > > > >> alias
> > > > >> > > > for options, if there was different naming), everything
> will be
> > > > >> > > > transparent for users. If the developer won't do
> refactoring at
> > > all,
> > > > >> > > > nothing will be changed for the connector because of
> backward
> > > > >> > > > compatibility. Also if a developer wants to use his own
> cache
> > > logic,
> > > > >> > > > he just can refuse to pass some of the configs into the
> > > framework,
> > > > >> and
> > > > >> > > > instead make his own implementation with already existing
> > > configs
> > > > >> and
> > > > >> > > > metrics (but actually I think that it's a rare case).
> > > > >> > > >
> > > > >> > > > > filters and projections should be pushed all the way down
> to
> > > the
> > > > >> > table
> > > > >> > > > function, like what we do in the scan source
> > > > >> > > >
> > > > >> > > > It's the great purpose. But the truth is that the ONLY
> connector
> > > > >> that
> > > > >> > > > supports filter pushdown is FileSystemTableSource
> > > > >> > > > (no database connector supports it currently). Also for some
> > > > >> databases
> > > > >> > > > it's simply impossible to pushdown such complex filters
> that we
> > > have
> > > > >> > > > in Flink.
> > > > >> > > >
> > > > >> > > > >  only applying these optimizations to the cache seems not
> > > quite
> > > > >> > useful
> > > > >> > > >
> > > > >> > > > Filters can cut off an arbitrarily large amount of data
> from the
> > > > >> > > > dimension table. For a simple example, suppose in dimension
> > > table
> > > > >> > > > 'users'
> > > > >> > > > we have column 'age' with values from 20 to 40, and input
> stream
> > > > >> > > > 'clicks' that is ~uniformly distributed by age of users. If
> we
> > > have
> > > > >> > > > filter 'age > 30',
> > > > >> > > > there will be twice less data in cache. This means the user
> can
> > > > >> > > > increase 'lookup.cache.max-rows' by almost 2 times. It will
> > > gain a
> > > > >> > > > huge
> > > > >> > > > performance boost. Moreover, this optimization starts to
> really
> > > > >> shine
> > > > >> > > > in 'ALL' cache, where tables without filters and projections
> > > can't
> > > > >> fit
> > > > >> > > > in memory, but with them - can. This opens up additional
> > > > >> possibilities
> > > > >> > > > for users. And this doesn't sound as 'not quite useful'.
> > > > >> > > >
> > > > >> > > > It would be great to hear other voices regarding this topic!
> > > Because
> > > > >> > > > we have quite a lot of controversial points, and I think
> with
> > > the
> > > > >> help
> > > > >> > > > of others it will be easier for us to come to a consensus.
> > > > >> > > >
> > > > >> > > > Best regards,
> > > > >> > > > Smirnov Alexander
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <
> renqschn@gmail.com
> > > >:
> > > > >> > > > >
> > > > >> > > > > Hi Alexander and Arvid,
> > > > >> > > > >
> > > > >> > > > > Thanks for the discussion and sorry for my late response!
> We
> > > had
> > > > >> an
> > > > >> > > > internal discussion together with Jark and Leonard and I’d
> like
> > > to
> > > > >> > > > summarize our ideas. Instead of implementing the cache
> logic in
> > > the
> > > > >> > table
> > > > >> > > > runtime layer or wrapping around the user-provided table
> > > function,
> > > > >> we
> > > > >> > > > prefer to introduce some new APIs extending TableFunction
> with
> > > these
> > > > >> > > > concerns:
> > > > >> > > > >
> > > > >> > > > > 1. Caching actually breaks the semantic of "FOR
> SYSTEM_TIME
> > > AS OF
> > > > >> > > > proc_time”, because it couldn’t truly reflect the content
> of the
> > > > >> lookup
> > > > >> > > > table at the moment of querying. If users choose to enable
> > > caching
> > > > >> on
> > > > >> > the
> > > > >> > > > lookup table, they implicitly indicate that this breakage is
> > > > >> acceptable
> > > > >> > > in
> > > > >> > > > exchange for the performance. So we prefer not to provide
> > > caching on
> > > > >> > the
> > > > >> > > > table runtime level.
> > > > >> > > > >
> > > > >> > > > > 2. If we make the cache implementation in the framework
> > > (whether
> > > > >> in a
> > > > >> > > > runner or a wrapper around TableFunction), we have to
> confront a
> > > > >> > > situation
> > > > >> > > > that allows table options in DDL to control the behavior of
> the
> > > > >> > > framework,
> > > > >> > > > which has never happened previously and should be cautious.
> > > Under
> > > > >> the
> > > > >> > > > current design the behavior of the framework should only be
> > > > >> specified
> > > > >> > by
> > > > >> > > > configurations (“table.exec.xxx”), and it’s hard to apply
> these
> > > > >> general
> > > > >> > > > configs to a specific table.
> > > > >> > > > >
> > > > >> > > > > 3. We have use cases that lookup source loads and refresh
> all
> > > > >> records
> > > > >> > > > periodically into the memory to achieve high lookup
> performance
> > > > >> (like
> > > > >> > > Hive
> > > > >> > > > connector in the community, and also widely used by our
> internal
> > > > >> > > > connectors). Wrapping the cache around the user’s
> TableFunction
> > > > >> works
> > > > >> > > fine
> > > > >> > > > for LRU caches, but I think we have to introduce a new
> > > interface for
> > > > >> > this
> > > > >> > > > all-caching scenario and the design would become more
> complex.
> > > > >> > > > >
> > > > >> > > > > 4. Providing the cache in the framework might introduce
> > > > >> compatibility
> > > > >> > > > issues to existing lookup sources like there might exist two
> > > caches
> > > > >> > with
> > > > >> > > > totally different strategies if the user incorrectly
> configures
> > > the
> > > > >> > table
> > > > >> > > > (one in the framework and another implemented by the lookup
> > > source).
> > > > >> > > > >
> > > > >> > > > > As for the optimization mentioned by Alexander, I think
> > > filters
> > > > >> and
> > > > >> > > > projections should be pushed all the way down to the table
> > > function,
> > > > >> > like
> > > > >> > > > what we do in the scan source, instead of the runner with
> the
> > > cache.
> > > > >> > The
> > > > >> > > > goal of using cache is to reduce the network I/O and
> pressure
> > > on the
> > > > >> > > > external system, and only applying these optimizations to
> the
> > > cache
> > > > >> > seems
> > > > >> > > > not quite useful.
> > > > >> > > > >
> > > > >> > > > > I made some updates to the FLIP[1] to reflect our ideas.
> We
> > > > >> prefer to
> > > > >> > > > keep the cache implementation as a part of TableFunction,
> and we
> > > > >> could
> > > > >> > > > provide some helper classes (CachingTableFunction,
> > > > >> > > AllCachingTableFunction,
> > > > >> > > > CachingAsyncTableFunction) to developers and regulate
> metrics
> > > of the
> > > > >> > > cache.
> > > > >> > > > Also, I made a POC[2] for your reference.
> > > > >> > > > >
> > > > >> > > > > Looking forward to your ideas!
> > > > >> > > > >
> > > > >> > > > > [1]
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > >> > > > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > > > >> > > > >
> > > > >> > > > > Best regards,
> > > > >> > > > >
> > > > >> > > > > Qingsheng
> > > > >> > > > >
> > > > >> > > > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > > > >> > > smiralexan@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > > >>
> > > > >> > > > >> Thanks for the response, Arvid!
> > > > >> > > > >>
> > > > >> > > > >> I have few comments on your message.
> > > > >> > > > >>
> > > > >> > > > >> > but could also live with an easier solution as the
> first
> > > step:
> > > > >> > > > >>
> > > > >> > > > >> I think that these 2 ways are mutually exclusive
> (originally
> > > > >> > proposed
> > > > >> > > > >> by Qingsheng and mine), because conceptually they follow
> the
> > > same
> > > > >> > > > >> goal, but implementation details are different. If we
> will
> > > go one
> > > > >> > way,
> > > > >> > > > >> moving to another way in the future will mean deleting
> > > existing
> > > > >> code
> > > > >> > > > >> and once again changing the API for connectors. So I
> think we
> > > > >> should
> > > > >> > > > >> reach a consensus with the community about that and then
> work
> > > > >> > together
> > > > >> > > > >> on this FLIP, i.e. divide the work on tasks for different
> > > parts
> > > > >> of
> > > > >> > the
> > > > >> > > > >> flip (for example, LRU cache unification / introducing
> > > proposed
> > > > >> set
> > > > >> > of
> > > > >> > > > >> metrics / further work…). WDYT, Qingsheng?
> > > > >> > > > >>
> > > > >> > > > >> > as the source will only receive the requests after
> filter
> > > > >> > > > >>
> > > > >> > > > >> Actually if filters are applied to fields of the lookup
> > > table, we
> > > > >> > > > >> firstly must do requests, and only after that we can
> filter
> > > > >> > responses,
> > > > >> > > > >> because lookup connectors don't have filter pushdown. So
> if
> > > > >> > filtering
> > > > >> > > > >> is done before caching, there will be much less rows in
> > > cache.
> > > > >> > > > >>
> > > > >> > > > >> > @Alexander unfortunately, your architecture is not
> shared.
> > > I
> > > > >> don't
> > > > >> > > > know the
> > > > >> > > > >>
> > > > >> > > > >> > solution to share images to be honest.
> > > > >> > > > >>
> > > > >> > > > >> Sorry for that, I’m a bit new to such kinds of
> conversations
> > > :)
> > > > >> > > > >> I have no write access to the confluence, so I made a
> Jira
> > > issue,
> > > > >> > > > >> where described the proposed changes in more details -
> > > > >> > > > >> https://issues.apache.org/jira/browse/FLINK-27411.
> > > > >> > > > >>
> > > > >> > > > >> Will happy to get more feedback!
> > > > >> > > > >>
> > > > >> > > > >> Best,
> > > > >> > > > >> Smirnov Alexander
> > > > >> > > > >>
> > > > >> > > > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <
> arvid@apache.org>:
> > > > >> > > > >> >
> > > > >> > > > >> > Hi Qingsheng,
> > > > >> > > > >> >
> > > > >> > > > >> > Thanks for driving this; the inconsistency was not
> > > satisfying
> > > > >> for
> > > > >> > > me.
> > > > >> > > > >> >
> > > > >> > > > >> > I second Alexander's idea though but could also live
> with
> > > an
> > > > >> > easier
> > > > >> > > > >> > solution as the first step: Instead of making caching
> an
> > > > >> > > > implementation
> > > > >> > > > >> > detail of TableFunction X, rather devise a caching
> layer
> > > > >> around X.
> > > > >> > > So
> > > > >> > > > the
> > > > >> > > > >> > proposal would be a CachingTableFunction that
> delegates to
> > > X in
> > > > >> > case
> > > > >> > > > of
> > > > >> > > > >> > misses and else manages the cache. Lifting it into the
> > > operator
> > > > >> > > model
> > > > >> > > > as
> > > > >> > > > >> > proposed would be even better but is probably
> unnecessary
> > > in
> > > > >> the
> > > > >> > > > first step
> > > > >> > > > >> > for a lookup source (as the source will only receive
> the
> > > > >> requests
> > > > >> > > > after
> > > > >> > > > >> > filter; applying projection may be more interesting to
> save
> > > > >> > memory).
> > > > >> > > > >> >
> > > > >> > > > >> > Another advantage is that all the changes of this FLIP
> > > would be
> > > > >> > > > limited to
> > > > >> > > > >> > options, no need for new public interfaces. Everything
> else
> > > > >> > remains
> > > > >> > > an
> > > > >> > > > >> > implementation of Table runtime. That means we can
> easily
> > > > >> > > incorporate
> > > > >> > > > the
> > > > >> > > > >> > optimization potential that Alexander pointed out
> later.
> > > > >> > > > >> >
> > > > >> > > > >> > @Alexander unfortunately, your architecture is not
> shared.
> > > I
> > > > >> don't
> > > > >> > > > know the
> > > > >> > > > >> > solution to share images to be honest.
> > > > >> > > > >> >
> > > > >> > > > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > > > >> > > > smiralexan@gmail.com>
> > > > >> > > > >> > wrote:
> > > > >> > > > >> >
> > > > >> > > > >> > > Hi Qingsheng! My name is Alexander, I'm not a
> committer
> > > yet,
> > > > >> but
> > > > >> > > I'd
> > > > >> > > > >> > > really like to become one. And this FLIP really
> > > interested
> > > > >> me.
> > > > >> > > > >> > > Actually I have worked on a similar feature in my
> > > company’s
> > > > >> > Flink
> > > > >> > > > >> > > fork, and we would like to share our thoughts on
> this and
> > > > >> make
> > > > >> > > code
> > > > >> > > > >> > > open source.
> > > > >> > > > >> > >
> > > > >> > > > >> > > I think there is a better alternative than
> introducing an
> > > > >> > abstract
> > > > >> > > > >> > > class for TableFunction (CachingTableFunction). As
> you
> > > know,
> > > > >> > > > >> > > TableFunction exists in the flink-table-common
> module,
> > > which
> > > > >> > > > provides
> > > > >> > > > >> > > only an API for working with tables – it’s very
> > > convenient
> > > > >> for
> > > > >> > > > importing
> > > > >> > > > >> > > in connectors. In turn, CachingTableFunction contains
> > > logic
> > > > >> for
> > > > >> > > > >> > > runtime execution,  so this class and everything
> > > connected
> > > > >> with
> > > > >> > it
> > > > >> > > > >> > > should be located in another module, probably in
> > > > >> > > > flink-table-runtime.
> > > > >> > > > >> > > But this will require connectors to depend on another
> > > module,
> > > > >> > > which
> > > > >> > > > >> > > contains a lot of runtime logic, which doesn’t sound
> > > good.
> > > > >> > > > >> > >
> > > > >> > > > >> > > I suggest adding a new method ‘getLookupConfig’ to
> > > > >> > > LookupTableSource
> > > > >> > > > >> > > or LookupRuntimeProvider to allow connectors to only
> pass
> > > > >> > > > >> > > configurations to the planner, therefore they won’t
> > > depend on
> > > > >> > > > runtime
> > > > >> > > > >> > > realization. Based on these configs planner will
> > > construct a
> > > > >> > > lookup
> > > > >> > > > >> > > join operator with corresponding runtime logic
> > > > >> (ProcessFunctions
> > > > >> > > in
> > > > >> > > > >> > > module flink-table-runtime). Architecture looks like
> in
> > > the
> > > > >> > pinned
> > > > >> > > > >> > > image (LookupConfig class there is actually yours
> > > > >> CacheConfig).
> > > > >> > > > >> > >
> > > > >> > > > >> > > Classes in flink-table-planner, that will be
> responsible
> > > for
> > > > >> > this
> > > > >> > > –
> > > > >> > > > >> > > CommonPhysicalLookupJoin and his inheritors.
> > > > >> > > > >> > > Current classes for lookup join in
> flink-table-runtime
> > > -
> > > > >> > > > >> > > LookupJoinRunner, AsyncLookupJoinRunner,
> > > > >> > LookupJoinRunnerWithCalc,
> > > > >> > > > >> > > AsyncLookupJoinRunnerWithCalc.
> > > > >> > > > >> > >
> > > > >> > > > >> > > I suggest adding classes LookupJoinCachingRunner,
> > > > >> > > > >> > > LookupJoinCachingRunnerWithCalc, etc.
> > > > >> > > > >> > >
> > > > >> > > > >> > > And here comes another more powerful advantage of
> such a
> > > > >> > solution.
> > > > >> > > > If
> > > > >> > > > >> > > we have caching logic on a lower level, we can apply
> some
> > > > >> > > > >> > > optimizations to it. LookupJoinRunnerWithCalc was
> named
> > > like
> > > > >> > this
> > > > >> > > > >> > > because it uses the ‘calc’ function, which actually
> > > mostly
> > > > >> > > consists
> > > > >> > > > of
> > > > >> > > > >> > > filters and projections.
> > > > >> > > > >> > >
> > > > >> > > > >> > > For example, in join table A with lookup table B
> > > condition
> > > > >> > ‘JOIN …
> > > > >> > > > ON
> > > > >> > > > >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary >
> 1000’
> > > > >> > ‘calc’
> > > > >> > > > >> > > function will contain filters A.age = B.age + 10 and
> > > > >> B.salary >
> > > > >> > > > 1000.
> > > > >> > > > >> > >
> > > > >> > > > >> > > If we apply this function before storing records in
> > > cache,
> > > > >> size
> > > > >> > of
> > > > >> > > > >> > > cache will be significantly reduced: filters = avoid
> > > storing
> > > > >> > > useless
> > > > >> > > > >> > > records in cache, projections = reduce records’
> size. So
> > > the
> > > > >> > > initial
> > > > >> > > > >> > > max number of records in cache can be increased by
> the
> > > user.
> > > > >> > > > >> > >
> > > > >> > > > >> > > What do you think about it?
> > > > >> > > > >> > >
> > > > >> > > > >> > >
> > > > >> > > > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > > > >> > > > >> > > > Hi devs,
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > Yuan and I would like to start a discussion about
> > > > >> FLIP-221[1],
> > > > >> > > > which
> > > > >> > > > >> > > introduces an abstraction of lookup table cache and
> its
> > > > >> standard
> > > > >> > > > metrics.
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > Currently each lookup table source should implement
> > > their
> > > > >> own
> > > > >> > > > cache to
> > > > >> > > > >> > > store lookup results, and there isn’t a standard of
> > > metrics
> > > > >> for
> > > > >> > > > users and
> > > > >> > > > >> > > developers to tuning their jobs with lookup joins,
> which
> > > is a
> > > > >> > > quite
> > > > >> > > > common
> > > > >> > > > >> > > use case in Flink table / SQL.
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > Therefore we propose some new APIs including cache,
> > > > >> metrics,
> > > > >> > > > wrapper
> > > > >> > > > >> > > classes of TableFunction and new table options.
> Please
> > > take a
> > > > >> > look
> > > > >> > > > at the
> > > > >> > > > >> > > FLIP page [1] to get more details. Any suggestions
> and
> > > > >> comments
> > > > >> > > > would be
> > > > >> > > > >> > > appreciated!
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > [1]
> > > > >> > > > >> > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > Best regards,
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > Qingsheng
> > > > >> > > > >> > > >
> > > > >> > > > >> > > >
> > > > >> > > > >> > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > --
> > > > >> > > > > Best Regards,
> > > > >> > > > >
> > > > >> > > > > Qingsheng Ren
> > > > >> > > > >
> > > > >> > > > > Real-time Computing Team
> > > > >> > > > > Alibaba Cloud
> > > > >> > > > >
> > > > >> > > > > Email: renqschn@gmail.com
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Roman Boyko
> > > > > e.: ro.v.boyko@gmail.com
> > > > >
> > >
>
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Александр Смирнов <sm...@gmail.com>.

Hi Martijn!

Got it. Therefore, the realization with InputFormat is not considered.
Thanks for clearing that up!

Best regards,
Smirnov Alexander

чт, 12 мая 2022 г. в 14:23, Martijn Visser <ma...@ververica.com>:
>
> Hi,
>
> With regards to:
>
> > But if there are plans to refactor all connectors to FLIP-27
>
> Yes, FLIP-27 is the target for all connectors. The old interfaces will be
> deprecated and connectors will either be refactored to use the new ones or
> dropped.
>
> The caching should work for connectors that are using FLIP-27 interfaces,
> we should not introduce new features for old interfaces.
>
> Best regards,
>
> Martijn
>
> On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
> wrote:
>
> > Hi Jark!
> >
> > Sorry for the late response. I would like to make some comments and
> > clarify my points.
> >
> > 1) I agree with your first statement. I think we can achieve both
> > advantages this way: put the Cache interface in flink-table-common,
> > but have implementations of it in flink-table-runtime. Therefore if a
> > connector developer wants to use existing cache strategies and their
> > implementations, he can just pass lookupConfig to the planner, but if
> > he wants to have its own cache implementation in his TableFunction, it
> > will be possible for him to use the existing interface for this
> > purpose (we can explicitly point this out in the documentation). In
> > this way all configs and metrics will be unified. WDYT?
> >
> > > If a filter can prune 90% of data in the cache, we will have 90% of
> > lookup requests that can never be cached
> >
> > 2) Let me clarify the logic filters optimization in case of LRU cache.
> > It looks like Cache<RowData, Collection<RowData>>. Here we always
> > store the response of the dimension table in cache, even after
> > applying calc function. I.e. if there are no rows after applying
> > filters to the result of the 'eval' method of TableFunction, we store
> > the empty list by lookup keys. Therefore the cache line will be
> > filled, but will require much less memory (in bytes). I.e. we don't
> > completely filter keys, by which result was pruned, but significantly
> > reduce required memory to store this result. If the user knows about
> > this behavior, he can increase the 'max-rows' option before the start
> > of the job. But actually I came up with the idea that we can do this
> > automatically by using the 'maximumWeight' and 'weigher' methods of
> > GuavaCache [1]. Weight can be the size of the collection of rows
> > (value of cache). Therefore cache can automatically fit much more
> > records than before.
> >
> > > Flink SQL has provided a standard way to do filters and projects
> > pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> > > Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's hard
> > to implement.
> >
> > It's debatable how difficult it will be to implement filter pushdown.
> > But I think the fact that currently there is no database connector
> > with filter pushdown at least means that this feature won't be
> > supported soon in connectors. Moreover, if we talk about other
> > connectors (not in Flink repo), their databases might not support all
> > Flink filters (or not support filters at all). I think users are
> > interested in supporting cache filters optimization  independently of
> > supporting other features and solving more complex problems (or
> > unsolvable at all).
> >
> > 3) I agree with your third statement. Actually in our internal version
> > I also tried to unify the logic of scanning and reloading data from
> > connectors. But unfortunately, I didn't find a way to unify the logic
> > of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
> > and reuse it in reloading ALL cache. As a result I settled on using
> > InputFormat, because it was used for scanning in all lookup
> > connectors. (I didn't know that there are plans to deprecate
> > InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
> > in ALL caching is not good idea, because this source was designed to
> > work in distributed environment (SplitEnumerator on JobManager and
> > SourceReaders on TaskManagers), not in one operator (lookup join
> > operator in our case). There is even no direct way to pass splits from
> > SplitEnumerator to SourceReader (this logic works through
> > SplitEnumeratorContext, which requires
> > OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
> > InputFormat for ALL cache seems much more clearer and easier. But if
> > there are plans to refactor all connectors to FLIP-27, I have the
> > following ideas: maybe we can refuse from lookup join ALL cache in
> > favor of simple join with multiple scanning of batch source? The point
> > is that the only difference between lookup join ALL cache and simple
> > join with batch source is that in the first case scanning is performed
> > multiple times, in between which state (cache) is cleared (correct me
> > if I'm wrong). So what if we extend the functionality of simple join
> > to support state reloading + extend the functionality of scanning
> > batch source multiple times (this one should be easy with new FLIP-27
> > source, that unifies streaming/batch reading - we will need to change
> > only SplitEnumerator, which will pass splits again after some TTL).
> > WDYT? I must say that this looks like a long-term goal and will make
> > the scope of this FLIP even larger than you said. Maybe we can limit
> > ourselves to a simpler solution now (InputFormats).
> >
> > So to sum up, my points is like this:
> > 1) There is a way to make both concise and flexible interfaces for
> > caching in lookup join.
> > 2) Cache filters optimization is important both in LRU and ALL caches.
> > 3) It is unclear when filter pushdown will be supported in Flink
> > connectors, some of the connectors might not have the opportunity to
> > support filter pushdown + as I know, currently filter pushdown works
> > only for scanning (not lookup). So cache filters + projections
> > optimization should be independent from other features.
> > 4) ALL cache realization is a complex topic that involves multiple
> > aspects of how Flink is developing. Refusing from InputFormat in favor
> > of FLIP-27 Source will make ALL cache realization really complex and
> > not clear, so maybe instead of that we can extend the functionality of
> > simple join or not refuse from InputFormat in case of lookup join ALL
> > cache?
> >
> > Best regards,
> > Smirnov Alexander
> >
> >
> >
> >
> >
> >
> >
> > [1]
> > https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> >
> > чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> > >
> > > It's great to see the active discussion! I want to share my ideas:
> > >
> > > 1) implement the cache in framework vs. connectors base
> > > I don't have a strong opinion on this. Both ways should work (e.g., cache
> > > pruning, compatibility).
> > > The framework way can provide more concise interfaces.
> > > The connector base way can define more flexible cache
> > > strategies/implementations.
> > > We are still investigating a way to see if we can have both advantages.
> > > We should reach a consensus that the way should be a final state, and we
> > > are on the path to it.
> > >
> > > 2) filters and projections pushdown:
> > > I agree with Alex that the filter pushdown into cache can benefit a lot
> > for
> > > ALL cache.
> > > However, this is not true for LRU cache. Connectors use cache to reduce
> > IO
> > > requests to databases for better throughput.
> > > If a filter can prune 90% of data in the cache, we will have 90% of
> > lookup
> > > requests that can never be cached
> > > and hit directly to the databases. That means the cache is meaningless in
> > > this case.
> > >
> > > IMO, Flink SQL has provided a standard way to do filters and projects
> > > pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> > > Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's hard
> > to
> > > implement.
> > > They should implement the pushdown interfaces to reduce IO and the cache
> > > size.
> > > That should be a final state that the scan source and lookup source share
> > > the exact pushdown implementation.
> > > I don't see why we need to duplicate the pushdown logic in caches, which
> > > will complex the lookup join design.
> > >
> > > 3) ALL cache abstraction
> > > All cache might be the most challenging part of this FLIP. We have never
> > > provided a reload-lookup public interface.
> > > Currently, we put the reload logic in the "eval" method of TableFunction.
> > > That's hard for some sources (e.g., Hive).
> > > Ideally, connector implementation should share the logic of reload and
> > > scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
> > Source.
> > > However, InputFormat/SourceFunction are deprecated, and the FLIP-27
> > source
> > > is deeply coupled with SourceOperator.
> > > If we want to invoke the FLIP-27 source in LookupJoin, this may make the
> > > scope of this FLIP much larger.
> > > We are still investigating how to abstract the ALL cache logic and reuse
> > > the existing source interfaces.
> > >
> > >
> > > Best,
> > > Jark
> > >
> > >
> > >
> > > On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com> wrote:
> > >
> > > > It's a much more complicated activity and lies out of the scope of this
> > > > improvement. Because such pushdowns should be done for all
> > ScanTableSource
> > > > implementations (not only for Lookup ones).
> > > >
> > > > On Thu, 5 May 2022 at 19:02, Martijn Visser <ma...@apache.org>
> > > > wrote:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> One question regarding "And Alexander correctly mentioned that filter
> > > >> pushdown still is not implemented for jdbc/hive/hbase." -> Would an
> > > >> alternative solution be to actually implement these filter pushdowns?
> > I
> > > >> can
> > > >> imagine that there are many more benefits to doing that, outside of
> > lookup
> > > >> caching and metrics.
> > > >>
> > > >> Best regards,
> > > >>
> > > >> Martijn Visser
> > > >> https://twitter.com/MartijnVisser82
> > > >> https://github.com/MartijnVisser
> > > >>
> > > >>
> > > >> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
> > wrote:
> > > >>
> > > >> > Hi everyone!
> > > >> >
> > > >> > Thanks for driving such a valuable improvement!
> > > >> >
> > > >> > I do think that single cache implementation would be a nice
> > opportunity
> > > >> for
> > > >> > users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
> > semantics
> > > >> > anyway - doesn't matter how it will be implemented.
> > > >> >
> > > >> > Putting myself in the user's shoes, I can say that:
> > > >> > 1) I would prefer to have the opportunity to cut off the cache size
> > by
> > > >> > simply filtering unnecessary data. And the most handy way to do it
> > is
> > > >> apply
> > > >> > it inside LookupRunners. It would be a bit harder to pass it
> > through the
> > > >> > LookupJoin node to TableFunction. And Alexander correctly mentioned
> > that
> > > >> > filter pushdown still is not implemented for jdbc/hive/hbase.
> > > >> > 2) The ability to set the different caching parameters for different
> > > >> tables
> > > >> > is quite important. So I would prefer to set it through DDL rather
> > than
> > > >> > have similar ttla, strategy and other options for all lookup tables.
> > > >> > 3) Providing the cache into the framework really deprives us of
> > > >> > extensibility (users won't be able to implement their own cache).
> > But
> > > >> most
> > > >> > probably it might be solved by creating more different cache
> > strategies
> > > >> and
> > > >> > a wider set of configurations.
> > > >> >
> > > >> > All these points are much closer to the schema proposed by
> > Alexander.
> > > >> > Qingshen Ren, please correct me if I'm not right and all these
> > > >> facilities
> > > >> > might be simply implemented in your architecture?
> > > >> >
> > > >> > Best regards,
> > > >> > Roman Boyko
> > > >> > e.: ro.v.boyko@gmail.com
> > > >> >
> > > >> > On Wed, 4 May 2022 at 21:01, Martijn Visser <
> > martijnvisser@apache.org>
> > > >> > wrote:
> > > >> >
> > > >> > > Hi everyone,
> > > >> > >
> > > >> > > I don't have much to chip in, but just wanted to express that I
> > really
> > > >> > > appreciate the in-depth discussion on this topic and I hope that
> > > >> others
> > > >> > > will join the conversation.
> > > >> > >
> > > >> > > Best regards,
> > > >> > >
> > > >> > > Martijn
> > > >> > >
> > > >> > > On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> > smiralexan@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi Qingsheng, Leonard and Jark,
> > > >> > > >
> > > >> > > > Thanks for your detailed feedback! However, I have questions
> > about
> > > >> > > > some of your statements (maybe I didn't get something?).
> > > >> > > >
> > > >> > > > > Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> > > >> > > proc_time”
> > > >> > > >
> > > >> > > > I agree that the semantics of "FOR SYSTEM_TIME AS OF proc_time"
> > is
> > > >> not
> > > >> > > > fully implemented with caching, but as you said, users go on it
> > > >> > > > consciously to achieve better performance (no one proposed to
> > enable
> > > >> > > > caching by default, etc.). Or by users do you mean other
> > developers
> > > >> of
> > > >> > > > connectors? In this case developers explicitly specify whether
> > their
> > > >> > > > connector supports caching or not (in the list of supported
> > > >> options),
> > > >> > > > no one makes them do that if they don't want to. So what
> > exactly is
> > > >> > > > the difference between implementing caching in modules
> > > >> > > > flink-table-runtime and in flink-table-common from the
> > considered
> > > >> > > > point of view? How does it affect on breaking/non-breaking the
> > > >> > > > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > > >> > > >
> > > >> > > > > confront a situation that allows table options in DDL to
> > control
> > > >> the
> > > >> > > > behavior of the framework, which has never happened previously
> > and
> > > >> > should
> > > >> > > > be cautious
> > > >> > > >
> > > >> > > > If we talk about main differences of semantics of DDL options
> > and
> > > >> > > > config options("table.exec.xxx"), isn't it about limiting the
> > scope
> > > >> of
> > > >> > > > the options + importance for the user business logic rather than
> > > >> > > > specific location of corresponding logic in the framework? I
> > mean
> > > >> that
> > > >> > > > in my design, for example, putting an option with lookup cache
> > > >> > > > strategy in configurations would  be the wrong decision,
> > because it
> > > >> > > > directly affects the user's business logic (not just performance
> > > >> > > > optimization) + touches just several functions of ONE table
> > (there
> > > >> can
> > > >> > > > be multiple tables with different caches). Does it really
> > matter for
> > > >> > > > the user (or someone else) where the logic is located, which is
> > > >> > > > affected by the applied option?
> > > >> > > > Also I can remember DDL option 'sink.parallelism', which in
> > some way
> > > >> > > > "controls the behavior of the framework" and I don't see any
> > problem
> > > >> > > > here.
> > > >> > > >
> > > >> > > > > introduce a new interface for this all-caching scenario and
> > the
> > > >> > design
> > > >> > > > would become more complex
> > > >> > > >
> > > >> > > > This is a subject for a separate discussion, but actually in our
> > > >> > > > internal version we solved this problem quite easily - we reused
> > > >> > > > InputFormat class (so there is no need for a new API). The
> > point is
> > > >> > > > that currently all lookup connectors use InputFormat for
> > scanning
> > > >> the
> > > >> > > > data in batch mode: HBase, JDBC and even Hive - it uses class
> > > >> > > > PartitionReader, that is actually just a wrapper around
> > InputFormat.
> > > >> > > > The advantage of this solution is the ability to reload cache
> > data
> > > >> in
> > > >> > > > parallel (number of threads depends on number of InputSplits,
> > but
> > > >> has
> > > >> > > > an upper limit). As a result cache reload time significantly
> > reduces
> > > >> > > > (as well as time of input stream blocking). I know that usually
> > we
> > > >> try
> > > >> > > > to avoid usage of concurrency in Flink code, but maybe this one
> > can
> > > >> be
> > > >> > > > an exception. BTW I don't say that it's an ideal solution, maybe
> > > >> there
> > > >> > > > are better ones.
> > > >> > > >
> > > >> > > > > Providing the cache in the framework might introduce
> > compatibility
> > > >> > > issues
> > > >> > > >
> > > >> > > > It's possible only in cases when the developer of the connector
> > > >> won't
> > > >> > > > properly refactor his code and will use new cache options
> > > >> incorrectly
> > > >> > > > (i.e. explicitly provide the same options into 2 different code
> > > >> > > > places). For correct behavior all he will need to do is to
> > redirect
> > > >> > > > existing options to the framework's LookupConfig (+ maybe add an
> > > >> alias
> > > >> > > > for options, if there was different naming), everything will be
> > > >> > > > transparent for users. If the developer won't do refactoring at
> > all,
> > > >> > > > nothing will be changed for the connector because of backward
> > > >> > > > compatibility. Also if a developer wants to use his own cache
> > logic,
> > > >> > > > he just can refuse to pass some of the configs into the
> > framework,
> > > >> and
> > > >> > > > instead make his own implementation with already existing
> > configs
> > > >> and
> > > >> > > > metrics (but actually I think that it's a rare case).
> > > >> > > >
> > > >> > > > > filters and projections should be pushed all the way down to
> > the
> > > >> > table
> > > >> > > > function, like what we do in the scan source
> > > >> > > >
> > > >> > > > It's the great purpose. But the truth is that the ONLY connector
> > > >> that
> > > >> > > > supports filter pushdown is FileSystemTableSource
> > > >> > > > (no database connector supports it currently). Also for some
> > > >> databases
> > > >> > > > it's simply impossible to pushdown such complex filters that we
> > have
> > > >> > > > in Flink.
> > > >> > > >
> > > >> > > > >  only applying these optimizations to the cache seems not
> > quite
> > > >> > useful
> > > >> > > >
> > > >> > > > Filters can cut off an arbitrarily large amount of data from the
> > > >> > > > dimension table. For a simple example, suppose in dimension
> > table
> > > >> > > > 'users'
> > > >> > > > we have column 'age' with values from 20 to 40, and input stream
> > > >> > > > 'clicks' that is ~uniformly distributed by age of users. If we
> > have
> > > >> > > > filter 'age > 30',
> > > >> > > > there will be twice less data in cache. This means the user can
> > > >> > > > increase 'lookup.cache.max-rows' by almost 2 times. It will
> > gain a
> > > >> > > > huge
> > > >> > > > performance boost. Moreover, this optimization starts to really
> > > >> shine
> > > >> > > > in 'ALL' cache, where tables without filters and projections
> > can't
> > > >> fit
> > > >> > > > in memory, but with them - can. This opens up additional
> > > >> possibilities
> > > >> > > > for users. And this doesn't sound as 'not quite useful'.
> > > >> > > >
> > > >> > > > It would be great to hear other voices regarding this topic!
> > Because
> > > >> > > > we have quite a lot of controversial points, and I think with
> > the
> > > >> help
> > > >> > > > of others it will be easier for us to come to a consensus.
> > > >> > > >
> > > >> > > > Best regards,
> > > >> > > > Smirnov Alexander
> > > >> > > >
> > > >> > > >
> > > >> > > > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <renqschn@gmail.com
> > >:
> > > >> > > > >
> > > >> > > > > Hi Alexander and Arvid,
> > > >> > > > >
> > > >> > > > > Thanks for the discussion and sorry for my late response! We
> > had
> > > >> an
> > > >> > > > internal discussion together with Jark and Leonard and I’d like
> > to
> > > >> > > > summarize our ideas. Instead of implementing the cache logic in
> > the
> > > >> > table
> > > >> > > > runtime layer or wrapping around the user-provided table
> > function,
> > > >> we
> > > >> > > > prefer to introduce some new APIs extending TableFunction with
> > these
> > > >> > > > concerns:
> > > >> > > > >
> > > >> > > > > 1. Caching actually breaks the semantic of "FOR SYSTEM_TIME
> > AS OF
> > > >> > > > proc_time”, because it couldn’t truly reflect the content of the
> > > >> lookup
> > > >> > > > table at the moment of querying. If users choose to enable
> > caching
> > > >> on
> > > >> > the
> > > >> > > > lookup table, they implicitly indicate that this breakage is
> > > >> acceptable
> > > >> > > in
> > > >> > > > exchange for the performance. So we prefer not to provide
> > caching on
> > > >> > the
> > > >> > > > table runtime level.
> > > >> > > > >
> > > >> > > > > 2. If we make the cache implementation in the framework
> > (whether
> > > >> in a
> > > >> > > > runner or a wrapper around TableFunction), we have to confront a
> > > >> > > situation
> > > >> > > > that allows table options in DDL to control the behavior of the
> > > >> > > framework,
> > > >> > > > which has never happened previously and should be cautious.
> > Under
> > > >> the
> > > >> > > > current design the behavior of the framework should only be
> > > >> specified
> > > >> > by
> > > >> > > > configurations (“table.exec.xxx”), and it’s hard to apply these
> > > >> general
> > > >> > > > configs to a specific table.
> > > >> > > > >
> > > >> > > > > 3. We have use cases that lookup source loads and refresh all
> > > >> records
> > > >> > > > periodically into the memory to achieve high lookup performance
> > > >> (like
> > > >> > > Hive
> > > >> > > > connector in the community, and also widely used by our internal
> > > >> > > > connectors). Wrapping the cache around the user’s TableFunction
> > > >> works
> > > >> > > fine
> > > >> > > > for LRU caches, but I think we have to introduce a new
> > interface for
> > > >> > this
> > > >> > > > all-caching scenario and the design would become more complex.
> > > >> > > > >
> > > >> > > > > 4. Providing the cache in the framework might introduce
> > > >> compatibility
> > > >> > > > issues to existing lookup sources like there might exist two
> > caches
> > > >> > with
> > > >> > > > totally different strategies if the user incorrectly configures
> > the
> > > >> > table
> > > >> > > > (one in the framework and another implemented by the lookup
> > source).
> > > >> > > > >
> > > >> > > > > As for the optimization mentioned by Alexander, I think
> > filters
> > > >> and
> > > >> > > > projections should be pushed all the way down to the table
> > function,
> > > >> > like
> > > >> > > > what we do in the scan source, instead of the runner with the
> > cache.
> > > >> > The
> > > >> > > > goal of using cache is to reduce the network I/O and pressure
> > on the
> > > >> > > > external system, and only applying these optimizations to the
> > cache
> > > >> > seems
> > > >> > > > not quite useful.
> > > >> > > > >
> > > >> > > > > I made some updates to the FLIP[1] to reflect our ideas. We
> > > >> prefer to
> > > >> > > > keep the cache implementation as a part of TableFunction, and we
> > > >> could
> > > >> > > > provide some helper classes (CachingTableFunction,
> > > >> > > AllCachingTableFunction,
> > > >> > > > CachingAsyncTableFunction) to developers and regulate metrics
> > of the
> > > >> > > cache.
> > > >> > > > Also, I made a POC[2] for your reference.
> > > >> > > > >
> > > >> > > > > Looking forward to your ideas!
> > > >> > > > >
> > > >> > > > > [1]
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >> > > > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > > >> > > > >
> > > >> > > > > Best regards,
> > > >> > > > >
> > > >> > > > > Qingsheng
> > > >> > > > >
> > > >> > > > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > > >> > > smiralexan@gmail.com>
> > > >> > > > wrote:
> > > >> > > > >>
> > > >> > > > >> Thanks for the response, Arvid!
> > > >> > > > >>
> > > >> > > > >> I have few comments on your message.
> > > >> > > > >>
> > > >> > > > >> > but could also live with an easier solution as the first
> > step:
> > > >> > > > >>
> > > >> > > > >> I think that these 2 ways are mutually exclusive (originally
> > > >> > proposed
> > > >> > > > >> by Qingsheng and mine), because conceptually they follow the
> > same
> > > >> > > > >> goal, but implementation details are different. If we will
> > go one
> > > >> > way,
> > > >> > > > >> moving to another way in the future will mean deleting
> > existing
> > > >> code
> > > >> > > > >> and once again changing the API for connectors. So I think we
> > > >> should
> > > >> > > > >> reach a consensus with the community about that and then work
> > > >> > together
> > > >> > > > >> on this FLIP, i.e. divide the work on tasks for different
> > parts
> > > >> of
> > > >> > the
> > > >> > > > >> flip (for example, LRU cache unification / introducing
> > proposed
> > > >> set
> > > >> > of
> > > >> > > > >> metrics / further work…). WDYT, Qingsheng?
> > > >> > > > >>
> > > >> > > > >> > as the source will only receive the requests after filter
> > > >> > > > >>
> > > >> > > > >> Actually if filters are applied to fields of the lookup
> > table, we
> > > >> > > > >> firstly must do requests, and only after that we can filter
> > > >> > responses,
> > > >> > > > >> because lookup connectors don't have filter pushdown. So if
> > > >> > filtering
> > > >> > > > >> is done before caching, there will be much less rows in
> > cache.
> > > >> > > > >>
> > > >> > > > >> > @Alexander unfortunately, your architecture is not shared.
> > I
> > > >> don't
> > > >> > > > know the
> > > >> > > > >>
> > > >> > > > >> > solution to share images to be honest.
> > > >> > > > >>
> > > >> > > > >> Sorry for that, I’m a bit new to such kinds of conversations
> > :)
> > > >> > > > >> I have no write access to the confluence, so I made a Jira
> > issue,
> > > >> > > > >> where described the proposed changes in more details -
> > > >> > > > >> https://issues.apache.org/jira/browse/FLINK-27411.
> > > >> > > > >>
> > > >> > > > >> Will happy to get more feedback!
> > > >> > > > >>
> > > >> > > > >> Best,
> > > >> > > > >> Smirnov Alexander
> > > >> > > > >>
> > > >> > > > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <ar...@apache.org>:
> > > >> > > > >> >
> > > >> > > > >> > Hi Qingsheng,
> > > >> > > > >> >
> > > >> > > > >> > Thanks for driving this; the inconsistency was not
> > satisfying
> > > >> for
> > > >> > > me.
> > > >> > > > >> >
> > > >> > > > >> > I second Alexander's idea though but could also live with
> > an
> > > >> > easier
> > > >> > > > >> > solution as the first step: Instead of making caching an
> > > >> > > > implementation
> > > >> > > > >> > detail of TableFunction X, rather devise a caching layer
> > > >> around X.
> > > >> > > So
> > > >> > > > the
> > > >> > > > >> > proposal would be a CachingTableFunction that delegates to
> > X in
> > > >> > case
> > > >> > > > of
> > > >> > > > >> > misses and else manages the cache. Lifting it into the
> > operator
> > > >> > > model
> > > >> > > > as
> > > >> > > > >> > proposed would be even better but is probably unnecessary
> > in
> > > >> the
> > > >> > > > first step
> > > >> > > > >> > for a lookup source (as the source will only receive the
> > > >> requests
> > > >> > > > after
> > > >> > > > >> > filter; applying projection may be more interesting to save
> > > >> > memory).
> > > >> > > > >> >
> > > >> > > > >> > Another advantage is that all the changes of this FLIP
> > would be
> > > >> > > > limited to
> > > >> > > > >> > options, no need for new public interfaces. Everything else
> > > >> > remains
> > > >> > > an
> > > >> > > > >> > implementation of Table runtime. That means we can easily
> > > >> > > incorporate
> > > >> > > > the
> > > >> > > > >> > optimization potential that Alexander pointed out later.
> > > >> > > > >> >
> > > >> > > > >> > @Alexander unfortunately, your architecture is not shared.
> > I
> > > >> don't
> > > >> > > > know the
> > > >> > > > >> > solution to share images to be honest.
> > > >> > > > >> >
> > > >> > > > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > > >> > > > smiralexan@gmail.com>
> > > >> > > > >> > wrote:
> > > >> > > > >> >
> > > >> > > > >> > > Hi Qingsheng! My name is Alexander, I'm not a committer
> > yet,
> > > >> but
> > > >> > > I'd
> > > >> > > > >> > > really like to become one. And this FLIP really
> > interested
> > > >> me.
> > > >> > > > >> > > Actually I have worked on a similar feature in my
> > company’s
> > > >> > Flink
> > > >> > > > >> > > fork, and we would like to share our thoughts on this and
> > > >> make
> > > >> > > code
> > > >> > > > >> > > open source.
> > > >> > > > >> > >
> > > >> > > > >> > > I think there is a better alternative than introducing an
> > > >> > abstract
> > > >> > > > >> > > class for TableFunction (CachingTableFunction). As you
> > know,
> > > >> > > > >> > > TableFunction exists in the flink-table-common module,
> > which
> > > >> > > > provides
> > > >> > > > >> > > only an API for working with tables – it’s very
> > convenient
> > > >> for
> > > >> > > > importing
> > > >> > > > >> > > in connectors. In turn, CachingTableFunction contains
> > logic
> > > >> for
> > > >> > > > >> > > runtime execution,  so this class and everything
> > connected
> > > >> with
> > > >> > it
> > > >> > > > >> > > should be located in another module, probably in
> > > >> > > > flink-table-runtime.
> > > >> > > > >> > > But this will require connectors to depend on another
> > module,
> > > >> > > which
> > > >> > > > >> > > contains a lot of runtime logic, which doesn’t sound
> > good.
> > > >> > > > >> > >
> > > >> > > > >> > > I suggest adding a new method ‘getLookupConfig’ to
> > > >> > > LookupTableSource
> > > >> > > > >> > > or LookupRuntimeProvider to allow connectors to only pass
> > > >> > > > >> > > configurations to the planner, therefore they won’t
> > depend on
> > > >> > > > runtime
> > > >> > > > >> > > realization. Based on these configs planner will
> > construct a
> > > >> > > lookup
> > > >> > > > >> > > join operator with corresponding runtime logic
> > > >> (ProcessFunctions
> > > >> > > in
> > > >> > > > >> > > module flink-table-runtime). Architecture looks like in
> > the
> > > >> > pinned
> > > >> > > > >> > > image (LookupConfig class there is actually yours
> > > >> CacheConfig).
> > > >> > > > >> > >
> > > >> > > > >> > > Classes in flink-table-planner, that will be responsible
> > for
> > > >> > this
> > > >> > > –
> > > >> > > > >> > > CommonPhysicalLookupJoin and his inheritors.
> > > >> > > > >> > > Current classes for lookup join in  flink-table-runtime
> > -
> > > >> > > > >> > > LookupJoinRunner, AsyncLookupJoinRunner,
> > > >> > LookupJoinRunnerWithCalc,
> > > >> > > > >> > > AsyncLookupJoinRunnerWithCalc.
> > > >> > > > >> > >
> > > >> > > > >> > > I suggest adding classes LookupJoinCachingRunner,
> > > >> > > > >> > > LookupJoinCachingRunnerWithCalc, etc.
> > > >> > > > >> > >
> > > >> > > > >> > > And here comes another more powerful advantage of such a
> > > >> > solution.
> > > >> > > > If
> > > >> > > > >> > > we have caching logic on a lower level, we can apply some
> > > >> > > > >> > > optimizations to it. LookupJoinRunnerWithCalc was named
> > like
> > > >> > this
> > > >> > > > >> > > because it uses the ‘calc’ function, which actually
> > mostly
> > > >> > > consists
> > > >> > > > of
> > > >> > > > >> > > filters and projections.
> > > >> > > > >> > >
> > > >> > > > >> > > For example, in join table A with lookup table B
> > condition
> > > >> > ‘JOIN …
> > > >> > > > ON
> > > >> > > > >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary > 1000’
> > > >> > ‘calc’
> > > >> > > > >> > > function will contain filters A.age = B.age + 10 and
> > > >> B.salary >
> > > >> > > > 1000.
> > > >> > > > >> > >
> > > >> > > > >> > > If we apply this function before storing records in
> > cache,
> > > >> size
> > > >> > of
> > > >> > > > >> > > cache will be significantly reduced: filters = avoid
> > storing
> > > >> > > useless
> > > >> > > > >> > > records in cache, projections = reduce records’ size. So
> > the
> > > >> > > initial
> > > >> > > > >> > > max number of records in cache can be increased by the
> > user.
> > > >> > > > >> > >
> > > >> > > > >> > > What do you think about it?
> > > >> > > > >> > >
> > > >> > > > >> > >
> > > >> > > > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > > >> > > > >> > > > Hi devs,
> > > >> > > > >> > > >
> > > >> > > > >> > > > Yuan and I would like to start a discussion about
> > > >> FLIP-221[1],
> > > >> > > > which
> > > >> > > > >> > > introduces an abstraction of lookup table cache and its
> > > >> standard
> > > >> > > > metrics.
> > > >> > > > >> > > >
> > > >> > > > >> > > > Currently each lookup table source should implement
> > their
> > > >> own
> > > >> > > > cache to
> > > >> > > > >> > > store lookup results, and there isn’t a standard of
> > metrics
> > > >> for
> > > >> > > > users and
> > > >> > > > >> > > developers to tuning their jobs with lookup joins, which
> > is a
> > > >> > > quite
> > > >> > > > common
> > > >> > > > >> > > use case in Flink table / SQL.
> > > >> > > > >> > > >
> > > >> > > > >> > > > Therefore we propose some new APIs including cache,
> > > >> metrics,
> > > >> > > > wrapper
> > > >> > > > >> > > classes of TableFunction and new table options. Please
> > take a
> > > >> > look
> > > >> > > > at the
> > > >> > > > >> > > FLIP page [1] to get more details. Any suggestions and
> > > >> comments
> > > >> > > > would be
> > > >> > > > >> > > appreciated!
> > > >> > > > >> > > >
> > > >> > > > >> > > > [1]
> > > >> > > > >> > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >> > > > >> > > >
> > > >> > > > >> > > > Best regards,
> > > >> > > > >> > > >
> > > >> > > > >> > > > Qingsheng
> > > >> > > > >> > > >
> > > >> > > > >> > > >
> > > >> > > > >> > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > --
> > > >> > > > > Best Regards,
> > > >> > > > >
> > > >> > > > > Qingsheng Ren
> > > >> > > > >
> > > >> > > > > Real-time Computing Team
> > > >> > > > > Alibaba Cloud
> > > >> > > > >
> > > >> > > > > Email: renqschn@gmail.com
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Roman Boyko
> > > > e.: ro.v.boyko@gmail.com
> > > >
> >

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Martijn Visser <ma...@ververica.com>.

Hi,

With regards to:

> But if there are plans to refactor all connectors to FLIP-27

Yes, FLIP-27 is the target for all connectors. The old interfaces will be
deprecated and connectors will either be refactored to use the new ones or
dropped.

The caching should work for connectors that are using FLIP-27 interfaces,
we should not introduce new features for old interfaces.

Best regards,

Martijn

On Thu, 12 May 2022 at 06:19, Александр Смирнов <sm...@gmail.com>
wrote:

> Hi Jark!
>
> Sorry for the late response. I would like to make some comments and
> clarify my points.
>
> 1) I agree with your first statement. I think we can achieve both
> advantages this way: put the Cache interface in flink-table-common,
> but have implementations of it in flink-table-runtime. Therefore if a
> connector developer wants to use existing cache strategies and their
> implementations, he can just pass lookupConfig to the planner, but if
> he wants to have its own cache implementation in his TableFunction, it
> will be possible for him to use the existing interface for this
> purpose (we can explicitly point this out in the documentation). In
> this way all configs and metrics will be unified. WDYT?
>
> > If a filter can prune 90% of data in the cache, we will have 90% of
> lookup requests that can never be cached
>
> 2) Let me clarify the logic filters optimization in case of LRU cache.
> It looks like Cache<RowData, Collection<RowData>>. Here we always
> store the response of the dimension table in cache, even after
> applying calc function. I.e. if there are no rows after applying
> filters to the result of the 'eval' method of TableFunction, we store
> the empty list by lookup keys. Therefore the cache line will be
> filled, but will require much less memory (in bytes). I.e. we don't
> completely filter keys, by which result was pruned, but significantly
> reduce required memory to store this result. If the user knows about
> this behavior, he can increase the 'max-rows' option before the start
> of the job. But actually I came up with the idea that we can do this
> automatically by using the 'maximumWeight' and 'weigher' methods of
> GuavaCache [1]. Weight can be the size of the collection of rows
> (value of cache). Therefore cache can automatically fit much more
> records than before.
>
> > Flink SQL has provided a standard way to do filters and projects
> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> > Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's hard
> to implement.
>
> It's debatable how difficult it will be to implement filter pushdown.
> But I think the fact that currently there is no database connector
> with filter pushdown at least means that this feature won't be
> supported soon in connectors. Moreover, if we talk about other
> connectors (not in Flink repo), their databases might not support all
> Flink filters (or not support filters at all). I think users are
> interested in supporting cache filters optimization  independently of
> supporting other features and solving more complex problems (or
> unsolvable at all).
>
> 3) I agree with your third statement. Actually in our internal version
> I also tried to unify the logic of scanning and reloading data from
> connectors. But unfortunately, I didn't find a way to unify the logic
> of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
> and reuse it in reloading ALL cache. As a result I settled on using
> InputFormat, because it was used for scanning in all lookup
> connectors. (I didn't know that there are plans to deprecate
> InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
> in ALL caching is not good idea, because this source was designed to
> work in distributed environment (SplitEnumerator on JobManager and
> SourceReaders on TaskManagers), not in one operator (lookup join
> operator in our case). There is even no direct way to pass splits from
> SplitEnumerator to SourceReader (this logic works through
> SplitEnumeratorContext, which requires
> OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
> InputFormat for ALL cache seems much more clearer and easier. But if
> there are plans to refactor all connectors to FLIP-27, I have the
> following ideas: maybe we can refuse from lookup join ALL cache in
> favor of simple join with multiple scanning of batch source? The point
> is that the only difference between lookup join ALL cache and simple
> join with batch source is that in the first case scanning is performed
> multiple times, in between which state (cache) is cleared (correct me
> if I'm wrong). So what if we extend the functionality of simple join
> to support state reloading + extend the functionality of scanning
> batch source multiple times (this one should be easy with new FLIP-27
> source, that unifies streaming/batch reading - we will need to change
> only SplitEnumerator, which will pass splits again after some TTL).
> WDYT? I must say that this looks like a long-term goal and will make
> the scope of this FLIP even larger than you said. Maybe we can limit
> ourselves to a simpler solution now (InputFormats).
>
> So to sum up, my points is like this:
> 1) There is a way to make both concise and flexible interfaces for
> caching in lookup join.
> 2) Cache filters optimization is important both in LRU and ALL caches.
> 3) It is unclear when filter pushdown will be supported in Flink
> connectors, some of the connectors might not have the opportunity to
> support filter pushdown + as I know, currently filter pushdown works
> only for scanning (not lookup). So cache filters + projections
> optimization should be independent from other features.
> 4) ALL cache realization is a complex topic that involves multiple
> aspects of how Flink is developing. Refusing from InputFormat in favor
> of FLIP-27 Source will make ALL cache realization really complex and
> not clear, so maybe instead of that we can extend the functionality of
> simple join or not refuse from InputFormat in case of lookup join ALL
> cache?
>
> Best regards,
> Smirnov Alexander
>
>
>
>
>
>
>
> [1]
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
>
> чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
> >
> > It's great to see the active discussion! I want to share my ideas:
> >
> > 1) implement the cache in framework vs. connectors base
> > I don't have a strong opinion on this. Both ways should work (e.g., cache
> > pruning, compatibility).
> > The framework way can provide more concise interfaces.
> > The connector base way can define more flexible cache
> > strategies/implementations.
> > We are still investigating a way to see if we can have both advantages.
> > We should reach a consensus that the way should be a final state, and we
> > are on the path to it.
> >
> > 2) filters and projections pushdown:
> > I agree with Alex that the filter pushdown into cache can benefit a lot
> for
> > ALL cache.
> > However, this is not true for LRU cache. Connectors use cache to reduce
> IO
> > requests to databases for better throughput.
> > If a filter can prune 90% of data in the cache, we will have 90% of
> lookup
> > requests that can never be cached
> > and hit directly to the databases. That means the cache is meaningless in
> > this case.
> >
> > IMO, Flink SQL has provided a standard way to do filters and projects
> > pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> > Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's hard
> to
> > implement.
> > They should implement the pushdown interfaces to reduce IO and the cache
> > size.
> > That should be a final state that the scan source and lookup source share
> > the exact pushdown implementation.
> > I don't see why we need to duplicate the pushdown logic in caches, which
> > will complex the lookup join design.
> >
> > 3) ALL cache abstraction
> > All cache might be the most challenging part of this FLIP. We have never
> > provided a reload-lookup public interface.
> > Currently, we put the reload logic in the "eval" method of TableFunction.
> > That's hard for some sources (e.g., Hive).
> > Ideally, connector implementation should share the logic of reload and
> > scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27
> Source.
> > However, InputFormat/SourceFunction are deprecated, and the FLIP-27
> source
> > is deeply coupled with SourceOperator.
> > If we want to invoke the FLIP-27 source in LookupJoin, this may make the
> > scope of this FLIP much larger.
> > We are still investigating how to abstract the ALL cache logic and reuse
> > the existing source interfaces.
> >
> >
> > Best,
> > Jark
> >
> >
> >
> > On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com> wrote:
> >
> > > It's a much more complicated activity and lies out of the scope of this
> > > improvement. Because such pushdowns should be done for all
> ScanTableSource
> > > implementations (not only for Lookup ones).
> > >
> > > On Thu, 5 May 2022 at 19:02, Martijn Visser <ma...@apache.org>
> > > wrote:
> > >
> > >> Hi everyone,
> > >>
> > >> One question regarding "And Alexander correctly mentioned that filter
> > >> pushdown still is not implemented for jdbc/hive/hbase." -> Would an
> > >> alternative solution be to actually implement these filter pushdowns?
> I
> > >> can
> > >> imagine that there are many more benefits to doing that, outside of
> lookup
> > >> caching and metrics.
> > >>
> > >> Best regards,
> > >>
> > >> Martijn Visser
> > >> https://twitter.com/MartijnVisser82
> > >> https://github.com/MartijnVisser
> > >>
> > >>
> > >> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com>
> wrote:
> > >>
> > >> > Hi everyone!
> > >> >
> > >> > Thanks for driving such a valuable improvement!
> > >> >
> > >> > I do think that single cache implementation would be a nice
> opportunity
> > >> for
> > >> > users. And it will break the "FOR SYSTEM_TIME AS OF proc_time"
> semantics
> > >> > anyway - doesn't matter how it will be implemented.
> > >> >
> > >> > Putting myself in the user's shoes, I can say that:
> > >> > 1) I would prefer to have the opportunity to cut off the cache size
> by
> > >> > simply filtering unnecessary data. And the most handy way to do it
> is
> > >> apply
> > >> > it inside LookupRunners. It would be a bit harder to pass it
> through the
> > >> > LookupJoin node to TableFunction. And Alexander correctly mentioned
> that
> > >> > filter pushdown still is not implemented for jdbc/hive/hbase.
> > >> > 2) The ability to set the different caching parameters for different
> > >> tables
> > >> > is quite important. So I would prefer to set it through DDL rather
> than
> > >> > have similar ttla, strategy and other options for all lookup tables.
> > >> > 3) Providing the cache into the framework really deprives us of
> > >> > extensibility (users won't be able to implement their own cache).
> But
> > >> most
> > >> > probably it might be solved by creating more different cache
> strategies
> > >> and
> > >> > a wider set of configurations.
> > >> >
> > >> > All these points are much closer to the schema proposed by
> Alexander.
> > >> > Qingshen Ren, please correct me if I'm not right and all these
> > >> facilities
> > >> > might be simply implemented in your architecture?
> > >> >
> > >> > Best regards,
> > >> > Roman Boyko
> > >> > e.: ro.v.boyko@gmail.com
> > >> >
> > >> > On Wed, 4 May 2022 at 21:01, Martijn Visser <
> martijnvisser@apache.org>
> > >> > wrote:
> > >> >
> > >> > > Hi everyone,
> > >> > >
> > >> > > I don't have much to chip in, but just wanted to express that I
> really
> > >> > > appreciate the in-depth discussion on this topic and I hope that
> > >> others
> > >> > > will join the conversation.
> > >> > >
> > >> > > Best regards,
> > >> > >
> > >> > > Martijn
> > >> > >
> > >> > > On Tue, 3 May 2022 at 10:15, Александр Смирнов <
> smiralexan@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Qingsheng, Leonard and Jark,
> > >> > > >
> > >> > > > Thanks for your detailed feedback! However, I have questions
> about
> > >> > > > some of your statements (maybe I didn't get something?).
> > >> > > >
> > >> > > > > Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> > >> > > proc_time”
> > >> > > >
> > >> > > > I agree that the semantics of "FOR SYSTEM_TIME AS OF proc_time"
> is
> > >> not
> > >> > > > fully implemented with caching, but as you said, users go on it
> > >> > > > consciously to achieve better performance (no one proposed to
> enable
> > >> > > > caching by default, etc.). Or by users do you mean other
> developers
> > >> of
> > >> > > > connectors? In this case developers explicitly specify whether
> their
> > >> > > > connector supports caching or not (in the list of supported
> > >> options),
> > >> > > > no one makes them do that if they don't want to. So what
> exactly is
> > >> > > > the difference between implementing caching in modules
> > >> > > > flink-table-runtime and in flink-table-common from the
> considered
> > >> > > > point of view? How does it affect on breaking/non-breaking the
> > >> > > > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > >> > > >
> > >> > > > > confront a situation that allows table options in DDL to
> control
> > >> the
> > >> > > > behavior of the framework, which has never happened previously
> and
> > >> > should
> > >> > > > be cautious
> > >> > > >
> > >> > > > If we talk about main differences of semantics of DDL options
> and
> > >> > > > config options("table.exec.xxx"), isn't it about limiting the
> scope
> > >> of
> > >> > > > the options + importance for the user business logic rather than
> > >> > > > specific location of corresponding logic in the framework? I
> mean
> > >> that
> > >> > > > in my design, for example, putting an option with lookup cache
> > >> > > > strategy in configurations would  be the wrong decision,
> because it
> > >> > > > directly affects the user's business logic (not just performance
> > >> > > > optimization) + touches just several functions of ONE table
> (there
> > >> can
> > >> > > > be multiple tables with different caches). Does it really
> matter for
> > >> > > > the user (or someone else) where the logic is located, which is
> > >> > > > affected by the applied option?
> > >> > > > Also I can remember DDL option 'sink.parallelism', which in
> some way
> > >> > > > "controls the behavior of the framework" and I don't see any
> problem
> > >> > > > here.
> > >> > > >
> > >> > > > > introduce a new interface for this all-caching scenario and
> the
> > >> > design
> > >> > > > would become more complex
> > >> > > >
> > >> > > > This is a subject for a separate discussion, but actually in our
> > >> > > > internal version we solved this problem quite easily - we reused
> > >> > > > InputFormat class (so there is no need for a new API). The
> point is
> > >> > > > that currently all lookup connectors use InputFormat for
> scanning
> > >> the
> > >> > > > data in batch mode: HBase, JDBC and even Hive - it uses class
> > >> > > > PartitionReader, that is actually just a wrapper around
> InputFormat.
> > >> > > > The advantage of this solution is the ability to reload cache
> data
> > >> in
> > >> > > > parallel (number of threads depends on number of InputSplits,
> but
> > >> has
> > >> > > > an upper limit). As a result cache reload time significantly
> reduces
> > >> > > > (as well as time of input stream blocking). I know that usually
> we
> > >> try
> > >> > > > to avoid usage of concurrency in Flink code, but maybe this one
> can
> > >> be
> > >> > > > an exception. BTW I don't say that it's an ideal solution, maybe
> > >> there
> > >> > > > are better ones.
> > >> > > >
> > >> > > > > Providing the cache in the framework might introduce
> compatibility
> > >> > > issues
> > >> > > >
> > >> > > > It's possible only in cases when the developer of the connector
> > >> won't
> > >> > > > properly refactor his code and will use new cache options
> > >> incorrectly
> > >> > > > (i.e. explicitly provide the same options into 2 different code
> > >> > > > places). For correct behavior all he will need to do is to
> redirect
> > >> > > > existing options to the framework's LookupConfig (+ maybe add an
> > >> alias
> > >> > > > for options, if there was different naming), everything will be
> > >> > > > transparent for users. If the developer won't do refactoring at
> all,
> > >> > > > nothing will be changed for the connector because of backward
> > >> > > > compatibility. Also if a developer wants to use his own cache
> logic,
> > >> > > > he just can refuse to pass some of the configs into the
> framework,
> > >> and
> > >> > > > instead make his own implementation with already existing
> configs
> > >> and
> > >> > > > metrics (but actually I think that it's a rare case).
> > >> > > >
> > >> > > > > filters and projections should be pushed all the way down to
> the
> > >> > table
> > >> > > > function, like what we do in the scan source
> > >> > > >
> > >> > > > It's the great purpose. But the truth is that the ONLY connector
> > >> that
> > >> > > > supports filter pushdown is FileSystemTableSource
> > >> > > > (no database connector supports it currently). Also for some
> > >> databases
> > >> > > > it's simply impossible to pushdown such complex filters that we
> have
> > >> > > > in Flink.
> > >> > > >
> > >> > > > >  only applying these optimizations to the cache seems not
> quite
> > >> > useful
> > >> > > >
> > >> > > > Filters can cut off an arbitrarily large amount of data from the
> > >> > > > dimension table. For a simple example, suppose in dimension
> table
> > >> > > > 'users'
> > >> > > > we have column 'age' with values from 20 to 40, and input stream
> > >> > > > 'clicks' that is ~uniformly distributed by age of users. If we
> have
> > >> > > > filter 'age > 30',
> > >> > > > there will be twice less data in cache. This means the user can
> > >> > > > increase 'lookup.cache.max-rows' by almost 2 times. It will
> gain a
> > >> > > > huge
> > >> > > > performance boost. Moreover, this optimization starts to really
> > >> shine
> > >> > > > in 'ALL' cache, where tables without filters and projections
> can't
> > >> fit
> > >> > > > in memory, but with them - can. This opens up additional
> > >> possibilities
> > >> > > > for users. And this doesn't sound as 'not quite useful'.
> > >> > > >
> > >> > > > It would be great to hear other voices regarding this topic!
> Because
> > >> > > > we have quite a lot of controversial points, and I think with
> the
> > >> help
> > >> > > > of others it will be easier for us to come to a consensus.
> > >> > > >
> > >> > > > Best regards,
> > >> > > > Smirnov Alexander
> > >> > > >
> > >> > > >
> > >> > > > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <renqschn@gmail.com
> >:
> > >> > > > >
> > >> > > > > Hi Alexander and Arvid,
> > >> > > > >
> > >> > > > > Thanks for the discussion and sorry for my late response! We
> had
> > >> an
> > >> > > > internal discussion together with Jark and Leonard and I’d like
> to
> > >> > > > summarize our ideas. Instead of implementing the cache logic in
> the
> > >> > table
> > >> > > > runtime layer or wrapping around the user-provided table
> function,
> > >> we
> > >> > > > prefer to introduce some new APIs extending TableFunction with
> these
> > >> > > > concerns:
> > >> > > > >
> > >> > > > > 1. Caching actually breaks the semantic of "FOR SYSTEM_TIME
> AS OF
> > >> > > > proc_time”, because it couldn’t truly reflect the content of the
> > >> lookup
> > >> > > > table at the moment of querying. If users choose to enable
> caching
> > >> on
> > >> > the
> > >> > > > lookup table, they implicitly indicate that this breakage is
> > >> acceptable
> > >> > > in
> > >> > > > exchange for the performance. So we prefer not to provide
> caching on
> > >> > the
> > >> > > > table runtime level.
> > >> > > > >
> > >> > > > > 2. If we make the cache implementation in the framework
> (whether
> > >> in a
> > >> > > > runner or a wrapper around TableFunction), we have to confront a
> > >> > > situation
> > >> > > > that allows table options in DDL to control the behavior of the
> > >> > > framework,
> > >> > > > which has never happened previously and should be cautious.
> Under
> > >> the
> > >> > > > current design the behavior of the framework should only be
> > >> specified
> > >> > by
> > >> > > > configurations (“table.exec.xxx”), and it’s hard to apply these
> > >> general
> > >> > > > configs to a specific table.
> > >> > > > >
> > >> > > > > 3. We have use cases that lookup source loads and refresh all
> > >> records
> > >> > > > periodically into the memory to achieve high lookup performance
> > >> (like
> > >> > > Hive
> > >> > > > connector in the community, and also widely used by our internal
> > >> > > > connectors). Wrapping the cache around the user’s TableFunction
> > >> works
> > >> > > fine
> > >> > > > for LRU caches, but I think we have to introduce a new
> interface for
> > >> > this
> > >> > > > all-caching scenario and the design would become more complex.
> > >> > > > >
> > >> > > > > 4. Providing the cache in the framework might introduce
> > >> compatibility
> > >> > > > issues to existing lookup sources like there might exist two
> caches
> > >> > with
> > >> > > > totally different strategies if the user incorrectly configures
> the
> > >> > table
> > >> > > > (one in the framework and another implemented by the lookup
> source).
> > >> > > > >
> > >> > > > > As for the optimization mentioned by Alexander, I think
> filters
> > >> and
> > >> > > > projections should be pushed all the way down to the table
> function,
> > >> > like
> > >> > > > what we do in the scan source, instead of the runner with the
> cache.
> > >> > The
> > >> > > > goal of using cache is to reduce the network I/O and pressure
> on the
> > >> > > > external system, and only applying these optimizations to the
> cache
> > >> > seems
> > >> > > > not quite useful.
> > >> > > > >
> > >> > > > > I made some updates to the FLIP[1] to reflect our ideas. We
> > >> prefer to
> > >> > > > keep the cache implementation as a part of TableFunction, and we
> > >> could
> > >> > > > provide some helper classes (CachingTableFunction,
> > >> > > AllCachingTableFunction,
> > >> > > > CachingAsyncTableFunction) to developers and regulate metrics
> of the
> > >> > > cache.
> > >> > > > Also, I made a POC[2] for your reference.
> > >> > > > >
> > >> > > > > Looking forward to your ideas!
> > >> > > > >
> > >> > > > > [1]
> > >> > > >
> > >> > >
> > >> >
> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >> > > > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > >> > > > >
> > >> > > > > Best regards,
> > >> > > > >
> > >> > > > > Qingsheng
> > >> > > > >
> > >> > > > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > >> > > smiralexan@gmail.com>
> > >> > > > wrote:
> > >> > > > >>
> > >> > > > >> Thanks for the response, Arvid!
> > >> > > > >>
> > >> > > > >> I have few comments on your message.
> > >> > > > >>
> > >> > > > >> > but could also live with an easier solution as the first
> step:
> > >> > > > >>
> > >> > > > >> I think that these 2 ways are mutually exclusive (originally
> > >> > proposed
> > >> > > > >> by Qingsheng and mine), because conceptually they follow the
> same
> > >> > > > >> goal, but implementation details are different. If we will
> go one
> > >> > way,
> > >> > > > >> moving to another way in the future will mean deleting
> existing
> > >> code
> > >> > > > >> and once again changing the API for connectors. So I think we
> > >> should
> > >> > > > >> reach a consensus with the community about that and then work
> > >> > together
> > >> > > > >> on this FLIP, i.e. divide the work on tasks for different
> parts
> > >> of
> > >> > the
> > >> > > > >> flip (for example, LRU cache unification / introducing
> proposed
> > >> set
> > >> > of
> > >> > > > >> metrics / further work…). WDYT, Qingsheng?
> > >> > > > >>
> > >> > > > >> > as the source will only receive the requests after filter
> > >> > > > >>
> > >> > > > >> Actually if filters are applied to fields of the lookup
> table, we
> > >> > > > >> firstly must do requests, and only after that we can filter
> > >> > responses,
> > >> > > > >> because lookup connectors don't have filter pushdown. So if
> > >> > filtering
> > >> > > > >> is done before caching, there will be much less rows in
> cache.
> > >> > > > >>
> > >> > > > >> > @Alexander unfortunately, your architecture is not shared.
> I
> > >> don't
> > >> > > > know the
> > >> > > > >>
> > >> > > > >> > solution to share images to be honest.
> > >> > > > >>
> > >> > > > >> Sorry for that, I’m a bit new to such kinds of conversations
> :)
> > >> > > > >> I have no write access to the confluence, so I made a Jira
> issue,
> > >> > > > >> where described the proposed changes in more details -
> > >> > > > >> https://issues.apache.org/jira/browse/FLINK-27411.
> > >> > > > >>
> > >> > > > >> Will happy to get more feedback!
> > >> > > > >>
> > >> > > > >> Best,
> > >> > > > >> Smirnov Alexander
> > >> > > > >>
> > >> > > > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <ar...@apache.org>:
> > >> > > > >> >
> > >> > > > >> > Hi Qingsheng,
> > >> > > > >> >
> > >> > > > >> > Thanks for driving this; the inconsistency was not
> satisfying
> > >> for
> > >> > > me.
> > >> > > > >> >
> > >> > > > >> > I second Alexander's idea though but could also live with
> an
> > >> > easier
> > >> > > > >> > solution as the first step: Instead of making caching an
> > >> > > > implementation
> > >> > > > >> > detail of TableFunction X, rather devise a caching layer
> > >> around X.
> > >> > > So
> > >> > > > the
> > >> > > > >> > proposal would be a CachingTableFunction that delegates to
> X in
> > >> > case
> > >> > > > of
> > >> > > > >> > misses and else manages the cache. Lifting it into the
> operator
> > >> > > model
> > >> > > > as
> > >> > > > >> > proposed would be even better but is probably unnecessary
> in
> > >> the
> > >> > > > first step
> > >> > > > >> > for a lookup source (as the source will only receive the
> > >> requests
> > >> > > > after
> > >> > > > >> > filter; applying projection may be more interesting to save
> > >> > memory).
> > >> > > > >> >
> > >> > > > >> > Another advantage is that all the changes of this FLIP
> would be
> > >> > > > limited to
> > >> > > > >> > options, no need for new public interfaces. Everything else
> > >> > remains
> > >> > > an
> > >> > > > >> > implementation of Table runtime. That means we can easily
> > >> > > incorporate
> > >> > > > the
> > >> > > > >> > optimization potential that Alexander pointed out later.
> > >> > > > >> >
> > >> > > > >> > @Alexander unfortunately, your architecture is not shared.
> I
> > >> don't
> > >> > > > know the
> > >> > > > >> > solution to share images to be honest.
> > >> > > > >> >
> > >> > > > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > >> > > > smiralexan@gmail.com>
> > >> > > > >> > wrote:
> > >> > > > >> >
> > >> > > > >> > > Hi Qingsheng! My name is Alexander, I'm not a committer
> yet,
> > >> but
> > >> > > I'd
> > >> > > > >> > > really like to become one. And this FLIP really
> interested
> > >> me.
> > >> > > > >> > > Actually I have worked on a similar feature in my
> company’s
> > >> > Flink
> > >> > > > >> > > fork, and we would like to share our thoughts on this and
> > >> make
> > >> > > code
> > >> > > > >> > > open source.
> > >> > > > >> > >
> > >> > > > >> > > I think there is a better alternative than introducing an
> > >> > abstract
> > >> > > > >> > > class for TableFunction (CachingTableFunction). As you
> know,
> > >> > > > >> > > TableFunction exists in the flink-table-common module,
> which
> > >> > > > provides
> > >> > > > >> > > only an API for working with tables – it’s very
> convenient
> > >> for
> > >> > > > importing
> > >> > > > >> > > in connectors. In turn, CachingTableFunction contains
> logic
> > >> for
> > >> > > > >> > > runtime execution,  so this class and everything
> connected
> > >> with
> > >> > it
> > >> > > > >> > > should be located in another module, probably in
> > >> > > > flink-table-runtime.
> > >> > > > >> > > But this will require connectors to depend on another
> module,
> > >> > > which
> > >> > > > >> > > contains a lot of runtime logic, which doesn’t sound
> good.
> > >> > > > >> > >
> > >> > > > >> > > I suggest adding a new method ‘getLookupConfig’ to
> > >> > > LookupTableSource
> > >> > > > >> > > or LookupRuntimeProvider to allow connectors to only pass
> > >> > > > >> > > configurations to the planner, therefore they won’t
> depend on
> > >> > > > runtime
> > >> > > > >> > > realization. Based on these configs planner will
> construct a
> > >> > > lookup
> > >> > > > >> > > join operator with corresponding runtime logic
> > >> (ProcessFunctions
> > >> > > in
> > >> > > > >> > > module flink-table-runtime). Architecture looks like in
> the
> > >> > pinned
> > >> > > > >> > > image (LookupConfig class there is actually yours
> > >> CacheConfig).
> > >> > > > >> > >
> > >> > > > >> > > Classes in flink-table-planner, that will be responsible
> for
> > >> > this
> > >> > > –
> > >> > > > >> > > CommonPhysicalLookupJoin and his inheritors.
> > >> > > > >> > > Current classes for lookup join in  flink-table-runtime
> -
> > >> > > > >> > > LookupJoinRunner, AsyncLookupJoinRunner,
> > >> > LookupJoinRunnerWithCalc,
> > >> > > > >> > > AsyncLookupJoinRunnerWithCalc.
> > >> > > > >> > >
> > >> > > > >> > > I suggest adding classes LookupJoinCachingRunner,
> > >> > > > >> > > LookupJoinCachingRunnerWithCalc, etc.
> > >> > > > >> > >
> > >> > > > >> > > And here comes another more powerful advantage of such a
> > >> > solution.
> > >> > > > If
> > >> > > > >> > > we have caching logic on a lower level, we can apply some
> > >> > > > >> > > optimizations to it. LookupJoinRunnerWithCalc was named
> like
> > >> > this
> > >> > > > >> > > because it uses the ‘calc’ function, which actually
> mostly
> > >> > > consists
> > >> > > > of
> > >> > > > >> > > filters and projections.
> > >> > > > >> > >
> > >> > > > >> > > For example, in join table A with lookup table B
> condition
> > >> > ‘JOIN …
> > >> > > > ON
> > >> > > > >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary > 1000’
> > >> > ‘calc’
> > >> > > > >> > > function will contain filters A.age = B.age + 10 and
> > >> B.salary >
> > >> > > > 1000.
> > >> > > > >> > >
> > >> > > > >> > > If we apply this function before storing records in
> cache,
> > >> size
> > >> > of
> > >> > > > >> > > cache will be significantly reduced: filters = avoid
> storing
> > >> > > useless
> > >> > > > >> > > records in cache, projections = reduce records’ size. So
> the
> > >> > > initial
> > >> > > > >> > > max number of records in cache can be increased by the
> user.
> > >> > > > >> > >
> > >> > > > >> > > What do you think about it?
> > >> > > > >> > >
> > >> > > > >> > >
> > >> > > > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > >> > > > >> > > > Hi devs,
> > >> > > > >> > > >
> > >> > > > >> > > > Yuan and I would like to start a discussion about
> > >> FLIP-221[1],
> > >> > > > which
> > >> > > > >> > > introduces an abstraction of lookup table cache and its
> > >> standard
> > >> > > > metrics.
> > >> > > > >> > > >
> > >> > > > >> > > > Currently each lookup table source should implement
> their
> > >> own
> > >> > > > cache to
> > >> > > > >> > > store lookup results, and there isn’t a standard of
> metrics
> > >> for
> > >> > > > users and
> > >> > > > >> > > developers to tuning their jobs with lookup joins, which
> is a
> > >> > > quite
> > >> > > > common
> > >> > > > >> > > use case in Flink table / SQL.
> > >> > > > >> > > >
> > >> > > > >> > > > Therefore we propose some new APIs including cache,
> > >> metrics,
> > >> > > > wrapper
> > >> > > > >> > > classes of TableFunction and new table options. Please
> take a
> > >> > look
> > >> > > > at the
> > >> > > > >> > > FLIP page [1] to get more details. Any suggestions and
> > >> comments
> > >> > > > would be
> > >> > > > >> > > appreciated!
> > >> > > > >> > > >
> > >> > > > >> > > > [1]
> > >> > > > >> > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >> > > > >> > > >
> > >> > > > >> > > > Best regards,
> > >> > > > >> > > >
> > >> > > > >> > > > Qingsheng
> > >> > > > >> > > >
> > >> > > > >> > > >
> > >> > > > >> > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Best Regards,
> > >> > > > >
> > >> > > > > Qingsheng Ren
> > >> > > > >
> > >> > > > > Real-time Computing Team
> > >> > > > > Alibaba Cloud
> > >> > > > >
> > >> > > > > Email: renqschn@gmail.com
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> > > --
> > > Best regards,
> > > Roman Boyko
> > > e.: ro.v.boyko@gmail.com
> > >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Александр Смирнов <sm...@gmail.com>.

Hi Jark!

Sorry for the late response. I would like to make some comments and
clarify my points.

1) I agree with your first statement. I think we can achieve both
advantages this way: put the Cache interface in flink-table-common,
but have implementations of it in flink-table-runtime. Therefore if a
connector developer wants to use existing cache strategies and their
implementations, he can just pass lookupConfig to the planner, but if
he wants to have its own cache implementation in his TableFunction, it
will be possible for him to use the existing interface for this
purpose (we can explicitly point this out in the documentation). In
this way all configs and metrics will be unified. WDYT?

> If a filter can prune 90% of data in the cache, we will have 90% of lookup requests that can never be cached

2) Let me clarify the logic filters optimization in case of LRU cache.
It looks like Cache<RowData, Collection<RowData>>. Here we always
store the response of the dimension table in cache, even after
applying calc function. I.e. if there are no rows after applying
filters to the result of the 'eval' method of TableFunction, we store
the empty list by lookup keys. Therefore the cache line will be
filled, but will require much less memory (in bytes). I.e. we don't
completely filter keys, by which result was pruned, but significantly
reduce required memory to store this result. If the user knows about
this behavior, he can increase the 'max-rows' option before the start
of the job. But actually I came up with the idea that we can do this
automatically by using the 'maximumWeight' and 'weigher' methods of
GuavaCache [1]. Weight can be the size of the collection of rows
(value of cache). Therefore cache can automatically fit much more
records than before.

> Flink SQL has provided a standard way to do filters and projects pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's hard to implement.

It's debatable how difficult it will be to implement filter pushdown.
But I think the fact that currently there is no database connector
with filter pushdown at least means that this feature won't be
supported soon in connectors. Moreover, if we talk about other
connectors (not in Flink repo), their databases might not support all
Flink filters (or not support filters at all). I think users are
interested in supporting cache filters optimization  independently of
supporting other features and solving more complex problems (or
unsolvable at all).

3) I agree with your third statement. Actually in our internal version
I also tried to unify the logic of scanning and reloading data from
connectors. But unfortunately, I didn't find a way to unify the logic
of all ScanRuntimeProviders (InputFormat, SourceFunction, Source,...)
and reuse it in reloading ALL cache. As a result I settled on using
InputFormat, because it was used for scanning in all lookup
connectors. (I didn't know that there are plans to deprecate
InputFormat in favor of FLIP-27 Source). IMO usage of FLIP-27 source
in ALL caching is not good idea, because this source was designed to
work in distributed environment (SplitEnumerator on JobManager and
SourceReaders on TaskManagers), not in one operator (lookup join
operator in our case). There is even no direct way to pass splits from
SplitEnumerator to SourceReader (this logic works through
SplitEnumeratorContext, which requires
OperatorCoordinator.SubtaskGateway to send AddSplitEvents). Usage of
InputFormat for ALL cache seems much more clearer and easier. But if
there are plans to refactor all connectors to FLIP-27, I have the
following ideas: maybe we can refuse from lookup join ALL cache in
favor of simple join with multiple scanning of batch source? The point
is that the only difference between lookup join ALL cache and simple
join with batch source is that in the first case scanning is performed
multiple times, in between which state (cache) is cleared (correct me
if I'm wrong). So what if we extend the functionality of simple join
to support state reloading + extend the functionality of scanning
batch source multiple times (this one should be easy with new FLIP-27
source, that unifies streaming/batch reading - we will need to change
only SplitEnumerator, which will pass splits again after some TTL).
WDYT? I must say that this looks like a long-term goal and will make
the scope of this FLIP even larger than you said. Maybe we can limit
ourselves to a simpler solution now (InputFormats).

So to sum up, my points is like this:
1) There is a way to make both concise and flexible interfaces for
caching in lookup join.
2) Cache filters optimization is important both in LRU and ALL caches.
3) It is unclear when filter pushdown will be supported in Flink
connectors, some of the connectors might not have the opportunity to
support filter pushdown + as I know, currently filter pushdown works
only for scanning (not lookup). So cache filters + projections
optimization should be independent from other features.
4) ALL cache realization is a complex topic that involves multiple
aspects of how Flink is developing. Refusing from InputFormat in favor
of FLIP-27 Source will make ALL cache realization really complex and
not clear, so maybe instead of that we can extend the functionality of
simple join or not refuse from InputFormat in case of lookup join ALL
cache?

Best regards,
Smirnov Alexander







[1] https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)

чт, 5 мая 2022 г. в 20:34, Jark Wu <im...@gmail.com>:
>
> It's great to see the active discussion! I want to share my ideas:
>
> 1) implement the cache in framework vs. connectors base
> I don't have a strong opinion on this. Both ways should work (e.g., cache
> pruning, compatibility).
> The framework way can provide more concise interfaces.
> The connector base way can define more flexible cache
> strategies/implementations.
> We are still investigating a way to see if we can have both advantages.
> We should reach a consensus that the way should be a final state, and we
> are on the path to it.
>
> 2) filters and projections pushdown:
> I agree with Alex that the filter pushdown into cache can benefit a lot for
> ALL cache.
> However, this is not true for LRU cache. Connectors use cache to reduce IO
> requests to databases for better throughput.
> If a filter can prune 90% of data in the cache, we will have 90% of lookup
> requests that can never be cached
> and hit directly to the databases. That means the cache is meaningless in
> this case.
>
> IMO, Flink SQL has provided a standard way to do filters and projects
> pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
> Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's hard to
> implement.
> They should implement the pushdown interfaces to reduce IO and the cache
> size.
> That should be a final state that the scan source and lookup source share
> the exact pushdown implementation.
> I don't see why we need to duplicate the pushdown logic in caches, which
> will complex the lookup join design.
>
> 3) ALL cache abstraction
> All cache might be the most challenging part of this FLIP. We have never
> provided a reload-lookup public interface.
> Currently, we put the reload logic in the "eval" method of TableFunction.
> That's hard for some sources (e.g., Hive).
> Ideally, connector implementation should share the logic of reload and
> scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27 Source.
> However, InputFormat/SourceFunction are deprecated, and the FLIP-27 source
> is deeply coupled with SourceOperator.
> If we want to invoke the FLIP-27 source in LookupJoin, this may make the
> scope of this FLIP much larger.
> We are still investigating how to abstract the ALL cache logic and reuse
> the existing source interfaces.
>
>
> Best,
> Jark
>
>
>
> On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com> wrote:
>
> > It's a much more complicated activity and lies out of the scope of this
> > improvement. Because such pushdowns should be done for all ScanTableSource
> > implementations (not only for Lookup ones).
> >
> > On Thu, 5 May 2022 at 19:02, Martijn Visser <ma...@apache.org>
> > wrote:
> >
> >> Hi everyone,
> >>
> >> One question regarding "And Alexander correctly mentioned that filter
> >> pushdown still is not implemented for jdbc/hive/hbase." -> Would an
> >> alternative solution be to actually implement these filter pushdowns? I
> >> can
> >> imagine that there are many more benefits to doing that, outside of lookup
> >> caching and metrics.
> >>
> >> Best regards,
> >>
> >> Martijn Visser
> >> https://twitter.com/MartijnVisser82
> >> https://github.com/MartijnVisser
> >>
> >>
> >> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com> wrote:
> >>
> >> > Hi everyone!
> >> >
> >> > Thanks for driving such a valuable improvement!
> >> >
> >> > I do think that single cache implementation would be a nice opportunity
> >> for
> >> > users. And it will break the "FOR SYSTEM_TIME AS OF proc_time" semantics
> >> > anyway - doesn't matter how it will be implemented.
> >> >
> >> > Putting myself in the user's shoes, I can say that:
> >> > 1) I would prefer to have the opportunity to cut off the cache size by
> >> > simply filtering unnecessary data. And the most handy way to do it is
> >> apply
> >> > it inside LookupRunners. It would be a bit harder to pass it through the
> >> > LookupJoin node to TableFunction. And Alexander correctly mentioned that
> >> > filter pushdown still is not implemented for jdbc/hive/hbase.
> >> > 2) The ability to set the different caching parameters for different
> >> tables
> >> > is quite important. So I would prefer to set it through DDL rather than
> >> > have similar ttla, strategy and other options for all lookup tables.
> >> > 3) Providing the cache into the framework really deprives us of
> >> > extensibility (users won't be able to implement their own cache). But
> >> most
> >> > probably it might be solved by creating more different cache strategies
> >> and
> >> > a wider set of configurations.
> >> >
> >> > All these points are much closer to the schema proposed by Alexander.
> >> > Qingshen Ren, please correct me if I'm not right and all these
> >> facilities
> >> > might be simply implemented in your architecture?
> >> >
> >> > Best regards,
> >> > Roman Boyko
> >> > e.: ro.v.boyko@gmail.com
> >> >
> >> > On Wed, 4 May 2022 at 21:01, Martijn Visser <ma...@apache.org>
> >> > wrote:
> >> >
> >> > > Hi everyone,
> >> > >
> >> > > I don't have much to chip in, but just wanted to express that I really
> >> > > appreciate the in-depth discussion on this topic and I hope that
> >> others
> >> > > will join the conversation.
> >> > >
> >> > > Best regards,
> >> > >
> >> > > Martijn
> >> > >
> >> > > On Tue, 3 May 2022 at 10:15, Александр Смирнов <sm...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi Qingsheng, Leonard and Jark,
> >> > > >
> >> > > > Thanks for your detailed feedback! However, I have questions about
> >> > > > some of your statements (maybe I didn't get something?).
> >> > > >
> >> > > > > Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> >> > > proc_time”
> >> > > >
> >> > > > I agree that the semantics of "FOR SYSTEM_TIME AS OF proc_time" is
> >> not
> >> > > > fully implemented with caching, but as you said, users go on it
> >> > > > consciously to achieve better performance (no one proposed to enable
> >> > > > caching by default, etc.). Or by users do you mean other developers
> >> of
> >> > > > connectors? In this case developers explicitly specify whether their
> >> > > > connector supports caching or not (in the list of supported
> >> options),
> >> > > > no one makes them do that if they don't want to. So what exactly is
> >> > > > the difference between implementing caching in modules
> >> > > > flink-table-runtime and in flink-table-common from the considered
> >> > > > point of view? How does it affect on breaking/non-breaking the
> >> > > > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >> > > >
> >> > > > > confront a situation that allows table options in DDL to control
> >> the
> >> > > > behavior of the framework, which has never happened previously and
> >> > should
> >> > > > be cautious
> >> > > >
> >> > > > If we talk about main differences of semantics of DDL options and
> >> > > > config options("table.exec.xxx"), isn't it about limiting the scope
> >> of
> >> > > > the options + importance for the user business logic rather than
> >> > > > specific location of corresponding logic in the framework? I mean
> >> that
> >> > > > in my design, for example, putting an option with lookup cache
> >> > > > strategy in configurations would  be the wrong decision, because it
> >> > > > directly affects the user's business logic (not just performance
> >> > > > optimization) + touches just several functions of ONE table (there
> >> can
> >> > > > be multiple tables with different caches). Does it really matter for
> >> > > > the user (or someone else) where the logic is located, which is
> >> > > > affected by the applied option?
> >> > > > Also I can remember DDL option 'sink.parallelism', which in some way
> >> > > > "controls the behavior of the framework" and I don't see any problem
> >> > > > here.
> >> > > >
> >> > > > > introduce a new interface for this all-caching scenario and the
> >> > design
> >> > > > would become more complex
> >> > > >
> >> > > > This is a subject for a separate discussion, but actually in our
> >> > > > internal version we solved this problem quite easily - we reused
> >> > > > InputFormat class (so there is no need for a new API). The point is
> >> > > > that currently all lookup connectors use InputFormat for scanning
> >> the
> >> > > > data in batch mode: HBase, JDBC and even Hive - it uses class
> >> > > > PartitionReader, that is actually just a wrapper around InputFormat.
> >> > > > The advantage of this solution is the ability to reload cache data
> >> in
> >> > > > parallel (number of threads depends on number of InputSplits, but
> >> has
> >> > > > an upper limit). As a result cache reload time significantly reduces
> >> > > > (as well as time of input stream blocking). I know that usually we
> >> try
> >> > > > to avoid usage of concurrency in Flink code, but maybe this one can
> >> be
> >> > > > an exception. BTW I don't say that it's an ideal solution, maybe
> >> there
> >> > > > are better ones.
> >> > > >
> >> > > > > Providing the cache in the framework might introduce compatibility
> >> > > issues
> >> > > >
> >> > > > It's possible only in cases when the developer of the connector
> >> won't
> >> > > > properly refactor his code and will use new cache options
> >> incorrectly
> >> > > > (i.e. explicitly provide the same options into 2 different code
> >> > > > places). For correct behavior all he will need to do is to redirect
> >> > > > existing options to the framework's LookupConfig (+ maybe add an
> >> alias
> >> > > > for options, if there was different naming), everything will be
> >> > > > transparent for users. If the developer won't do refactoring at all,
> >> > > > nothing will be changed for the connector because of backward
> >> > > > compatibility. Also if a developer wants to use his own cache logic,
> >> > > > he just can refuse to pass some of the configs into the framework,
> >> and
> >> > > > instead make his own implementation with already existing configs
> >> and
> >> > > > metrics (but actually I think that it's a rare case).
> >> > > >
> >> > > > > filters and projections should be pushed all the way down to the
> >> > table
> >> > > > function, like what we do in the scan source
> >> > > >
> >> > > > It's the great purpose. But the truth is that the ONLY connector
> >> that
> >> > > > supports filter pushdown is FileSystemTableSource
> >> > > > (no database connector supports it currently). Also for some
> >> databases
> >> > > > it's simply impossible to pushdown such complex filters that we have
> >> > > > in Flink.
> >> > > >
> >> > > > >  only applying these optimizations to the cache seems not quite
> >> > useful
> >> > > >
> >> > > > Filters can cut off an arbitrarily large amount of data from the
> >> > > > dimension table. For a simple example, suppose in dimension table
> >> > > > 'users'
> >> > > > we have column 'age' with values from 20 to 40, and input stream
> >> > > > 'clicks' that is ~uniformly distributed by age of users. If we have
> >> > > > filter 'age > 30',
> >> > > > there will be twice less data in cache. This means the user can
> >> > > > increase 'lookup.cache.max-rows' by almost 2 times. It will gain a
> >> > > > huge
> >> > > > performance boost. Moreover, this optimization starts to really
> >> shine
> >> > > > in 'ALL' cache, where tables without filters and projections can't
> >> fit
> >> > > > in memory, but with them - can. This opens up additional
> >> possibilities
> >> > > > for users. And this doesn't sound as 'not quite useful'.
> >> > > >
> >> > > > It would be great to hear other voices regarding this topic! Because
> >> > > > we have quite a lot of controversial points, and I think with the
> >> help
> >> > > > of others it will be easier for us to come to a consensus.
> >> > > >
> >> > > > Best regards,
> >> > > > Smirnov Alexander
> >> > > >
> >> > > >
> >> > > > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <re...@gmail.com>:
> >> > > > >
> >> > > > > Hi Alexander and Arvid,
> >> > > > >
> >> > > > > Thanks for the discussion and sorry for my late response! We had
> >> an
> >> > > > internal discussion together with Jark and Leonard and I’d like to
> >> > > > summarize our ideas. Instead of implementing the cache logic in the
> >> > table
> >> > > > runtime layer or wrapping around the user-provided table function,
> >> we
> >> > > > prefer to introduce some new APIs extending TableFunction with these
> >> > > > concerns:
> >> > > > >
> >> > > > > 1. Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> >> > > > proc_time”, because it couldn’t truly reflect the content of the
> >> lookup
> >> > > > table at the moment of querying. If users choose to enable caching
> >> on
> >> > the
> >> > > > lookup table, they implicitly indicate that this breakage is
> >> acceptable
> >> > > in
> >> > > > exchange for the performance. So we prefer not to provide caching on
> >> > the
> >> > > > table runtime level.
> >> > > > >
> >> > > > > 2. If we make the cache implementation in the framework (whether
> >> in a
> >> > > > runner or a wrapper around TableFunction), we have to confront a
> >> > > situation
> >> > > > that allows table options in DDL to control the behavior of the
> >> > > framework,
> >> > > > which has never happened previously and should be cautious. Under
> >> the
> >> > > > current design the behavior of the framework should only be
> >> specified
> >> > by
> >> > > > configurations (“table.exec.xxx”), and it’s hard to apply these
> >> general
> >> > > > configs to a specific table.
> >> > > > >
> >> > > > > 3. We have use cases that lookup source loads and refresh all
> >> records
> >> > > > periodically into the memory to achieve high lookup performance
> >> (like
> >> > > Hive
> >> > > > connector in the community, and also widely used by our internal
> >> > > > connectors). Wrapping the cache around the user’s TableFunction
> >> works
> >> > > fine
> >> > > > for LRU caches, but I think we have to introduce a new interface for
> >> > this
> >> > > > all-caching scenario and the design would become more complex.
> >> > > > >
> >> > > > > 4. Providing the cache in the framework might introduce
> >> compatibility
> >> > > > issues to existing lookup sources like there might exist two caches
> >> > with
> >> > > > totally different strategies if the user incorrectly configures the
> >> > table
> >> > > > (one in the framework and another implemented by the lookup source).
> >> > > > >
> >> > > > > As for the optimization mentioned by Alexander, I think filters
> >> and
> >> > > > projections should be pushed all the way down to the table function,
> >> > like
> >> > > > what we do in the scan source, instead of the runner with the cache.
> >> > The
> >> > > > goal of using cache is to reduce the network I/O and pressure on the
> >> > > > external system, and only applying these optimizations to the cache
> >> > seems
> >> > > > not quite useful.
> >> > > > >
> >> > > > > I made some updates to the FLIP[1] to reflect our ideas. We
> >> prefer to
> >> > > > keep the cache implementation as a part of TableFunction, and we
> >> could
> >> > > > provide some helper classes (CachingTableFunction,
> >> > > AllCachingTableFunction,
> >> > > > CachingAsyncTableFunction) to developers and regulate metrics of the
> >> > > cache.
> >> > > > Also, I made a POC[2] for your reference.
> >> > > > >
> >> > > > > Looking forward to your ideas!
> >> > > > >
> >> > > > > [1]
> >> > > >
> >> > >
> >> >
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> > > > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> >> > > > >
> >> > > > > Best regards,
> >> > > > >
> >> > > > > Qingsheng
> >> > > > >
> >> > > > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> >> > > smiralexan@gmail.com>
> >> > > > wrote:
> >> > > > >>
> >> > > > >> Thanks for the response, Arvid!
> >> > > > >>
> >> > > > >> I have few comments on your message.
> >> > > > >>
> >> > > > >> > but could also live with an easier solution as the first step:
> >> > > > >>
> >> > > > >> I think that these 2 ways are mutually exclusive (originally
> >> > proposed
> >> > > > >> by Qingsheng and mine), because conceptually they follow the same
> >> > > > >> goal, but implementation details are different. If we will go one
> >> > way,
> >> > > > >> moving to another way in the future will mean deleting existing
> >> code
> >> > > > >> and once again changing the API for connectors. So I think we
> >> should
> >> > > > >> reach a consensus with the community about that and then work
> >> > together
> >> > > > >> on this FLIP, i.e. divide the work on tasks for different parts
> >> of
> >> > the
> >> > > > >> flip (for example, LRU cache unification / introducing proposed
> >> set
> >> > of
> >> > > > >> metrics / further work…). WDYT, Qingsheng?
> >> > > > >>
> >> > > > >> > as the source will only receive the requests after filter
> >> > > > >>
> >> > > > >> Actually if filters are applied to fields of the lookup table, we
> >> > > > >> firstly must do requests, and only after that we can filter
> >> > responses,
> >> > > > >> because lookup connectors don't have filter pushdown. So if
> >> > filtering
> >> > > > >> is done before caching, there will be much less rows in cache.
> >> > > > >>
> >> > > > >> > @Alexander unfortunately, your architecture is not shared. I
> >> don't
> >> > > > know the
> >> > > > >>
> >> > > > >> > solution to share images to be honest.
> >> > > > >>
> >> > > > >> Sorry for that, I’m a bit new to such kinds of conversations :)
> >> > > > >> I have no write access to the confluence, so I made a Jira issue,
> >> > > > >> where described the proposed changes in more details -
> >> > > > >> https://issues.apache.org/jira/browse/FLINK-27411.
> >> > > > >>
> >> > > > >> Will happy to get more feedback!
> >> > > > >>
> >> > > > >> Best,
> >> > > > >> Smirnov Alexander
> >> > > > >>
> >> > > > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <ar...@apache.org>:
> >> > > > >> >
> >> > > > >> > Hi Qingsheng,
> >> > > > >> >
> >> > > > >> > Thanks for driving this; the inconsistency was not satisfying
> >> for
> >> > > me.
> >> > > > >> >
> >> > > > >> > I second Alexander's idea though but could also live with an
> >> > easier
> >> > > > >> > solution as the first step: Instead of making caching an
> >> > > > implementation
> >> > > > >> > detail of TableFunction X, rather devise a caching layer
> >> around X.
> >> > > So
> >> > > > the
> >> > > > >> > proposal would be a CachingTableFunction that delegates to X in
> >> > case
> >> > > > of
> >> > > > >> > misses and else manages the cache. Lifting it into the operator
> >> > > model
> >> > > > as
> >> > > > >> > proposed would be even better but is probably unnecessary in
> >> the
> >> > > > first step
> >> > > > >> > for a lookup source (as the source will only receive the
> >> requests
> >> > > > after
> >> > > > >> > filter; applying projection may be more interesting to save
> >> > memory).
> >> > > > >> >
> >> > > > >> > Another advantage is that all the changes of this FLIP would be
> >> > > > limited to
> >> > > > >> > options, no need for new public interfaces. Everything else
> >> > remains
> >> > > an
> >> > > > >> > implementation of Table runtime. That means we can easily
> >> > > incorporate
> >> > > > the
> >> > > > >> > optimization potential that Alexander pointed out later.
> >> > > > >> >
> >> > > > >> > @Alexander unfortunately, your architecture is not shared. I
> >> don't
> >> > > > know the
> >> > > > >> > solution to share images to be honest.
> >> > > > >> >
> >> > > > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> >> > > > smiralexan@gmail.com>
> >> > > > >> > wrote:
> >> > > > >> >
> >> > > > >> > > Hi Qingsheng! My name is Alexander, I'm not a committer yet,
> >> but
> >> > > I'd
> >> > > > >> > > really like to become one. And this FLIP really interested
> >> me.
> >> > > > >> > > Actually I have worked on a similar feature in my company’s
> >> > Flink
> >> > > > >> > > fork, and we would like to share our thoughts on this and
> >> make
> >> > > code
> >> > > > >> > > open source.
> >> > > > >> > >
> >> > > > >> > > I think there is a better alternative than introducing an
> >> > abstract
> >> > > > >> > > class for TableFunction (CachingTableFunction). As you know,
> >> > > > >> > > TableFunction exists in the flink-table-common module, which
> >> > > > provides
> >> > > > >> > > only an API for working with tables – it’s very convenient
> >> for
> >> > > > importing
> >> > > > >> > > in connectors. In turn, CachingTableFunction contains logic
> >> for
> >> > > > >> > > runtime execution,  so this class and everything connected
> >> with
> >> > it
> >> > > > >> > > should be located in another module, probably in
> >> > > > flink-table-runtime.
> >> > > > >> > > But this will require connectors to depend on another module,
> >> > > which
> >> > > > >> > > contains a lot of runtime logic, which doesn’t sound good.
> >> > > > >> > >
> >> > > > >> > > I suggest adding a new method ‘getLookupConfig’ to
> >> > > LookupTableSource
> >> > > > >> > > or LookupRuntimeProvider to allow connectors to only pass
> >> > > > >> > > configurations to the planner, therefore they won’t depend on
> >> > > > runtime
> >> > > > >> > > realization. Based on these configs planner will construct a
> >> > > lookup
> >> > > > >> > > join operator with corresponding runtime logic
> >> (ProcessFunctions
> >> > > in
> >> > > > >> > > module flink-table-runtime). Architecture looks like in the
> >> > pinned
> >> > > > >> > > image (LookupConfig class there is actually yours
> >> CacheConfig).
> >> > > > >> > >
> >> > > > >> > > Classes in flink-table-planner, that will be responsible for
> >> > this
> >> > > –
> >> > > > >> > > CommonPhysicalLookupJoin and his inheritors.
> >> > > > >> > > Current classes for lookup join in  flink-table-runtime  -
> >> > > > >> > > LookupJoinRunner, AsyncLookupJoinRunner,
> >> > LookupJoinRunnerWithCalc,
> >> > > > >> > > AsyncLookupJoinRunnerWithCalc.
> >> > > > >> > >
> >> > > > >> > > I suggest adding classes LookupJoinCachingRunner,
> >> > > > >> > > LookupJoinCachingRunnerWithCalc, etc.
> >> > > > >> > >
> >> > > > >> > > And here comes another more powerful advantage of such a
> >> > solution.
> >> > > > If
> >> > > > >> > > we have caching logic on a lower level, we can apply some
> >> > > > >> > > optimizations to it. LookupJoinRunnerWithCalc was named like
> >> > this
> >> > > > >> > > because it uses the ‘calc’ function, which actually mostly
> >> > > consists
> >> > > > of
> >> > > > >> > > filters and projections.
> >> > > > >> > >
> >> > > > >> > > For example, in join table A with lookup table B condition
> >> > ‘JOIN …
> >> > > > ON
> >> > > > >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary > 1000’
> >> > ‘calc’
> >> > > > >> > > function will contain filters A.age = B.age + 10 and
> >> B.salary >
> >> > > > 1000.
> >> > > > >> > >
> >> > > > >> > > If we apply this function before storing records in cache,
> >> size
> >> > of
> >> > > > >> > > cache will be significantly reduced: filters = avoid storing
> >> > > useless
> >> > > > >> > > records in cache, projections = reduce records’ size. So the
> >> > > initial
> >> > > > >> > > max number of records in cache can be increased by the user.
> >> > > > >> > >
> >> > > > >> > > What do you think about it?
> >> > > > >> > >
> >> > > > >> > >
> >> > > > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >> > > > >> > > > Hi devs,
> >> > > > >> > > >
> >> > > > >> > > > Yuan and I would like to start a discussion about
> >> FLIP-221[1],
> >> > > > which
> >> > > > >> > > introduces an abstraction of lookup table cache and its
> >> standard
> >> > > > metrics.
> >> > > > >> > > >
> >> > > > >> > > > Currently each lookup table source should implement their
> >> own
> >> > > > cache to
> >> > > > >> > > store lookup results, and there isn’t a standard of metrics
> >> for
> >> > > > users and
> >> > > > >> > > developers to tuning their jobs with lookup joins, which is a
> >> > > quite
> >> > > > common
> >> > > > >> > > use case in Flink table / SQL.
> >> > > > >> > > >
> >> > > > >> > > > Therefore we propose some new APIs including cache,
> >> metrics,
> >> > > > wrapper
> >> > > > >> > > classes of TableFunction and new table options. Please take a
> >> > look
> >> > > > at the
> >> > > > >> > > FLIP page [1] to get more details. Any suggestions and
> >> comments
> >> > > > would be
> >> > > > >> > > appreciated!
> >> > > > >> > > >
> >> > > > >> > > > [1]
> >> > > > >> > >
> >> > > >
> >> > >
> >> >
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> > > > >> > > >
> >> > > > >> > > > Best regards,
> >> > > > >> > > >
> >> > > > >> > > > Qingsheng
> >> > > > >> > > >
> >> > > > >> > > >
> >> > > > >> > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Best Regards,
> >> > > > >
> >> > > > > Qingsheng Ren
> >> > > > >
> >> > > > > Real-time Computing Team
> >> > > > > Alibaba Cloud
> >> > > > >
> >> > > > > Email: renqschn@gmail.com
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> > --
> > Best regards,
> > Roman Boyko
> > e.: ro.v.boyko@gmail.com
> >

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Jark Wu <im...@gmail.com>.

It's great to see the active discussion! I want to share my ideas:

1) implement the cache in framework vs. connectors base
I don't have a strong opinion on this. Both ways should work (e.g., cache
pruning, compatibility).
The framework way can provide more concise interfaces.
The connector base way can define more flexible cache
strategies/implementations.
We are still investigating a way to see if we can have both advantages.
We should reach a consensus that the way should be a final state, and we
are on the path to it.

2) filters and projections pushdown:
I agree with Alex that the filter pushdown into cache can benefit a lot for
ALL cache.
However, this is not true for LRU cache. Connectors use cache to reduce IO
requests to databases for better throughput.
If a filter can prune 90% of data in the cache, we will have 90% of lookup
requests that can never be cached
and hit directly to the databases. That means the cache is meaningless in
this case.

IMO, Flink SQL has provided a standard way to do filters and projects
pushdown, i.e., SupportsFilterPushDown and SupportsProjectionPushDown.
Jdbc/hive/HBase haven't implemented the interfaces, don't mean it's hard to
implement.
They should implement the pushdown interfaces to reduce IO and the cache
size.
That should be a final state that the scan source and lookup source share
the exact pushdown implementation.
I don't see why we need to duplicate the pushdown logic in caches, which
will complex the lookup join design.

3) ALL cache abstraction
All cache might be the most challenging part of this FLIP. We have never
provided a reload-lookup public interface.
Currently, we put the reload logic in the "eval" method of TableFunction.
That's hard for some sources (e.g., Hive).
Ideally, connector implementation should share the logic of reload and
scan, i.e. ScanTableSource with InputFormat/SourceFunction/FLIP-27 Source.
However, InputFormat/SourceFunction are deprecated, and the FLIP-27 source
is deeply coupled with SourceOperator.
If we want to invoke the FLIP-27 source in LookupJoin, this may make the
scope of this FLIP much larger.
We are still investigating how to abstract the ALL cache logic and reuse
the existing source interfaces.


Best,
Jark



On Thu, 5 May 2022 at 20:22, Roman Boyko <ro...@gmail.com> wrote:

> It's a much more complicated activity and lies out of the scope of this
> improvement. Because such pushdowns should be done for all ScanTableSource
> implementations (not only for Lookup ones).
>
> On Thu, 5 May 2022 at 19:02, Martijn Visser <ma...@apache.org>
> wrote:
>
>> Hi everyone,
>>
>> One question regarding "And Alexander correctly mentioned that filter
>> pushdown still is not implemented for jdbc/hive/hbase." -> Would an
>> alternative solution be to actually implement these filter pushdowns? I
>> can
>> imagine that there are many more benefits to doing that, outside of lookup
>> caching and metrics.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>> https://github.com/MartijnVisser
>>
>>
>> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com> wrote:
>>
>> > Hi everyone!
>> >
>> > Thanks for driving such a valuable improvement!
>> >
>> > I do think that single cache implementation would be a nice opportunity
>> for
>> > users. And it will break the "FOR SYSTEM_TIME AS OF proc_time" semantics
>> > anyway - doesn't matter how it will be implemented.
>> >
>> > Putting myself in the user's shoes, I can say that:
>> > 1) I would prefer to have the opportunity to cut off the cache size by
>> > simply filtering unnecessary data. And the most handy way to do it is
>> apply
>> > it inside LookupRunners. It would be a bit harder to pass it through the
>> > LookupJoin node to TableFunction. And Alexander correctly mentioned that
>> > filter pushdown still is not implemented for jdbc/hive/hbase.
>> > 2) The ability to set the different caching parameters for different
>> tables
>> > is quite important. So I would prefer to set it through DDL rather than
>> > have similar ttla, strategy and other options for all lookup tables.
>> > 3) Providing the cache into the framework really deprives us of
>> > extensibility (users won't be able to implement their own cache). But
>> most
>> > probably it might be solved by creating more different cache strategies
>> and
>> > a wider set of configurations.
>> >
>> > All these points are much closer to the schema proposed by Alexander.
>> > Qingshen Ren, please correct me if I'm not right and all these
>> facilities
>> > might be simply implemented in your architecture?
>> >
>> > Best regards,
>> > Roman Boyko
>> > e.: ro.v.boyko@gmail.com
>> >
>> > On Wed, 4 May 2022 at 21:01, Martijn Visser <ma...@apache.org>
>> > wrote:
>> >
>> > > Hi everyone,
>> > >
>> > > I don't have much to chip in, but just wanted to express that I really
>> > > appreciate the in-depth discussion on this topic and I hope that
>> others
>> > > will join the conversation.
>> > >
>> > > Best regards,
>> > >
>> > > Martijn
>> > >
>> > > On Tue, 3 May 2022 at 10:15, Александр Смирнов <sm...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Qingsheng, Leonard and Jark,
>> > > >
>> > > > Thanks for your detailed feedback! However, I have questions about
>> > > > some of your statements (maybe I didn't get something?).
>> > > >
>> > > > > Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
>> > > proc_time”
>> > > >
>> > > > I agree that the semantics of "FOR SYSTEM_TIME AS OF proc_time" is
>> not
>> > > > fully implemented with caching, but as you said, users go on it
>> > > > consciously to achieve better performance (no one proposed to enable
>> > > > caching by default, etc.). Or by users do you mean other developers
>> of
>> > > > connectors? In this case developers explicitly specify whether their
>> > > > connector supports caching or not (in the list of supported
>> options),
>> > > > no one makes them do that if they don't want to. So what exactly is
>> > > > the difference between implementing caching in modules
>> > > > flink-table-runtime and in flink-table-common from the considered
>> > > > point of view? How does it affect on breaking/non-breaking the
>> > > > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>> > > >
>> > > > > confront a situation that allows table options in DDL to control
>> the
>> > > > behavior of the framework, which has never happened previously and
>> > should
>> > > > be cautious
>> > > >
>> > > > If we talk about main differences of semantics of DDL options and
>> > > > config options("table.exec.xxx"), isn't it about limiting the scope
>> of
>> > > > the options + importance for the user business logic rather than
>> > > > specific location of corresponding logic in the framework? I mean
>> that
>> > > > in my design, for example, putting an option with lookup cache
>> > > > strategy in configurations would  be the wrong decision, because it
>> > > > directly affects the user's business logic (not just performance
>> > > > optimization) + touches just several functions of ONE table (there
>> can
>> > > > be multiple tables with different caches). Does it really matter for
>> > > > the user (or someone else) where the logic is located, which is
>> > > > affected by the applied option?
>> > > > Also I can remember DDL option 'sink.parallelism', which in some way
>> > > > "controls the behavior of the framework" and I don't see any problem
>> > > > here.
>> > > >
>> > > > > introduce a new interface for this all-caching scenario and the
>> > design
>> > > > would become more complex
>> > > >
>> > > > This is a subject for a separate discussion, but actually in our
>> > > > internal version we solved this problem quite easily - we reused
>> > > > InputFormat class (so there is no need for a new API). The point is
>> > > > that currently all lookup connectors use InputFormat for scanning
>> the
>> > > > data in batch mode: HBase, JDBC and even Hive - it uses class
>> > > > PartitionReader, that is actually just a wrapper around InputFormat.
>> > > > The advantage of this solution is the ability to reload cache data
>> in
>> > > > parallel (number of threads depends on number of InputSplits, but
>> has
>> > > > an upper limit). As a result cache reload time significantly reduces
>> > > > (as well as time of input stream blocking). I know that usually we
>> try
>> > > > to avoid usage of concurrency in Flink code, but maybe this one can
>> be
>> > > > an exception. BTW I don't say that it's an ideal solution, maybe
>> there
>> > > > are better ones.
>> > > >
>> > > > > Providing the cache in the framework might introduce compatibility
>> > > issues
>> > > >
>> > > > It's possible only in cases when the developer of the connector
>> won't
>> > > > properly refactor his code and will use new cache options
>> incorrectly
>> > > > (i.e. explicitly provide the same options into 2 different code
>> > > > places). For correct behavior all he will need to do is to redirect
>> > > > existing options to the framework's LookupConfig (+ maybe add an
>> alias
>> > > > for options, if there was different naming), everything will be
>> > > > transparent for users. If the developer won't do refactoring at all,
>> > > > nothing will be changed for the connector because of backward
>> > > > compatibility. Also if a developer wants to use his own cache logic,
>> > > > he just can refuse to pass some of the configs into the framework,
>> and
>> > > > instead make his own implementation with already existing configs
>> and
>> > > > metrics (but actually I think that it's a rare case).
>> > > >
>> > > > > filters and projections should be pushed all the way down to the
>> > table
>> > > > function, like what we do in the scan source
>> > > >
>> > > > It's the great purpose. But the truth is that the ONLY connector
>> that
>> > > > supports filter pushdown is FileSystemTableSource
>> > > > (no database connector supports it currently). Also for some
>> databases
>> > > > it's simply impossible to pushdown such complex filters that we have
>> > > > in Flink.
>> > > >
>> > > > >  only applying these optimizations to the cache seems not quite
>> > useful
>> > > >
>> > > > Filters can cut off an arbitrarily large amount of data from the
>> > > > dimension table. For a simple example, suppose in dimension table
>> > > > 'users'
>> > > > we have column 'age' with values from 20 to 40, and input stream
>> > > > 'clicks' that is ~uniformly distributed by age of users. If we have
>> > > > filter 'age > 30',
>> > > > there will be twice less data in cache. This means the user can
>> > > > increase 'lookup.cache.max-rows' by almost 2 times. It will gain a
>> > > > huge
>> > > > performance boost. Moreover, this optimization starts to really
>> shine
>> > > > in 'ALL' cache, where tables without filters and projections can't
>> fit
>> > > > in memory, but with them - can. This opens up additional
>> possibilities
>> > > > for users. And this doesn't sound as 'not quite useful'.
>> > > >
>> > > > It would be great to hear other voices regarding this topic! Because
>> > > > we have quite a lot of controversial points, and I think with the
>> help
>> > > > of others it will be easier for us to come to a consensus.
>> > > >
>> > > > Best regards,
>> > > > Smirnov Alexander
>> > > >
>> > > >
>> > > > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <re...@gmail.com>:
>> > > > >
>> > > > > Hi Alexander and Arvid,
>> > > > >
>> > > > > Thanks for the discussion and sorry for my late response! We had
>> an
>> > > > internal discussion together with Jark and Leonard and I’d like to
>> > > > summarize our ideas. Instead of implementing the cache logic in the
>> > table
>> > > > runtime layer or wrapping around the user-provided table function,
>> we
>> > > > prefer to introduce some new APIs extending TableFunction with these
>> > > > concerns:
>> > > > >
>> > > > > 1. Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
>> > > > proc_time”, because it couldn’t truly reflect the content of the
>> lookup
>> > > > table at the moment of querying. If users choose to enable caching
>> on
>> > the
>> > > > lookup table, they implicitly indicate that this breakage is
>> acceptable
>> > > in
>> > > > exchange for the performance. So we prefer not to provide caching on
>> > the
>> > > > table runtime level.
>> > > > >
>> > > > > 2. If we make the cache implementation in the framework (whether
>> in a
>> > > > runner or a wrapper around TableFunction), we have to confront a
>> > > situation
>> > > > that allows table options in DDL to control the behavior of the
>> > > framework,
>> > > > which has never happened previously and should be cautious. Under
>> the
>> > > > current design the behavior of the framework should only be
>> specified
>> > by
>> > > > configurations (“table.exec.xxx”), and it’s hard to apply these
>> general
>> > > > configs to a specific table.
>> > > > >
>> > > > > 3. We have use cases that lookup source loads and refresh all
>> records
>> > > > periodically into the memory to achieve high lookup performance
>> (like
>> > > Hive
>> > > > connector in the community, and also widely used by our internal
>> > > > connectors). Wrapping the cache around the user’s TableFunction
>> works
>> > > fine
>> > > > for LRU caches, but I think we have to introduce a new interface for
>> > this
>> > > > all-caching scenario and the design would become more complex.
>> > > > >
>> > > > > 4. Providing the cache in the framework might introduce
>> compatibility
>> > > > issues to existing lookup sources like there might exist two caches
>> > with
>> > > > totally different strategies if the user incorrectly configures the
>> > table
>> > > > (one in the framework and another implemented by the lookup source).
>> > > > >
>> > > > > As for the optimization mentioned by Alexander, I think filters
>> and
>> > > > projections should be pushed all the way down to the table function,
>> > like
>> > > > what we do in the scan source, instead of the runner with the cache.
>> > The
>> > > > goal of using cache is to reduce the network I/O and pressure on the
>> > > > external system, and only applying these optimizations to the cache
>> > seems
>> > > > not quite useful.
>> > > > >
>> > > > > I made some updates to the FLIP[1] to reflect our ideas. We
>> prefer to
>> > > > keep the cache implementation as a part of TableFunction, and we
>> could
>> > > > provide some helper classes (CachingTableFunction,
>> > > AllCachingTableFunction,
>> > > > CachingAsyncTableFunction) to developers and regulate metrics of the
>> > > cache.
>> > > > Also, I made a POC[2] for your reference.
>> > > > >
>> > > > > Looking forward to your ideas!
>> > > > >
>> > > > > [1]
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> > > > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
>> > > > >
>> > > > > Best regards,
>> > > > >
>> > > > > Qingsheng
>> > > > >
>> > > > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
>> > > smiralexan@gmail.com>
>> > > > wrote:
>> > > > >>
>> > > > >> Thanks for the response, Arvid!
>> > > > >>
>> > > > >> I have few comments on your message.
>> > > > >>
>> > > > >> > but could also live with an easier solution as the first step:
>> > > > >>
>> > > > >> I think that these 2 ways are mutually exclusive (originally
>> > proposed
>> > > > >> by Qingsheng and mine), because conceptually they follow the same
>> > > > >> goal, but implementation details are different. If we will go one
>> > way,
>> > > > >> moving to another way in the future will mean deleting existing
>> code
>> > > > >> and once again changing the API for connectors. So I think we
>> should
>> > > > >> reach a consensus with the community about that and then work
>> > together
>> > > > >> on this FLIP, i.e. divide the work on tasks for different parts
>> of
>> > the
>> > > > >> flip (for example, LRU cache unification / introducing proposed
>> set
>> > of
>> > > > >> metrics / further work…). WDYT, Qingsheng?
>> > > > >>
>> > > > >> > as the source will only receive the requests after filter
>> > > > >>
>> > > > >> Actually if filters are applied to fields of the lookup table, we
>> > > > >> firstly must do requests, and only after that we can filter
>> > responses,
>> > > > >> because lookup connectors don't have filter pushdown. So if
>> > filtering
>> > > > >> is done before caching, there will be much less rows in cache.
>> > > > >>
>> > > > >> > @Alexander unfortunately, your architecture is not shared. I
>> don't
>> > > > know the
>> > > > >>
>> > > > >> > solution to share images to be honest.
>> > > > >>
>> > > > >> Sorry for that, I’m a bit new to such kinds of conversations :)
>> > > > >> I have no write access to the confluence, so I made a Jira issue,
>> > > > >> where described the proposed changes in more details -
>> > > > >> https://issues.apache.org/jira/browse/FLINK-27411.
>> > > > >>
>> > > > >> Will happy to get more feedback!
>> > > > >>
>> > > > >> Best,
>> > > > >> Smirnov Alexander
>> > > > >>
>> > > > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <ar...@apache.org>:
>> > > > >> >
>> > > > >> > Hi Qingsheng,
>> > > > >> >
>> > > > >> > Thanks for driving this; the inconsistency was not satisfying
>> for
>> > > me.
>> > > > >> >
>> > > > >> > I second Alexander's idea though but could also live with an
>> > easier
>> > > > >> > solution as the first step: Instead of making caching an
>> > > > implementation
>> > > > >> > detail of TableFunction X, rather devise a caching layer
>> around X.
>> > > So
>> > > > the
>> > > > >> > proposal would be a CachingTableFunction that delegates to X in
>> > case
>> > > > of
>> > > > >> > misses and else manages the cache. Lifting it into the operator
>> > > model
>> > > > as
>> > > > >> > proposed would be even better but is probably unnecessary in
>> the
>> > > > first step
>> > > > >> > for a lookup source (as the source will only receive the
>> requests
>> > > > after
>> > > > >> > filter; applying projection may be more interesting to save
>> > memory).
>> > > > >> >
>> > > > >> > Another advantage is that all the changes of this FLIP would be
>> > > > limited to
>> > > > >> > options, no need for new public interfaces. Everything else
>> > remains
>> > > an
>> > > > >> > implementation of Table runtime. That means we can easily
>> > > incorporate
>> > > > the
>> > > > >> > optimization potential that Alexander pointed out later.
>> > > > >> >
>> > > > >> > @Alexander unfortunately, your architecture is not shared. I
>> don't
>> > > > know the
>> > > > >> > solution to share images to be honest.
>> > > > >> >
>> > > > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
>> > > > smiralexan@gmail.com>
>> > > > >> > wrote:
>> > > > >> >
>> > > > >> > > Hi Qingsheng! My name is Alexander, I'm not a committer yet,
>> but
>> > > I'd
>> > > > >> > > really like to become one. And this FLIP really interested
>> me.
>> > > > >> > > Actually I have worked on a similar feature in my company’s
>> > Flink
>> > > > >> > > fork, and we would like to share our thoughts on this and
>> make
>> > > code
>> > > > >> > > open source.
>> > > > >> > >
>> > > > >> > > I think there is a better alternative than introducing an
>> > abstract
>> > > > >> > > class for TableFunction (CachingTableFunction). As you know,
>> > > > >> > > TableFunction exists in the flink-table-common module, which
>> > > > provides
>> > > > >> > > only an API for working with tables – it’s very convenient
>> for
>> > > > importing
>> > > > >> > > in connectors. In turn, CachingTableFunction contains logic
>> for
>> > > > >> > > runtime execution,  so this class and everything connected
>> with
>> > it
>> > > > >> > > should be located in another module, probably in
>> > > > flink-table-runtime.
>> > > > >> > > But this will require connectors to depend on another module,
>> > > which
>> > > > >> > > contains a lot of runtime logic, which doesn’t sound good.
>> > > > >> > >
>> > > > >> > > I suggest adding a new method ‘getLookupConfig’ to
>> > > LookupTableSource
>> > > > >> > > or LookupRuntimeProvider to allow connectors to only pass
>> > > > >> > > configurations to the planner, therefore they won’t depend on
>> > > > runtime
>> > > > >> > > realization. Based on these configs planner will construct a
>> > > lookup
>> > > > >> > > join operator with corresponding runtime logic
>> (ProcessFunctions
>> > > in
>> > > > >> > > module flink-table-runtime). Architecture looks like in the
>> > pinned
>> > > > >> > > image (LookupConfig class there is actually yours
>> CacheConfig).
>> > > > >> > >
>> > > > >> > > Classes in flink-table-planner, that will be responsible for
>> > this
>> > > –
>> > > > >> > > CommonPhysicalLookupJoin and his inheritors.
>> > > > >> > > Current classes for lookup join in  flink-table-runtime  -
>> > > > >> > > LookupJoinRunner, AsyncLookupJoinRunner,
>> > LookupJoinRunnerWithCalc,
>> > > > >> > > AsyncLookupJoinRunnerWithCalc.
>> > > > >> > >
>> > > > >> > > I suggest adding classes LookupJoinCachingRunner,
>> > > > >> > > LookupJoinCachingRunnerWithCalc, etc.
>> > > > >> > >
>> > > > >> > > And here comes another more powerful advantage of such a
>> > solution.
>> > > > If
>> > > > >> > > we have caching logic on a lower level, we can apply some
>> > > > >> > > optimizations to it. LookupJoinRunnerWithCalc was named like
>> > this
>> > > > >> > > because it uses the ‘calc’ function, which actually mostly
>> > > consists
>> > > > of
>> > > > >> > > filters and projections.
>> > > > >> > >
>> > > > >> > > For example, in join table A with lookup table B condition
>> > ‘JOIN …
>> > > > ON
>> > > > >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary > 1000’
>> > ‘calc’
>> > > > >> > > function will contain filters A.age = B.age + 10 and
>> B.salary >
>> > > > 1000.
>> > > > >> > >
>> > > > >> > > If we apply this function before storing records in cache,
>> size
>> > of
>> > > > >> > > cache will be significantly reduced: filters = avoid storing
>> > > useless
>> > > > >> > > records in cache, projections = reduce records’ size. So the
>> > > initial
>> > > > >> > > max number of records in cache can be increased by the user.
>> > > > >> > >
>> > > > >> > > What do you think about it?
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
>> > > > >> > > > Hi devs,
>> > > > >> > > >
>> > > > >> > > > Yuan and I would like to start a discussion about
>> FLIP-221[1],
>> > > > which
>> > > > >> > > introduces an abstraction of lookup table cache and its
>> standard
>> > > > metrics.
>> > > > >> > > >
>> > > > >> > > > Currently each lookup table source should implement their
>> own
>> > > > cache to
>> > > > >> > > store lookup results, and there isn’t a standard of metrics
>> for
>> > > > users and
>> > > > >> > > developers to tuning their jobs with lookup joins, which is a
>> > > quite
>> > > > common
>> > > > >> > > use case in Flink table / SQL.
>> > > > >> > > >
>> > > > >> > > > Therefore we propose some new APIs including cache,
>> metrics,
>> > > > wrapper
>> > > > >> > > classes of TableFunction and new table options. Please take a
>> > look
>> > > > at the
>> > > > >> > > FLIP page [1] to get more details. Any suggestions and
>> comments
>> > > > would be
>> > > > >> > > appreciated!
>> > > > >> > > >
>> > > > >> > > > [1]
>> > > > >> > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
>> > > > >> > > >
>> > > > >> > > > Best regards,
>> > > > >> > > >
>> > > > >> > > > Qingsheng
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Best Regards,
>> > > > >
>> > > > > Qingsheng Ren
>> > > > >
>> > > > > Real-time Computing Team
>> > > > > Alibaba Cloud
>> > > > >
>> > > > > Email: renqschn@gmail.com
>> > > >
>> > >
>> >
>>
>
>
> --
> Best regards,
> Roman Boyko
> e.: ro.v.boyko@gmail.com
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Roman Boyko <ro...@gmail.com>.

It's a much more complicated activity and lies out of the scope of this
improvement. Because such pushdowns should be done for all ScanTableSource
implementations (not only for Lookup ones).

On Thu, 5 May 2022 at 19:02, Martijn Visser <ma...@apache.org>
wrote:

> Hi everyone,
>
> One question regarding "And Alexander correctly mentioned that filter
> pushdown still is not implemented for jdbc/hive/hbase." -> Would an
> alternative solution be to actually implement these filter pushdowns? I can
> imagine that there are many more benefits to doing that, outside of lookup
> caching and metrics.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
> https://github.com/MartijnVisser
>
>
> On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com> wrote:
>
> > Hi everyone!
> >
> > Thanks for driving such a valuable improvement!
> >
> > I do think that single cache implementation would be a nice opportunity
> for
> > users. And it will break the "FOR SYSTEM_TIME AS OF proc_time" semantics
> > anyway - doesn't matter how it will be implemented.
> >
> > Putting myself in the user's shoes, I can say that:
> > 1) I would prefer to have the opportunity to cut off the cache size by
> > simply filtering unnecessary data. And the most handy way to do it is
> apply
> > it inside LookupRunners. It would be a bit harder to pass it through the
> > LookupJoin node to TableFunction. And Alexander correctly mentioned that
> > filter pushdown still is not implemented for jdbc/hive/hbase.
> > 2) The ability to set the different caching parameters for different
> tables
> > is quite important. So I would prefer to set it through DDL rather than
> > have similar ttla, strategy and other options for all lookup tables.
> > 3) Providing the cache into the framework really deprives us of
> > extensibility (users won't be able to implement their own cache). But
> most
> > probably it might be solved by creating more different cache strategies
> and
> > a wider set of configurations.
> >
> > All these points are much closer to the schema proposed by Alexander.
> > Qingshen Ren, please correct me if I'm not right and all these facilities
> > might be simply implemented in your architecture?
> >
> > Best regards,
> > Roman Boyko
> > e.: ro.v.boyko@gmail.com
> >
> > On Wed, 4 May 2022 at 21:01, Martijn Visser <ma...@apache.org>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > I don't have much to chip in, but just wanted to express that I really
> > > appreciate the in-depth discussion on this topic and I hope that others
> > > will join the conversation.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Tue, 3 May 2022 at 10:15, Александр Смирнов <sm...@gmail.com>
> > > wrote:
> > >
> > > > Hi Qingsheng, Leonard and Jark,
> > > >
> > > > Thanks for your detailed feedback! However, I have questions about
> > > > some of your statements (maybe I didn't get something?).
> > > >
> > > > > Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> > > proc_time”
> > > >
> > > > I agree that the semantics of "FOR SYSTEM_TIME AS OF proc_time" is
> not
> > > > fully implemented with caching, but as you said, users go on it
> > > > consciously to achieve better performance (no one proposed to enable
> > > > caching by default, etc.). Or by users do you mean other developers
> of
> > > > connectors? In this case developers explicitly specify whether their
> > > > connector supports caching or not (in the list of supported options),
> > > > no one makes them do that if they don't want to. So what exactly is
> > > > the difference between implementing caching in modules
> > > > flink-table-runtime and in flink-table-common from the considered
> > > > point of view? How does it affect on breaking/non-breaking the
> > > > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > > >
> > > > > confront a situation that allows table options in DDL to control
> the
> > > > behavior of the framework, which has never happened previously and
> > should
> > > > be cautious
> > > >
> > > > If we talk about main differences of semantics of DDL options and
> > > > config options("table.exec.xxx"), isn't it about limiting the scope
> of
> > > > the options + importance for the user business logic rather than
> > > > specific location of corresponding logic in the framework? I mean
> that
> > > > in my design, for example, putting an option with lookup cache
> > > > strategy in configurations would  be the wrong decision, because it
> > > > directly affects the user's business logic (not just performance
> > > > optimization) + touches just several functions of ONE table (there
> can
> > > > be multiple tables with different caches). Does it really matter for
> > > > the user (or someone else) where the logic is located, which is
> > > > affected by the applied option?
> > > > Also I can remember DDL option 'sink.parallelism', which in some way
> > > > "controls the behavior of the framework" and I don't see any problem
> > > > here.
> > > >
> > > > > introduce a new interface for this all-caching scenario and the
> > design
> > > > would become more complex
> > > >
> > > > This is a subject for a separate discussion, but actually in our
> > > > internal version we solved this problem quite easily - we reused
> > > > InputFormat class (so there is no need for a new API). The point is
> > > > that currently all lookup connectors use InputFormat for scanning the
> > > > data in batch mode: HBase, JDBC and even Hive - it uses class
> > > > PartitionReader, that is actually just a wrapper around InputFormat.
> > > > The advantage of this solution is the ability to reload cache data in
> > > > parallel (number of threads depends on number of InputSplits, but has
> > > > an upper limit). As a result cache reload time significantly reduces
> > > > (as well as time of input stream blocking). I know that usually we
> try
> > > > to avoid usage of concurrency in Flink code, but maybe this one can
> be
> > > > an exception. BTW I don't say that it's an ideal solution, maybe
> there
> > > > are better ones.
> > > >
> > > > > Providing the cache in the framework might introduce compatibility
> > > issues
> > > >
> > > > It's possible only in cases when the developer of the connector won't
> > > > properly refactor his code and will use new cache options incorrectly
> > > > (i.e. explicitly provide the same options into 2 different code
> > > > places). For correct behavior all he will need to do is to redirect
> > > > existing options to the framework's LookupConfig (+ maybe add an
> alias
> > > > for options, if there was different naming), everything will be
> > > > transparent for users. If the developer won't do refactoring at all,
> > > > nothing will be changed for the connector because of backward
> > > > compatibility. Also if a developer wants to use his own cache logic,
> > > > he just can refuse to pass some of the configs into the framework,
> and
> > > > instead make his own implementation with already existing configs and
> > > > metrics (but actually I think that it's a rare case).
> > > >
> > > > > filters and projections should be pushed all the way down to the
> > table
> > > > function, like what we do in the scan source
> > > >
> > > > It's the great purpose. But the truth is that the ONLY connector that
> > > > supports filter pushdown is FileSystemTableSource
> > > > (no database connector supports it currently). Also for some
> databases
> > > > it's simply impossible to pushdown such complex filters that we have
> > > > in Flink.
> > > >
> > > > >  only applying these optimizations to the cache seems not quite
> > useful
> > > >
> > > > Filters can cut off an arbitrarily large amount of data from the
> > > > dimension table. For a simple example, suppose in dimension table
> > > > 'users'
> > > > we have column 'age' with values from 20 to 40, and input stream
> > > > 'clicks' that is ~uniformly distributed by age of users. If we have
> > > > filter 'age > 30',
> > > > there will be twice less data in cache. This means the user can
> > > > increase 'lookup.cache.max-rows' by almost 2 times. It will gain a
> > > > huge
> > > > performance boost. Moreover, this optimization starts to really shine
> > > > in 'ALL' cache, where tables without filters and projections can't
> fit
> > > > in memory, but with them - can. This opens up additional
> possibilities
> > > > for users. And this doesn't sound as 'not quite useful'.
> > > >
> > > > It would be great to hear other voices regarding this topic! Because
> > > > we have quite a lot of controversial points, and I think with the
> help
> > > > of others it will be easier for us to come to a consensus.
> > > >
> > > > Best regards,
> > > > Smirnov Alexander
> > > >
> > > >
> > > > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <re...@gmail.com>:
> > > > >
> > > > > Hi Alexander and Arvid,
> > > > >
> > > > > Thanks for the discussion and sorry for my late response! We had an
> > > > internal discussion together with Jark and Leonard and I’d like to
> > > > summarize our ideas. Instead of implementing the cache logic in the
> > table
> > > > runtime layer or wrapping around the user-provided table function, we
> > > > prefer to introduce some new APIs extending TableFunction with these
> > > > concerns:
> > > > >
> > > > > 1. Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> > > > proc_time”, because it couldn’t truly reflect the content of the
> lookup
> > > > table at the moment of querying. If users choose to enable caching on
> > the
> > > > lookup table, they implicitly indicate that this breakage is
> acceptable
> > > in
> > > > exchange for the performance. So we prefer not to provide caching on
> > the
> > > > table runtime level.
> > > > >
> > > > > 2. If we make the cache implementation in the framework (whether
> in a
> > > > runner or a wrapper around TableFunction), we have to confront a
> > > situation
> > > > that allows table options in DDL to control the behavior of the
> > > framework,
> > > > which has never happened previously and should be cautious. Under the
> > > > current design the behavior of the framework should only be specified
> > by
> > > > configurations (“table.exec.xxx”), and it’s hard to apply these
> general
> > > > configs to a specific table.
> > > > >
> > > > > 3. We have use cases that lookup source loads and refresh all
> records
> > > > periodically into the memory to achieve high lookup performance (like
> > > Hive
> > > > connector in the community, and also widely used by our internal
> > > > connectors). Wrapping the cache around the user’s TableFunction works
> > > fine
> > > > for LRU caches, but I think we have to introduce a new interface for
> > this
> > > > all-caching scenario and the design would become more complex.
> > > > >
> > > > > 4. Providing the cache in the framework might introduce
> compatibility
> > > > issues to existing lookup sources like there might exist two caches
> > with
> > > > totally different strategies if the user incorrectly configures the
> > table
> > > > (one in the framework and another implemented by the lookup source).
> > > > >
> > > > > As for the optimization mentioned by Alexander, I think filters and
> > > > projections should be pushed all the way down to the table function,
> > like
> > > > what we do in the scan source, instead of the runner with the cache.
> > The
> > > > goal of using cache is to reduce the network I/O and pressure on the
> > > > external system, and only applying these optimizations to the cache
> > seems
> > > > not quite useful.
> > > > >
> > > > > I made some updates to the FLIP[1] to reflect our ideas. We prefer
> to
> > > > keep the cache implementation as a part of TableFunction, and we
> could
> > > > provide some helper classes (CachingTableFunction,
> > > AllCachingTableFunction,
> > > > CachingAsyncTableFunction) to developers and regulate metrics of the
> > > cache.
> > > > Also, I made a POC[2] for your reference.
> > > > >
> > > > > Looking forward to your ideas!
> > > > >
> > > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Qingsheng
> > > > >
> > > > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > > smiralexan@gmail.com>
> > > > wrote:
> > > > >>
> > > > >> Thanks for the response, Arvid!
> > > > >>
> > > > >> I have few comments on your message.
> > > > >>
> > > > >> > but could also live with an easier solution as the first step:
> > > > >>
> > > > >> I think that these 2 ways are mutually exclusive (originally
> > proposed
> > > > >> by Qingsheng and mine), because conceptually they follow the same
> > > > >> goal, but implementation details are different. If we will go one
> > way,
> > > > >> moving to another way in the future will mean deleting existing
> code
> > > > >> and once again changing the API for connectors. So I think we
> should
> > > > >> reach a consensus with the community about that and then work
> > together
> > > > >> on this FLIP, i.e. divide the work on tasks for different parts of
> > the
> > > > >> flip (for example, LRU cache unification / introducing proposed
> set
> > of
> > > > >> metrics / further work…). WDYT, Qingsheng?
> > > > >>
> > > > >> > as the source will only receive the requests after filter
> > > > >>
> > > > >> Actually if filters are applied to fields of the lookup table, we
> > > > >> firstly must do requests, and only after that we can filter
> > responses,
> > > > >> because lookup connectors don't have filter pushdown. So if
> > filtering
> > > > >> is done before caching, there will be much less rows in cache.
> > > > >>
> > > > >> > @Alexander unfortunately, your architecture is not shared. I
> don't
> > > > know the
> > > > >>
> > > > >> > solution to share images to be honest.
> > > > >>
> > > > >> Sorry for that, I’m a bit new to such kinds of conversations :)
> > > > >> I have no write access to the confluence, so I made a Jira issue,
> > > > >> where described the proposed changes in more details -
> > > > >> https://issues.apache.org/jira/browse/FLINK-27411.
> > > > >>
> > > > >> Will happy to get more feedback!
> > > > >>
> > > > >> Best,
> > > > >> Smirnov Alexander
> > > > >>
> > > > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <ar...@apache.org>:
> > > > >> >
> > > > >> > Hi Qingsheng,
> > > > >> >
> > > > >> > Thanks for driving this; the inconsistency was not satisfying
> for
> > > me.
> > > > >> >
> > > > >> > I second Alexander's idea though but could also live with an
> > easier
> > > > >> > solution as the first step: Instead of making caching an
> > > > implementation
> > > > >> > detail of TableFunction X, rather devise a caching layer around
> X.
> > > So
> > > > the
> > > > >> > proposal would be a CachingTableFunction that delegates to X in
> > case
> > > > of
> > > > >> > misses and else manages the cache. Lifting it into the operator
> > > model
> > > > as
> > > > >> > proposed would be even better but is probably unnecessary in the
> > > > first step
> > > > >> > for a lookup source (as the source will only receive the
> requests
> > > > after
> > > > >> > filter; applying projection may be more interesting to save
> > memory).
> > > > >> >
> > > > >> > Another advantage is that all the changes of this FLIP would be
> > > > limited to
> > > > >> > options, no need for new public interfaces. Everything else
> > remains
> > > an
> > > > >> > implementation of Table runtime. That means we can easily
> > > incorporate
> > > > the
> > > > >> > optimization potential that Alexander pointed out later.
> > > > >> >
> > > > >> > @Alexander unfortunately, your architecture is not shared. I
> don't
> > > > know the
> > > > >> > solution to share images to be honest.
> > > > >> >
> > > > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > > > smiralexan@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi Qingsheng! My name is Alexander, I'm not a committer yet,
> but
> > > I'd
> > > > >> > > really like to become one. And this FLIP really interested me.
> > > > >> > > Actually I have worked on a similar feature in my company’s
> > Flink
> > > > >> > > fork, and we would like to share our thoughts on this and make
> > > code
> > > > >> > > open source.
> > > > >> > >
> > > > >> > > I think there is a better alternative than introducing an
> > abstract
> > > > >> > > class for TableFunction (CachingTableFunction). As you know,
> > > > >> > > TableFunction exists in the flink-table-common module, which
> > > > provides
> > > > >> > > only an API for working with tables – it’s very convenient for
> > > > importing
> > > > >> > > in connectors. In turn, CachingTableFunction contains logic
> for
> > > > >> > > runtime execution,  so this class and everything connected
> with
> > it
> > > > >> > > should be located in another module, probably in
> > > > flink-table-runtime.
> > > > >> > > But this will require connectors to depend on another module,
> > > which
> > > > >> > > contains a lot of runtime logic, which doesn’t sound good.
> > > > >> > >
> > > > >> > > I suggest adding a new method ‘getLookupConfig’ to
> > > LookupTableSource
> > > > >> > > or LookupRuntimeProvider to allow connectors to only pass
> > > > >> > > configurations to the planner, therefore they won’t depend on
> > > > runtime
> > > > >> > > realization. Based on these configs planner will construct a
> > > lookup
> > > > >> > > join operator with corresponding runtime logic
> (ProcessFunctions
> > > in
> > > > >> > > module flink-table-runtime). Architecture looks like in the
> > pinned
> > > > >> > > image (LookupConfig class there is actually yours
> CacheConfig).
> > > > >> > >
> > > > >> > > Classes in flink-table-planner, that will be responsible for
> > this
> > > –
> > > > >> > > CommonPhysicalLookupJoin and his inheritors.
> > > > >> > > Current classes for lookup join in  flink-table-runtime  -
> > > > >> > > LookupJoinRunner, AsyncLookupJoinRunner,
> > LookupJoinRunnerWithCalc,
> > > > >> > > AsyncLookupJoinRunnerWithCalc.
> > > > >> > >
> > > > >> > > I suggest adding classes LookupJoinCachingRunner,
> > > > >> > > LookupJoinCachingRunnerWithCalc, etc.
> > > > >> > >
> > > > >> > > And here comes another more powerful advantage of such a
> > solution.
> > > > If
> > > > >> > > we have caching logic on a lower level, we can apply some
> > > > >> > > optimizations to it. LookupJoinRunnerWithCalc was named like
> > this
> > > > >> > > because it uses the ‘calc’ function, which actually mostly
> > > consists
> > > > of
> > > > >> > > filters and projections.
> > > > >> > >
> > > > >> > > For example, in join table A with lookup table B condition
> > ‘JOIN …
> > > > ON
> > > > >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary > 1000’
> > ‘calc’
> > > > >> > > function will contain filters A.age = B.age + 10 and B.salary
> >
> > > > 1000.
> > > > >> > >
> > > > >> > > If we apply this function before storing records in cache,
> size
> > of
> > > > >> > > cache will be significantly reduced: filters = avoid storing
> > > useless
> > > > >> > > records in cache, projections = reduce records’ size. So the
> > > initial
> > > > >> > > max number of records in cache can be increased by the user.
> > > > >> > >
> > > > >> > > What do you think about it?
> > > > >> > >
> > > > >> > >
> > > > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > > > >> > > > Hi devs,
> > > > >> > > >
> > > > >> > > > Yuan and I would like to start a discussion about
> FLIP-221[1],
> > > > which
> > > > >> > > introduces an abstraction of lookup table cache and its
> standard
> > > > metrics.
> > > > >> > > >
> > > > >> > > > Currently each lookup table source should implement their
> own
> > > > cache to
> > > > >> > > store lookup results, and there isn’t a standard of metrics
> for
> > > > users and
> > > > >> > > developers to tuning their jobs with lookup joins, which is a
> > > quite
> > > > common
> > > > >> > > use case in Flink table / SQL.
> > > > >> > > >
> > > > >> > > > Therefore we propose some new APIs including cache, metrics,
> > > > wrapper
> > > > >> > > classes of TableFunction and new table options. Please take a
> > look
> > > > at the
> > > > >> > > FLIP page [1] to get more details. Any suggestions and
> comments
> > > > would be
> > > > >> > > appreciated!
> > > > >> > > >
> > > > >> > > > [1]
> > > > >> > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > >> > > >
> > > > >> > > > Best regards,
> > > > >> > > >
> > > > >> > > > Qingsheng
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards,
> > > > >
> > > > > Qingsheng Ren
> > > > >
> > > > > Real-time Computing Team
> > > > > Alibaba Cloud
> > > > >
> > > > > Email: renqschn@gmail.com
> > > >
> > >
> >
>


-- 
Best regards,
Roman Boyko
e.: ro.v.boyko@gmail.com

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Martijn Visser <ma...@apache.org>.

Hi everyone,

One question regarding "And Alexander correctly mentioned that filter
pushdown still is not implemented for jdbc/hive/hbase." -> Would an
alternative solution be to actually implement these filter pushdowns? I can
imagine that there are many more benefits to doing that, outside of lookup
caching and metrics.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82
https://github.com/MartijnVisser


On Thu, 5 May 2022 at 13:58, Roman Boyko <ro...@gmail.com> wrote:

> Hi everyone!
>
> Thanks for driving such a valuable improvement!
>
> I do think that single cache implementation would be a nice opportunity for
> users. And it will break the "FOR SYSTEM_TIME AS OF proc_time" semantics
> anyway - doesn't matter how it will be implemented.
>
> Putting myself in the user's shoes, I can say that:
> 1) I would prefer to have the opportunity to cut off the cache size by
> simply filtering unnecessary data. And the most handy way to do it is apply
> it inside LookupRunners. It would be a bit harder to pass it through the
> LookupJoin node to TableFunction. And Alexander correctly mentioned that
> filter pushdown still is not implemented for jdbc/hive/hbase.
> 2) The ability to set the different caching parameters for different tables
> is quite important. So I would prefer to set it through DDL rather than
> have similar ttla, strategy and other options for all lookup tables.
> 3) Providing the cache into the framework really deprives us of
> extensibility (users won't be able to implement their own cache). But most
> probably it might be solved by creating more different cache strategies and
> a wider set of configurations.
>
> All these points are much closer to the schema proposed by Alexander.
> Qingshen Ren, please correct me if I'm not right and all these facilities
> might be simply implemented in your architecture?
>
> Best regards,
> Roman Boyko
> e.: ro.v.boyko@gmail.com
>
> On Wed, 4 May 2022 at 21:01, Martijn Visser <ma...@apache.org>
> wrote:
>
> > Hi everyone,
> >
> > I don't have much to chip in, but just wanted to express that I really
> > appreciate the in-depth discussion on this topic and I hope that others
> > will join the conversation.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Tue, 3 May 2022 at 10:15, Александр Смирнов <sm...@gmail.com>
> > wrote:
> >
> > > Hi Qingsheng, Leonard and Jark,
> > >
> > > Thanks for your detailed feedback! However, I have questions about
> > > some of your statements (maybe I didn't get something?).
> > >
> > > > Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> > proc_time”
> > >
> > > I agree that the semantics of "FOR SYSTEM_TIME AS OF proc_time" is not
> > > fully implemented with caching, but as you said, users go on it
> > > consciously to achieve better performance (no one proposed to enable
> > > caching by default, etc.). Or by users do you mean other developers of
> > > connectors? In this case developers explicitly specify whether their
> > > connector supports caching or not (in the list of supported options),
> > > no one makes them do that if they don't want to. So what exactly is
> > > the difference between implementing caching in modules
> > > flink-table-runtime and in flink-table-common from the considered
> > > point of view? How does it affect on breaking/non-breaking the
> > > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> > >
> > > > confront a situation that allows table options in DDL to control the
> > > behavior of the framework, which has never happened previously and
> should
> > > be cautious
> > >
> > > If we talk about main differences of semantics of DDL options and
> > > config options("table.exec.xxx"), isn't it about limiting the scope of
> > > the options + importance for the user business logic rather than
> > > specific location of corresponding logic in the framework? I mean that
> > > in my design, for example, putting an option with lookup cache
> > > strategy in configurations would  be the wrong decision, because it
> > > directly affects the user's business logic (not just performance
> > > optimization) + touches just several functions of ONE table (there can
> > > be multiple tables with different caches). Does it really matter for
> > > the user (or someone else) where the logic is located, which is
> > > affected by the applied option?
> > > Also I can remember DDL option 'sink.parallelism', which in some way
> > > "controls the behavior of the framework" and I don't see any problem
> > > here.
> > >
> > > > introduce a new interface for this all-caching scenario and the
> design
> > > would become more complex
> > >
> > > This is a subject for a separate discussion, but actually in our
> > > internal version we solved this problem quite easily - we reused
> > > InputFormat class (so there is no need for a new API). The point is
> > > that currently all lookup connectors use InputFormat for scanning the
> > > data in batch mode: HBase, JDBC and even Hive - it uses class
> > > PartitionReader, that is actually just a wrapper around InputFormat.
> > > The advantage of this solution is the ability to reload cache data in
> > > parallel (number of threads depends on number of InputSplits, but has
> > > an upper limit). As a result cache reload time significantly reduces
> > > (as well as time of input stream blocking). I know that usually we try
> > > to avoid usage of concurrency in Flink code, but maybe this one can be
> > > an exception. BTW I don't say that it's an ideal solution, maybe there
> > > are better ones.
> > >
> > > > Providing the cache in the framework might introduce compatibility
> > issues
> > >
> > > It's possible only in cases when the developer of the connector won't
> > > properly refactor his code and will use new cache options incorrectly
> > > (i.e. explicitly provide the same options into 2 different code
> > > places). For correct behavior all he will need to do is to redirect
> > > existing options to the framework's LookupConfig (+ maybe add an alias
> > > for options, if there was different naming), everything will be
> > > transparent for users. If the developer won't do refactoring at all,
> > > nothing will be changed for the connector because of backward
> > > compatibility. Also if a developer wants to use his own cache logic,
> > > he just can refuse to pass some of the configs into the framework, and
> > > instead make his own implementation with already existing configs and
> > > metrics (but actually I think that it's a rare case).
> > >
> > > > filters and projections should be pushed all the way down to the
> table
> > > function, like what we do in the scan source
> > >
> > > It's the great purpose. But the truth is that the ONLY connector that
> > > supports filter pushdown is FileSystemTableSource
> > > (no database connector supports it currently). Also for some databases
> > > it's simply impossible to pushdown such complex filters that we have
> > > in Flink.
> > >
> > > >  only applying these optimizations to the cache seems not quite
> useful
> > >
> > > Filters can cut off an arbitrarily large amount of data from the
> > > dimension table. For a simple example, suppose in dimension table
> > > 'users'
> > > we have column 'age' with values from 20 to 40, and input stream
> > > 'clicks' that is ~uniformly distributed by age of users. If we have
> > > filter 'age > 30',
> > > there will be twice less data in cache. This means the user can
> > > increase 'lookup.cache.max-rows' by almost 2 times. It will gain a
> > > huge
> > > performance boost. Moreover, this optimization starts to really shine
> > > in 'ALL' cache, where tables without filters and projections can't fit
> > > in memory, but with them - can. This opens up additional possibilities
> > > for users. And this doesn't sound as 'not quite useful'.
> > >
> > > It would be great to hear other voices regarding this topic! Because
> > > we have quite a lot of controversial points, and I think with the help
> > > of others it will be easier for us to come to a consensus.
> > >
> > > Best regards,
> > > Smirnov Alexander
> > >
> > >
> > > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <re...@gmail.com>:
> > > >
> > > > Hi Alexander and Arvid,
> > > >
> > > > Thanks for the discussion and sorry for my late response! We had an
> > > internal discussion together with Jark and Leonard and I’d like to
> > > summarize our ideas. Instead of implementing the cache logic in the
> table
> > > runtime layer or wrapping around the user-provided table function, we
> > > prefer to introduce some new APIs extending TableFunction with these
> > > concerns:
> > > >
> > > > 1. Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> > > proc_time”, because it couldn’t truly reflect the content of the lookup
> > > table at the moment of querying. If users choose to enable caching on
> the
> > > lookup table, they implicitly indicate that this breakage is acceptable
> > in
> > > exchange for the performance. So we prefer not to provide caching on
> the
> > > table runtime level.
> > > >
> > > > 2. If we make the cache implementation in the framework (whether in a
> > > runner or a wrapper around TableFunction), we have to confront a
> > situation
> > > that allows table options in DDL to control the behavior of the
> > framework,
> > > which has never happened previously and should be cautious. Under the
> > > current design the behavior of the framework should only be specified
> by
> > > configurations (“table.exec.xxx”), and it’s hard to apply these general
> > > configs to a specific table.
> > > >
> > > > 3. We have use cases that lookup source loads and refresh all records
> > > periodically into the memory to achieve high lookup performance (like
> > Hive
> > > connector in the community, and also widely used by our internal
> > > connectors). Wrapping the cache around the user’s TableFunction works
> > fine
> > > for LRU caches, but I think we have to introduce a new interface for
> this
> > > all-caching scenario and the design would become more complex.
> > > >
> > > > 4. Providing the cache in the framework might introduce compatibility
> > > issues to existing lookup sources like there might exist two caches
> with
> > > totally different strategies if the user incorrectly configures the
> table
> > > (one in the framework and another implemented by the lookup source).
> > > >
> > > > As for the optimization mentioned by Alexander, I think filters and
> > > projections should be pushed all the way down to the table function,
> like
> > > what we do in the scan source, instead of the runner with the cache.
> The
> > > goal of using cache is to reduce the network I/O and pressure on the
> > > external system, and only applying these optimizations to the cache
> seems
> > > not quite useful.
> > > >
> > > > I made some updates to the FLIP[1] to reflect our ideas. We prefer to
> > > keep the cache implementation as a part of TableFunction, and we could
> > > provide some helper classes (CachingTableFunction,
> > AllCachingTableFunction,
> > > CachingAsyncTableFunction) to developers and regulate metrics of the
> > cache.
> > > Also, I made a POC[2] for your reference.
> > > >
> > > > Looking forward to your ideas!
> > > >
> > > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > > >
> > > > Best regards,
> > > >
> > > > Qingsheng
> > > >
> > > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> > smiralexan@gmail.com>
> > > wrote:
> > > >>
> > > >> Thanks for the response, Arvid!
> > > >>
> > > >> I have few comments on your message.
> > > >>
> > > >> > but could also live with an easier solution as the first step:
> > > >>
> > > >> I think that these 2 ways are mutually exclusive (originally
> proposed
> > > >> by Qingsheng and mine), because conceptually they follow the same
> > > >> goal, but implementation details are different. If we will go one
> way,
> > > >> moving to another way in the future will mean deleting existing code
> > > >> and once again changing the API for connectors. So I think we should
> > > >> reach a consensus with the community about that and then work
> together
> > > >> on this FLIP, i.e. divide the work on tasks for different parts of
> the
> > > >> flip (for example, LRU cache unification / introducing proposed set
> of
> > > >> metrics / further work…). WDYT, Qingsheng?
> > > >>
> > > >> > as the source will only receive the requests after filter
> > > >>
> > > >> Actually if filters are applied to fields of the lookup table, we
> > > >> firstly must do requests, and only after that we can filter
> responses,
> > > >> because lookup connectors don't have filter pushdown. So if
> filtering
> > > >> is done before caching, there will be much less rows in cache.
> > > >>
> > > >> > @Alexander unfortunately, your architecture is not shared. I don't
> > > know the
> > > >>
> > > >> > solution to share images to be honest.
> > > >>
> > > >> Sorry for that, I’m a bit new to such kinds of conversations :)
> > > >> I have no write access to the confluence, so I made a Jira issue,
> > > >> where described the proposed changes in more details -
> > > >> https://issues.apache.org/jira/browse/FLINK-27411.
> > > >>
> > > >> Will happy to get more feedback!
> > > >>
> > > >> Best,
> > > >> Smirnov Alexander
> > > >>
> > > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <ar...@apache.org>:
> > > >> >
> > > >> > Hi Qingsheng,
> > > >> >
> > > >> > Thanks for driving this; the inconsistency was not satisfying for
> > me.
> > > >> >
> > > >> > I second Alexander's idea though but could also live with an
> easier
> > > >> > solution as the first step: Instead of making caching an
> > > implementation
> > > >> > detail of TableFunction X, rather devise a caching layer around X.
> > So
> > > the
> > > >> > proposal would be a CachingTableFunction that delegates to X in
> case
> > > of
> > > >> > misses and else manages the cache. Lifting it into the operator
> > model
> > > as
> > > >> > proposed would be even better but is probably unnecessary in the
> > > first step
> > > >> > for a lookup source (as the source will only receive the requests
> > > after
> > > >> > filter; applying projection may be more interesting to save
> memory).
> > > >> >
> > > >> > Another advantage is that all the changes of this FLIP would be
> > > limited to
> > > >> > options, no need for new public interfaces. Everything else
> remains
> > an
> > > >> > implementation of Table runtime. That means we can easily
> > incorporate
> > > the
> > > >> > optimization potential that Alexander pointed out later.
> > > >> >
> > > >> > @Alexander unfortunately, your architecture is not shared. I don't
> > > know the
> > > >> > solution to share images to be honest.
> > > >> >
> > > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > > smiralexan@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Qingsheng! My name is Alexander, I'm not a committer yet, but
> > I'd
> > > >> > > really like to become one. And this FLIP really interested me.
> > > >> > > Actually I have worked on a similar feature in my company’s
> Flink
> > > >> > > fork, and we would like to share our thoughts on this and make
> > code
> > > >> > > open source.
> > > >> > >
> > > >> > > I think there is a better alternative than introducing an
> abstract
> > > >> > > class for TableFunction (CachingTableFunction). As you know,
> > > >> > > TableFunction exists in the flink-table-common module, which
> > > provides
> > > >> > > only an API for working with tables – it’s very convenient for
> > > importing
> > > >> > > in connectors. In turn, CachingTableFunction contains logic for
> > > >> > > runtime execution,  so this class and everything connected with
> it
> > > >> > > should be located in another module, probably in
> > > flink-table-runtime.
> > > >> > > But this will require connectors to depend on another module,
> > which
> > > >> > > contains a lot of runtime logic, which doesn’t sound good.
> > > >> > >
> > > >> > > I suggest adding a new method ‘getLookupConfig’ to
> > LookupTableSource
> > > >> > > or LookupRuntimeProvider to allow connectors to only pass
> > > >> > > configurations to the planner, therefore they won’t depend on
> > > runtime
> > > >> > > realization. Based on these configs planner will construct a
> > lookup
> > > >> > > join operator with corresponding runtime logic (ProcessFunctions
> > in
> > > >> > > module flink-table-runtime). Architecture looks like in the
> pinned
> > > >> > > image (LookupConfig class there is actually yours CacheConfig).
> > > >> > >
> > > >> > > Classes in flink-table-planner, that will be responsible for
> this
> > –
> > > >> > > CommonPhysicalLookupJoin and his inheritors.
> > > >> > > Current classes for lookup join in  flink-table-runtime  -
> > > >> > > LookupJoinRunner, AsyncLookupJoinRunner,
> LookupJoinRunnerWithCalc,
> > > >> > > AsyncLookupJoinRunnerWithCalc.
> > > >> > >
> > > >> > > I suggest adding classes LookupJoinCachingRunner,
> > > >> > > LookupJoinCachingRunnerWithCalc, etc.
> > > >> > >
> > > >> > > And here comes another more powerful advantage of such a
> solution.
> > > If
> > > >> > > we have caching logic on a lower level, we can apply some
> > > >> > > optimizations to it. LookupJoinRunnerWithCalc was named like
> this
> > > >> > > because it uses the ‘calc’ function, which actually mostly
> > consists
> > > of
> > > >> > > filters and projections.
> > > >> > >
> > > >> > > For example, in join table A with lookup table B condition
> ‘JOIN …
> > > ON
> > > >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary > 1000’
> ‘calc’
> > > >> > > function will contain filters A.age = B.age + 10 and B.salary >
> > > 1000.
> > > >> > >
> > > >> > > If we apply this function before storing records in cache, size
> of
> > > >> > > cache will be significantly reduced: filters = avoid storing
> > useless
> > > >> > > records in cache, projections = reduce records’ size. So the
> > initial
> > > >> > > max number of records in cache can be increased by the user.
> > > >> > >
> > > >> > > What do you think about it?
> > > >> > >
> > > >> > >
> > > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > > >> > > > Hi devs,
> > > >> > > >
> > > >> > > > Yuan and I would like to start a discussion about FLIP-221[1],
> > > which
> > > >> > > introduces an abstraction of lookup table cache and its standard
> > > metrics.
> > > >> > > >
> > > >> > > > Currently each lookup table source should implement their own
> > > cache to
> > > >> > > store lookup results, and there isn’t a standard of metrics for
> > > users and
> > > >> > > developers to tuning their jobs with lookup joins, which is a
> > quite
> > > common
> > > >> > > use case in Flink table / SQL.
> > > >> > > >
> > > >> > > > Therefore we propose some new APIs including cache, metrics,
> > > wrapper
> > > >> > > classes of TableFunction and new table options. Please take a
> look
> > > at the
> > > >> > > FLIP page [1] to get more details. Any suggestions and comments
> > > would be
> > > >> > > appreciated!
> > > >> > > >
> > > >> > > > [1]
> > > >> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > >> > > >
> > > >> > > > Best regards,
> > > >> > > >
> > > >> > > > Qingsheng
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards,
> > > >
> > > > Qingsheng Ren
> > > >
> > > > Real-time Computing Team
> > > > Alibaba Cloud
> > > >
> > > > Email: renqschn@gmail.com
> > >
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Roman Boyko <ro...@gmail.com>.

Hi everyone!

Thanks for driving such a valuable improvement!

I do think that single cache implementation would be a nice opportunity for
users. And it will break the "FOR SYSTEM_TIME AS OF proc_time" semantics
anyway - doesn't matter how it will be implemented.

Putting myself in the user's shoes, I can say that:
1) I would prefer to have the opportunity to cut off the cache size by
simply filtering unnecessary data. And the most handy way to do it is apply
it inside LookupRunners. It would be a bit harder to pass it through the
LookupJoin node to TableFunction. And Alexander correctly mentioned that
filter pushdown still is not implemented for jdbc/hive/hbase.
2) The ability to set the different caching parameters for different tables
is quite important. So I would prefer to set it through DDL rather than
have similar ttla, strategy and other options for all lookup tables.
3) Providing the cache into the framework really deprives us of
extensibility (users won't be able to implement their own cache). But most
probably it might be solved by creating more different cache strategies and
a wider set of configurations.

All these points are much closer to the schema proposed by Alexander.
Qingshen Ren, please correct me if I'm not right and all these facilities
might be simply implemented in your architecture?

Best regards,
Roman Boyko
e.: ro.v.boyko@gmail.com

On Wed, 4 May 2022 at 21:01, Martijn Visser <ma...@apache.org>
wrote:

> Hi everyone,
>
> I don't have much to chip in, but just wanted to express that I really
> appreciate the in-depth discussion on this topic and I hope that others
> will join the conversation.
>
> Best regards,
>
> Martijn
>
> On Tue, 3 May 2022 at 10:15, Александр Смирнов <sm...@gmail.com>
> wrote:
>
> > Hi Qingsheng, Leonard and Jark,
> >
> > Thanks for your detailed feedback! However, I have questions about
> > some of your statements (maybe I didn't get something?).
> >
> > > Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> proc_time”
> >
> > I agree that the semantics of "FOR SYSTEM_TIME AS OF proc_time" is not
> > fully implemented with caching, but as you said, users go on it
> > consciously to achieve better performance (no one proposed to enable
> > caching by default, etc.). Or by users do you mean other developers of
> > connectors? In this case developers explicitly specify whether their
> > connector supports caching or not (in the list of supported options),
> > no one makes them do that if they don't want to. So what exactly is
> > the difference between implementing caching in modules
> > flink-table-runtime and in flink-table-common from the considered
> > point of view? How does it affect on breaking/non-breaking the
> > semantics of "FOR SYSTEM_TIME AS OF proc_time"?
> >
> > > confront a situation that allows table options in DDL to control the
> > behavior of the framework, which has never happened previously and should
> > be cautious
> >
> > If we talk about main differences of semantics of DDL options and
> > config options("table.exec.xxx"), isn't it about limiting the scope of
> > the options + importance for the user business logic rather than
> > specific location of corresponding logic in the framework? I mean that
> > in my design, for example, putting an option with lookup cache
> > strategy in configurations would  be the wrong decision, because it
> > directly affects the user's business logic (not just performance
> > optimization) + touches just several functions of ONE table (there can
> > be multiple tables with different caches). Does it really matter for
> > the user (or someone else) where the logic is located, which is
> > affected by the applied option?
> > Also I can remember DDL option 'sink.parallelism', which in some way
> > "controls the behavior of the framework" and I don't see any problem
> > here.
> >
> > > introduce a new interface for this all-caching scenario and the design
> > would become more complex
> >
> > This is a subject for a separate discussion, but actually in our
> > internal version we solved this problem quite easily - we reused
> > InputFormat class (so there is no need for a new API). The point is
> > that currently all lookup connectors use InputFormat for scanning the
> > data in batch mode: HBase, JDBC and even Hive - it uses class
> > PartitionReader, that is actually just a wrapper around InputFormat.
> > The advantage of this solution is the ability to reload cache data in
> > parallel (number of threads depends on number of InputSplits, but has
> > an upper limit). As a result cache reload time significantly reduces
> > (as well as time of input stream blocking). I know that usually we try
> > to avoid usage of concurrency in Flink code, but maybe this one can be
> > an exception. BTW I don't say that it's an ideal solution, maybe there
> > are better ones.
> >
> > > Providing the cache in the framework might introduce compatibility
> issues
> >
> > It's possible only in cases when the developer of the connector won't
> > properly refactor his code and will use new cache options incorrectly
> > (i.e. explicitly provide the same options into 2 different code
> > places). For correct behavior all he will need to do is to redirect
> > existing options to the framework's LookupConfig (+ maybe add an alias
> > for options, if there was different naming), everything will be
> > transparent for users. If the developer won't do refactoring at all,
> > nothing will be changed for the connector because of backward
> > compatibility. Also if a developer wants to use his own cache logic,
> > he just can refuse to pass some of the configs into the framework, and
> > instead make his own implementation with already existing configs and
> > metrics (but actually I think that it's a rare case).
> >
> > > filters and projections should be pushed all the way down to the table
> > function, like what we do in the scan source
> >
> > It's the great purpose. But the truth is that the ONLY connector that
> > supports filter pushdown is FileSystemTableSource
> > (no database connector supports it currently). Also for some databases
> > it's simply impossible to pushdown such complex filters that we have
> > in Flink.
> >
> > >  only applying these optimizations to the cache seems not quite useful
> >
> > Filters can cut off an arbitrarily large amount of data from the
> > dimension table. For a simple example, suppose in dimension table
> > 'users'
> > we have column 'age' with values from 20 to 40, and input stream
> > 'clicks' that is ~uniformly distributed by age of users. If we have
> > filter 'age > 30',
> > there will be twice less data in cache. This means the user can
> > increase 'lookup.cache.max-rows' by almost 2 times. It will gain a
> > huge
> > performance boost. Moreover, this optimization starts to really shine
> > in 'ALL' cache, where tables without filters and projections can't fit
> > in memory, but with them - can. This opens up additional possibilities
> > for users. And this doesn't sound as 'not quite useful'.
> >
> > It would be great to hear other voices regarding this topic! Because
> > we have quite a lot of controversial points, and I think with the help
> > of others it will be easier for us to come to a consensus.
> >
> > Best regards,
> > Smirnov Alexander
> >
> >
> > пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <re...@gmail.com>:
> > >
> > > Hi Alexander and Arvid,
> > >
> > > Thanks for the discussion and sorry for my late response! We had an
> > internal discussion together with Jark and Leonard and I’d like to
> > summarize our ideas. Instead of implementing the cache logic in the table
> > runtime layer or wrapping around the user-provided table function, we
> > prefer to introduce some new APIs extending TableFunction with these
> > concerns:
> > >
> > > 1. Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> > proc_time”, because it couldn’t truly reflect the content of the lookup
> > table at the moment of querying. If users choose to enable caching on the
> > lookup table, they implicitly indicate that this breakage is acceptable
> in
> > exchange for the performance. So we prefer not to provide caching on the
> > table runtime level.
> > >
> > > 2. If we make the cache implementation in the framework (whether in a
> > runner or a wrapper around TableFunction), we have to confront a
> situation
> > that allows table options in DDL to control the behavior of the
> framework,
> > which has never happened previously and should be cautious. Under the
> > current design the behavior of the framework should only be specified by
> > configurations (“table.exec.xxx”), and it’s hard to apply these general
> > configs to a specific table.
> > >
> > > 3. We have use cases that lookup source loads and refresh all records
> > periodically into the memory to achieve high lookup performance (like
> Hive
> > connector in the community, and also widely used by our internal
> > connectors). Wrapping the cache around the user’s TableFunction works
> fine
> > for LRU caches, but I think we have to introduce a new interface for this
> > all-caching scenario and the design would become more complex.
> > >
> > > 4. Providing the cache in the framework might introduce compatibility
> > issues to existing lookup sources like there might exist two caches with
> > totally different strategies if the user incorrectly configures the table
> > (one in the framework and another implemented by the lookup source).
> > >
> > > As for the optimization mentioned by Alexander, I think filters and
> > projections should be pushed all the way down to the table function, like
> > what we do in the scan source, instead of the runner with the cache. The
> > goal of using cache is to reduce the network I/O and pressure on the
> > external system, and only applying these optimizations to the cache seems
> > not quite useful.
> > >
> > > I made some updates to the FLIP[1] to reflect our ideas. We prefer to
> > keep the cache implementation as a part of TableFunction, and we could
> > provide some helper classes (CachingTableFunction,
> AllCachingTableFunction,
> > CachingAsyncTableFunction) to developers and regulate metrics of the
> cache.
> > Also, I made a POC[2] for your reference.
> > >
> > > Looking forward to your ideas!
> > >
> > > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> > >
> > > Best regards,
> > >
> > > Qingsheng
> > >
> > > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <
> smiralexan@gmail.com>
> > wrote:
> > >>
> > >> Thanks for the response, Arvid!
> > >>
> > >> I have few comments on your message.
> > >>
> > >> > but could also live with an easier solution as the first step:
> > >>
> > >> I think that these 2 ways are mutually exclusive (originally proposed
> > >> by Qingsheng and mine), because conceptually they follow the same
> > >> goal, but implementation details are different. If we will go one way,
> > >> moving to another way in the future will mean deleting existing code
> > >> and once again changing the API for connectors. So I think we should
> > >> reach a consensus with the community about that and then work together
> > >> on this FLIP, i.e. divide the work on tasks for different parts of the
> > >> flip (for example, LRU cache unification / introducing proposed set of
> > >> metrics / further work…). WDYT, Qingsheng?
> > >>
> > >> > as the source will only receive the requests after filter
> > >>
> > >> Actually if filters are applied to fields of the lookup table, we
> > >> firstly must do requests, and only after that we can filter responses,
> > >> because lookup connectors don't have filter pushdown. So if filtering
> > >> is done before caching, there will be much less rows in cache.
> > >>
> > >> > @Alexander unfortunately, your architecture is not shared. I don't
> > know the
> > >>
> > >> > solution to share images to be honest.
> > >>
> > >> Sorry for that, I’m a bit new to such kinds of conversations :)
> > >> I have no write access to the confluence, so I made a Jira issue,
> > >> where described the proposed changes in more details -
> > >> https://issues.apache.org/jira/browse/FLINK-27411.
> > >>
> > >> Will happy to get more feedback!
> > >>
> > >> Best,
> > >> Smirnov Alexander
> > >>
> > >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <ar...@apache.org>:
> > >> >
> > >> > Hi Qingsheng,
> > >> >
> > >> > Thanks for driving this; the inconsistency was not satisfying for
> me.
> > >> >
> > >> > I second Alexander's idea though but could also live with an easier
> > >> > solution as the first step: Instead of making caching an
> > implementation
> > >> > detail of TableFunction X, rather devise a caching layer around X.
> So
> > the
> > >> > proposal would be a CachingTableFunction that delegates to X in case
> > of
> > >> > misses and else manages the cache. Lifting it into the operator
> model
> > as
> > >> > proposed would be even better but is probably unnecessary in the
> > first step
> > >> > for a lookup source (as the source will only receive the requests
> > after
> > >> > filter; applying projection may be more interesting to save memory).
> > >> >
> > >> > Another advantage is that all the changes of this FLIP would be
> > limited to
> > >> > options, no need for new public interfaces. Everything else remains
> an
> > >> > implementation of Table runtime. That means we can easily
> incorporate
> > the
> > >> > optimization potential that Alexander pointed out later.
> > >> >
> > >> > @Alexander unfortunately, your architecture is not shared. I don't
> > know the
> > >> > solution to share images to be honest.
> > >> >
> > >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> > smiralexan@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Hi Qingsheng! My name is Alexander, I'm not a committer yet, but
> I'd
> > >> > > really like to become one. And this FLIP really interested me.
> > >> > > Actually I have worked on a similar feature in my company’s Flink
> > >> > > fork, and we would like to share our thoughts on this and make
> code
> > >> > > open source.
> > >> > >
> > >> > > I think there is a better alternative than introducing an abstract
> > >> > > class for TableFunction (CachingTableFunction). As you know,
> > >> > > TableFunction exists in the flink-table-common module, which
> > provides
> > >> > > only an API for working with tables – it’s very convenient for
> > importing
> > >> > > in connectors. In turn, CachingTableFunction contains logic for
> > >> > > runtime execution,  so this class and everything connected with it
> > >> > > should be located in another module, probably in
> > flink-table-runtime.
> > >> > > But this will require connectors to depend on another module,
> which
> > >> > > contains a lot of runtime logic, which doesn’t sound good.
> > >> > >
> > >> > > I suggest adding a new method ‘getLookupConfig’ to
> LookupTableSource
> > >> > > or LookupRuntimeProvider to allow connectors to only pass
> > >> > > configurations to the planner, therefore they won’t depend on
> > runtime
> > >> > > realization. Based on these configs planner will construct a
> lookup
> > >> > > join operator with corresponding runtime logic (ProcessFunctions
> in
> > >> > > module flink-table-runtime). Architecture looks like in the pinned
> > >> > > image (LookupConfig class there is actually yours CacheConfig).
> > >> > >
> > >> > > Classes in flink-table-planner, that will be responsible for this
> –
> > >> > > CommonPhysicalLookupJoin and his inheritors.
> > >> > > Current classes for lookup join in  flink-table-runtime  -
> > >> > > LookupJoinRunner, AsyncLookupJoinRunner, LookupJoinRunnerWithCalc,
> > >> > > AsyncLookupJoinRunnerWithCalc.
> > >> > >
> > >> > > I suggest adding classes LookupJoinCachingRunner,
> > >> > > LookupJoinCachingRunnerWithCalc, etc.
> > >> > >
> > >> > > And here comes another more powerful advantage of such a solution.
> > If
> > >> > > we have caching logic on a lower level, we can apply some
> > >> > > optimizations to it. LookupJoinRunnerWithCalc was named like this
> > >> > > because it uses the ‘calc’ function, which actually mostly
> consists
> > of
> > >> > > filters and projections.
> > >> > >
> > >> > > For example, in join table A with lookup table B condition ‘JOIN …
> > ON
> > >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary > 1000’  ‘calc’
> > >> > > function will contain filters A.age = B.age + 10 and B.salary >
> > 1000.
> > >> > >
> > >> > > If we apply this function before storing records in cache, size of
> > >> > > cache will be significantly reduced: filters = avoid storing
> useless
> > >> > > records in cache, projections = reduce records’ size. So the
> initial
> > >> > > max number of records in cache can be increased by the user.
> > >> > >
> > >> > > What do you think about it?
> > >> > >
> > >> > >
> > >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> > >> > > > Hi devs,
> > >> > > >
> > >> > > > Yuan and I would like to start a discussion about FLIP-221[1],
> > which
> > >> > > introduces an abstraction of lookup table cache and its standard
> > metrics.
> > >> > > >
> > >> > > > Currently each lookup table source should implement their own
> > cache to
> > >> > > store lookup results, and there isn’t a standard of metrics for
> > users and
> > >> > > developers to tuning their jobs with lookup joins, which is a
> quite
> > common
> > >> > > use case in Flink table / SQL.
> > >> > > >
> > >> > > > Therefore we propose some new APIs including cache, metrics,
> > wrapper
> > >> > > classes of TableFunction and new table options. Please take a look
> > at the
> > >> > > FLIP page [1] to get more details. Any suggestions and comments
> > would be
> > >> > > appreciated!
> > >> > > >
> > >> > > > [1]
> > >> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >> > > >
> > >> > > > Best regards,
> > >> > > >
> > >> > > > Qingsheng
> > >> > > >
> > >> > > >
> > >> > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > >
> > > Qingsheng Ren
> > >
> > > Real-time Computing Team
> > > Alibaba Cloud
> > >
> > > Email: renqschn@gmail.com
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Posted by Martijn Visser <ma...@apache.org>.

Hi everyone,

I don't have much to chip in, but just wanted to express that I really
appreciate the in-depth discussion on this topic and I hope that others
will join the conversation.

Best regards,

Martijn

On Tue, 3 May 2022 at 10:15, Александр Смирнов <sm...@gmail.com> wrote:

> Hi Qingsheng, Leonard and Jark,
>
> Thanks for your detailed feedback! However, I have questions about
> some of your statements (maybe I didn't get something?).
>
> > Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF proc_time”
>
> I agree that the semantics of "FOR SYSTEM_TIME AS OF proc_time" is not
> fully implemented with caching, but as you said, users go on it
> consciously to achieve better performance (no one proposed to enable
> caching by default, etc.). Or by users do you mean other developers of
> connectors? In this case developers explicitly specify whether their
> connector supports caching or not (in the list of supported options),
> no one makes them do that if they don't want to. So what exactly is
> the difference between implementing caching in modules
> flink-table-runtime and in flink-table-common from the considered
> point of view? How does it affect on breaking/non-breaking the
> semantics of "FOR SYSTEM_TIME AS OF proc_time"?
>
> > confront a situation that allows table options in DDL to control the
> behavior of the framework, which has never happened previously and should
> be cautious
>
> If we talk about main differences of semantics of DDL options and
> config options("table.exec.xxx"), isn't it about limiting the scope of
> the options + importance for the user business logic rather than
> specific location of corresponding logic in the framework? I mean that
> in my design, for example, putting an option with lookup cache
> strategy in configurations would  be the wrong decision, because it
> directly affects the user's business logic (not just performance
> optimization) + touches just several functions of ONE table (there can
> be multiple tables with different caches). Does it really matter for
> the user (or someone else) where the logic is located, which is
> affected by the applied option?
> Also I can remember DDL option 'sink.parallelism', which in some way
> "controls the behavior of the framework" and I don't see any problem
> here.
>
> > introduce a new interface for this all-caching scenario and the design
> would become more complex
>
> This is a subject for a separate discussion, but actually in our
> internal version we solved this problem quite easily - we reused
> InputFormat class (so there is no need for a new API). The point is
> that currently all lookup connectors use InputFormat for scanning the
> data in batch mode: HBase, JDBC and even Hive - it uses class
> PartitionReader, that is actually just a wrapper around InputFormat.
> The advantage of this solution is the ability to reload cache data in
> parallel (number of threads depends on number of InputSplits, but has
> an upper limit). As a result cache reload time significantly reduces
> (as well as time of input stream blocking). I know that usually we try
> to avoid usage of concurrency in Flink code, but maybe this one can be
> an exception. BTW I don't say that it's an ideal solution, maybe there
> are better ones.
>
> > Providing the cache in the framework might introduce compatibility issues
>
> It's possible only in cases when the developer of the connector won't
> properly refactor his code and will use new cache options incorrectly
> (i.e. explicitly provide the same options into 2 different code
> places). For correct behavior all he will need to do is to redirect
> existing options to the framework's LookupConfig (+ maybe add an alias
> for options, if there was different naming), everything will be
> transparent for users. If the developer won't do refactoring at all,
> nothing will be changed for the connector because of backward
> compatibility. Also if a developer wants to use his own cache logic,
> he just can refuse to pass some of the configs into the framework, and
> instead make his own implementation with already existing configs and
> metrics (but actually I think that it's a rare case).
>
> > filters and projections should be pushed all the way down to the table
> function, like what we do in the scan source
>
> It's the great purpose. But the truth is that the ONLY connector that
> supports filter pushdown is FileSystemTableSource
> (no database connector supports it currently). Also for some databases
> it's simply impossible to pushdown such complex filters that we have
> in Flink.
>
> >  only applying these optimizations to the cache seems not quite useful
>
> Filters can cut off an arbitrarily large amount of data from the
> dimension table. For a simple example, suppose in dimension table
> 'users'
> we have column 'age' with values from 20 to 40, and input stream
> 'clicks' that is ~uniformly distributed by age of users. If we have
> filter 'age > 30',
> there will be twice less data in cache. This means the user can
> increase 'lookup.cache.max-rows' by almost 2 times. It will gain a
> huge
> performance boost. Moreover, this optimization starts to really shine
> in 'ALL' cache, where tables without filters and projections can't fit
> in memory, but with them - can. This opens up additional possibilities
> for users. And this doesn't sound as 'not quite useful'.
>
> It would be great to hear other voices regarding this topic! Because
> we have quite a lot of controversial points, and I think with the help
> of others it will be easier for us to come to a consensus.
>
> Best regards,
> Smirnov Alexander
>
>
> пт, 29 апр. 2022 г. в 22:33, Qingsheng Ren <re...@gmail.com>:
> >
> > Hi Alexander and Arvid,
> >
> > Thanks for the discussion and sorry for my late response! We had an
> internal discussion together with Jark and Leonard and I’d like to
> summarize our ideas. Instead of implementing the cache logic in the table
> runtime layer or wrapping around the user-provided table function, we
> prefer to introduce some new APIs extending TableFunction with these
> concerns:
> >
> > 1. Caching actually breaks the semantic of "FOR SYSTEM_TIME AS OF
> proc_time”, because it couldn’t truly reflect the content of the lookup
> table at the moment of querying. If users choose to enable caching on the
> lookup table, they implicitly indicate that this breakage is acceptable in
> exchange for the performance. So we prefer not to provide caching on the
> table runtime level.
> >
> > 2. If we make the cache implementation in the framework (whether in a
> runner or a wrapper around TableFunction), we have to confront a situation
> that allows table options in DDL to control the behavior of the framework,
> which has never happened previously and should be cautious. Under the
> current design the behavior of the framework should only be specified by
> configurations (“table.exec.xxx”), and it’s hard to apply these general
> configs to a specific table.
> >
> > 3. We have use cases that lookup source loads and refresh all records
> periodically into the memory to achieve high lookup performance (like Hive
> connector in the community, and also widely used by our internal
> connectors). Wrapping the cache around the user’s TableFunction works fine
> for LRU caches, but I think we have to introduce a new interface for this
> all-caching scenario and the design would become more complex.
> >
> > 4. Providing the cache in the framework might introduce compatibility
> issues to existing lookup sources like there might exist two caches with
> totally different strategies if the user incorrectly configures the table
> (one in the framework and another implemented by the lookup source).
> >
> > As for the optimization mentioned by Alexander, I think filters and
> projections should be pushed all the way down to the table function, like
> what we do in the scan source, instead of the runner with the cache. The
> goal of using cache is to reduce the network I/O and pressure on the
> external system, and only applying these optimizations to the cache seems
> not quite useful.
> >
> > I made some updates to the FLIP[1] to reflect our ideas. We prefer to
> keep the cache implementation as a part of TableFunction, and we could
> provide some helper classes (CachingTableFunction, AllCachingTableFunction,
> CachingAsyncTableFunction) to developers and regulate metrics of the cache.
> Also, I made a POC[2] for your reference.
> >
> > Looking forward to your ideas!
> >
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > [2] https://github.com/PatrickRen/flink/tree/FLIP-221
> >
> > Best regards,
> >
> > Qingsheng
> >
> > On Tue, Apr 26, 2022 at 4:45 PM Александр Смирнов <sm...@gmail.com>
> wrote:
> >>
> >> Thanks for the response, Arvid!
> >>
> >> I have few comments on your message.
> >>
> >> > but could also live with an easier solution as the first step:
> >>
> >> I think that these 2 ways are mutually exclusive (originally proposed
> >> by Qingsheng and mine), because conceptually they follow the same
> >> goal, but implementation details are different. If we will go one way,
> >> moving to another way in the future will mean deleting existing code
> >> and once again changing the API for connectors. So I think we should
> >> reach a consensus with the community about that and then work together
> >> on this FLIP, i.e. divide the work on tasks for different parts of the
> >> flip (for example, LRU cache unification / introducing proposed set of
> >> metrics / further work…). WDYT, Qingsheng?
> >>
> >> > as the source will only receive the requests after filter
> >>
> >> Actually if filters are applied to fields of the lookup table, we
> >> firstly must do requests, and only after that we can filter responses,
> >> because lookup connectors don't have filter pushdown. So if filtering
> >> is done before caching, there will be much less rows in cache.
> >>
> >> > @Alexander unfortunately, your architecture is not shared. I don't
> know the
> >>
> >> > solution to share images to be honest.
> >>
> >> Sorry for that, I’m a bit new to such kinds of conversations :)
> >> I have no write access to the confluence, so I made a Jira issue,
> >> where described the proposed changes in more details -
> >> https://issues.apache.org/jira/browse/FLINK-27411.
> >>
> >> Will happy to get more feedback!
> >>
> >> Best,
> >> Smirnov Alexander
> >>
> >> пн, 25 апр. 2022 г. в 19:49, Arvid Heise <ar...@apache.org>:
> >> >
> >> > Hi Qingsheng,
> >> >
> >> > Thanks for driving this; the inconsistency was not satisfying for me.
> >> >
> >> > I second Alexander's idea though but could also live with an easier
> >> > solution as the first step: Instead of making caching an
> implementation
> >> > detail of TableFunction X, rather devise a caching layer around X. So
> the
> >> > proposal would be a CachingTableFunction that delegates to X in case
> of
> >> > misses and else manages the cache. Lifting it into the operator model
> as
> >> > proposed would be even better but is probably unnecessary in the
> first step
> >> > for a lookup source (as the source will only receive the requests
> after
> >> > filter; applying projection may be more interesting to save memory).
> >> >
> >> > Another advantage is that all the changes of this FLIP would be
> limited to
> >> > options, no need for new public interfaces. Everything else remains an
> >> > implementation of Table runtime. That means we can easily incorporate
> the
> >> > optimization potential that Alexander pointed out later.
> >> >
> >> > @Alexander unfortunately, your architecture is not shared. I don't
> know the
> >> > solution to share images to be honest.
> >> >
> >> > On Fri, Apr 22, 2022 at 5:04 PM Александр Смирнов <
> smiralexan@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Qingsheng! My name is Alexander, I'm not a committer yet, but I'd
> >> > > really like to become one. And this FLIP really interested me.
> >> > > Actually I have worked on a similar feature in my company’s Flink
> >> > > fork, and we would like to share our thoughts on this and make code
> >> > > open source.
> >> > >
> >> > > I think there is a better alternative than introducing an abstract
> >> > > class for TableFunction (CachingTableFunction). As you know,
> >> > > TableFunction exists in the flink-table-common module, which
> provides
> >> > > only an API for working with tables – it’s very convenient for
> importing
> >> > > in connectors. In turn, CachingTableFunction contains logic for
> >> > > runtime execution,  so this class and everything connected with it
> >> > > should be located in another module, probably in
> flink-table-runtime.
> >> > > But this will require connectors to depend on another module, which
> >> > > contains a lot of runtime logic, which doesn’t sound good.
> >> > >
> >> > > I suggest adding a new method ‘getLookupConfig’ to LookupTableSource
> >> > > or LookupRuntimeProvider to allow connectors to only pass
> >> > > configurations to the planner, therefore they won’t depend on
> runtime
> >> > > realization. Based on these configs planner will construct a lookup
> >> > > join operator with corresponding runtime logic (ProcessFunctions in
> >> > > module flink-table-runtime). Architecture looks like in the pinned
> >> > > image (LookupConfig class there is actually yours CacheConfig).
> >> > >
> >> > > Classes in flink-table-planner, that will be responsible for this –
> >> > > CommonPhysicalLookupJoin and his inheritors.
> >> > > Current classes for lookup join in  flink-table-runtime  -
> >> > > LookupJoinRunner, AsyncLookupJoinRunner, LookupJoinRunnerWithCalc,
> >> > > AsyncLookupJoinRunnerWithCalc.
> >> > >
> >> > > I suggest adding classes LookupJoinCachingRunner,
> >> > > LookupJoinCachingRunnerWithCalc, etc.
> >> > >
> >> > > And here comes another more powerful advantage of such a solution.
> If
> >> > > we have caching logic on a lower level, we can apply some
> >> > > optimizations to it. LookupJoinRunnerWithCalc was named like this
> >> > > because it uses the ‘calc’ function, which actually mostly consists
> of
> >> > > filters and projections.
> >> > >
> >> > > For example, in join table A with lookup table B condition ‘JOIN …
> ON
> >> > > A.id = B.id AND A.age = B.age + 10 WHERE B.salary > 1000’  ‘calc’
> >> > > function will contain filters A.age = B.age + 10 and B.salary >
> 1000.
> >> > >
> >> > > If we apply this function before storing records in cache, size of
> >> > > cache will be significantly reduced: filters = avoid storing useless
> >> > > records in cache, projections = reduce records’ size. So the initial
> >> > > max number of records in cache can be increased by the user.
> >> > >
> >> > > What do you think about it?
> >> > >
> >> > >
> >> > > On 2022/04/19 02:47:11 Qingsheng Ren wrote:
> >> > > > Hi devs,
> >> > > >
> >> > > > Yuan and I would like to start a discussion about FLIP-221[1],
> which
> >> > > introduces an abstraction of lookup table cache and its standard
> metrics.
> >> > > >
> >> > > > Currently each lookup table source should implement their own
> cache to
> >> > > store lookup results, and there isn’t a standard of metrics for
> users and
> >> > > developers to tuning their jobs with lookup joins, which is a quite
> common
> >> > > use case in Flink table / SQL.
> >> > > >
> >> > > > Therefore we propose some new APIs including cache, metrics,
> wrapper
> >> > > classes of TableFunction and new table options. Please take a look
> at the
> >> > > FLIP page [1] to get more details. Any suggestions and comments
> would be
> >> > > appreciated!
> >> > > >
> >> > > > [1]
> >> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> >> > > >
> >> > > > Best regards,
> >> > > >
> >> > > > Qingsheng
> >> > > >
> >> > > >
> >> > >
> >
> >
> >
> > --
> > Best Regards,
> >
> > Qingsheng Ren
> >
> > Real-time Computing Team
> > Alibaba Cloud
> >
> > Email: renqschn@gmail.com
>