You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Becket Qin <be...@gmail.com> on 2018/11/20 13:56:08 UTC

[DISCUSS] Support Interactive Programming in Flink Table API

Hi all,

As a few recent email threads have pointed out, it is a promising
opportunity to enhance Flink Table API in various aspects, including
functionality and ease of use among others. One of the scenarios where we
feel Flink could improve is interactive programming. To explain the issues
and facilitate the discussion on the solution, we put together the
following document with our proposal.

https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing

Feedback and comments are very welcome!

Thanks,

Jiangjie (Becket) Qin

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Fabian,

Thanks for sharing the feedback!

Re: 1)
Good question about the implementation. In fact, Alibaba has modified the
query planning a little bit to add something called LogicalNodeBlock.
Basically, a given DAG could be divided into a few LogicalNodeBlocks, and
the optimization will be done within each LogicalNodeBlock, i.e. a sub
graph. This feature significantly helped in many cases, including cache().

So in the case of table.cache() is invoked, Flink will add a sink to that
table, and that table will become the last LogicalNode of a block. The
subsequent tables referring the cached table will be in another block. You
are absolutely right that when looking at a table with cache flag set,
Flink needs to know whether it should create the cache or read from the
cache. The current idea is idea is to have TableEnvironment remember that
information. To explain with example:

Table t1 = ....
t1.cache(); // A flag is set on t1 to indicate that it needs to be cached.
t1.count(); // A job is submitted, a TableSink is added to t1, with a
t1_UUID as the sink table name. When the job returns successfully, a
mapping of t1 -> Table_UUID will be remembered by the TableEnvironment.
t1.count(); // The table environment goes over the DAG and found the t1 ->
t1_UUID mapping. It replace the t1 DAG with a table scan of t1_UUID

Re: 2)
If I understand correctly, the ambiguity comes from the assumption that a
table is mutable. i.e. something like table.insert(). Is there any implicit
behavior if the table is immutable? If the implicit behavior comes from the
mutability, when cache() returns a CachedTable which extends table. Does
that mean one can also insert into the CachedTable? This sounds pretty
confusing.

Re: 3)
I explained why I thought cache() and materialize should be two different
methods in the reply to Piotrek and Jark. Please let me know what do you
think.

Thanks the feedback again.

Jiangjie (Becket) Qin



On Thu, Nov 29, 2018 at 9:16 PM Fabian Hueske <fh...@gmail.com> wrote:

> Hi,
>
> Thanks for the clarification Becket!
>
> I have a few thoughts to share / questions:
>
> 1) I'd like to know how you plan to implement the feature on a plan /
> planner level.
>
> I would imaging the following to happen when Table.cache() is called:
>
> 1) immediately optimize the Table and internally convert it into a
> DataSet/DataStream. This is necessary, to avoid that operators of later
> queries on top of the Table are pushed down.
> 2) register the DataSet/DataStream as a DataSet/DataStream-backed Table X
> 3) add a sink to the DataSet/DataStream. This is the materialization of the
> Table X
>
> Based on your proposal the following would happen:
>
> Table t1 = ....
> t1.cache(); // cache() returns void. The logical plan of t1 is replaced by
> a scan of X. There is also a reference to the materialization of X.
>
> t1.count(); // this executes the program, including the DataSet/DataStream
> that backs X and the sink that writes the materialization of X
> t1.count(); // this executes the program, but reads X from the
> materialization.
>
> My question is, how do you determine when whether the scan of t1 should go
> against the DataSet/DataStream program and when against the
> materialization?
> AFAIK, there is no hook that will tell you that a part of the program was
> executed. Flipping a switch during optimization or plan generation is not
> sufficient as there is no guarantee that the plan is also executed.
>
> Overall, this behavior is somewhat similar to what I proposed in
> FLINK-8950, which does not include persisting the table, but just
> optimizing and reregistering it as DataSet/DataStream scan.
>
> 2) I think Piotr has a point about the implicit behavior and side effects
> of the cache() method if it does not return anything.
> Consider the following example:
>
> Table t1 = ???
> Table t2 = methodThatAppliesOperators(t1);
> Table t3 = methodThatAppliesOtherOperators(t1);
>
> In this case, the behavior/performance of the plan that results from the
> second method call depends on whether t1 was modified by the first method
> or not.
> This is the classic issue of mutable vs. immutable objects.
> Also, as Piotr pointed out, it might also be good to have the original plan
> of t1, because in some cases it is possible to push filters down such that
> evaluating the query from scratch might be more efficient than accessing
> the cache.
> Moreover, a CachedTable could extend Table() and offer a method refresh().
> This sounds quite useful in an interactive session mode.
>
> 3) Regarding the name, I can see both arguments. IMO, materialize() seems
> to be more future proof.
>
> Best, Fabian
>
> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> wshaoxuan@gmail.com>:
>
> > Hi Piotr,
> >
> > Thanks for sharing your ideas on the method naming. We will think about
> > your suggestions. But I don't understand why we need to change the return
> > type of cache().
> >
> > Cache() is a physical operation, it does not change the logic of
> > the `Table`. On the tableAPI layer, we should not introduce a new table
> > type unless the logic of table has been changed. If we introduce a new
> > table type `CachedTable`, we need create the same set of methods of
> `Table`
> > for it. I don't think it is worth doing this. Or can you please elaborate
> > more on what could be the "implicit behaviours/side effects" you are
> > thinking about?
> >
> > Regards,
> > Shaoxuan
> >
> >
> >
> > On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <pi...@data-artisans.com>
> > wrote:
> >
> > > Hi Becket,
> > >
> > > Thanks for the response.
> > >
> > > 1. I wasn’t saying that materialised view must be mutable or not. The
> > same
> > > thing applies to caches as well. To the contrary, I would expect more
> > > consistency and updates from something that is called “cache” vs
> > something
> > > that’s a “materialised view”. In other words, IMO most caches do not
> > serve
> > > you invalid/outdated data and they handle updates on their own.
> > >
> > > 2. I don’t think that having in the future two very similar concepts of
> > > `materialized` view and `cache` is a good idea. It would be confusing
> for
> > > the users. I think it could be handled by variations/overloading of
> > > materialised view concept. We could start with:
> > >
> > > `MaterializedTable materialize()` - immutable, session life scope
> > > (basically the same semantic as you are proposing
> > >
> > > And then in the future (if ever) build on top of that/expand it with:
> > >
> > > `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
> > > materialize(refreshHook=…)`
> > >
> > > Or with cross session support:
> > >
> > > `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
> > > materializeInto(tableFactory=…)`
> > >
> > > I’m not saying that we should implement cross session/refreshing now or
> > > even in the near future. I’m just arguing that naming current immutable
> > > session life scope method `materialize()` is more future proof and more
> > > consistent with SQL (on which after all table-api is heavily basing
> on).
> > >
> > > 3. Even if we agree on naming it `cache()`, I would still insist on
> > > `cache()` returning `CachedTable` handle to avoid implicit
> > behaviours/side
> > > effects and to give both us & users more flexibility.
> > >
> > > Piotrek
> > >
> > > > On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
> > > >
> > > > Just to add a little bit, the materialized view is probably more
> > similar
> > > to
> > > > the persistent() brought up earlier in the thread. So it is usually
> > cross
> > > > session and could be used in a larger scope. For example, a
> > materialized
> > > > view created by user A may be visible to user B. It is probably
> > something
> > > > we want to have in the future. I'll put it in the future work
> section.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com>
> > wrote:
> > > >
> > > >> Hi Piotrek,
> > > >>
> > > >> Thanks for the explanation.
> > > >>
> > > >> Right now we are mostly thinking of the cached table as immutable. I
> > can
> > > >> see the Materialized view would be useful in the future. That said,
> I
> > > think
> > > >> a simple cache mechanism is probably still needed. So to me, cache()
> > and
> > > >> materialize() should be two separate method as they address
> different
> > > >> needs. Materialize() is a higher level concept usually implying
> > > periodical
> > > >> update, while cache() has much simpler semantic. For example, one
> may
> > > >> create a materialized view and use cache() method in the
> materialized
> > > view
> > > >> creation logic. So that during the materialized view update, they do
> > not
> > > >> need to worry about the case that the cached table is also changed.
> > > Maybe
> > > >> under the hood, materialized() and cache() could share some
> mechanism,
> > > but
> > > >> I think a simple cache() method would be handy in a lot of cases.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jiangjie (Becket) Qin
> > > >>
> > > >> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > piotr@data-artisans.com
> > > >
> > > >> wrote:
> > > >>
> > > >>> Hi Becket,
> > > >>>
> > > >>>> Is there any extra thing user can do on a MaterializedTable that
> > they
> > > >>> cannot do on a Table?
> > > >>>
> > > >>> Maybe not in the initial implementation, but various DBs offer
> > > different
> > > >>> ways to “refresh” the materialised view. Hooks, triggers, timers,
> > > manually
> > > >>> etc. Having `MaterializedTable` would help us to handle that in the
> > > future.
> > > >>>
> > > >>>> After users call *table.cache(), *users can just use that table
> and
> > do
> > > >>> anything that is supported on a Table, including SQL.
> > > >>>
> > > >>> This is some implicit behaviour with side effects. Imagine if user
> > has
> > > a
> > > >>> long and complicated program, that touches table `b` multiple
> times,
> > > maybe
> > > >>> scattered around different methods. If he modifies his program by
> > > inserting
> > > >>> in one place
> > > >>>
> > > >>> b.cache()
> > > >>>
> > > >>> This implicitly alters the semantic and behaviour of his code all
> > over
> > > >>> the place, maybe in a ways that might cause problems. For example
> > what
> > > if
> > > >>> underlying data is changing?
> > > >>>
> > > >>> Having invisible side effects is also not very clean, for example
> > think
> > > >>> about something like this (but more complicated):
> > > >>>
> > > >>> Table b = ...;
> > > >>>
> > > >>> If (some_condition) {
> > > >>>  processTable1(b)
> > > >>> }
> > > >>> else {
> > > >>>  processTable2(b)
> > > >>> }
> > > >>>
> > > >>> // do more stuff with b
> > > >>>
> > > >>> And user adds `b.cache()` call to only one of the `processTable1`
> or
> > > >>> `processTable2` methods.
> > > >>>
> > > >>> On the other hand
> > > >>>
> > > >>> Table materialisedB = b.materialize()
> > > >>>
> > > >>> Avoids (at least some of) the side effect issues and forces user to
> > > >>> explicitly use `materialisedB` where it’s appropriate and forces
> user
> > > to
> > > >>> think what does it actually mean. And if something doesn’t work in
> > the
> > > end
> > > >>> for the user, he will know what has he changed instead of blaming
> > > Flink for
> > > >>> some “magic” underneath. In the above example, after materialising
> b
> > in
> > > >>> only one of the methods, he should/would realise about the issue
> when
> > > >>> handling the return value `MaterializedTable` of that method.
> > > >>>
> > > >>> I guess it comes down to personal preferences if you like things to
> > be
> > > >>> implicit or not. The more power is the user, probably the more
> likely
> > > he is
> > > >>> to like/understand implicit behaviour. And we as Table API
> designers
> > > are
> > > >>> the most power users out there, so I would proceed with caution (so
> > > that we
> > > >>> do not end up in the crazy perl realm with it’s lovely implicit
> > method
> > > >>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> > > >>>
> > > >>>> Table API to also support non-relational processing cases, cache()
> > > >>> might be slightly better.
> > > >>>
> > > >>> I think even such extended Table API could benefit from sticking
> > > to/being
> > > >>> consistent with SQL where both SQL and Table API are basically the
> > > same.
> > > >>>
> > > >>> One more thing. `MaterializedTable materialize()` could be more
> > > >>> powerful/flexible allowing the user to operate both on materialised
> > > and not
> > > >>> materialised view at the same time for whatever reasons (underlying
> > > data
> > > >>> changing/better optimisation opportunities after pushing down more
> > > filters
> > > >>> etc). For example:
> > > >>>
> > > >>> Table b = …;
> > > >>>
> > > >>> MaterlizedTable mb = b.materialize();
> > > >>>
> > > >>> Val min = mb.min();
> > > >>> Val max = mb.max();
> > > >>>
> > > >>> Val user42 = b.filter(‘userId = 42);
> > > >>>
> > > >>> Could be more efficient compared to `b.cache()` if `filter(‘userId
> =
> > > >>> 42);` allows for much more aggressive optimisations.
> > > >>>
> > > >>> Piotrek
> > > >>>
> > > >>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
> wrote:
> > > >>>>
> > > >>>> I'm not suggesting to add support for Ignite. This was just an
> > > example.
> > > >>>> Plasma and Arrow sound interesting, too.
> > > >>>> For the sake of this proposal, it would be up to the user to
> > > implement a
> > > >>>> TableFactory and corresponding TableSource / TableSink classes to
> > > >>> persist
> > > >>>> and read the data.
> > > >>>>
> > > >>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> > > >>>> pompermaier@okkam.it>:
> > > >>>>
> > > >>>>> What about to add also Apache Plasma + Arrow as an alternative to
> > > >>> Apache
> > > >>>>> Ignite?
> > > >>>>> [1]
> > > >>>>>
> > > >>>
> > >
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > >>>>>
> > > >>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> fhueske@gmail.com>
> > > >>> wrote:
> > > >>>>>
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> Thanks for the proposal!
> > > >>>>>>
> > > >>>>>> To summarize, you propose a new method Table.cache(): Table that
> > > will
> > > >>>>>> trigger a job and write the result into some temporary storage
> as
> > > >>> defined
> > > >>>>>> by a TableFactory.
> > > >>>>>> The cache() call blocks while the job is running and eventually
> > > >>> returns a
> > > >>>>>> Table object that represents a scan of the temporary table.
> > > >>>>>> When the "session" is closed (closing to be defined?), the
> > temporary
> > > >>>>> tables
> > > >>>>>> are all dropped.
> > > >>>>>>
> > > >>>>>> I think this behavior makes sense and is a good first step
> towards
> > > >>> more
> > > >>>>>> interactive workloads.
> > > >>>>>> However, its performance suffers from writing to and reading
> from
> > > >>>>> external
> > > >>>>>> systems.
> > > >>>>>> I think this is OK for now. Changes that would significantly
> > improve
> > > >>> the
> > > >>>>>> situation (i.e., pinning data in-memory across jobs) would have
> > > large
> > > >>>>>> impacts on many components of Flink.
> > > >>>>>> Users could use in-memory filesystems or storage grids (Apache
> > > >>> Ignite) to
> > > >>>>>> mitigate some of the performance effects.
> > > >>>>>>
> > > >>>>>> Best, Fabian
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> > > >>>>>> becket.qin@gmail.com
> > > >>>>>>> :
> > > >>>>>>
> > > >>>>>>> Thanks for the explanation, Piotrek.
> > > >>>>>>>
> > > >>>>>>> Is there any extra thing user can do on a MaterializedTable
> that
> > > they
> > > >>>>>>> cannot do on a Table? After users call *table.cache(), *users
> can
> > > >>> just
> > > >>>>>> use
> > > >>>>>>> that table and do anything that is supported on a Table,
> > including
> > > >>> SQL.
> > > >>>>>>>
> > > >>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
> > > >>> cache()
> > > >>>>>> is
> > > >>>>>>> a bit more general than materialize(). Given that we are
> > enhancing
> > > >>> the
> > > >>>>>>> Table API to also support non-relational processing cases,
> > cache()
> > > >>>>> might
> > > >>>>>> be
> > > >>>>>>> slightly better.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>>
> > > >>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > > >>>>> piotr@data-artisans.com
> > > >>>>>>>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Becket,
> > > >>>>>>>>
> > > >>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> > > >>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want
> to
> > > >>>>>> provide
> > > >>>>>>> an
> > > >>>>>>>> alternate way of writing the data.
> > > >>>>>>>>
> > > >>>>>>>> Now that I hopefully understand the proposal, maybe we could
> > > rename
> > > >>>>>>>> `cache()` to
> > > >>>>>>>>
> > > >>>>>>>> void materialize()
> > > >>>>>>>>
> > > >>>>>>>> or going step further
> > > >>>>>>>>
> > > >>>>>>>> MaterializedTable materialize()
> > > >>>>>>>> MaterializedTable createMaterializedView()
> > > >>>>>>>>
> > > >>>>>>>> ?
> > > >>>>>>>>
> > > >>>>>>>> The second option with returning a handle I think is more
> > flexible
> > > >>>>> and
> > > >>>>>>>> could provide features such as “refresh”/“delete” or generally
> > > >>>>> speaking
> > > >>>>>>>> manage the the view. In the future we could also think about
> > > adding
> > > >>>>>> hooks
> > > >>>>>>>> to automatically refresh view etc. It is also more explicit -
> > > >>>>>>>> materialization returning a new table handle will not have the
> > > same
> > > >>>>>>>> implicit side effects as adding a simple line of code like
> > > >>>>> `b.cache()`
> > > >>>>>>>> would have.
> > > >>>>>>>>
> > > >>>>>>>> It would also be more SQL like, making it more intuitive for
> > users
> > > >>>>>>> already
> > > >>>>>>>> familiar with the SQL.
> > > >>>>>>>>
> > > >>>>>>>> Piotrek
> > > >>>>>>>>
> > > >>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>
> > > >>>>>>>>> For the cache() method itself, yes, it is equivalent to
> > creating
> > > a
> > > >>>>>>>> BUILT-IN
> > > >>>>>>>>> materialized view with a lifecycle. That functionality is
> > missing
> > > >>>>>>> today,
> > > >>>>>>>>> though. Not sure if I understand your question. Do you mean
> we
> > > >>>>>> already
> > > >>>>>>>> have
> > > >>>>>>>>> the functionality and just need a syntax sugar?
> > > >>>>>>>>>
> > > >>>>>>>>> What's more interesting in the proposal is do we want to stop
> > at
> > > >>>>>>> creating
> > > >>>>>>>>> the materialized view? Or do we want to extend that in the
> > future
> > > >>>>> to
> > > >>>>>> a
> > > >>>>>>>> more
> > > >>>>>>>>> useful unified data store distributed with Flink? And do we
> > want
> > > to
> > > >>>>>>> have
> > > >>>>>>>> a
> > > >>>>>>>>> mechanism allow more flexible user job pattern with their own
> > > user
> > > >>>>>>>> defined
> > > >>>>>>>>> services. These considerations are much more architectural.
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>>
> > > >>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > > >>>>>>> piotr@data-artisans.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Interesting idea. I’m trying to understand the problem.
> Isn’t
> > > the
> > > >>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
> > later
> > > >>>>>>> reading
> > > >>>>>>>>>> from it? Where this sink has a limited live scope/live time?
> > And
> > > >>>>> the
> > > >>>>>>>> sink
> > > >>>>>>>>>> could be implemented as in memory or a file sink?
> > > >>>>>>>>>>
> > > >>>>>>>>>> If so, what’s the problem with creating a materialised view
> > > from a
> > > >>>>>>> table
> > > >>>>>>>>>> “b” (from your document’s example) and reusing this
> > materialised
> > > >>>>>> view
> > > >>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> > materialised
> > > >>>>>> views
> > > >>>>>>>> (for
> > > >>>>>>>>>> example when current session finishes)? Maybe we need some
> > > >>>>> syntactic
> > > >>>>>>>> sugar
> > > >>>>>>>>>> on top of it?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Piotrek
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <becket.qin@gmail.com
> >
> > > >>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> > > >>>>>>> lifecycle/defined
> > > >>>>>>>>>>> scope. I just added a section in the future work for this.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > > >>>>>>> sunjincheng121@gmail.com
> > > >>>>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Jiangjie,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thank you for the explanation about the name of
> `cache()`, I
> > > >>>>>>>> understand
> > > >>>>>>>>>> why
> > > >>>>>>>>>>>> you designed this way!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
> data
> > > >>>>>>>> persistence?
> > > >>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user
> > is
> > > >>>>> not
> > > >>>>>>>>>> worried
> > > >>>>>>>>>>>> about data loss, and will clearly specify the time range
> for
> > > >>>>>> keeping
> > > >>>>>>>>>> time.
> > > >>>>>>>>>>>> At the same time, if we want to expand, we can also share
> > in a
> > > >>>>>>> certain
> > > >>>>>>>>>>>> group of session, for example:
> > LifeCycle.SESSION_GROUP(...), I
> > > >>>>> am
> > > >>>>>>> not
> > > >>>>>>>>>> sure,
> > > >>>>>>>>>>>> just an immature suggestion, for reference only!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Bests,
> > > >>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
> 下午1:33写道:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Re: Jincheng,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> persist(),
> > > >>>>>>>> personally I
> > > >>>>>>>>>>>>> find cache() to be more accurately describing the
> behavior,
> > > >>>>> i.e.
> > > >>>>>>> the
> > > >>>>>>>>>>>> Table
> > > >>>>>>>>>>>>> is cached for the session, but will be deleted after the
> > > >>>>> session
> > > >>>>>> is
> > > >>>>>>>>>>>> closed.
> > > >>>>>>>>>>>>> persist() seems a little misleading as people might think
> > the
> > > >>>>>> table
> > > >>>>>>>>>> will
> > > >>>>>>>>>>>>> still be there even after the session is gone.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Great point about mixing the batch and stream processing
> in
> > > the
> > > >>>>>>> same
> > > >>>>>>>>>> job.
> > > >>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
> that
> > > >>>>> would
> > > >>>>>>> be
> > > >>>>>>>> a
> > > >>>>>>>>>>>> huge
> > > >>>>>>>>>>>>> change across the board, including sources, operators and
> > > >>>>>>>>>> optimizations,
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>> name some. Likely we will need several separate in-depth
> > > >>>>>>> discussions.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > > >>>>> xingcanc@gmail.com>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
> both
> > > >>>>>>>> orthogonal
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
> time
> > > we
> > > >>>>>> plan
> > > >>>>>>>> to
> > > >>>>>>>>>>>>>> introduce another storage mechanism other than the
> state.
> > > >>>>> Maybe
> > > >>>>>>> it’s
> > > >>>>>>>>>>>>> better
> > > >>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
> > > specific
> > > >>>>>>> part?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> > > underlying
> > > >>>>>>>>>> service.
> > > >>>>>>>>>>>>>> This seems to be quite a major change to the existing
> > > >>>>> codebase.
> > > >>>>>> As
> > > >>>>>>>> you
> > > >>>>>>>>>>>>>> claimed, the service should be extendible to support
> other
> > > >>>>>>>> components
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>> we’d better discussed it in another thread.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
> > Table
> > > >>>>>> API,
> > > >>>>>>> in
> > > >>>>>>>>>>>> case
> > > >>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>> Xingcan
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > > >>>>>> xiaoweij@gmail.com>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
> is
> > > not
> > > >>>>>> very
> > > >>>>>>>>>>>>>> reliable.
> > > >>>>>>>>>>>>>>> There is no guarantee that it will be executed
> > > successfully.
> > > >>>>> We
> > > >>>>>>> may
> > > >>>>>>>>>>>>> risk
> > > >>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
> have
> > an
> > > >>>>>>>>>>>> association
> > > >>>>>>>>>>>>>>> between temp table and session id. So we can always
> clean
> > > up
> > > >>>>>> temp
> > > >>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>> which are no longer associated with any active
> sessions.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>> Xiaowei
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > > >>>>>>>>>>>>> sunjincheng121@gmail.com>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Interactive Programming is very useful and user
> friendly
> > > in
> > > >>>>>> case
> > > >>>>>>>> of
> > > >>>>>>>>>>>>> your
> > > >>>>>>>>>>>>>>>> examples.
> > > >>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> executed
> > in
> > > >>>>>>> several
> > > >>>>>>>>>>>>>> stages
> > > >>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in
> > > order
> > > >>>>>> to
> > > >>>>>>>>>>>>> utilize
> > > >>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
> job
> > > by
> > > >>>>>>>>>>>>>> env.execute().
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> > > >>>>> `persist()`,
> > > >>>>>>> And
> > > >>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>> Flink framework determines whether we internally cache
> > in
> > > >>>>>> memory
> > > >>>>>>>> or
> > > >>>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
> > > backend
> > > >>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support
> > for
> > > >>>>>>>> streaming
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit
> > in
> > > >>>>>>>>>>>> "Interactive
> > > >>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
> > > FLIP!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
> > > 下午9:56写道:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
> is a
> > > >>>>>>> promising
> > > >>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> > > aspects,
> > > >>>>>>>>>>>> including
> > > >>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
> the
> > > >>>>>>> scenarios
> > > >>>>>>>>>>>>> where
> > > >>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming.
> To
> > > >>>>>> explain
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>> issues
> > > >>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
> > > >>>>>> together
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>>> following document with our proposal.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Shaoxuan Wang <ws...@gmail.com>.
Hi Fabian and Piotr,

Thanks for the feedback. I think I now understand you a little better.

1. “Materialize" and “cache" are two different scenarios IMO. "Materialize"
is a complex feature that allows the user to really create a
materializedView/table, and the materialized table will be timely updated
either when sourceTable is varied or timer is triggered. I can image this
feature will need lots of components to be added in Flink, like flinkStore,
meta system, job scheduler etc. This is definitely something that we want
to have but have not been planned yet. "Cache" addresses the performance
issue when consequent jobs needed to be executed and the latter one want to
reuse the result of previous one’s as an input source.

2. In the case of “Cache”. I did not consider that the method (let us first
assume there is such method) could modify the input table. To make sure I
understand you correctly. Is this what you mean by “refresh":
Table t1 = ???
Table t2 = t1.cache()
Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1'
//assume t1 can be modified
Table t4 = methodThatAppliesOperators(t2) // t1is used
t2.refresh() //load t1'
Table t5 = methodThatAppliesOperators(t2) // t1’ is used

I can see the value of having a new return type for cache() in this case.
(Maybe I missed sth.) But do we have such methods or expect to have any of
those that can modify the input table? If not, I do not see the need that
we should add a new return type for cache().

3. I I agree we should keep the “the logic plan of t1” and let optimizer to
decide the optimal plan weather to scan the cache data or not. This is
useful for both materialize and cache cases. When we start to think about
this cache proposal, I am evening thinking to let optimizer smartly add a
cache as needed. But this needs lots of changes on the optimization
framework itself (cross-job optimization), also it does not improve the
problem when user executes the table queries in an interactive way (because
optimizer cannot predict the future queries).


Regards,
Shaoxuan


On Thu, Nov 29, 2018 at 9:16 PM Fabian Hueske <fh...@gmail.com> wrote:

> Hi,
>
> Thanks for the clarification Becket!
>
> I have a few thoughts to share / questions:
>
> 1) I'd like to know how you plan to implement the feature on a plan /
> planner level.
>
> I would imaging the following to happen when Table.cache() is called:
>
> 1) immediately optimize the Table and internally convert it into a
> DataSet/DataStream. This is necessary, to avoid that operators of later
> queries on top of the Table are pushed down.
> 2) register the DataSet/DataStream as a DataSet/DataStream-backed Table X
> 3) add a sink to the DataSet/DataStream. This is the materialization of the
> Table X
>
> Based on your proposal the following would happen:
>
> Table t1 = ....
> t1.cache(); // cache() returns void. The logical plan of t1 is replaced by
> a scan of X. There is also a reference to the materialization of X.
>
> t1.count(); // this executes the program, including the DataSet/DataStream
> that backs X and the sink that writes the materialization of X
> t1.count(); // this executes the program, but reads X from the
> materialization.
>
> My question is, how do you determine when whether the scan of t1 should go
> against the DataSet/DataStream program and when against the
> materialization?
> AFAIK, there is no hook that will tell you that a part of the program was
> executed. Flipping a switch during optimization or plan generation is not
> sufficient as there is no guarantee that the plan is also executed.
>
> Overall, this behavior is somewhat similar to what I proposed in
> FLINK-8950, which does not include persisting the table, but just
> optimizing and reregistering it as DataSet/DataStream scan.
>
> 2) I think Piotr has a point about the implicit behavior and side effects
> of the cache() method if it does not return anything.
> Consider the following example:
>
> Table t1 = ???
> Table t2 = methodThatAppliesOperators(t1);
> Table t3 = methodThatAppliesOtherOperators(t1);
>
> In this case, the behavior/performance of the plan that results from the
> second method call depends on whether t1 was modified by the first method
> or not.
> This is the classic issue of mutable vs. immutable objects.
> Also, as Piotr pointed out, it might also be good to have the original plan
> of t1, because in some cases it is possible to push filters down such that
> evaluating the query from scratch might be more efficient than accessing
> the cache.
> Moreover, a CachedTable could extend Table() and offer a method refresh().
> This sounds quite useful in an interactive session mode.
>
> 3) Regarding the name, I can see both arguments. IMO, materialize() seems
> to be more future proof.
>
> Best, Fabian
>
> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> wshaoxuan@gmail.com>:
>
> > Hi Piotr,
> >
> > Thanks for sharing your ideas on the method naming. We will think about
> > your suggestions. But I don't understand why we need to change the return
> > type of cache().
> >
> > Cache() is a physical operation, it does not change the logic of
> > the `Table`. On the tableAPI layer, we should not introduce a new table
> > type unless the logic of table has been changed. If we introduce a new
> > table type `CachedTable`, we need create the same set of methods of
> `Table`
> > for it. I don't think it is worth doing this. Or can you please elaborate
> > more on what could be the "implicit behaviours/side effects" you are
> > thinking about?
> >
> > Regards,
> > Shaoxuan
> >
> >
> >
> > On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <pi...@data-artisans.com>
> > wrote:
> >
> > > Hi Becket,
> > >
> > > Thanks for the response.
> > >
> > > 1. I wasn’t saying that materialised view must be mutable or not. The
> > same
> > > thing applies to caches as well. To the contrary, I would expect more
> > > consistency and updates from something that is called “cache” vs
> > something
> > > that’s a “materialised view”. In other words, IMO most caches do not
> > serve
> > > you invalid/outdated data and they handle updates on their own.
> > >
> > > 2. I don’t think that having in the future two very similar concepts of
> > > `materialized` view and `cache` is a good idea. It would be confusing
> for
> > > the users. I think it could be handled by variations/overloading of
> > > materialised view concept. We could start with:
> > >
> > > `MaterializedTable materialize()` - immutable, session life scope
> > > (basically the same semantic as you are proposing
> > >
> > > And then in the future (if ever) build on top of that/expand it with:
> > >
> > > `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
> > > materialize(refreshHook=…)`
> > >
> > > Or with cross session support:
> > >
> > > `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
> > > materializeInto(tableFactory=…)`
> > >
> > > I’m not saying that we should implement cross session/refreshing now or
> > > even in the near future. I’m just arguing that naming current immutable
> > > session life scope method `materialize()` is more future proof and more
> > > consistent with SQL (on which after all table-api is heavily basing
> on).
> > >
> > > 3. Even if we agree on naming it `cache()`, I would still insist on
> > > `cache()` returning `CachedTable` handle to avoid implicit
> > behaviours/side
> > > effects and to give both us & users more flexibility.
> > >
> > > Piotrek
> > >
> > > > On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
> > > >
> > > > Just to add a little bit, the materialized view is probably more
> > similar
> > > to
> > > > the persistent() brought up earlier in the thread. So it is usually
> > cross
> > > > session and could be used in a larger scope. For example, a
> > materialized
> > > > view created by user A may be visible to user B. It is probably
> > something
> > > > we want to have in the future. I'll put it in the future work
> section.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com>
> > wrote:
> > > >
> > > >> Hi Piotrek,
> > > >>
> > > >> Thanks for the explanation.
> > > >>
> > > >> Right now we are mostly thinking of the cached table as immutable. I
> > can
> > > >> see the Materialized view would be useful in the future. That said,
> I
> > > think
> > > >> a simple cache mechanism is probably still needed. So to me, cache()
> > and
> > > >> materialize() should be two separate method as they address
> different
> > > >> needs. Materialize() is a higher level concept usually implying
> > > periodical
> > > >> update, while cache() has much simpler semantic. For example, one
> may
> > > >> create a materialized view and use cache() method in the
> materialized
> > > view
> > > >> creation logic. So that during the materialized view update, they do
> > not
> > > >> need to worry about the case that the cached table is also changed.
> > > Maybe
> > > >> under the hood, materialized() and cache() could share some
> mechanism,
> > > but
> > > >> I think a simple cache() method would be handy in a lot of cases.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jiangjie (Becket) Qin
> > > >>
> > > >> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > piotr@data-artisans.com
> > > >
> > > >> wrote:
> > > >>
> > > >>> Hi Becket,
> > > >>>
> > > >>>> Is there any extra thing user can do on a MaterializedTable that
> > they
> > > >>> cannot do on a Table?
> > > >>>
> > > >>> Maybe not in the initial implementation, but various DBs offer
> > > different
> > > >>> ways to “refresh” the materialised view. Hooks, triggers, timers,
> > > manually
> > > >>> etc. Having `MaterializedTable` would help us to handle that in the
> > > future.
> > > >>>
> > > >>>> After users call *table.cache(), *users can just use that table
> and
> > do
> > > >>> anything that is supported on a Table, including SQL.
> > > >>>
> > > >>> This is some implicit behaviour with side effects. Imagine if user
> > has
> > > a
> > > >>> long and complicated program, that touches table `b` multiple
> times,
> > > maybe
> > > >>> scattered around different methods. If he modifies his program by
> > > inserting
> > > >>> in one place
> > > >>>
> > > >>> b.cache()
> > > >>>
> > > >>> This implicitly alters the semantic and behaviour of his code all
> > over
> > > >>> the place, maybe in a ways that might cause problems. For example
> > what
> > > if
> > > >>> underlying data is changing?
> > > >>>
> > > >>> Having invisible side effects is also not very clean, for example
> > think
> > > >>> about something like this (but more complicated):
> > > >>>
> > > >>> Table b = ...;
> > > >>>
> > > >>> If (some_condition) {
> > > >>>  processTable1(b)
> > > >>> }
> > > >>> else {
> > > >>>  processTable2(b)
> > > >>> }
> > > >>>
> > > >>> // do more stuff with b
> > > >>>
> > > >>> And user adds `b.cache()` call to only one of the `processTable1`
> or
> > > >>> `processTable2` methods.
> > > >>>
> > > >>> On the other hand
> > > >>>
> > > >>> Table materialisedB = b.materialize()
> > > >>>
> > > >>> Avoids (at least some of) the side effect issues and forces user to
> > > >>> explicitly use `materialisedB` where it’s appropriate and forces
> user
> > > to
> > > >>> think what does it actually mean. And if something doesn’t work in
> > the
> > > end
> > > >>> for the user, he will know what has he changed instead of blaming
> > > Flink for
> > > >>> some “magic” underneath. In the above example, after materialising
> b
> > in
> > > >>> only one of the methods, he should/would realise about the issue
> when
> > > >>> handling the return value `MaterializedTable` of that method.
> > > >>>
> > > >>> I guess it comes down to personal preferences if you like things to
> > be
> > > >>> implicit or not. The more power is the user, probably the more
> likely
> > > he is
> > > >>> to like/understand implicit behaviour. And we as Table API
> designers
> > > are
> > > >>> the most power users out there, so I would proceed with caution (so
> > > that we
> > > >>> do not end up in the crazy perl realm with it’s lovely implicit
> > method
> > > >>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> > > >>>
> > > >>>> Table API to also support non-relational processing cases, cache()
> > > >>> might be slightly better.
> > > >>>
> > > >>> I think even such extended Table API could benefit from sticking
> > > to/being
> > > >>> consistent with SQL where both SQL and Table API are basically the
> > > same.
> > > >>>
> > > >>> One more thing. `MaterializedTable materialize()` could be more
> > > >>> powerful/flexible allowing the user to operate both on materialised
> > > and not
> > > >>> materialised view at the same time for whatever reasons (underlying
> > > data
> > > >>> changing/better optimisation opportunities after pushing down more
> > > filters
> > > >>> etc). For example:
> > > >>>
> > > >>> Table b = …;
> > > >>>
> > > >>> MaterlizedTable mb = b.materialize();
> > > >>>
> > > >>> Val min = mb.min();
> > > >>> Val max = mb.max();
> > > >>>
> > > >>> Val user42 = b.filter(‘userId = 42);
> > > >>>
> > > >>> Could be more efficient compared to `b.cache()` if `filter(‘userId
> =
> > > >>> 42);` allows for much more aggressive optimisations.
> > > >>>
> > > >>> Piotrek
> > > >>>
> > > >>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
> wrote:
> > > >>>>
> > > >>>> I'm not suggesting to add support for Ignite. This was just an
> > > example.
> > > >>>> Plasma and Arrow sound interesting, too.
> > > >>>> For the sake of this proposal, it would be up to the user to
> > > implement a
> > > >>>> TableFactory and corresponding TableSource / TableSink classes to
> > > >>> persist
> > > >>>> and read the data.
> > > >>>>
> > > >>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> > > >>>> pompermaier@okkam.it>:
> > > >>>>
> > > >>>>> What about to add also Apache Plasma + Arrow as an alternative to
> > > >>> Apache
> > > >>>>> Ignite?
> > > >>>>> [1]
> > > >>>>>
> > > >>>
> > >
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > >>>>>
> > > >>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> fhueske@gmail.com>
> > > >>> wrote:
> > > >>>>>
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> Thanks for the proposal!
> > > >>>>>>
> > > >>>>>> To summarize, you propose a new method Table.cache(): Table that
> > > will
> > > >>>>>> trigger a job and write the result into some temporary storage
> as
> > > >>> defined
> > > >>>>>> by a TableFactory.
> > > >>>>>> The cache() call blocks while the job is running and eventually
> > > >>> returns a
> > > >>>>>> Table object that represents a scan of the temporary table.
> > > >>>>>> When the "session" is closed (closing to be defined?), the
> > temporary
> > > >>>>> tables
> > > >>>>>> are all dropped.
> > > >>>>>>
> > > >>>>>> I think this behavior makes sense and is a good first step
> towards
> > > >>> more
> > > >>>>>> interactive workloads.
> > > >>>>>> However, its performance suffers from writing to and reading
> from
> > > >>>>> external
> > > >>>>>> systems.
> > > >>>>>> I think this is OK for now. Changes that would significantly
> > improve
> > > >>> the
> > > >>>>>> situation (i.e., pinning data in-memory across jobs) would have
> > > large
> > > >>>>>> impacts on many components of Flink.
> > > >>>>>> Users could use in-memory filesystems or storage grids (Apache
> > > >>> Ignite) to
> > > >>>>>> mitigate some of the performance effects.
> > > >>>>>>
> > > >>>>>> Best, Fabian
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> > > >>>>>> becket.qin@gmail.com
> > > >>>>>>> :
> > > >>>>>>
> > > >>>>>>> Thanks for the explanation, Piotrek.
> > > >>>>>>>
> > > >>>>>>> Is there any extra thing user can do on a MaterializedTable
> that
> > > they
> > > >>>>>>> cannot do on a Table? After users call *table.cache(), *users
> can
> > > >>> just
> > > >>>>>> use
> > > >>>>>>> that table and do anything that is supported on a Table,
> > including
> > > >>> SQL.
> > > >>>>>>>
> > > >>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
> > > >>> cache()
> > > >>>>>> is
> > > >>>>>>> a bit more general than materialize(). Given that we are
> > enhancing
> > > >>> the
> > > >>>>>>> Table API to also support non-relational processing cases,
> > cache()
> > > >>>>> might
> > > >>>>>> be
> > > >>>>>>> slightly better.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>>
> > > >>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > > >>>>> piotr@data-artisans.com
> > > >>>>>>>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Becket,
> > > >>>>>>>>
> > > >>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> > > >>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want
> to
> > > >>>>>> provide
> > > >>>>>>> an
> > > >>>>>>>> alternate way of writing the data.
> > > >>>>>>>>
> > > >>>>>>>> Now that I hopefully understand the proposal, maybe we could
> > > rename
> > > >>>>>>>> `cache()` to
> > > >>>>>>>>
> > > >>>>>>>> void materialize()
> > > >>>>>>>>
> > > >>>>>>>> or going step further
> > > >>>>>>>>
> > > >>>>>>>> MaterializedTable materialize()
> > > >>>>>>>> MaterializedTable createMaterializedView()
> > > >>>>>>>>
> > > >>>>>>>> ?
> > > >>>>>>>>
> > > >>>>>>>> The second option with returning a handle I think is more
> > flexible
> > > >>>>> and
> > > >>>>>>>> could provide features such as “refresh”/“delete” or generally
> > > >>>>> speaking
> > > >>>>>>>> manage the the view. In the future we could also think about
> > > adding
> > > >>>>>> hooks
> > > >>>>>>>> to automatically refresh view etc. It is also more explicit -
> > > >>>>>>>> materialization returning a new table handle will not have the
> > > same
> > > >>>>>>>> implicit side effects as adding a simple line of code like
> > > >>>>> `b.cache()`
> > > >>>>>>>> would have.
> > > >>>>>>>>
> > > >>>>>>>> It would also be more SQL like, making it more intuitive for
> > users
> > > >>>>>>> already
> > > >>>>>>>> familiar with the SQL.
> > > >>>>>>>>
> > > >>>>>>>> Piotrek
> > > >>>>>>>>
> > > >>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>
> > > >>>>>>>>> For the cache() method itself, yes, it is equivalent to
> > creating
> > > a
> > > >>>>>>>> BUILT-IN
> > > >>>>>>>>> materialized view with a lifecycle. That functionality is
> > missing
> > > >>>>>>> today,
> > > >>>>>>>>> though. Not sure if I understand your question. Do you mean
> we
> > > >>>>>> already
> > > >>>>>>>> have
> > > >>>>>>>>> the functionality and just need a syntax sugar?
> > > >>>>>>>>>
> > > >>>>>>>>> What's more interesting in the proposal is do we want to stop
> > at
> > > >>>>>>> creating
> > > >>>>>>>>> the materialized view? Or do we want to extend that in the
> > future
> > > >>>>> to
> > > >>>>>> a
> > > >>>>>>>> more
> > > >>>>>>>>> useful unified data store distributed with Flink? And do we
> > want
> > > to
> > > >>>>>>> have
> > > >>>>>>>> a
> > > >>>>>>>>> mechanism allow more flexible user job pattern with their own
> > > user
> > > >>>>>>>> defined
> > > >>>>>>>>> services. These considerations are much more architectural.
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>>
> > > >>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > > >>>>>>> piotr@data-artisans.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Interesting idea. I’m trying to understand the problem.
> Isn’t
> > > the
> > > >>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
> > later
> > > >>>>>>> reading
> > > >>>>>>>>>> from it? Where this sink has a limited live scope/live time?
> > And
> > > >>>>> the
> > > >>>>>>>> sink
> > > >>>>>>>>>> could be implemented as in memory or a file sink?
> > > >>>>>>>>>>
> > > >>>>>>>>>> If so, what’s the problem with creating a materialised view
> > > from a
> > > >>>>>>> table
> > > >>>>>>>>>> “b” (from your document’s example) and reusing this
> > materialised
> > > >>>>>> view
> > > >>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> > materialised
> > > >>>>>> views
> > > >>>>>>>> (for
> > > >>>>>>>>>> example when current session finishes)? Maybe we need some
> > > >>>>> syntactic
> > > >>>>>>>> sugar
> > > >>>>>>>>>> on top of it?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Piotrek
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <becket.qin@gmail.com
> >
> > > >>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> > > >>>>>>> lifecycle/defined
> > > >>>>>>>>>>> scope. I just added a section in the future work for this.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > > >>>>>>> sunjincheng121@gmail.com
> > > >>>>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Jiangjie,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thank you for the explanation about the name of
> `cache()`, I
> > > >>>>>>>> understand
> > > >>>>>>>>>> why
> > > >>>>>>>>>>>> you designed this way!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
> data
> > > >>>>>>>> persistence?
> > > >>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user
> > is
> > > >>>>> not
> > > >>>>>>>>>> worried
> > > >>>>>>>>>>>> about data loss, and will clearly specify the time range
> for
> > > >>>>>> keeping
> > > >>>>>>>>>> time.
> > > >>>>>>>>>>>> At the same time, if we want to expand, we can also share
> > in a
> > > >>>>>>> certain
> > > >>>>>>>>>>>> group of session, for example:
> > LifeCycle.SESSION_GROUP(...), I
> > > >>>>> am
> > > >>>>>>> not
> > > >>>>>>>>>> sure,
> > > >>>>>>>>>>>> just an immature suggestion, for reference only!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Bests,
> > > >>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
> 下午1:33写道:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Re: Jincheng,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> persist(),
> > > >>>>>>>> personally I
> > > >>>>>>>>>>>>> find cache() to be more accurately describing the
> behavior,
> > > >>>>> i.e.
> > > >>>>>>> the
> > > >>>>>>>>>>>> Table
> > > >>>>>>>>>>>>> is cached for the session, but will be deleted after the
> > > >>>>> session
> > > >>>>>> is
> > > >>>>>>>>>>>> closed.
> > > >>>>>>>>>>>>> persist() seems a little misleading as people might think
> > the
> > > >>>>>> table
> > > >>>>>>>>>> will
> > > >>>>>>>>>>>>> still be there even after the session is gone.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Great point about mixing the batch and stream processing
> in
> > > the
> > > >>>>>>> same
> > > >>>>>>>>>> job.
> > > >>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
> that
> > > >>>>> would
> > > >>>>>>> be
> > > >>>>>>>> a
> > > >>>>>>>>>>>> huge
> > > >>>>>>>>>>>>> change across the board, including sources, operators and
> > > >>>>>>>>>> optimizations,
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>> name some. Likely we will need several separate in-depth
> > > >>>>>>> discussions.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > > >>>>> xingcanc@gmail.com>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
> both
> > > >>>>>>>> orthogonal
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
> time
> > > we
> > > >>>>>> plan
> > > >>>>>>>> to
> > > >>>>>>>>>>>>>> introduce another storage mechanism other than the
> state.
> > > >>>>> Maybe
> > > >>>>>>> it’s
> > > >>>>>>>>>>>>> better
> > > >>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
> > > specific
> > > >>>>>>> part?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> > > underlying
> > > >>>>>>>>>> service.
> > > >>>>>>>>>>>>>> This seems to be quite a major change to the existing
> > > >>>>> codebase.
> > > >>>>>> As
> > > >>>>>>>> you
> > > >>>>>>>>>>>>>> claimed, the service should be extendible to support
> other
> > > >>>>>>>> components
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>> we’d better discussed it in another thread.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
> > Table
> > > >>>>>> API,
> > > >>>>>>> in
> > > >>>>>>>>>>>> case
> > > >>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>> Xingcan
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > > >>>>>> xiaoweij@gmail.com>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
> is
> > > not
> > > >>>>>> very
> > > >>>>>>>>>>>>>> reliable.
> > > >>>>>>>>>>>>>>> There is no guarantee that it will be executed
> > > successfully.
> > > >>>>> We
> > > >>>>>>> may
> > > >>>>>>>>>>>>> risk
> > > >>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
> have
> > an
> > > >>>>>>>>>>>> association
> > > >>>>>>>>>>>>>>> between temp table and session id. So we can always
> clean
> > > up
> > > >>>>>> temp
> > > >>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>> which are no longer associated with any active
> sessions.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>> Xiaowei
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > > >>>>>>>>>>>>> sunjincheng121@gmail.com>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Interactive Programming is very useful and user
> friendly
> > > in
> > > >>>>>> case
> > > >>>>>>>> of
> > > >>>>>>>>>>>>> your
> > > >>>>>>>>>>>>>>>> examples.
> > > >>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> executed
> > in
> > > >>>>>>> several
> > > >>>>>>>>>>>>>> stages
> > > >>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in
> > > order
> > > >>>>>> to
> > > >>>>>>>>>>>>> utilize
> > > >>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
> job
> > > by
> > > >>>>>>>>>>>>>> env.execute().
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> > > >>>>> `persist()`,
> > > >>>>>>> And
> > > >>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>> Flink framework determines whether we internally cache
> > in
> > > >>>>>> memory
> > > >>>>>>>> or
> > > >>>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
> > > backend
> > > >>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support
> > for
> > > >>>>>>>> streaming
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit
> > in
> > > >>>>>>>>>>>> "Interactive
> > > >>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
> > > FLIP!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
> > > 下午9:56写道:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
> is a
> > > >>>>>>> promising
> > > >>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> > > aspects,
> > > >>>>>>>>>>>> including
> > > >>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
> the
> > > >>>>>>> scenarios
> > > >>>>>>>>>>>>> where
> > > >>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming.
> To
> > > >>>>>> explain
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>> issues
> > > >>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
> > > >>>>>> together
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>>> following document with our proposal.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotr,

1. `env.getCacheService().releaseCacheFor(cachedT);` vs
`cachedT.releaseCache();`
It doesn't matter which signature we provide. To those who write the
function, "releasing the cache" is not a "side effect", it is exactly what
they wanted. Even if they know that they may be releasing someone else's
cache at the same time,  there is nothing they can do about it.

2. re: option 3.
I don't think `.cache()` is mutating the original table object at all. This
is exactly the same as `void t.writeToSink()`, we can even name it
`writeToCache()` if you think that would make it less misleading.

3. ref count or not.
I tend to agree that the "side effect" of releasing a cache is probably not
a big problem. So I think option 4 (as below) is acceptable.

Table cache() - create cache of a table, returning table with a hint.
void uncache() - drop the cache of the table if there is any.
Table.hint("ignoreCache").foo() - absolutely ignore cache even if it exists.

This will eventually go to a consistent state after we have automatic
caching enabled. i.e. after `b = a.cache()`, `a.foo()` and `b.foo()` are
exactly the same.

Thanks,

Jiangjie (Becket) Qin


On Wed, Jan 9, 2019 at 8:31 PM Piotr Nowojski <pi...@da-platform.com> wrote:

> Hi,
>
> I know that it still can have side effects and that’s why I wrote:
>
> > Something like this might be a better (not perfect, but just a bit
> better):
>
> My point was that this:
>
> void foo(Table t) {
>  val cachedT = t.cache();
>  ...
>  env.getCacheService().releaseCacheFor(cachedT);
> }
>
> Should communicate the potential side effects to the user in a better way
> compared to:
>
> void foo(Table t) {
>  val cachedT = t.cache();
>  …
>  cachedT.releaseCache();
> }
>
> Your option 3. has the problem of API class being mutable on `.cache()`
> calls.
>
> As I wrote before, we could use reference counting on `Table` or
> `CachedTable` returned from Option 4., but:
>
> > I think that introducing ref counting could be confusing and it will be
> > error prone, since Flink-table’s users are not used to closing/releasing
> > resources.
>
> I have a feeling that the inconvenience for the users in all of the use
> cases where they do not care about releasing the cache manually (which I
> would expect to be the vast majority), would overshadow potential benefits
> of using ref counting. And it’s not like ref counting can not cause
> problems on it’s own, with users wondering “why my cache wasn’t released?"
> (Because of dangling/not closed reference).
>
> Piotrek
>
> > On 8 Jan 2019, at 14:06, Becket Qin <be...@gmail.com> wrote:
> >
> > Just to clarify, when I say foo() like below, I assume that foo() must
> have
> > a way to release its own cache, so it must have access to table env.
> >
> > void foo(Table t) {
> >  ...
> >  t.cache(); // create cache for t
> >  ...
> >  env.getCacheService().releaseCacheFor(t); // release cache for t
> > }
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Jan 8, 2019 at 9:04 PM Becket Qin <be...@gmail.com> wrote:
> >
> >> Hi Piotr,
> >>
> >> I don't think it is feasible to ask every third party library to have
> >> method signature with CacheService as an argument.
> >>
> >> And even that signature does not really solve the problem. Imagine
> >> function foo() looks like following:
> >>
> >> void foo(Table t) {
> >>  ...
> >>  t.cache(); // create cache for t
> >>  ...
> >>  env.getCacheService().releaseCacheFor(t); // release cache for t
> >> }
> >>
> >> From function foo()'s perspective, it created a cache and released it.
> >> However, if someone invokes foo like this:
> >> {
> >>  Table src = ...
> >>  Table t = src.select(...).cache()
> >>  foo(t)
> >>  // t is uncached by foo() already.
> >> }
> >>
> >> So the "side effect" still exists.
> >>
> >> I think the only safe way to ensure there is no side effect while
> sharing
> >> the cache is to use ref count.
> >>
> >> BTW, the discussion we are having here is exactly the reason that I
> prefer
> >> option 3. From technical perspective option 3 solves all the concerns.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >> On Tue, Jan 8, 2019 at 8:41 PM Piotr Nowojski <pi...@da-platform.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I think that introducing ref counting could be confusing and it will be
> >>> error prone, since Flink-table’s users are not used to
> closing/releasing
> >>> resources. I was more objecting placing the
> >>> `uncache()`/`dropCache()`/`releaseCache()` (releaseCache sounds best
> to me)
> >>> as a method in the “Table”. It might be not obvious that it will drop
> the
> >>> cache for all of the usages of the given table. For example:
> >>>
> >>> public void foo(Table t) {
> >>> // …
> >>> t.releaseCache();
> >>> }
> >>>
> >>> public void bar(Table t) {
> >>>  // ...
> >>> }
> >>>
> >>> Table a = …
> >>> val cachedA = a.cache()
> >>>
> >>> foo(cachedA)
> >>> bar(cachedA)
> >>>
> >>>
> >>> My problem with above example is that `t.releaseCache()` call is not
> >>> doing the best possible job in communicating to the user that it will
> have
> >>> a side effects for other places, like `bar(cachedA)` call. Something
> like
> >>> this might be a better (not perfect, but just a bit better):
> >>>
> >>> public void foo(Table t, CacheService cacheService) {
> >>> // …
> >>> cacheService.releaseCacheFor(t);
> >>> }
> >>>
> >>> Table a = …
> >>> val cachedA = a.cache()
> >>>
> >>> foo(cachedA, env.getCacheService())
> >>> bar(cachedA)
> >>>
> >>>
> >>> Also from another perspective, maybe placing `releaseCache()` method in
> >>> Table might not be the best separation of concerns - `releaseCache()`
> >>> method seams significantly different compared to other existing
> methods.
> >>>
> >>> Piotrek
> >>>
> >>>> On 8 Jan 2019, at 12:28, Becket Qin <be...@gmail.com> wrote:
> >>>>
> >>>> Hi Piotr,
> >>>>
> >>>> You are right. There might be two intuitive meanings when users call
> >>>> 'a.uncache()', namely:
> >>>> 1. release the resource
> >>>> 2. Do not use cache for the next operation.
> >>>>
> >>>> Case (1) would likely be the dominant use case. So I would suggest we
> >>>> dedicate uncache() method to case (1), i.e. for resource release, but
> >>> not
> >>>> for ignoring cache.
> >>>>
> >>>> For case 2, i.e. explicitly ignoring cache (which is rare), users may
> >>> use
> >>>> something like 'hint("ignoreCache")'. I think this is better as it is
> a
> >>>> little weird for users to call `a.uncache()` while they may not even
> >>> know
> >>>> if the table is cached at all.
> >>>>
> >>>> Assuming we let `uncache()` to only release resource, one possibility
> is
> >>>> using ref count to mitigate the side effect. That means a ref count is
> >>>> incremented on `cache()` and decremented on `uncache()`. That means
> >>>> `uncache()` does not physically release the resource immediately, but
> >>> just
> >>>> means the cache could be released.
> >>>> That being said, I am not sure if this is really a better solution as
> it
> >>>> seems a little counter intuitive. Maybe calling it releaseCache()
> help a
> >>>> little bit?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Jan 8, 2019 at 5:36 PM Piotr Nowojski <pi...@da-platform.com>
> >>> wrote:
> >>>>
> >>>>> Hi Becket,
> >>>>>
> >>>>> With `uncache` there are probably two features that we can think
> about:
> >>>>>
> >>>>> a)
> >>>>>
> >>>>> Physically dropping the cached table from the storage, freeing up the
> >>>>> resources
> >>>>>
> >>>>> b)
> >>>>>
> >>>>> Hinting the optimizer to not cache the reads for the next query/table
> >>>>>
> >>>>> a) Has the issue as I wrote before, that it seemed to be an operation
> >>>>> inherently “flawed" with having side effects.
> >>>>>
> >>>>> I’m not sure how it would be best to express. We could make it work:
> >>>>>
> >>>>> 1. via a method on a Table as you proposed:
> >>>>>
> >>>>> void Table#dropCache()
> >>>>> void Table#uncache()
> >>>>>
> >>>>> 2. Operation on the environment
> >>>>>
> >>>>> env.dropCacheFor(table) // or some other argument that allows user to
> >>>>> identify the desired cache
> >>>>>
> >>>>> 3. Extending (from your original design doc) `setTableService` method
> >>> to
> >>>>> return some control handle like:
> >>>>>
> >>>>> TableServiceControl setTableService(TableFactory tf,
> >>>>>                    TableProperties properties,
> >>>>>                    TempTableCleanUpCallback cleanUpCallback);
> >>>>>
> >>>>> (TableServiceControl? TableService? TableServiceHandle?
> CacheService?)
> >>>>>
> >>>>> And having the drop cache method there:
> >>>>>
> >>>>> TableServiceControl#dropCache(table)
> >>>>>
> >>>>> Out of those options, option 1 might have a disadvantage of kind of
> not
> >>>>> making the user aware, that this is a global operation with side
> >>> effects.
> >>>>> Like the old example of:
> >>>>>
> >>>>> public void foo(Table t) {
> >>>>> // …
> >>>>> t.dropCache();
> >>>>> }
> >>>>>
> >>>>> It might not be immediately obvious that `t.dropCache()` is some kind
> >>> of
> >>>>> global operation, with side effects visible outside of the `foo`
> >>> function.
> >>>>>
> >>>>> On the other hand, both option 2 and 3, might have greater chance of
> >>>>> catching user’s attention:
> >>>>>
> >>>>> public void foo(Table t, CacheService cacheService) {
> >>>>> // …
> >>>>> cacheService.dropCache(t);
> >>>>> }
> >>>>>
> >>>>> b) could be achieved quite easily:
> >>>>>
> >>>>> Table a = …
> >>>>> val notCached1 = a.doNotCache()
> >>>>> val cachedA = a.cache()
> >>>>> val notCached2 = cachedA.doNotCache() // equivalent of notCached1
> >>>>>
> >>>>> `doNotCache()` would behave similarly to `cache()` - return a copy of
> >>> the
> >>>>> table with removed “cache” hint and/or added “never cache” hint.
> >>>>>
> >>>>> Piotrek
> >>>>>
> >>>>>
> >>>>>> On 8 Jan 2019, at 03:17, Becket Qin <be...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Piotr,
> >>>>>>
> >>>>>> Thanks for the proposal and detailed explanation. I like the idea of
> >>>>>> returning a new hinted Table without modifying the original table.
> >>> This
> >>>>>> also leave the room for users to benefit from future implicit
> caching.
> >>>>>>
> >>>>>> Just to make sure I get the full picture. In your proposal, there
> will
> >>>>> also
> >>>>>> be a 'void Table#uncache()' method to release the cache, right?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>> On Mon, Jan 7, 2019 at 11:50 PM Piotr Nowojski <
> piotr@da-platform.com
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Becket!
> >>>>>>>
> >>>>>>> After further thinking I tend to agree that my previous proposal
> >>>>> (*Option
> >>>>>>> 2*) indeed might not be if would in the future introduce automatic
> >>>>> caching.
> >>>>>>> However I would like to propose a slightly modified version of it:
> >>>>>>>
> >>>>>>> *Option 4*
> >>>>>>>
> >>>>>>> Adding `cache()` method with following signature:
> >>>>>>>
> >>>>>>> Table Table#cache();
> >>>>>>>
> >>>>>>> Without side-effects, and `cache()` call do not modify/change
> >>> original
> >>>>>>> Table in any way.
> >>>>>>> It would return a copy of original table, with added hint for the
> >>>>>>> optimizer to cache the table, so that the future accesses to the
> >>>>> returned
> >>>>>>> table might be cached or not.
> >>>>>>>
> >>>>>>> Assuming that we are talking about a setup, where we do not have
> >>>>> automatic
> >>>>>>> caching enabled (possible future extension).
> >>>>>>>
> >>>>>>> Example #1:
> >>>>>>>
> >>>>>>> ```
> >>>>>>> Table a = …
> >>>>>>> a.foo() // not cached
> >>>>>>>
> >>>>>>> val cachedTable = a.cache();
> >>>>>>>
> >>>>>>> cachedA.bar() // maybe cached
> >>>>>>> a.foo() // same as before - effectively not cached
> >>>>>>> ```
> >>>>>>>
> >>>>>>> Both the first and the second `a.foo()` operations would behave in
> >>> the
> >>>>>>> exactly same way. Again, `a.cache()` call doesn’t affect `a`
> itself.
> >>> If
> >>>>> `a`
> >>>>>>> was not hinted for caching before `a.cache();`, then both `a.foo()`
> >>>>> calls
> >>>>>>> wouldn’t use cache.
> >>>>>>>
> >>>>>>> Returned `cachedA` would be hinted with “cache” hint, so probably
> >>>>>>> `cachedA.bar()` would go through cache (unless optimiser decides
> the
> >>>>>>> opposite)
> >>>>>>>
> >>>>>>> Example #2
> >>>>>>>
> >>>>>>> ```
> >>>>>>> Table a = …
> >>>>>>>
> >>>>>>> a.foo() // not cached
> >>>>>>>
> >>>>>>> val b = a.cache();
> >>>>>>>
> >>>>>>> a.foo() // same as before - effectively not cached
> >>>>>>> b.foo() // maybe cached
> >>>>>>>
> >>>>>>> val c = b.cache();
> >>>>>>>
> >>>>>>> a.foo() // same as before - effectively not cached
> >>>>>>> b.foo() // same as before - effectively maybe cached
> >>>>>>> c.foo() // maybe cached
> >>>>>>> ```
> >>>>>>>
> >>>>>>> Now, assuming that we have some future “automatic caching
> >>> optimisation”:
> >>>>>>>
> >>>>>>> Example #3
> >>>>>>>
> >>>>>>> ```
> >>>>>>> env.enableAutomaticCaching()
> >>>>>>> Table a = …
> >>>>>>>
> >>>>>>> a.foo() // might be cached, depending if `a` was selected to
> >>> automatic
> >>>>>>> caching
> >>>>>>>
> >>>>>>> val b = a.cache();
> >>>>>>>
> >>>>>>> a.foo() // same as before - might be cached, if `a` was selected to
> >>>>>>> automatic caching
> >>>>>>> b.foo() // maybe cached
> >>>>>>> ```
> >>>>>>>
> >>>>>>>
> >>>>>>> More or less this is the same behaviour as:
> >>>>>>>
> >>>>>>> Table a = ...
> >>>>>>> val b = a.filter(x > 20)
> >>>>>>>
> >>>>>>> calling `filter` hasn’t changed or altered `a` in anyway. If `a`
> was
> >>>>>>> previously filtered:
> >>>>>>>
> >>>>>>> Table src = …
> >>>>>>> val a = src.filter(x > 20)
> >>>>>>> val b = a.filter(x > 20)
> >>>>>>>
> >>>>>>> then yes, `a` and `b` will be the same. But the point is that
> neither
> >>>>>>> `filter` nor `cache` changes the original `a` table.
> >>>>>>>
> >>>>>>> One thing is that indeed, physically dropping cache operation, will
> >>> have
> >>>>>>> side effects and it will in a way mutate the cached table
> references.
> >>>>> But
> >>>>>>> this is I think unavoidable in any solution - the same issue as
> >>> calling
> >>>>>>> `.close()`, or calling destructor in C++.
> >>>>>>>
> >>>>>>> Piotrek
> >>>>>>>
> >>>>>>>> On 7 Jan 2019, at 10:41, Becket Qin <be...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Happy New Year, everybody!
> >>>>>>>>
> >>>>>>>> I would like to resume this discussion thread. At this point, We
> >>> have
> >>>>>>>> agreed on the first step goal of interactive programming. The open
> >>>>>>>> discussion is the exact API. More specifically, what should
> >>> *cache()*
> >>>>>>>> method return and what is the semantic. There are three options:
> >>>>>>>>
> >>>>>>>> *Option 1*
> >>>>>>>> *void cache()* OR *Table cache()* which returns the original table
> >>> for
> >>>>>>>> chained calls.
> >>>>>>>> *void uncache() *releases the cache.
> >>>>>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation
> foo().
> >>>>>>>>
> >>>>>>>> - Semantic: a.cache() hints that table 'a' should be cached.
> >>> Optimizer
> >>>>>>>> decides whether the cache will be used or not.
> >>>>>>>> - pros: simple and no confusion between CachedTable and original
> >>> table
> >>>>>>>> - cons: A table may be cached / uncached in a method invocation,
> >>> while
> >>>>>>> the
> >>>>>>>> caller does not know about this.
> >>>>>>>>
> >>>>>>>> *Option 2*
> >>>>>>>> *CachedTable cache()*
> >>>>>>>> *CachedTable *extends *Table *with an additional *uncache()*
> method
> >>>>>>>>
> >>>>>>>> - Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will
> >>>>> always
> >>>>>>>> use cache. *a.bar() *will always use original DAG.
> >>>>>>>> - pros: No potential side effects in method invocation.
> >>>>>>>> - cons: Optimizer has no chance to kick in. Future optimization
> will
> >>>>>>> become
> >>>>>>>> a behavior change and need users to change the code.
> >>>>>>>>
> >>>>>>>> *Option 3*
> >>>>>>>> *CacheHandle cache()*
> >>>>>>>> *CacheHandle.release() *to release a cache handle on the table. If
> >>> all
> >>>>>>>> cache handles are released, the cache could be removed.
> >>>>>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation
> foo().
> >>>>>>>>
> >>>>>>>> - Semantic: *a.cache() *hints that 'a' should be cached. Optimizer
> >>>>>>> decides
> >>>>>>>> whether the cache will be used or not. Cache is released either no
> >>>>> handle
> >>>>>>>> is on it, or the user program exits.
> >>>>>>>> - pros: No potential side effect in method invocation. No
> confusion
> >>>>>>> between
> >>>>>>>> cached table v.s original table.
> >>>>>>>> - cons: An additional CacheHandle exposed to the users.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Personally I prefer option 3 for the following reasons:
> >>>>>>>> 1. It is simple. Vast majority of the users would just call
> >>>>>>>> *a.cache()* followed
> >>>>>>>> by *a.foo(),* *a.bar(), etc. *
> >>>>>>>> 2. There is no semantic ambiguity and semantic change if we decide
> >>> to
> >>>>> add
> >>>>>>>> implicit cache in the future.
> >>>>>>>> 3. There is no side effect in the method calls.
> >>>>>>>> 4. Admittedly we need to expose one more CacheHandle class to the
> >>>>> users.
> >>>>>>>> But it is not that difficult to understand given similar well
> known
> >>>>>>> concept
> >>>>>>>> like ref count (we can name it CacheReference if that is easier to
> >>>>>>>> understand). So I think it is fine.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <becket.qin@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Piotrek,
> >>>>>>>>>
> >>>>>>>>> 1. Regarding optimization.
> >>>>>>>>> Sure there are many cases that the decision is hard to make. But
> >>> that
> >>>>>>> does
> >>>>>>>>> not make it any easier for the users to make those decisions. I
> >>>>> imagine
> >>>>>>> 99%
> >>>>>>>>> of the users would just naively use cache. I am not saying we can
> >>>>>>> optimize
> >>>>>>>>> in all the cases. But as long as we agree that at least in
> certain
> >>>>>>> cases (I
> >>>>>>>>> would argue most cases), optimizer can do a little better than an
> >>>>>>> average
> >>>>>>>>> user who likely knows little about Flink internals, we should not
> >>> push
> >>>>>>> the
> >>>>>>>>> burden of optimization to users.
> >>>>>>>>>
> >>>>>>>>> BTW, it seems some of your concerns are related to the
> >>>>> implementation. I
> >>>>>>>>> did not mention the implementation of the caching service because
> >>> that
> >>>>>>>>> should not affect the API semantic. Not sure if this helps, but
> >>>>> imagine
> >>>>>>> the
> >>>>>>>>> default implementation has one StorageNode service colocating
> with
> >>>>> each
> >>>>>>> TM.
> >>>>>>>>> It could be running within the TM process or in a standalone
> >>> process,
> >>>>>>>>> depending on configuration.
> >>>>>>>>>
> >>>>>>>>> The StorageNode uses memory + spill-to-disk mechanism. The cached
> >>> data
> >>>>>>>>> will just be written to the local StorageNode service. If the
> >>>>>>> StorageNode
> >>>>>>>>> is running within the TM process, the in-memory cache could just
> be
> >>>>>>> objects
> >>>>>>>>> so we save some serde cost. A later job referring to the cached
> >>> Table
> >>>>>>> will
> >>>>>>>>> be scheduled in a locality aware manner, i.e. run in the TM whose
> >>> peer
> >>>>>>>>> StorageNode hosts the data.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2. Semantic
> >>>>>>>>> I am not sure why introducing a new hintCache() or
> >>>>>>>>> env.enableAutomaticCaching() method would avoid the consequence
> of
> >>>>>>> semantic
> >>>>>>>>> change.
> >>>>>>>>>
> >>>>>>>>> If the auto optimization is not enabled by default, users still
> >>> need
> >>>>> to
> >>>>>>>>> make code change to all existing programs in order to get the
> >>> benefit.
> >>>>>>>>> If the auto optimization is enabled by default, advanced users
> who
> >>>>> know
> >>>>>>>>> that they really want to use cache will suddenly lose the
> >>> opportunity
> >>>>>>> to do
> >>>>>>>>> so, unless they change the code to disable auto optimization.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 3. side effect
> >>>>>>>>> The CacheHandle is not only for where to put uncache(). It is to
> >>> solve
> >>>>>>> the
> >>>>>>>>> implicit performance impact by moving the uncache() to the
> >>>>> CacheHandle.
> >>>>>>>>>
> >>>>>>>>> - If users wants to leverage cache, they can call a.cache().
> After
> >>>>>>>>> that, unless user explicitly release that CacheHandle, a.foo()
> will
> >>>>>>> always
> >>>>>>>>> leverage cache if needed (optimizer may choose to ignore cache if
> >>>>> that
> >>>>>>>>> helps accelerate the process). Any function call will not be able
> >>> to
> >>>>>>>>> release the cache because they do not have that CacheHandle.
> >>>>>>>>> - If some advanced users do not want to use cache at all, they
> will
> >>>>>>>>> call a.hint(ignoreCache).foo(). This will for sure ignore cache
> and
> >>>>>>> use the
> >>>>>>>>> original DAG to process.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> In vast majority of the cases, users wouldn't really care
> whether
> >>> the
> >>>>>>>>>> cache is used or not.
> >>>>>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
> >>>>> memory
> >>>>>>>>>> caching) would add additional IO costs. It’s similar as saying
> >>> that
> >>>>>>> users
> >>>>>>>>>> would not see a difference between Spark/Flink and MapReduce
> >>>>> (MapReduce
> >>>>>>>>>> writes data to disks after every map/reduce stage).
> >>>>>>>>>
> >>>>>>>>> What I wanted to say is that in most cases, after users call
> >>> cache(),
> >>>>>>> they
> >>>>>>>>> don't really care about whether auto optimization has decided to
> >>>>> ignore
> >>>>>>> the
> >>>>>>>>> cache or not, as long as the program runs faster.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <
> >>>>>>> piotr@data-artisans.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the quick answer :)
> >>>>>>>>>>
> >>>>>>>>>> Re 1.
> >>>>>>>>>>
> >>>>>>>>>> I generally agree with you, however couple of points:
> >>>>>>>>>>
> >>>>>>>>>> a) the problem with using automatic caching is bigger, because
> you
> >>>>> will
> >>>>>>>>>> have to decide, how do you compare IO vs CPU costs and if you
> pick
> >>>>>>> wrong,
> >>>>>>>>>> additional IO costs might be enormous or even can crash your
> >>> system.
> >>>>>>> This
> >>>>>>>>>> is more difficult problem compared to let say join reordering,
> >>> where
> >>>>>>> the
> >>>>>>>>>> only issue is to have good statistics that can capture
> >>> correlations
> >>>>>>> between
> >>>>>>>>>> columns (when you reorder joins number of IO operations do not
> >>>>> change)
> >>>>>>>>>> c) your example is completely independent of caching.
> >>>>>>>>>>
> >>>>>>>>>> Query like this:
> >>>>>>>>>>
> >>>>>>>>>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1
> >>> ===`f2).as('f3,
> >>>>>>>>>> …).filter(‘f3 > 30)
> >>>>>>>>>>
> >>>>>>>>>> Should/could be optimised to empty result immediately, without
> the
> >>>>> need
> >>>>>>>>>> for any cache/materialisation and that should work even without
> >>> any
> >>>>>>>>>> statistics provided by the connector.
> >>>>>>>>>>
> >>>>>>>>>> For me prerequisite to any serious cost-based optimisations
> would
> >>> be
> >>>>>>> some
> >>>>>>>>>> reasonable benchmark coverage of the code (tpch?). Otherwise
> that
> >>>>>>> would be
> >>>>>>>>>> equivalent of adding not tested code, since we wouldn’t be able
> to
> >>>>>>> verify
> >>>>>>>>>> our assumptions, like how does the writing of 10 000 records to
> >>>>>>>>>> cache/RocksDB/Kafka/CSV file compare to
> >>> joining/filtering/processing
> >>>>> of
> >>>>>>>>>> lets say 1000 000 rows.
> >>>>>>>>>>
> >>>>>>>>>> Re 2.
> >>>>>>>>>>
> >>>>>>>>>> I wasn’t proposing to change the semantic later. I was proposing
> >>> that
> >>>>>>> we
> >>>>>>>>>> start now:
> >>>>>>>>>>
> >>>>>>>>>> CachedTable cachedA = a.cache()
> >>>>>>>>>> cachedA.foo() // Cache is used
> >>>>>>>>>> a.bar() // Original DAG is used
> >>>>>>>>>>
> >>>>>>>>>> And then later we can think about adding for example
> >>>>>>>>>>
> >>>>>>>>>> CachedTable cachedA = a.hintCache()
> >>>>>>>>>> cachedA.foo() // Cache might be used
> >>>>>>>>>> a.bar() // Original DAG is used
> >>>>>>>>>>
> >>>>>>>>>> Or
> >>>>>>>>>>
> >>>>>>>>>> env.enableAutomaticCaching()
> >>>>>>>>>> a.foo() // Cache might be used
> >>>>>>>>>> a.bar() // Cache might be used
> >>>>>>>>>>
> >>>>>>>>>> Or (I would still not like this option):
> >>>>>>>>>>
> >>>>>>>>>> a.hintCache()
> >>>>>>>>>> a.foo() // Cache might be used
> >>>>>>>>>> a.bar() // Cache might be used
> >>>>>>>>>>
> >>>>>>>>>> Or whatever else that will come to our mind. Even if we add some
> >>>>>>>>>> automatic caching in the future, keeping implicit (`CachedTable
> >>>>>>> cache()`)
> >>>>>>>>>> caching will still be useful, at least in some cases.
> >>>>>>>>>>
> >>>>>>>>>> Re 3.
> >>>>>>>>>>
> >>>>>>>>>>> 2. The source tables are immutable during one run of batch
> >>>>> processing
> >>>>>>>>>> logic.
> >>>>>>>>>>> 3. The cache is immutable during one run of batch processing
> >>> logic.
> >>>>>>>>>>
> >>>>>>>>>>> I think assumption 2 and 3 are by definition what batch
> >>> processing
> >>>>>>>>>> means,
> >>>>>>>>>>> i.e the data must be complete before it is processed and should
> >>> not
> >>>>>>>>>> change
> >>>>>>>>>>> when the processing is running.
> >>>>>>>>>>
> >>>>>>>>>> I agree that this is how batch systems SHOULD be working.
> However
> >>> I
> >>>>>>> know
> >>>>>>>>>> from my previous experience that it’s not always the case.
> >>> Sometimes
> >>>>>>> users
> >>>>>>>>>> are just working on some non transactional storage, which can be
> >>>>>>> (either
> >>>>>>>>>> constantly or occasionally) being modified by some other
> processes
> >>>>> for
> >>>>>>>>>> whatever the reasons (fixing the data, updating, adding new data
> >>>>> etc).
> >>>>>>>>>>
> >>>>>>>>>> But even if we ignore this point (data immutability),
> performance
> >>>>> side
> >>>>>>>>>> effect issue of your proposal remains. If user calls `void
> >>> a.cache()`
> >>>>>>> deep
> >>>>>>>>>> inside some private method, it will have implicit side effects
> on
> >>>>> other
> >>>>>>>>>> parts of his program that might not be obvious.
> >>>>>>>>>>
> >>>>>>>>>> Re `CacheHandle`.
> >>>>>>>>>>
> >>>>>>>>>> If I understand it correctly, it only addresses the issue where
> to
> >>>>>>> place
> >>>>>>>>>> method `uncache`/`dropCache`.
> >>>>>>>>>>
> >>>>>>>>>> Btw,
> >>>>>>>>>>
> >>>>>>>>>>> In vast majority of the cases, users wouldn't really care
> whether
> >>>>> the
> >>>>>>>>>> cache is used or not.
> >>>>>>>>>>
> >>>>>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
> >>>>> memory
> >>>>>>>>>> caching) would add additional IO costs. It’s similar as saying
> >>> that
> >>>>>>> users
> >>>>>>>>>> would not see a difference between Spark/Flink and MapReduce
> >>>>> (MapReduce
> >>>>>>>>>> writes data to disks after every map/reduce stage).
> >>>>>>>>>>
> >>>>>>>>>> Piotrek
> >>>>>>>>>>
> >>>>>>>>>>> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com>
> >>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>
> >>>>>>>>>>> Not sure if you noticed, in my last email, I was proposing
> >>>>>>> `CacheHandle
> >>>>>>>>>>> cache()` to avoid the potential side effect due to function
> >>> calls.
> >>>>>>>>>>>
> >>>>>>>>>>> Let's look at the disagreement in your reply one by one.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Optimization chances
> >>>>>>>>>>>
> >>>>>>>>>>> Optimization is never a trivial work. This is exactly why we
> >>> should
> >>>>>>> not
> >>>>>>>>>> let
> >>>>>>>>>>> user manually do that. Databases have done huge amount of work
> in
> >>>>> this
> >>>>>>>>>>> area. At Alibaba, we rely heavily on many optimization rules to
> >>>>> boost
> >>>>>>>>>> the
> >>>>>>>>>>> SQL query performance.
> >>>>>>>>>>>
> >>>>>>>>>>> In your example, if I filling the filter conditions in a
> certain
> >>>>> way,
> >>>>>>>>>> the
> >>>>>>>>>>> optimization would become obvious.
> >>>>>>>>>>>
> >>>>>>>>>>> Table src1 = … // read from connector 1
> >>>>>>>>>>> Table src2 = … // read from connector 2
> >>>>>>>>>>>
> >>>>>>>>>>> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1
> >>> ===
> >>>>>>>>>>> `f2).as('f3, ...)
> >>>>>>>>>>> a.cache() // write cache to connector 3, when writing the
> >>> records,
> >>>>>>>>>> remember
> >>>>>>>>>>> min and max of `f1
> >>>>>>>>>>>
> >>>>>>>>>>> a.filter('f3 > 30) // There is no need to read from any
> connector
> >>>>>>>>>> because
> >>>>>>>>>>> `a` does not contain any record whose 'f3 is greater than 30.
> >>>>>>>>>>> env.execute()
> >>>>>>>>>>> a.select(…)
> >>>>>>>>>>>
> >>>>>>>>>>> BTW, it seems to me that adding some basic statistics is fairly
> >>>>>>>>>>> straightforward and the cost is pretty marginal if not
> >>> ignorable. In
> >>>>>>>>>> fact
> >>>>>>>>>>> it is not only needed for optimization, but also for cases such
> >>> as
> >>>>> ML,
> >>>>>>>>>>> where some algorithms may need to decide their parameter based
> on
> >>>>> the
> >>>>>>>>>>> statistics of the data.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 2. Same API, one semantic now, another semantic later.
> >>>>>>>>>>>
> >>>>>>>>>>> I am trying to understand what is the semantic of `CachedTable
> >>>>>>> cache()`
> >>>>>>>>>> you
> >>>>>>>>>>> are proposing. IMO, we should avoid designing an API whose
> >>> semantic
> >>>>>>>>>> will be
> >>>>>>>>>>> changed later. If we have a "CachedTable cache()" method, then
> >>> the
> >>>>>>>>>> semantic
> >>>>>>>>>>> should be very clearly defined upfront and do not change later.
> >>> It
> >>>>>>>>>> should
> >>>>>>>>>>> never be "right now let's go with semantic 1, later we can
> >>> silently
> >>>>>>>>>> change
> >>>>>>>>>>> it to semantic 2 or 3". Such change could result in bad
> >>> consequence.
> >>>>>>> For
> >>>>>>>>>>> example, let's say we decide go with semantic 1:
> >>>>>>>>>>>
> >>>>>>>>>>> CachedTable cachedA = a.cache()
> >>>>>>>>>>> cachedA.foo() // Cache is used
> >>>>>>>>>>> a.bar() // Original DAG is used.
> >>>>>>>>>>>
> >>>>>>>>>>> Now majority of the users would be using cachedA.foo() in their
> >>>>> code.
> >>>>>>>>>> And
> >>>>>>>>>>> some advanced users will use a.bar() to explicitly skip the
> >>> cache.
> >>>>>>> Later
> >>>>>>>>>>> on, we added smart optimization and change the semantic to
> >>> semantic
> >>>>> 2:
> >>>>>>>>>>>
> >>>>>>>>>>> CachedTable cachedA = a.cache()
> >>>>>>>>>>> cachedA.foo() // Cache is used
> >>>>>>>>>>> a.bar() // Cache MIGHT be used, and Flink may decide to skip
> >>> cache
> >>>>> if
> >>>>>>>>>> it is
> >>>>>>>>>>> faster.
> >>>>>>>>>>>
> >>>>>>>>>>> Now most of the users who were writing cachedA.foo() will not
> >>>>> benefit
> >>>>>>>>>> from
> >>>>>>>>>>> this optimization at all, unless they change their code to use
> >>>>> a.foo()
> >>>>>>>>>>> instead. And those advanced users suddenly lose the option to
> >>>>>>> explicitly
> >>>>>>>>>>> ignore cache unless they change their code (assuming we care
> >>> enough
> >>>>> to
> >>>>>>>>>>> provide something like hint(useCache)). If we don't define the
> >>>>>>> semantic
> >>>>>>>>>>> carefully, our users will have to change their code again and
> >>> again
> >>>>>>>>>> while
> >>>>>>>>>>> they shouldn't have to.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 3. side effect.
> >>>>>>>>>>>
> >>>>>>>>>>> Before we talk about side effect, we have to agree on the
> >>>>> assumptions.
> >>>>>>>>>> The
> >>>>>>>>>>> assumptions I have are following:
> >>>>>>>>>>> 1. We are talking about batch processing.
> >>>>>>>>>>> 2. The source tables are immutable during one run of batch
> >>>>> processing
> >>>>>>>>>> logic.
> >>>>>>>>>>> 3. The cache is immutable during one run of batch processing
> >>> logic.
> >>>>>>>>>>>
> >>>>>>>>>>> I think assumption 2 and 3 are by definition what batch
> >>> processing
> >>>>>>>>>> means,
> >>>>>>>>>>> i.e the data must be complete before it is processed and should
> >>> not
> >>>>>>>>>> change
> >>>>>>>>>>> when the processing is running.
> >>>>>>>>>>>
> >>>>>>>>>>> As far as I am aware of, I don't know any batch processing
> system
> >>>>>>>>>> breaking
> >>>>>>>>>>> those assumptions. Even for relational database tables, where
> >>>>> queries
> >>>>>>>>>> can
> >>>>>>>>>>> run with concurrent modifications, necessary locking are still
> >>>>>>> required
> >>>>>>>>>> to
> >>>>>>>>>>> ensure the integrity of the query result.
> >>>>>>>>>>>
> >>>>>>>>>>> Please let me know if you disagree with the above assumptions.
> If
> >>>>> you
> >>>>>>>>>> agree
> >>>>>>>>>>> with these assumptions, with the `CacheHandle cache()` API in
> my
> >>>>> last
> >>>>>>>>>>> email, do you still see side effects?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <
> >>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Regarding the chance of optimization, it might not be that
> >>> rare.
> >>>>>>> Some
> >>>>>>>>>>>> very
> >>>>>>>>>>>>> simple statistics could already help in many cases. For
> >>> example,
> >>>>>>>>>> simply
> >>>>>>>>>>>>> maintaining max and min of each fields can already eliminate
> >>> some
> >>>>>>>>>>>>> unnecessary table scan (potentially scanning the cached
> table)
> >>> if
> >>>>>>> the
> >>>>>>>>>>>>> result is doomed to be empty. A histogram would give even
> >>> further
> >>>>>>>>>>>>> information. The optimizer could be very careful and only
> >>> ignores
> >>>>>>>>>> cache
> >>>>>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
> >>>>> filter
> >>>>>>> on
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> cache will absolutely return nothing.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I do not see how this might be easy to achieve. It would
> require
> >>>>> tons
> >>>>>>>>>> of
> >>>>>>>>>>>> effort to make it work and in the end you would still have a
> >>>>> problem
> >>>>>>> of
> >>>>>>>>>>>> comparing/trading CPU cycles vs IO. For example:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Table src1 = … // read from connector 1
> >>>>>>>>>>>> Table src2 = … // read from connector 2
> >>>>>>>>>>>>
> >>>>>>>>>>>> Table a = src1.filter(…).join(src2.filter(…), …)
> >>>>>>>>>>>> a.cache() // write cache to connector 3
> >>>>>>>>>>>>
> >>>>>>>>>>>> a.filter(…)
> >>>>>>>>>>>> env.execute()
> >>>>>>>>>>>> a.select(…)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Decision whether it’s better to:
> >>>>>>>>>>>> A) read from connector1/connector2, filter/map and join them
> >>> twice
> >>>>>>>>>>>> B) read from connector1/connector2, filter/map and join them
> >>> once,
> >>>>>>> pay
> >>>>>>>>>> the
> >>>>>>>>>>>> price of writing to connector 3 and then reading from it
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is very far from trivial. `a` can end up much larger than
> `src1`
> >>>>> and
> >>>>>>>>>>>> `src2`, writes to connector 3 might be extremely slow, reads
> >>> from
> >>>>>>>>>> connector
> >>>>>>>>>>>> 3 can be slower compared to reads from connector 1 & 2, … .
> You
> >>>>>>> really
> >>>>>>>>>> need
> >>>>>>>>>>>> to have extremely good statistics to correctly asses size of
> the
> >>>>>>>>>> output and
> >>>>>>>>>>>> it would still be failing many times (correlations etc). And
> >>> keep
> >>>>> in
> >>>>>>>>>> mind
> >>>>>>>>>>>> that at the moment we do not have ANY statistics at all. More
> >>> than
> >>>>>>>>>> that, it
> >>>>>>>>>>>> would require significantly more testing and setting up some
> >>>>>>>>>> benchmarks to
> >>>>>>>>>>>> make sure that we do not brake it with some regressions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> That’s why I’m strongly opposing this idea - at least let’s
> not
> >>>>>>> starts
> >>>>>>>>>>>> with this. If we first start with completely manual/explicit
> >>>>> caching,
> >>>>>>>>>>>> without any magic, it would be a significant improvement for
> the
> >>>>>>> users
> >>>>>>>>>> for
> >>>>>>>>>>>> a fraction of the development cost. After implementing that,
> >>> when
> >>>>> we
> >>>>>>>>>>>> already have all of the working pieces, we can start working
> on
> >>>>> some
> >>>>>>>>>>>> optimisations rules. As I wrote before, if we start with
> >>>>>>>>>>>>
> >>>>>>>>>>>> `CachedTable cache()`
> >>>>>>>>>>>>
> >>>>>>>>>>>> We can later work on follow up stories to make it automatic.
> >>>>> Despite
> >>>>>>>>>> that
> >>>>>>>>>>>> I don’t like this implicit/side effect approach with `void`
> >>> method,
> >>>>>>>>>> having
> >>>>>>>>>>>> explicit `CachedTable cache()` wouldn’t even prevent as from
> >>> later
> >>>>>>>>>> adding
> >>>>>>>>>>>> `void hintCache()` method, with the exact semantic that you
> >>> want.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On top of that I re-rise again that having implicit `void
> >>>>>>>>>>>> cache()/hintCache()` has other side effects and problems with
> >>> non
> >>>>>>>>>> immutable
> >>>>>>>>>>>> data, and being annoying when used secretly inside methods.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Explicit `CachedTable cache()` just looks like much less
> >>>>>>> controversial
> >>>>>>>>>> MVP
> >>>>>>>>>>>> and if we decide to go further with this topic, it’s not a
> >>> wasted
> >>>>>>>>>> effort,
> >>>>>>>>>>>> but just lies on a stright path to more advanced/complicated
> >>>>>>> solutions
> >>>>>>>>>> in
> >>>>>>>>>>>> the future. Are there any drawbacks of starting with
> >>> `CachedTable
> >>>>>>>>>> cache()`
> >>>>>>>>>>>> that I’m missing?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com>
> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Introducing CacheHandle seems too complicated. That means
> users
> >>>>> have
> >>>>>>>>>> to
> >>>>>>>>>>>>> maintain Handler properly.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> And since cache is just a hint for optimizer, why not just
> >>> return
> >>>>>>>>>> Table
> >>>>>>>>>>>>> itself for cache method. This hint info should be kept in
> >>> Table I
> >>>>>>>>>>>> believe.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So how about adding method cache and uncache for Table, and
> >>> both
> >>>>>>>>>> return
> >>>>>>>>>>>>> Table. Because what cache and uncache did is just adding some
> >>> hint
> >>>>>>>>>> info
> >>>>>>>>>>>>> into Table.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Till and Piotrek,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the clarification. That solves quite a few
> >>> confusion.
> >>>>> My
> >>>>>>>>>>>>>> understanding of how cache works is same as what Till
> >>> describe.
> >>>>>>> i.e.
> >>>>>>>>>>>>>> cache() is a hint to Flink, but it is not guaranteed that
> >>> cache
> >>>>>>>>>> always
> >>>>>>>>>>>>>> exist and it might be recomputed from its lineage.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Is this the core of our disagreement here? That you would
> like
> >>>>> this
> >>>>>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Semantic wise, yes. That's also why I think materialize()
> has
> >>> a
> >>>>>>> much
> >>>>>>>>>>>> larger
> >>>>>>>>>>>>>> scope than cache(), thus it should be a different method.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regarding the chance of optimization, it might not be that
> >>> rare.
> >>>>>>> Some
> >>>>>>>>>>>> very
> >>>>>>>>>>>>>> simple statistics could already help in many cases. For
> >>> example,
> >>>>>>>>>> simply
> >>>>>>>>>>>>>> maintaining max and min of each fields can already eliminate
> >>> some
> >>>>>>>>>>>>>> unnecessary table scan (potentially scanning the cached
> >>> table) if
> >>>>>>> the
> >>>>>>>>>>>>>> result is doomed to be empty. A histogram would give even
> >>> further
> >>>>>>>>>>>>>> information. The optimizer could be very careful and only
> >>> ignores
> >>>>>>>>>> cache
> >>>>>>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
> >>>>> filter
> >>>>>>>>>> on
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> cache will absolutely return nothing.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Given the above clarification on cache, I would like to
> >>> revisit
> >>>>> the
> >>>>>>>>>>>>>> original "void cache()" proposal and see if we can improve
> on
> >>> top
> >>>>>>> of
> >>>>>>>>>>>> that.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What do you think about the following modified interface?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Table {
> >>>>>>>>>>>>>> /**
> >>>>>>>>>>>>>> * This call hints Flink to maintain a cache of this table
> and
> >>>>>>>>>> leverage
> >>>>>>>>>>>>>> it for performance optimization if needed.
> >>>>>>>>>>>>>> * Note that Flink may still decide to not use the cache if
> it
> >>> is
> >>>>>>>>>>>> cheaper
> >>>>>>>>>>>>>> by doing so.
> >>>>>>>>>>>>>> *
> >>>>>>>>>>>>>> * A CacheHandle will be returned to allow user release the
> >>> cache
> >>>>>>>>>>>>>> actively. The cache will be deleted if there
> >>>>>>>>>>>>>> * is no unreleased cache handlers to it. When the
> >>>>> TableEnvironment
> >>>>>>>>>> is
> >>>>>>>>>>>>>> closed. The cache will also be deleted
> >>>>>>>>>>>>>> * and all the cache handlers will be released.
> >>>>>>>>>>>>>> *
> >>>>>>>>>>>>>> * @return a CacheHandle referring to the cache of this
> table.
> >>>>>>>>>>>>>> */
> >>>>>>>>>>>>>> CacheHandle cache();
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> CacheHandle {
> >>>>>>>>>>>>>> /**
> >>>>>>>>>>>>>> * Close the cache handle. This method does not necessarily
> >>>>> deletes
> >>>>>>>>>> the
> >>>>>>>>>>>>>> cache. Instead, it simply decrements the reference counter
> to
> >>> the
> >>>>>>>>>> cache.
> >>>>>>>>>>>>>> * When the there is no handle referring to a cache. The
> cache
> >>>>> will
> >>>>>>>>>> be
> >>>>>>>>>>>>>> deleted.
> >>>>>>>>>>>>>> *
> >>>>>>>>>>>>>> * @return the number of open handles to the cache after this
> >>>>> handle
> >>>>>>>>>>>> has
> >>>>>>>>>>>>>> been released.
> >>>>>>>>>>>>>> */
> >>>>>>>>>>>>>> int release()
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The rationale behind this interface is following:
> >>>>>>>>>>>>>> In vast majority of the cases, users wouldn't really care
> >>> whether
> >>>>>>> the
> >>>>>>>>>>>> cache
> >>>>>>>>>>>>>> is used or not. So I think the most intuitive way is letting
> >>>>>>> cache()
> >>>>>>>>>>>> return
> >>>>>>>>>>>>>> nothing. So nobody needs to worry about the difference
> between
> >>>>>>>>>>>> operations
> >>>>>>>>>>>>>> on CacheTables and those on the "original" tables. This will
> >>> make
> >>>>>>>>>> maybe
> >>>>>>>>>>>>>> 99.9% of the users happy. There were two concerns raised for
> >>> this
> >>>>>>>>>>>> approach:
> >>>>>>>>>>>>>> 1. In some rare cases, users may want to ignore cache,
> >>>>>>>>>>>>>> 2. A table might be cached/uncached in a third party
> function
> >>>>> while
> >>>>>>>>>> the
> >>>>>>>>>>>>>> caller does not know.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For the first issue, users can use hint("ignoreCache") to
> >>>>>>> explicitly
> >>>>>>>>>>>> ignore
> >>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>> For the second issue, the above proposal lets cache()
> return a
> >>>>>>>>>>>> CacheHandle,
> >>>>>>>>>>>>>> the only method in it is release(). Different CacheHandles
> >>> will
> >>>>>>>>>> refer to
> >>>>>>>>>>>>>> the same cache, if a cache no longer has any cache handle,
> it
> >>>>> will
> >>>>>>> be
> >>>>>>>>>>>>>> deleted. This will address the following case:
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>> val handle1 = a.cache()
> >>>>>>>>>>>>>> process(a)
> >>>>>>>>>>>>>> a.select(...) // cache is still available, handle1 has not
> >>> been
> >>>>>>>>>>>> released.
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> void process(Table t) {
> >>>>>>>>>>>>>> val handle2 = t.cache() // new handle to cache
> >>>>>>>>>>>>>> t.select(...) // optimizer decides cache usage
> >>>>>>>>>>>>>> t.hint("ignoreCache").select(...) // cache is ignored
> >>>>>>>>>>>>>> handle2.release() // release the handle, but the cache may
> >>> still
> >>>>> be
> >>>>>>>>>>>>>> available if there are other handles
> >>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Does the above modified approach look reasonable to you?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <
> >>>>>>> trohrmann@apache.org>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I was aiming at semantics similar to 1. I actually thought
> >>> that
> >>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>> would tell the system to materialize the intermediate
> result
> >>> so
> >>>>>>> that
> >>>>>>>>>>>>>>> subsequent queries don't need to reprocess it. This means
> >>> that
> >>>>> the
> >>>>>>>>>>>> usage
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> the cached table in this example
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>>>>> val b1 = cachedTable.select(…)
> >>>>>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
> >>>>>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
> >>>>>>>>>>>>>>> val c1 = a.select(…)
> >>>>>>>>>>>>>>> val c2 = a.foo().select(…)
> >>>>>>>>>>>>>>> val c3 = a.bar().select(...)
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> strongly depends on interleaved calls which trigger the
> >>>>> execution
> >>>>>>> of
> >>>>>>>>>>>> sub
> >>>>>>>>>>>>>>> queries. So for example, if there is only a single
> >>> env.execute
> >>>>>>> call
> >>>>>>>>>> at
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be
> >>>>>>> computed
> >>>>>>>>>> by
> >>>>>>>>>>>>>>> reading directly from the sources (given that there is
> only a
> >>>>>>> single
> >>>>>>>>>>>>>>> JobGraph). It just happens that the result of `a` will be
> >>> cached
> >>>>>>>>>> such
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>> we skip the processing of `a` when there are subsequent
> >>> queries
> >>>>>>>>>> reading
> >>>>>>>>>>>>>>> from `cachedTable`. If for some reason the system cannot
> >>>>>>> materialize
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> table (e.g. running out of disk space, ttl expired), then
> it
> >>>>> could
> >>>>>>>>>> also
> >>>>>>>>>>>>>>> happen that we need to reprocess `a`. In that sense
> >>>>> `cachedTable`
> >>>>>>>>>>>> simply
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> an identifier for the materialized result of `a` with the
> >>>>> lineage
> >>>>>>>>>> how
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>> reprocess it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
> >>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>>>>>>> val b = cachedTable.select(...)
> >>>>>>>>>>>>>>>>> val c = a.select(...)
> >>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c
> uses
> >>>>>>>>>> original
> >>>>>>>>>>>>>> DAG
> >>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no
> >>> chance to
> >>>>>>>>>>>>>>> optimize.
> >>>>>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c
> >>> leaves
> >>>>> the
> >>>>>>>>>>>>>>>> optimizer
> >>>>>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In
> this
> >>>>> case,
> >>>>>>>>>> user
> >>>>>>>>>>>>>>>> lose
> >>>>>>>>>>>>>>>>> the option to NOT use cache.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> As you can see, neither of the options seem perfect.
> >>> However,
> >>>>> I
> >>>>>>>>>> guess
> >>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>> and Till are proposing the third option:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether
> cache
> >>> or
> >>>>>>> DAG
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> used. c always use the DAG.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I am pretty sure that me, Till, Fabian and others were all
> >>>>>>>>>> proposing
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> advocating in favour of semantic “1”. No cost based
> >>> optimiser
> >>>>>>>>>>>> decisions
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>> all.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>>>>>> val b1 = cachedTable.select(…)
> >>>>>>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
> >>>>>>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
> >>>>>>>>>>>>>>>> val c1 = a.select(…)
> >>>>>>>>>>>>>>>> val c2 = a.foo().select(…)
> >>>>>>>>>>>>>>>> val c3 = a.bar().select(...)
> >>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and
> >>> c3
> >>>>> are
> >>>>>>>>>>>>>>>> re-executing whole plan for “a”.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In the future we could discuss going one step further,
> >>>>>>> introducing
> >>>>>>>>>>>> some
> >>>>>>>>>>>>>>>> global optimisation (that can be manually
> enabled/disabled):
> >>>>>>>>>>>>>> deduplicate
> >>>>>>>>>>>>>>>> plan nodes/deduplicate sub queries/re-use sub queries
> >>>>> results/or
> >>>>>>>>>>>>>> whatever
> >>>>>>>>>>>>>>>> we could call it. It could do two things:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1. Automatically try to deduplicate fragments of the plan
> >>> and
> >>>>>>> share
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> result using CachedTable - in other words automatically
> >>> insert
> >>>>>>>>>>>>>>> `CachedTable
> >>>>>>>>>>>>>>>> cache()` calls.
> >>>>>>>>>>>>>>>> 2. Automatically make decision to bypass explicit
> >>> `CachedTable`
> >>>>>>>>>> access
> >>>>>>>>>>>>>>>> (this would be the equivalent of what you described as
> >>>>> “semantic
> >>>>>>>>>> 3”).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> However as I wrote previously, I have big doubts if such
> >>>>>>> cost-based
> >>>>>>>>>>>>>>>> optimisation would work (this applies also to “Semantic
> >>> 2”). I
> >>>>>>>>>> would
> >>>>>>>>>>>>>>> expect
> >>>>>>>>>>>>>>>> it to do more harm than good in so many cases, that it
> >>> wouldn’t
> >>>>>>>>>> make
> >>>>>>>>>>>>>>> sense.
> >>>>>>>>>>>>>>>> Even assuming that we calculate statistics perfectly (this
> >>>>> ain’t
> >>>>>>>>>> gonna
> >>>>>>>>>>>>>>>> happen), it’s virtually impossible to correctly estimate
> >>>>> correct
> >>>>>>>>>>>>>> exchange
> >>>>>>>>>>>>>>>> rate of CPU cycles vs IO operations as it is changing so
> >>> much
> >>>>>>> from
> >>>>>>>>>>>>>>>> deployment to deployment.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Is this the core of our disagreement here? That you would
> >>> like
> >>>>>>> this
> >>>>>>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <
> becket.qin@gmail.com
> >>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Another potential concern for semantic 3 is that. In the
> >>>>> future,
> >>>>>>>>>> we
> >>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>> add
> >>>>>>>>>>>>>>>>> automatic caching to Flink. e.g. cache the intermediate
> >>>>> results
> >>>>>>> at
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> shuffle boundary. If our semantic is that reference to
> the
> >>>>>>>>>> original
> >>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>> means skipping cache, those users may not be able to
> >>> benefit
> >>>>>>> from
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>> implicit cache.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <
> >>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for the reply. Thought about it again, I might
> have
> >>>>>>>>>>>>>>> misunderstood
> >>>>>>>>>>>>>>>>>> your proposal in earlier emails. Returning a CachedTable
> >>>>> might
> >>>>>>>>>> not
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> bad
> >>>>>>>>>>>>>>>>>> idea.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I was more concerned about the semantic and its
> >>> intuitiveness
> >>>>>>>>>> when a
> >>>>>>>>>>>>>>>>>> CachedTable is returned. i..e, if cache() returns
> >>>>> CachedTable.
> >>>>>>>>>> What
> >>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> semantic in the following code:
> >>>>>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>>>>>>>> val b = cachedTable.select(...)
> >>>>>>>>>>>>>>>>>> val c = a.select(...)
> >>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>> What is the difference between b and c? At the first
> >>> glance,
> >>>>> I
> >>>>>>>>>> see
> >>>>>>>>>>>>>> two
> >>>>>>>>>>>>>>>>>> options:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c
> uses
> >>>>>>>>>> original
> >>>>>>>>>>>>>>> DAG
> >>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no
> >>> chance
> >>>>> to
> >>>>>>>>>>>>>>> optimize.
> >>>>>>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c
> >>> leaves
> >>>>>>> the
> >>>>>>>>>>>>>>>> optimizer
> >>>>>>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In
> this
> >>>>>>> case,
> >>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>> lose
> >>>>>>>>>>>>>>>>>> the option to NOT use cache.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> As you can see, neither of the options seem perfect.
> >>>>> However, I
> >>>>>>>>>>>>>> guess
> >>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>> and Till are proposing the third option:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether
> >>> cache or
> >>>>>>> DAG
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> be used. c always use the DAG.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This does address all the concerns. It is just that from
> >>>>>>>>>>>>>> intuitiveness
> >>>>>>>>>>>>>>>>>> perspective, I found that asking user to explicitly use
> a
> >>>>>>>>>>>>>> CachedTable
> >>>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>>> the optimizer might choose to ignore is a little weird.
> >>> That
> >>>>>>> was
> >>>>>>>>>>>>>> why I
> >>>>>>>>>>>>>>>> did
> >>>>>>>>>>>>>>>>>> not think about that semantic. But given there is
> material
> >>>>>>>>>> benefit,
> >>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>> this semantic is acceptable.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to
> >>> use
> >>>>>>>>>> cache
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> not,
> >>>>>>>>>>>>>>>>>>> then why do we need “void cache()” method at all? Would
> >>> It
> >>>>>>>>>>>>>>> “increase”
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> chance of using the cache? That’s sounds strange. What
> >>> would
> >>>>>>> be
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not?
> >>> If we
> >>>>>>>>>> want
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan
> >>> nodes
> >>>>>>>>>>>>>>>> deduplication”
> >>>>>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
> >>>>>>>>>> optimiser
> >>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>> all of
> >>>>>>>>>>>>>>>>>>> the work.
> >>>>>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any
> >>> use/not
> >>>>> use
> >>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> decision.
> >>>>>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical
> whether
> >>>>> such
> >>>>>>>>>> cost
> >>>>>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>>>>> optimisations would work properly and I would still
> >>> insist
> >>>>>>>>>> first on
> >>>>>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable
> >>> cache()`)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> We are absolutely on the same page here. An explicit
> >>> cache()
> >>>>>>>>>> method
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> necessary not only because optimizer may not be able to
> >>> make
> >>>>>>> the
> >>>>>>>>>>>>>> right
> >>>>>>>>>>>>>>>>>> decision, but also because of the nature of interactive
> >>>>>>>>>> programming.
> >>>>>>>>>>>>>>> For
> >>>>>>>>>>>>>>>>>> example, if users write the following code in Scala
> shell:
> >>>>>>>>>>>>>>>>>> val b = a.select(...)
> >>>>>>>>>>>>>>>>>> val c = b.select(...)
> >>>>>>>>>>>>>>>>>> val d = c.select(...).writeToSink(...)
> >>>>>>>>>>>>>>>>>> tEnv.execute()
> >>>>>>>>>>>>>>>>>> There is no way optimizer will know whether b or c will
> be
> >>>>> used
> >>>>>>>>>> in
> >>>>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>>>> code, unless users hint explicitly.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to
> our
> >>>>>>>>>>>>>> objections
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects,
> which
> >>> me,
> >>>>>>>>>> Jark,
> >>>>>>>>>>>>>>>> Fabian,
> >>>>>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Is there any other side effects if we use semantic 3
> >>>>> mentioned
> >>>>>>>>>>>>>> above?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> JIangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
> >>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Sorry for not responding long time.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regarding case1.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would
> >>>>> expect
> >>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1`
> >>>>> wouldn’t
> >>>>>>>>>>>>>> affect
> >>>>>>>>>>>>>>>>>>> `cachedTableA2`. Just as in any other database dropping
> >>>>>>>>>> modifying
> >>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>> independent table/materialised view does not affect
> >>> others.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> What I meant is that assuming there is already a
> cached
> >>>>>>> table,
> >>>>>>>>>>>>>>> ideally
> >>>>>>>>>>>>>>>>>>> users need
> >>>>>>>>>>>>>>>>>>>> not to specify whether the next query should read from
> >>> the
> >>>>>>>>>> cache
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether
> to
> >>> use
> >>>>>>>>>> cache
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>> not, then why do we need “void cache()” method at all?
> >>> Would
> >>>>>>> It
> >>>>>>>>>>>>>>>> “increase”
> >>>>>>>>>>>>>>>>>>> the chance of using the cache? That’s sounds strange.
> >>> What
> >>>>>>>>>> would be
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not?
> >>> If we
> >>>>>>>>>> want
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan
> >>> nodes
> >>>>>>>>>>>>>>>> deduplication”
> >>>>>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
> >>>>>>>>>> optimiser
> >>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>> all of
> >>>>>>>>>>>>>>>>>>> the work.
> >>>>>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any
> >>> use/not
> >>>>> use
> >>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> decision.
> >>>>>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical
> whether
> >>>>> such
> >>>>>>>>>> cost
> >>>>>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>>>>> optimisations would work properly and I would still
> >>> insist
> >>>>>>>>>> first on
> >>>>>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable
> >>> cache()`)
> >>>>>>>>>>>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()`
> >>>>>>> doesn’t
> >>>>>>>>>>>>>>>>>>> contradict future work on automated cost based caching.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to
> >>> our
> >>>>>>>>>>>>>> objections
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects,
> which
> >>> me,
> >>>>>>>>>> Jark,
> >>>>>>>>>>>>>>>> Fabian,
> >>>>>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <
> >>> becket.qin@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> It is true that after the first job submission, there
> >>> will
> >>>>> be
> >>>>>>>>>> no
> >>>>>>>>>>>>>>>>>>> ambiguity
> >>>>>>>>>>>>>>>>>>>> in terms of whether a cached table is used or not.
> That
> >>> is
> >>>>>>> the
> >>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> cache() without returning a CachedTable.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Conceptually one could think of cache() as
> introducing a
> >>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>> from which you need to consume from if you want to
> >>> benefit
> >>>>>>>>>> from
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>> functionality.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I am thinking a little differently. I think it is a
> hint
> >>>>> (as
> >>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>>>>> later) instead of a new operator. I'd like to be
> careful
> >>>>>>> about
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>>>>>> of the API. A hint is a property set on an existing
> >>>>> operator,
> >>>>>>>>>> but
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> itself an operator as it does not really manipulate
> the
> >>>>> data.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of
> >>> decision
> >>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>> intermediate result should be cached. But especially
> >>> when
> >>>>>>>>>>>>>> executing
> >>>>>>>>>>>>>>>>>>> ad-hoc
> >>>>>>>>>>>>>>>>>>>>> queries the user might better know which results need
> >>> to
> >>>>> be
> >>>>>>>>>>>>>> cached
> >>>>>>>>>>>>>>>>>>> because
> >>>>>>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I
> >>> would
> >>>>>>>>>> consider
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of
> course,
> >>> in
> >>>>>>> the
> >>>>>>>>>>>>>>> future
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>> might add functionality which tries to automatically
> >>> cache
> >>>>>>>>>>>>>> results
> >>>>>>>>>>>>>>>>>>> (e.g.
> >>>>>>>>>>>>>>>>>>>>> caching the latest intermediate results until so and
> so
> >>>>> much
> >>>>>>>>>>>>>> space
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
> >>>>>>>>>> `CachedTable
> >>>>>>>>>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I agree that cache() method is needed for exactly the
> >>>>> reason
> >>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>> mentioned,
> >>>>>>>>>>>>>>>>>>>> i.e. Flink cannot predict what users are going to
> write
> >>>>>>> later,
> >>>>>>>>>> so
> >>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>> need to tell Flink explicitly that this table will be
> >>> used
> >>>>>>>>>> later.
> >>>>>>>>>>>>>>>> What I
> >>>>>>>>>>>>>>>>>>>> meant is that assuming there is already a cached
> table,
> >>>>>>> ideally
> >>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>>> not to specify whether the next query should read from
> >>> the
> >>>>>>>>>> cache
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> To explain the difference between returning / not
> >>>>> returning a
> >>>>>>>>>>>>>>>>>>> CachedTable,
> >>>>>>>>>>>>>>>>>>>> I want compare the following two case:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> *Case 1:  returning a CachedTable*
> >>>>>>>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>>>>>>> val cachedTableA1 = a.cache()
> >>>>>>>>>>>>>>>>>>>> val cachedTableA2 = a.cache()
> >>>>>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> c = a.filter(...) // User specify that the original
> DAG
> >>> is
> >>>>>>>>>> used?
> >>>>>>>>>>>>>> Or
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> optimizer decides whether DAG or cache should be used?
> >>>>>>>>>>>>>>>>>>>> d = cachedTableA1.filter() // User specify that the
> >>> cached
> >>>>>>>>>> table
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> used.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> a.unCache() // Can cachedTableA still be used
> >>> afterwards?
> >>>>>>>>>>>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be
> >>> used?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> *Case 2: not returning a CachedTable*
> >>>>>>>>>>>>>>>>>>>> b = a.map()
> >>>>>>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>>>>>> a.cache() // no-op
> >>>>>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the
> >>> cache or
> >>>>>>> DAG
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the
> >>> cache or
> >>>>>>> DAG
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> a.unCache()
> >>>>>>>>>>>>>>>>>>>> a.unCache() // no-op
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to
> >>>>> choose
> >>>>>>>>>>>>>>> between
> >>>>>>>>>>>>>>>>>>> DAG
> >>>>>>>>>>>>>>>>>>>> and cache. And the unCache() call becomes tricky.
> >>>>>>>>>>>>>>>>>>>> In case 2, users do not need to worry about whether
> >>> cache
> >>>>> or
> >>>>>>>>>> DAG
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> used.
> >>>>>>>>>>>>>>>>>>>> And the unCache() semantic is clear. However, the
> >>> caveat is
> >>>>>>>>>> that
> >>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>> cannot explicitly ignore the cache.
> >>>>>>>>>>>>>>>>>>
> >>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@da-platform.com>.
Hi,

I know that it still can have side effects and that’s why I wrote:

> Something like this might be a better (not perfect, but just a bit better):

My point was that this:

void foo(Table t) {
 val cachedT = t.cache();
 ...
 env.getCacheService().releaseCacheFor(cachedT);
}

Should communicate the potential side effects to the user in a better way compared to:

void foo(Table t) {
 val cachedT = t.cache();
 …
 cachedT.releaseCache();
}

Your option 3. has the problem of API class being mutable on `.cache()` calls.

As I wrote before, we could use reference counting on `Table` or `CachedTable` returned from Option 4., but:

> I think that introducing ref counting could be confusing and it will be
> error prone, since Flink-table’s users are not used to closing/releasing
> resources.

I have a feeling that the inconvenience for the users in all of the use cases where they do not care about releasing the cache manually (which I would expect to be the vast majority), would overshadow potential benefits of using ref counting. And it’s not like ref counting can not cause problems on it’s own, with users wondering “why my cache wasn’t released?" (Because of dangling/not closed reference).

Piotrek

> On 8 Jan 2019, at 14:06, Becket Qin <be...@gmail.com> wrote:
> 
> Just to clarify, when I say foo() like below, I assume that foo() must have
> a way to release its own cache, so it must have access to table env.
> 
> void foo(Table t) {
>  ...
>  t.cache(); // create cache for t
>  ...
>  env.getCacheService().releaseCacheFor(t); // release cache for t
> }
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Tue, Jan 8, 2019 at 9:04 PM Becket Qin <be...@gmail.com> wrote:
> 
>> Hi Piotr,
>> 
>> I don't think it is feasible to ask every third party library to have
>> method signature with CacheService as an argument.
>> 
>> And even that signature does not really solve the problem. Imagine
>> function foo() looks like following:
>> 
>> void foo(Table t) {
>>  ...
>>  t.cache(); // create cache for t
>>  ...
>>  env.getCacheService().releaseCacheFor(t); // release cache for t
>> }
>> 
>> From function foo()'s perspective, it created a cache and released it.
>> However, if someone invokes foo like this:
>> {
>>  Table src = ...
>>  Table t = src.select(...).cache()
>>  foo(t)
>>  // t is uncached by foo() already.
>> }
>> 
>> So the "side effect" still exists.
>> 
>> I think the only safe way to ensure there is no side effect while sharing
>> the cache is to use ref count.
>> 
>> BTW, the discussion we are having here is exactly the reason that I prefer
>> option 3. From technical perspective option 3 solves all the concerns.
>> 
>> Thanks,
>> 
>> Jiangjie (Becket) Qin
>> 
>> 
>> On Tue, Jan 8, 2019 at 8:41 PM Piotr Nowojski <pi...@da-platform.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I think that introducing ref counting could be confusing and it will be
>>> error prone, since Flink-table’s users are not used to closing/releasing
>>> resources. I was more objecting placing the
>>> `uncache()`/`dropCache()`/`releaseCache()` (releaseCache sounds best to me)
>>> as a method in the “Table”. It might be not obvious that it will drop the
>>> cache for all of the usages of the given table. For example:
>>> 
>>> public void foo(Table t) {
>>> // …
>>> t.releaseCache();
>>> }
>>> 
>>> public void bar(Table t) {
>>>  // ...
>>> }
>>> 
>>> Table a = …
>>> val cachedA = a.cache()
>>> 
>>> foo(cachedA)
>>> bar(cachedA)
>>> 
>>> 
>>> My problem with above example is that `t.releaseCache()` call is not
>>> doing the best possible job in communicating to the user that it will have
>>> a side effects for other places, like `bar(cachedA)` call. Something like
>>> this might be a better (not perfect, but just a bit better):
>>> 
>>> public void foo(Table t, CacheService cacheService) {
>>> // …
>>> cacheService.releaseCacheFor(t);
>>> }
>>> 
>>> Table a = …
>>> val cachedA = a.cache()
>>> 
>>> foo(cachedA, env.getCacheService())
>>> bar(cachedA)
>>> 
>>> 
>>> Also from another perspective, maybe placing `releaseCache()` method in
>>> Table might not be the best separation of concerns - `releaseCache()`
>>> method seams significantly different compared to other existing methods.
>>> 
>>> Piotrek
>>> 
>>>> On 8 Jan 2019, at 12:28, Becket Qin <be...@gmail.com> wrote:
>>>> 
>>>> Hi Piotr,
>>>> 
>>>> You are right. There might be two intuitive meanings when users call
>>>> 'a.uncache()', namely:
>>>> 1. release the resource
>>>> 2. Do not use cache for the next operation.
>>>> 
>>>> Case (1) would likely be the dominant use case. So I would suggest we
>>>> dedicate uncache() method to case (1), i.e. for resource release, but
>>> not
>>>> for ignoring cache.
>>>> 
>>>> For case 2, i.e. explicitly ignoring cache (which is rare), users may
>>> use
>>>> something like 'hint("ignoreCache")'. I think this is better as it is a
>>>> little weird for users to call `a.uncache()` while they may not even
>>> know
>>>> if the table is cached at all.
>>>> 
>>>> Assuming we let `uncache()` to only release resource, one possibility is
>>>> using ref count to mitigate the side effect. That means a ref count is
>>>> incremented on `cache()` and decremented on `uncache()`. That means
>>>> `uncache()` does not physically release the resource immediately, but
>>> just
>>>> means the cache could be released.
>>>> That being said, I am not sure if this is really a better solution as it
>>>> seems a little counter intuitive. Maybe calling it releaseCache() help a
>>>> little bit?
>>>> 
>>>> Thanks,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jan 8, 2019 at 5:36 PM Piotr Nowojski <pi...@da-platform.com>
>>> wrote:
>>>> 
>>>>> Hi Becket,
>>>>> 
>>>>> With `uncache` there are probably two features that we can think about:
>>>>> 
>>>>> a)
>>>>> 
>>>>> Physically dropping the cached table from the storage, freeing up the
>>>>> resources
>>>>> 
>>>>> b)
>>>>> 
>>>>> Hinting the optimizer to not cache the reads for the next query/table
>>>>> 
>>>>> a) Has the issue as I wrote before, that it seemed to be an operation
>>>>> inherently “flawed" with having side effects.
>>>>> 
>>>>> I’m not sure how it would be best to express. We could make it work:
>>>>> 
>>>>> 1. via a method on a Table as you proposed:
>>>>> 
>>>>> void Table#dropCache()
>>>>> void Table#uncache()
>>>>> 
>>>>> 2. Operation on the environment
>>>>> 
>>>>> env.dropCacheFor(table) // or some other argument that allows user to
>>>>> identify the desired cache
>>>>> 
>>>>> 3. Extending (from your original design doc) `setTableService` method
>>> to
>>>>> return some control handle like:
>>>>> 
>>>>> TableServiceControl setTableService(TableFactory tf,
>>>>>                    TableProperties properties,
>>>>>                    TempTableCleanUpCallback cleanUpCallback);
>>>>> 
>>>>> (TableServiceControl? TableService? TableServiceHandle? CacheService?)
>>>>> 
>>>>> And having the drop cache method there:
>>>>> 
>>>>> TableServiceControl#dropCache(table)
>>>>> 
>>>>> Out of those options, option 1 might have a disadvantage of kind of not
>>>>> making the user aware, that this is a global operation with side
>>> effects.
>>>>> Like the old example of:
>>>>> 
>>>>> public void foo(Table t) {
>>>>> // …
>>>>> t.dropCache();
>>>>> }
>>>>> 
>>>>> It might not be immediately obvious that `t.dropCache()` is some kind
>>> of
>>>>> global operation, with side effects visible outside of the `foo`
>>> function.
>>>>> 
>>>>> On the other hand, both option 2 and 3, might have greater chance of
>>>>> catching user’s attention:
>>>>> 
>>>>> public void foo(Table t, CacheService cacheService) {
>>>>> // …
>>>>> cacheService.dropCache(t);
>>>>> }
>>>>> 
>>>>> b) could be achieved quite easily:
>>>>> 
>>>>> Table a = …
>>>>> val notCached1 = a.doNotCache()
>>>>> val cachedA = a.cache()
>>>>> val notCached2 = cachedA.doNotCache() // equivalent of notCached1
>>>>> 
>>>>> `doNotCache()` would behave similarly to `cache()` - return a copy of
>>> the
>>>>> table with removed “cache” hint and/or added “never cache” hint.
>>>>> 
>>>>> Piotrek
>>>>> 
>>>>> 
>>>>>> On 8 Jan 2019, at 03:17, Becket Qin <be...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi Piotr,
>>>>>> 
>>>>>> Thanks for the proposal and detailed explanation. I like the idea of
>>>>>> returning a new hinted Table without modifying the original table.
>>> This
>>>>>> also leave the room for users to benefit from future implicit caching.
>>>>>> 
>>>>>> Just to make sure I get the full picture. In your proposal, there will
>>>>> also
>>>>>> be a 'void Table#uncache()' method to release the cache, right?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jiangjie (Becket) Qin
>>>>>> 
>>>>>> On Mon, Jan 7, 2019 at 11:50 PM Piotr Nowojski <piotr@da-platform.com
>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Becket!
>>>>>>> 
>>>>>>> After further thinking I tend to agree that my previous proposal
>>>>> (*Option
>>>>>>> 2*) indeed might not be if would in the future introduce automatic
>>>>> caching.
>>>>>>> However I would like to propose a slightly modified version of it:
>>>>>>> 
>>>>>>> *Option 4*
>>>>>>> 
>>>>>>> Adding `cache()` method with following signature:
>>>>>>> 
>>>>>>> Table Table#cache();
>>>>>>> 
>>>>>>> Without side-effects, and `cache()` call do not modify/change
>>> original
>>>>>>> Table in any way.
>>>>>>> It would return a copy of original table, with added hint for the
>>>>>>> optimizer to cache the table, so that the future accesses to the
>>>>> returned
>>>>>>> table might be cached or not.
>>>>>>> 
>>>>>>> Assuming that we are talking about a setup, where we do not have
>>>>> automatic
>>>>>>> caching enabled (possible future extension).
>>>>>>> 
>>>>>>> Example #1:
>>>>>>> 
>>>>>>> ```
>>>>>>> Table a = …
>>>>>>> a.foo() // not cached
>>>>>>> 
>>>>>>> val cachedTable = a.cache();
>>>>>>> 
>>>>>>> cachedA.bar() // maybe cached
>>>>>>> a.foo() // same as before - effectively not cached
>>>>>>> ```
>>>>>>> 
>>>>>>> Both the first and the second `a.foo()` operations would behave in
>>> the
>>>>>>> exactly same way. Again, `a.cache()` call doesn’t affect `a` itself.
>>> If
>>>>> `a`
>>>>>>> was not hinted for caching before `a.cache();`, then both `a.foo()`
>>>>> calls
>>>>>>> wouldn’t use cache.
>>>>>>> 
>>>>>>> Returned `cachedA` would be hinted with “cache” hint, so probably
>>>>>>> `cachedA.bar()` would go through cache (unless optimiser decides the
>>>>>>> opposite)
>>>>>>> 
>>>>>>> Example #2
>>>>>>> 
>>>>>>> ```
>>>>>>> Table a = …
>>>>>>> 
>>>>>>> a.foo() // not cached
>>>>>>> 
>>>>>>> val b = a.cache();
>>>>>>> 
>>>>>>> a.foo() // same as before - effectively not cached
>>>>>>> b.foo() // maybe cached
>>>>>>> 
>>>>>>> val c = b.cache();
>>>>>>> 
>>>>>>> a.foo() // same as before - effectively not cached
>>>>>>> b.foo() // same as before - effectively maybe cached
>>>>>>> c.foo() // maybe cached
>>>>>>> ```
>>>>>>> 
>>>>>>> Now, assuming that we have some future “automatic caching
>>> optimisation”:
>>>>>>> 
>>>>>>> Example #3
>>>>>>> 
>>>>>>> ```
>>>>>>> env.enableAutomaticCaching()
>>>>>>> Table a = …
>>>>>>> 
>>>>>>> a.foo() // might be cached, depending if `a` was selected to
>>> automatic
>>>>>>> caching
>>>>>>> 
>>>>>>> val b = a.cache();
>>>>>>> 
>>>>>>> a.foo() // same as before - might be cached, if `a` was selected to
>>>>>>> automatic caching
>>>>>>> b.foo() // maybe cached
>>>>>>> ```
>>>>>>> 
>>>>>>> 
>>>>>>> More or less this is the same behaviour as:
>>>>>>> 
>>>>>>> Table a = ...
>>>>>>> val b = a.filter(x > 20)
>>>>>>> 
>>>>>>> calling `filter` hasn’t changed or altered `a` in anyway. If `a` was
>>>>>>> previously filtered:
>>>>>>> 
>>>>>>> Table src = …
>>>>>>> val a = src.filter(x > 20)
>>>>>>> val b = a.filter(x > 20)
>>>>>>> 
>>>>>>> then yes, `a` and `b` will be the same. But the point is that neither
>>>>>>> `filter` nor `cache` changes the original `a` table.
>>>>>>> 
>>>>>>> One thing is that indeed, physically dropping cache operation, will
>>> have
>>>>>>> side effects and it will in a way mutate the cached table references.
>>>>> But
>>>>>>> this is I think unavoidable in any solution - the same issue as
>>> calling
>>>>>>> `.close()`, or calling destructor in C++.
>>>>>>> 
>>>>>>> Piotrek
>>>>>>> 
>>>>>>>> On 7 Jan 2019, at 10:41, Becket Qin <be...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Happy New Year, everybody!
>>>>>>>> 
>>>>>>>> I would like to resume this discussion thread. At this point, We
>>> have
>>>>>>>> agreed on the first step goal of interactive programming. The open
>>>>>>>> discussion is the exact API. More specifically, what should
>>> *cache()*
>>>>>>>> method return and what is the semantic. There are three options:
>>>>>>>> 
>>>>>>>> *Option 1*
>>>>>>>> *void cache()* OR *Table cache()* which returns the original table
>>> for
>>>>>>>> chained calls.
>>>>>>>> *void uncache() *releases the cache.
>>>>>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
>>>>>>>> 
>>>>>>>> - Semantic: a.cache() hints that table 'a' should be cached.
>>> Optimizer
>>>>>>>> decides whether the cache will be used or not.
>>>>>>>> - pros: simple and no confusion between CachedTable and original
>>> table
>>>>>>>> - cons: A table may be cached / uncached in a method invocation,
>>> while
>>>>>>> the
>>>>>>>> caller does not know about this.
>>>>>>>> 
>>>>>>>> *Option 2*
>>>>>>>> *CachedTable cache()*
>>>>>>>> *CachedTable *extends *Table *with an additional *uncache()* method
>>>>>>>> 
>>>>>>>> - Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will
>>>>> always
>>>>>>>> use cache. *a.bar() *will always use original DAG.
>>>>>>>> - pros: No potential side effects in method invocation.
>>>>>>>> - cons: Optimizer has no chance to kick in. Future optimization will
>>>>>>> become
>>>>>>>> a behavior change and need users to change the code.
>>>>>>>> 
>>>>>>>> *Option 3*
>>>>>>>> *CacheHandle cache()*
>>>>>>>> *CacheHandle.release() *to release a cache handle on the table. If
>>> all
>>>>>>>> cache handles are released, the cache could be removed.
>>>>>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
>>>>>>>> 
>>>>>>>> - Semantic: *a.cache() *hints that 'a' should be cached. Optimizer
>>>>>>> decides
>>>>>>>> whether the cache will be used or not. Cache is released either no
>>>>> handle
>>>>>>>> is on it, or the user program exits.
>>>>>>>> - pros: No potential side effect in method invocation. No confusion
>>>>>>> between
>>>>>>>> cached table v.s original table.
>>>>>>>> - cons: An additional CacheHandle exposed to the users.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Personally I prefer option 3 for the following reasons:
>>>>>>>> 1. It is simple. Vast majority of the users would just call
>>>>>>>> *a.cache()* followed
>>>>>>>> by *a.foo(),* *a.bar(), etc. *
>>>>>>>> 2. There is no semantic ambiguity and semantic change if we decide
>>> to
>>>>> add
>>>>>>>> implicit cache in the future.
>>>>>>>> 3. There is no side effect in the method calls.
>>>>>>>> 4. Admittedly we need to expose one more CacheHandle class to the
>>>>> users.
>>>>>>>> But it is not that difficult to understand given similar well known
>>>>>>> concept
>>>>>>>> like ref count (we can name it CacheReference if that is easier to
>>>>>>>> understand). So I think it is fine.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <be...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Piotrek,
>>>>>>>>> 
>>>>>>>>> 1. Regarding optimization.
>>>>>>>>> Sure there are many cases that the decision is hard to make. But
>>> that
>>>>>>> does
>>>>>>>>> not make it any easier for the users to make those decisions. I
>>>>> imagine
>>>>>>> 99%
>>>>>>>>> of the users would just naively use cache. I am not saying we can
>>>>>>> optimize
>>>>>>>>> in all the cases. But as long as we agree that at least in certain
>>>>>>> cases (I
>>>>>>>>> would argue most cases), optimizer can do a little better than an
>>>>>>> average
>>>>>>>>> user who likely knows little about Flink internals, we should not
>>> push
>>>>>>> the
>>>>>>>>> burden of optimization to users.
>>>>>>>>> 
>>>>>>>>> BTW, it seems some of your concerns are related to the
>>>>> implementation. I
>>>>>>>>> did not mention the implementation of the caching service because
>>> that
>>>>>>>>> should not affect the API semantic. Not sure if this helps, but
>>>>> imagine
>>>>>>> the
>>>>>>>>> default implementation has one StorageNode service colocating with
>>>>> each
>>>>>>> TM.
>>>>>>>>> It could be running within the TM process or in a standalone
>>> process,
>>>>>>>>> depending on configuration.
>>>>>>>>> 
>>>>>>>>> The StorageNode uses memory + spill-to-disk mechanism. The cached
>>> data
>>>>>>>>> will just be written to the local StorageNode service. If the
>>>>>>> StorageNode
>>>>>>>>> is running within the TM process, the in-memory cache could just be
>>>>>>> objects
>>>>>>>>> so we save some serde cost. A later job referring to the cached
>>> Table
>>>>>>> will
>>>>>>>>> be scheduled in a locality aware manner, i.e. run in the TM whose
>>> peer
>>>>>>>>> StorageNode hosts the data.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2. Semantic
>>>>>>>>> I am not sure why introducing a new hintCache() or
>>>>>>>>> env.enableAutomaticCaching() method would avoid the consequence of
>>>>>>> semantic
>>>>>>>>> change.
>>>>>>>>> 
>>>>>>>>> If the auto optimization is not enabled by default, users still
>>> need
>>>>> to
>>>>>>>>> make code change to all existing programs in order to get the
>>> benefit.
>>>>>>>>> If the auto optimization is enabled by default, advanced users who
>>>>> know
>>>>>>>>> that they really want to use cache will suddenly lose the
>>> opportunity
>>>>>>> to do
>>>>>>>>> so, unless they change the code to disable auto optimization.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 3. side effect
>>>>>>>>> The CacheHandle is not only for where to put uncache(). It is to
>>> solve
>>>>>>> the
>>>>>>>>> implicit performance impact by moving the uncache() to the
>>>>> CacheHandle.
>>>>>>>>> 
>>>>>>>>> - If users wants to leverage cache, they can call a.cache(). After
>>>>>>>>> that, unless user explicitly release that CacheHandle, a.foo() will
>>>>>>> always
>>>>>>>>> leverage cache if needed (optimizer may choose to ignore cache if
>>>>> that
>>>>>>>>> helps accelerate the process). Any function call will not be able
>>> to
>>>>>>>>> release the cache because they do not have that CacheHandle.
>>>>>>>>> - If some advanced users do not want to use cache at all, they will
>>>>>>>>> call a.hint(ignoreCache).foo(). This will for sure ignore cache and
>>>>>>> use the
>>>>>>>>> original DAG to process.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> In vast majority of the cases, users wouldn't really care whether
>>> the
>>>>>>>>>> cache is used or not.
>>>>>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
>>>>> memory
>>>>>>>>>> caching) would add additional IO costs. It’s similar as saying
>>> that
>>>>>>> users
>>>>>>>>>> would not see a difference between Spark/Flink and MapReduce
>>>>> (MapReduce
>>>>>>>>>> writes data to disks after every map/reduce stage).
>>>>>>>>> 
>>>>>>>>> What I wanted to say is that in most cases, after users call
>>> cache(),
>>>>>>> they
>>>>>>>>> don't really care about whether auto optimization has decided to
>>>>> ignore
>>>>>>> the
>>>>>>>>> cache or not, as long as the program runs faster.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <
>>>>>>> piotr@data-artisans.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the quick answer :)
>>>>>>>>>> 
>>>>>>>>>> Re 1.
>>>>>>>>>> 
>>>>>>>>>> I generally agree with you, however couple of points:
>>>>>>>>>> 
>>>>>>>>>> a) the problem with using automatic caching is bigger, because you
>>>>> will
>>>>>>>>>> have to decide, how do you compare IO vs CPU costs and if you pick
>>>>>>> wrong,
>>>>>>>>>> additional IO costs might be enormous or even can crash your
>>> system.
>>>>>>> This
>>>>>>>>>> is more difficult problem compared to let say join reordering,
>>> where
>>>>>>> the
>>>>>>>>>> only issue is to have good statistics that can capture
>>> correlations
>>>>>>> between
>>>>>>>>>> columns (when you reorder joins number of IO operations do not
>>>>> change)
>>>>>>>>>> c) your example is completely independent of caching.
>>>>>>>>>> 
>>>>>>>>>> Query like this:
>>>>>>>>>> 
>>>>>>>>>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1
>>> ===`f2).as('f3,
>>>>>>>>>> …).filter(‘f3 > 30)
>>>>>>>>>> 
>>>>>>>>>> Should/could be optimised to empty result immediately, without the
>>>>> need
>>>>>>>>>> for any cache/materialisation and that should work even without
>>> any
>>>>>>>>>> statistics provided by the connector.
>>>>>>>>>> 
>>>>>>>>>> For me prerequisite to any serious cost-based optimisations would
>>> be
>>>>>>> some
>>>>>>>>>> reasonable benchmark coverage of the code (tpch?). Otherwise that
>>>>>>> would be
>>>>>>>>>> equivalent of adding not tested code, since we wouldn’t be able to
>>>>>>> verify
>>>>>>>>>> our assumptions, like how does the writing of 10 000 records to
>>>>>>>>>> cache/RocksDB/Kafka/CSV file compare to
>>> joining/filtering/processing
>>>>> of
>>>>>>>>>> lets say 1000 000 rows.
>>>>>>>>>> 
>>>>>>>>>> Re 2.
>>>>>>>>>> 
>>>>>>>>>> I wasn’t proposing to change the semantic later. I was proposing
>>> that
>>>>>>> we
>>>>>>>>>> start now:
>>>>>>>>>> 
>>>>>>>>>> CachedTable cachedA = a.cache()
>>>>>>>>>> cachedA.foo() // Cache is used
>>>>>>>>>> a.bar() // Original DAG is used
>>>>>>>>>> 
>>>>>>>>>> And then later we can think about adding for example
>>>>>>>>>> 
>>>>>>>>>> CachedTable cachedA = a.hintCache()
>>>>>>>>>> cachedA.foo() // Cache might be used
>>>>>>>>>> a.bar() // Original DAG is used
>>>>>>>>>> 
>>>>>>>>>> Or
>>>>>>>>>> 
>>>>>>>>>> env.enableAutomaticCaching()
>>>>>>>>>> a.foo() // Cache might be used
>>>>>>>>>> a.bar() // Cache might be used
>>>>>>>>>> 
>>>>>>>>>> Or (I would still not like this option):
>>>>>>>>>> 
>>>>>>>>>> a.hintCache()
>>>>>>>>>> a.foo() // Cache might be used
>>>>>>>>>> a.bar() // Cache might be used
>>>>>>>>>> 
>>>>>>>>>> Or whatever else that will come to our mind. Even if we add some
>>>>>>>>>> automatic caching in the future, keeping implicit (`CachedTable
>>>>>>> cache()`)
>>>>>>>>>> caching will still be useful, at least in some cases.
>>>>>>>>>> 
>>>>>>>>>> Re 3.
>>>>>>>>>> 
>>>>>>>>>>> 2. The source tables are immutable during one run of batch
>>>>> processing
>>>>>>>>>> logic.
>>>>>>>>>>> 3. The cache is immutable during one run of batch processing
>>> logic.
>>>>>>>>>> 
>>>>>>>>>>> I think assumption 2 and 3 are by definition what batch
>>> processing
>>>>>>>>>> means,
>>>>>>>>>>> i.e the data must be complete before it is processed and should
>>> not
>>>>>>>>>> change
>>>>>>>>>>> when the processing is running.
>>>>>>>>>> 
>>>>>>>>>> I agree that this is how batch systems SHOULD be working. However
>>> I
>>>>>>> know
>>>>>>>>>> from my previous experience that it’s not always the case.
>>> Sometimes
>>>>>>> users
>>>>>>>>>> are just working on some non transactional storage, which can be
>>>>>>> (either
>>>>>>>>>> constantly or occasionally) being modified by some other processes
>>>>> for
>>>>>>>>>> whatever the reasons (fixing the data, updating, adding new data
>>>>> etc).
>>>>>>>>>> 
>>>>>>>>>> But even if we ignore this point (data immutability), performance
>>>>> side
>>>>>>>>>> effect issue of your proposal remains. If user calls `void
>>> a.cache()`
>>>>>>> deep
>>>>>>>>>> inside some private method, it will have implicit side effects on
>>>>> other
>>>>>>>>>> parts of his program that might not be obvious.
>>>>>>>>>> 
>>>>>>>>>> Re `CacheHandle`.
>>>>>>>>>> 
>>>>>>>>>> If I understand it correctly, it only addresses the issue where to
>>>>>>> place
>>>>>>>>>> method `uncache`/`dropCache`.
>>>>>>>>>> 
>>>>>>>>>> Btw,
>>>>>>>>>> 
>>>>>>>>>>> In vast majority of the cases, users wouldn't really care whether
>>>>> the
>>>>>>>>>> cache is used or not.
>>>>>>>>>> 
>>>>>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
>>>>> memory
>>>>>>>>>> caching) would add additional IO costs. It’s similar as saying
>>> that
>>>>>>> users
>>>>>>>>>> would not see a difference between Spark/Flink and MapReduce
>>>>> (MapReduce
>>>>>>>>>> writes data to disks after every map/reduce stage).
>>>>>>>>>> 
>>>>>>>>>> Piotrek
>>>>>>>>>> 
>>>>>>>>>>> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com>
>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>> 
>>>>>>>>>>> Not sure if you noticed, in my last email, I was proposing
>>>>>>> `CacheHandle
>>>>>>>>>>> cache()` to avoid the potential side effect due to function
>>> calls.
>>>>>>>>>>> 
>>>>>>>>>>> Let's look at the disagreement in your reply one by one.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 1. Optimization chances
>>>>>>>>>>> 
>>>>>>>>>>> Optimization is never a trivial work. This is exactly why we
>>> should
>>>>>>> not
>>>>>>>>>> let
>>>>>>>>>>> user manually do that. Databases have done huge amount of work in
>>>>> this
>>>>>>>>>>> area. At Alibaba, we rely heavily on many optimization rules to
>>>>> boost
>>>>>>>>>> the
>>>>>>>>>>> SQL query performance.
>>>>>>>>>>> 
>>>>>>>>>>> In your example, if I filling the filter conditions in a certain
>>>>> way,
>>>>>>>>>> the
>>>>>>>>>>> optimization would become obvious.
>>>>>>>>>>> 
>>>>>>>>>>> Table src1 = … // read from connector 1
>>>>>>>>>>> Table src2 = … // read from connector 2
>>>>>>>>>>> 
>>>>>>>>>>> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1
>>> ===
>>>>>>>>>>> `f2).as('f3, ...)
>>>>>>>>>>> a.cache() // write cache to connector 3, when writing the
>>> records,
>>>>>>>>>> remember
>>>>>>>>>>> min and max of `f1
>>>>>>>>>>> 
>>>>>>>>>>> a.filter('f3 > 30) // There is no need to read from any connector
>>>>>>>>>> because
>>>>>>>>>>> `a` does not contain any record whose 'f3 is greater than 30.
>>>>>>>>>>> env.execute()
>>>>>>>>>>> a.select(…)
>>>>>>>>>>> 
>>>>>>>>>>> BTW, it seems to me that adding some basic statistics is fairly
>>>>>>>>>>> straightforward and the cost is pretty marginal if not
>>> ignorable. In
>>>>>>>>>> fact
>>>>>>>>>>> it is not only needed for optimization, but also for cases such
>>> as
>>>>> ML,
>>>>>>>>>>> where some algorithms may need to decide their parameter based on
>>>>> the
>>>>>>>>>>> statistics of the data.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 2. Same API, one semantic now, another semantic later.
>>>>>>>>>>> 
>>>>>>>>>>> I am trying to understand what is the semantic of `CachedTable
>>>>>>> cache()`
>>>>>>>>>> you
>>>>>>>>>>> are proposing. IMO, we should avoid designing an API whose
>>> semantic
>>>>>>>>>> will be
>>>>>>>>>>> changed later. If we have a "CachedTable cache()" method, then
>>> the
>>>>>>>>>> semantic
>>>>>>>>>>> should be very clearly defined upfront and do not change later.
>>> It
>>>>>>>>>> should
>>>>>>>>>>> never be "right now let's go with semantic 1, later we can
>>> silently
>>>>>>>>>> change
>>>>>>>>>>> it to semantic 2 or 3". Such change could result in bad
>>> consequence.
>>>>>>> For
>>>>>>>>>>> example, let's say we decide go with semantic 1:
>>>>>>>>>>> 
>>>>>>>>>>> CachedTable cachedA = a.cache()
>>>>>>>>>>> cachedA.foo() // Cache is used
>>>>>>>>>>> a.bar() // Original DAG is used.
>>>>>>>>>>> 
>>>>>>>>>>> Now majority of the users would be using cachedA.foo() in their
>>>>> code.
>>>>>>>>>> And
>>>>>>>>>>> some advanced users will use a.bar() to explicitly skip the
>>> cache.
>>>>>>> Later
>>>>>>>>>>> on, we added smart optimization and change the semantic to
>>> semantic
>>>>> 2:
>>>>>>>>>>> 
>>>>>>>>>>> CachedTable cachedA = a.cache()
>>>>>>>>>>> cachedA.foo() // Cache is used
>>>>>>>>>>> a.bar() // Cache MIGHT be used, and Flink may decide to skip
>>> cache
>>>>> if
>>>>>>>>>> it is
>>>>>>>>>>> faster.
>>>>>>>>>>> 
>>>>>>>>>>> Now most of the users who were writing cachedA.foo() will not
>>>>> benefit
>>>>>>>>>> from
>>>>>>>>>>> this optimization at all, unless they change their code to use
>>>>> a.foo()
>>>>>>>>>>> instead. And those advanced users suddenly lose the option to
>>>>>>> explicitly
>>>>>>>>>>> ignore cache unless they change their code (assuming we care
>>> enough
>>>>> to
>>>>>>>>>>> provide something like hint(useCache)). If we don't define the
>>>>>>> semantic
>>>>>>>>>>> carefully, our users will have to change their code again and
>>> again
>>>>>>>>>> while
>>>>>>>>>>> they shouldn't have to.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 3. side effect.
>>>>>>>>>>> 
>>>>>>>>>>> Before we talk about side effect, we have to agree on the
>>>>> assumptions.
>>>>>>>>>> The
>>>>>>>>>>> assumptions I have are following:
>>>>>>>>>>> 1. We are talking about batch processing.
>>>>>>>>>>> 2. The source tables are immutable during one run of batch
>>>>> processing
>>>>>>>>>> logic.
>>>>>>>>>>> 3. The cache is immutable during one run of batch processing
>>> logic.
>>>>>>>>>>> 
>>>>>>>>>>> I think assumption 2 and 3 are by definition what batch
>>> processing
>>>>>>>>>> means,
>>>>>>>>>>> i.e the data must be complete before it is processed and should
>>> not
>>>>>>>>>> change
>>>>>>>>>>> when the processing is running.
>>>>>>>>>>> 
>>>>>>>>>>> As far as I am aware of, I don't know any batch processing system
>>>>>>>>>> breaking
>>>>>>>>>>> those assumptions. Even for relational database tables, where
>>>>> queries
>>>>>>>>>> can
>>>>>>>>>>> run with concurrent modifications, necessary locking are still
>>>>>>> required
>>>>>>>>>> to
>>>>>>>>>>> ensure the integrity of the query result.
>>>>>>>>>>> 
>>>>>>>>>>> Please let me know if you disagree with the above assumptions. If
>>>>> you
>>>>>>>>>> agree
>>>>>>>>>>> with these assumptions, with the `CacheHandle cache()` API in my
>>>>> last
>>>>>>>>>>> email, do you still see side effects?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <
>>>>>>> piotr@data-artisans.com
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>> 
>>>>>>>>>>>>> Regarding the chance of optimization, it might not be that
>>> rare.
>>>>>>> Some
>>>>>>>>>>>> very
>>>>>>>>>>>>> simple statistics could already help in many cases. For
>>> example,
>>>>>>>>>> simply
>>>>>>>>>>>>> maintaining max and min of each fields can already eliminate
>>> some
>>>>>>>>>>>>> unnecessary table scan (potentially scanning the cached table)
>>> if
>>>>>>> the
>>>>>>>>>>>>> result is doomed to be empty. A histogram would give even
>>> further
>>>>>>>>>>>>> information. The optimizer could be very careful and only
>>> ignores
>>>>>>>>>> cache
>>>>>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
>>>>> filter
>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>>> cache will absolutely return nothing.
>>>>>>>>>>>> 
>>>>>>>>>>>> I do not see how this might be easy to achieve. It would require
>>>>> tons
>>>>>>>>>> of
>>>>>>>>>>>> effort to make it work and in the end you would still have a
>>>>> problem
>>>>>>> of
>>>>>>>>>>>> comparing/trading CPU cycles vs IO. For example:
>>>>>>>>>>>> 
>>>>>>>>>>>> Table src1 = … // read from connector 1
>>>>>>>>>>>> Table src2 = … // read from connector 2
>>>>>>>>>>>> 
>>>>>>>>>>>> Table a = src1.filter(…).join(src2.filter(…), …)
>>>>>>>>>>>> a.cache() // write cache to connector 3
>>>>>>>>>>>> 
>>>>>>>>>>>> a.filter(…)
>>>>>>>>>>>> env.execute()
>>>>>>>>>>>> a.select(…)
>>>>>>>>>>>> 
>>>>>>>>>>>> Decision whether it’s better to:
>>>>>>>>>>>> A) read from connector1/connector2, filter/map and join them
>>> twice
>>>>>>>>>>>> B) read from connector1/connector2, filter/map and join them
>>> once,
>>>>>>> pay
>>>>>>>>>> the
>>>>>>>>>>>> price of writing to connector 3 and then reading from it
>>>>>>>>>>>> 
>>>>>>>>>>>> Is very far from trivial. `a` can end up much larger than `src1`
>>>>> and
>>>>>>>>>>>> `src2`, writes to connector 3 might be extremely slow, reads
>>> from
>>>>>>>>>> connector
>>>>>>>>>>>> 3 can be slower compared to reads from connector 1 & 2, … . You
>>>>>>> really
>>>>>>>>>> need
>>>>>>>>>>>> to have extremely good statistics to correctly asses size of the
>>>>>>>>>> output and
>>>>>>>>>>>> it would still be failing many times (correlations etc). And
>>> keep
>>>>> in
>>>>>>>>>> mind
>>>>>>>>>>>> that at the moment we do not have ANY statistics at all. More
>>> than
>>>>>>>>>> that, it
>>>>>>>>>>>> would require significantly more testing and setting up some
>>>>>>>>>> benchmarks to
>>>>>>>>>>>> make sure that we do not brake it with some regressions.
>>>>>>>>>>>> 
>>>>>>>>>>>> That’s why I’m strongly opposing this idea - at least let’s not
>>>>>>> starts
>>>>>>>>>>>> with this. If we first start with completely manual/explicit
>>>>> caching,
>>>>>>>>>>>> without any magic, it would be a significant improvement for the
>>>>>>> users
>>>>>>>>>> for
>>>>>>>>>>>> a fraction of the development cost. After implementing that,
>>> when
>>>>> we
>>>>>>>>>>>> already have all of the working pieces, we can start working on
>>>>> some
>>>>>>>>>>>> optimisations rules. As I wrote before, if we start with
>>>>>>>>>>>> 
>>>>>>>>>>>> `CachedTable cache()`
>>>>>>>>>>>> 
>>>>>>>>>>>> We can later work on follow up stories to make it automatic.
>>>>> Despite
>>>>>>>>>> that
>>>>>>>>>>>> I don’t like this implicit/side effect approach with `void`
>>> method,
>>>>>>>>>> having
>>>>>>>>>>>> explicit `CachedTable cache()` wouldn’t even prevent as from
>>> later
>>>>>>>>>> adding
>>>>>>>>>>>> `void hintCache()` method, with the exact semantic that you
>>> want.
>>>>>>>>>>>> 
>>>>>>>>>>>> On top of that I re-rise again that having implicit `void
>>>>>>>>>>>> cache()/hintCache()` has other side effects and problems with
>>> non
>>>>>>>>>> immutable
>>>>>>>>>>>> data, and being annoying when used secretly inside methods.
>>>>>>>>>>>> 
>>>>>>>>>>>> Explicit `CachedTable cache()` just looks like much less
>>>>>>> controversial
>>>>>>>>>> MVP
>>>>>>>>>>>> and if we decide to go further with this topic, it’s not a
>>> wasted
>>>>>>>>>> effort,
>>>>>>>>>>>> but just lies on a stright path to more advanced/complicated
>>>>>>> solutions
>>>>>>>>>> in
>>>>>>>>>>>> the future. Are there any drawbacks of starting with
>>> `CachedTable
>>>>>>>>>> cache()`
>>>>>>>>>>>> that I’m missing?
>>>>>>>>>>>> 
>>>>>>>>>>>> Piotrek
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Introducing CacheHandle seems too complicated. That means users
>>>>> have
>>>>>>>>>> to
>>>>>>>>>>>>> maintain Handler properly.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And since cache is just a hint for optimizer, why not just
>>> return
>>>>>>>>>> Table
>>>>>>>>>>>>> itself for cache method. This hint info should be kept in
>>> Table I
>>>>>>>>>>>> believe.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So how about adding method cache and uncache for Table, and
>>> both
>>>>>>>>>> return
>>>>>>>>>>>>> Table. Because what cache and uncache did is just adding some
>>> hint
>>>>>>>>>> info
>>>>>>>>>>>>> into Table.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Till and Piotrek,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the clarification. That solves quite a few
>>> confusion.
>>>>> My
>>>>>>>>>>>>>> understanding of how cache works is same as what Till
>>> describe.
>>>>>>> i.e.
>>>>>>>>>>>>>> cache() is a hint to Flink, but it is not guaranteed that
>>> cache
>>>>>>>>>> always
>>>>>>>>>>>>>> exist and it might be recomputed from its lineage.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is this the core of our disagreement here? That you would like
>>>>> this
>>>>>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Semantic wise, yes. That's also why I think materialize() has
>>> a
>>>>>>> much
>>>>>>>>>>>> larger
>>>>>>>>>>>>>> scope than cache(), thus it should be a different method.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regarding the chance of optimization, it might not be that
>>> rare.
>>>>>>> Some
>>>>>>>>>>>> very
>>>>>>>>>>>>>> simple statistics could already help in many cases. For
>>> example,
>>>>>>>>>> simply
>>>>>>>>>>>>>> maintaining max and min of each fields can already eliminate
>>> some
>>>>>>>>>>>>>> unnecessary table scan (potentially scanning the cached
>>> table) if
>>>>>>> the
>>>>>>>>>>>>>> result is doomed to be empty. A histogram would give even
>>> further
>>>>>>>>>>>>>> information. The optimizer could be very careful and only
>>> ignores
>>>>>>>>>> cache
>>>>>>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
>>>>> filter
>>>>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>>>> cache will absolutely return nothing.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Given the above clarification on cache, I would like to
>>> revisit
>>>>> the
>>>>>>>>>>>>>> original "void cache()" proposal and see if we can improve on
>>> top
>>>>>>> of
>>>>>>>>>>>> that.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What do you think about the following modified interface?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Table {
>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>> * This call hints Flink to maintain a cache of this table and
>>>>>>>>>> leverage
>>>>>>>>>>>>>> it for performance optimization if needed.
>>>>>>>>>>>>>> * Note that Flink may still decide to not use the cache if it
>>> is
>>>>>>>>>>>> cheaper
>>>>>>>>>>>>>> by doing so.
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> * A CacheHandle will be returned to allow user release the
>>> cache
>>>>>>>>>>>>>> actively. The cache will be deleted if there
>>>>>>>>>>>>>> * is no unreleased cache handlers to it. When the
>>>>> TableEnvironment
>>>>>>>>>> is
>>>>>>>>>>>>>> closed. The cache will also be deleted
>>>>>>>>>>>>>> * and all the cache handlers will be released.
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> * @return a CacheHandle referring to the cache of this table.
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>> CacheHandle cache();
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> CacheHandle {
>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>> * Close the cache handle. This method does not necessarily
>>>>> deletes
>>>>>>>>>> the
>>>>>>>>>>>>>> cache. Instead, it simply decrements the reference counter to
>>> the
>>>>>>>>>> cache.
>>>>>>>>>>>>>> * When the there is no handle referring to a cache. The cache
>>>>> will
>>>>>>>>>> be
>>>>>>>>>>>>>> deleted.
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> * @return the number of open handles to the cache after this
>>>>> handle
>>>>>>>>>>>> has
>>>>>>>>>>>>>> been released.
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>> int release()
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The rationale behind this interface is following:
>>>>>>>>>>>>>> In vast majority of the cases, users wouldn't really care
>>> whether
>>>>>>> the
>>>>>>>>>>>> cache
>>>>>>>>>>>>>> is used or not. So I think the most intuitive way is letting
>>>>>>> cache()
>>>>>>>>>>>> return
>>>>>>>>>>>>>> nothing. So nobody needs to worry about the difference between
>>>>>>>>>>>> operations
>>>>>>>>>>>>>> on CacheTables and those on the "original" tables. This will
>>> make
>>>>>>>>>> maybe
>>>>>>>>>>>>>> 99.9% of the users happy. There were two concerns raised for
>>> this
>>>>>>>>>>>> approach:
>>>>>>>>>>>>>> 1. In some rare cases, users may want to ignore cache,
>>>>>>>>>>>>>> 2. A table might be cached/uncached in a third party function
>>>>> while
>>>>>>>>>> the
>>>>>>>>>>>>>> caller does not know.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For the first issue, users can use hint("ignoreCache") to
>>>>>>> explicitly
>>>>>>>>>>>> ignore
>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>> For the second issue, the above proposal lets cache() return a
>>>>>>>>>>>> CacheHandle,
>>>>>>>>>>>>>> the only method in it is release(). Different CacheHandles
>>> will
>>>>>>>>>> refer to
>>>>>>>>>>>>>> the same cache, if a cache no longer has any cache handle, it
>>>>> will
>>>>>>> be
>>>>>>>>>>>>>> deleted. This will address the following case:
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>> val handle1 = a.cache()
>>>>>>>>>>>>>> process(a)
>>>>>>>>>>>>>> a.select(...) // cache is still available, handle1 has not
>>> been
>>>>>>>>>>>> released.
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> void process(Table t) {
>>>>>>>>>>>>>> val handle2 = t.cache() // new handle to cache
>>>>>>>>>>>>>> t.select(...) // optimizer decides cache usage
>>>>>>>>>>>>>> t.hint("ignoreCache").select(...) // cache is ignored
>>>>>>>>>>>>>> handle2.release() // release the handle, but the cache may
>>> still
>>>>> be
>>>>>>>>>>>>>> available if there are other handles
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Does the above modified approach look reasonable to you?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <
>>>>>>> trohrmann@apache.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I was aiming at semantics similar to 1. I actually thought
>>> that
>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>> would tell the system to materialize the intermediate result
>>> so
>>>>>>> that
>>>>>>>>>>>>>>> subsequent queries don't need to reprocess it. This means
>>> that
>>>>> the
>>>>>>>>>>>> usage
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> the cached table in this example
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>>>>> val b1 = cachedTable.select(…)
>>>>>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
>>>>>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
>>>>>>>>>>>>>>> val c1 = a.select(…)
>>>>>>>>>>>>>>> val c2 = a.foo().select(…)
>>>>>>>>>>>>>>> val c3 = a.bar().select(...)
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> strongly depends on interleaved calls which trigger the
>>>>> execution
>>>>>>> of
>>>>>>>>>>>> sub
>>>>>>>>>>>>>>> queries. So for example, if there is only a single
>>> env.execute
>>>>>>> call
>>>>>>>>>> at
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be
>>>>>>> computed
>>>>>>>>>> by
>>>>>>>>>>>>>>> reading directly from the sources (given that there is only a
>>>>>>> single
>>>>>>>>>>>>>>> JobGraph). It just happens that the result of `a` will be
>>> cached
>>>>>>>>>> such
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> we skip the processing of `a` when there are subsequent
>>> queries
>>>>>>>>>> reading
>>>>>>>>>>>>>>> from `cachedTable`. If for some reason the system cannot
>>>>>>> materialize
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> table (e.g. running out of disk space, ttl expired), then it
>>>>> could
>>>>>>>>>> also
>>>>>>>>>>>>>>> happen that we need to reprocess `a`. In that sense
>>>>> `cachedTable`
>>>>>>>>>>>> simply
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> an identifier for the materialized result of `a` with the
>>>>> lineage
>>>>>>>>>> how
>>>>>>>>>>>> to
>>>>>>>>>>>>>>> reprocess it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>>>>>>> val b = cachedTable.select(...)
>>>>>>>>>>>>>>>>> val c = a.select(...)
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>>>>>>>>>> original
>>>>>>>>>>>>>> DAG
>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no
>>> chance to
>>>>>>>>>>>>>>> optimize.
>>>>>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c
>>> leaves
>>>>> the
>>>>>>>>>>>>>>>> optimizer
>>>>>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
>>>>> case,
>>>>>>>>>> user
>>>>>>>>>>>>>>>> lose
>>>>>>>>>>>>>>>>> the option to NOT use cache.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> As you can see, neither of the options seem perfect.
>>> However,
>>>>> I
>>>>>>>>>> guess
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> and Till are proposing the third option:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache
>>> or
>>>>>>> DAG
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> used. c always use the DAG.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am pretty sure that me, Till, Fabian and others were all
>>>>>>>>>> proposing
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> advocating in favour of semantic “1”. No cost based
>>> optimiser
>>>>>>>>>>>> decisions
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>>>>>> val b1 = cachedTable.select(…)
>>>>>>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
>>>>>>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
>>>>>>>>>>>>>>>> val c1 = a.select(…)
>>>>>>>>>>>>>>>> val c2 = a.foo().select(…)
>>>>>>>>>>>>>>>> val c3 = a.bar().select(...)
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and
>>> c3
>>>>> are
>>>>>>>>>>>>>>>> re-executing whole plan for “a”.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In the future we could discuss going one step further,
>>>>>>> introducing
>>>>>>>>>>>> some
>>>>>>>>>>>>>>>> global optimisation (that can be manually enabled/disabled):
>>>>>>>>>>>>>> deduplicate
>>>>>>>>>>>>>>>> plan nodes/deduplicate sub queries/re-use sub queries
>>>>> results/or
>>>>>>>>>>>>>> whatever
>>>>>>>>>>>>>>>> we could call it. It could do two things:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1. Automatically try to deduplicate fragments of the plan
>>> and
>>>>>>> share
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> result using CachedTable - in other words automatically
>>> insert
>>>>>>>>>>>>>>> `CachedTable
>>>>>>>>>>>>>>>> cache()` calls.
>>>>>>>>>>>>>>>> 2. Automatically make decision to bypass explicit
>>> `CachedTable`
>>>>>>>>>> access
>>>>>>>>>>>>>>>> (this would be the equivalent of what you described as
>>>>> “semantic
>>>>>>>>>> 3”).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> However as I wrote previously, I have big doubts if such
>>>>>>> cost-based
>>>>>>>>>>>>>>>> optimisation would work (this applies also to “Semantic
>>> 2”). I
>>>>>>>>>> would
>>>>>>>>>>>>>>> expect
>>>>>>>>>>>>>>>> it to do more harm than good in so many cases, that it
>>> wouldn’t
>>>>>>>>>> make
>>>>>>>>>>>>>>> sense.
>>>>>>>>>>>>>>>> Even assuming that we calculate statistics perfectly (this
>>>>> ain’t
>>>>>>>>>> gonna
>>>>>>>>>>>>>>>> happen), it’s virtually impossible to correctly estimate
>>>>> correct
>>>>>>>>>>>>>> exchange
>>>>>>>>>>>>>>>> rate of CPU cycles vs IO operations as it is changing so
>>> much
>>>>>>> from
>>>>>>>>>>>>>>>> deployment to deployment.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Is this the core of our disagreement here? That you would
>>> like
>>>>>>> this
>>>>>>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <becket.qin@gmail.com
>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Another potential concern for semantic 3 is that. In the
>>>>> future,
>>>>>>>>>> we
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>>>> automatic caching to Flink. e.g. cache the intermediate
>>>>> results
>>>>>>> at
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> shuffle boundary. If our semantic is that reference to the
>>>>>>>>>> original
>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> means skipping cache, those users may not be able to
>>> benefit
>>>>>>> from
>>>>>>>>>> the
>>>>>>>>>>>>>>>>> implicit cache.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <
>>>>>>> becket.qin@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the reply. Thought about it again, I might have
>>>>>>>>>>>>>>> misunderstood
>>>>>>>>>>>>>>>>>> your proposal in earlier emails. Returning a CachedTable
>>>>> might
>>>>>>>>>> not
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> bad
>>>>>>>>>>>>>>>>>> idea.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I was more concerned about the semantic and its
>>> intuitiveness
>>>>>>>>>> when a
>>>>>>>>>>>>>>>>>> CachedTable is returned. i..e, if cache() returns
>>>>> CachedTable.
>>>>>>>>>> What
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> semantic in the following code:
>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>>>>>>>> val b = cachedTable.select(...)
>>>>>>>>>>>>>>>>>> val c = a.select(...)
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> What is the difference between b and c? At the first
>>> glance,
>>>>> I
>>>>>>>>>> see
>>>>>>>>>>>>>> two
>>>>>>>>>>>>>>>>>> options:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>>>>>>>>>> original
>>>>>>>>>>>>>>> DAG
>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no
>>> chance
>>>>> to
>>>>>>>>>>>>>>> optimize.
>>>>>>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c
>>> leaves
>>>>>>> the
>>>>>>>>>>>>>>>> optimizer
>>>>>>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
>>>>>>> case,
>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>> lose
>>>>>>>>>>>>>>>>>> the option to NOT use cache.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> As you can see, neither of the options seem perfect.
>>>>> However, I
>>>>>>>>>>>>>> guess
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> and Till are proposing the third option:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether
>>> cache or
>>>>>>> DAG
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> be used. c always use the DAG.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This does address all the concerns. It is just that from
>>>>>>>>>>>>>> intuitiveness
>>>>>>>>>>>>>>>>>> perspective, I found that asking user to explicitly use a
>>>>>>>>>>>>>> CachedTable
>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>> the optimizer might choose to ignore is a little weird.
>>> That
>>>>>>> was
>>>>>>>>>>>>>> why I
>>>>>>>>>>>>>>>> did
>>>>>>>>>>>>>>>>>> not think about that semantic. But given there is material
>>>>>>>>>> benefit,
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> this semantic is acceptable.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to
>>> use
>>>>>>>>>> cache
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> not,
>>>>>>>>>>>>>>>>>>> then why do we need “void cache()” method at all? Would
>>> It
>>>>>>>>>>>>>>> “increase”
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> chance of using the cache? That’s sounds strange. What
>>> would
>>>>>>> be
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not?
>>> If we
>>>>>>>>>> want
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan
>>> nodes
>>>>>>>>>>>>>>>> deduplication”
>>>>>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>>>>>>>>>> optimiser
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> all of
>>>>>>>>>>>>>>>>>>> the work.
>>>>>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any
>>> use/not
>>>>> use
>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> decision.
>>>>>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
>>>>> such
>>>>>>>>>> cost
>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>>> optimisations would work properly and I would still
>>> insist
>>>>>>>>>> first on
>>>>>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable
>>> cache()`)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We are absolutely on the same page here. An explicit
>>> cache()
>>>>>>>>>> method
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> necessary not only because optimizer may not be able to
>>> make
>>>>>>> the
>>>>>>>>>>>>>> right
>>>>>>>>>>>>>>>>>> decision, but also because of the nature of interactive
>>>>>>>>>> programming.
>>>>>>>>>>>>>>> For
>>>>>>>>>>>>>>>>>> example, if users write the following code in Scala shell:
>>>>>>>>>>>>>>>>>> val b = a.select(...)
>>>>>>>>>>>>>>>>>> val c = b.select(...)
>>>>>>>>>>>>>>>>>> val d = c.select(...).writeToSink(...)
>>>>>>>>>>>>>>>>>> tEnv.execute()
>>>>>>>>>>>>>>>>>> There is no way optimizer will know whether b or c will be
>>>>> used
>>>>>>>>>> in
>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>> code, unless users hint explicitly.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
>>>>>>>>>>>>>> objections
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which
>>> me,
>>>>>>>>>> Jark,
>>>>>>>>>>>>>>>> Fabian,
>>>>>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Is there any other side effects if we use semantic 3
>>>>> mentioned
>>>>>>>>>>>>>> above?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> JIangjie (Becket) Qin
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Sorry for not responding long time.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regarding case1.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would
>>>>> expect
>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1`
>>>>> wouldn’t
>>>>>>>>>>>>>> affect
>>>>>>>>>>>>>>>>>>> `cachedTableA2`. Just as in any other database dropping
>>>>>>>>>> modifying
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> independent table/materialised view does not affect
>>> others.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> What I meant is that assuming there is already a cached
>>>>>>> table,
>>>>>>>>>>>>>>> ideally
>>>>>>>>>>>>>>>>>>> users need
>>>>>>>>>>>>>>>>>>>> not to specify whether the next query should read from
>>> the
>>>>>>>>>> cache
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to
>>> use
>>>>>>>>>> cache
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>> not, then why do we need “void cache()” method at all?
>>> Would
>>>>>>> It
>>>>>>>>>>>>>>>> “increase”
>>>>>>>>>>>>>>>>>>> the chance of using the cache? That’s sounds strange.
>>> What
>>>>>>>>>> would be
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not?
>>> If we
>>>>>>>>>> want
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan
>>> nodes
>>>>>>>>>>>>>>>> deduplication”
>>>>>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>>>>>>>>>> optimiser
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> all of
>>>>>>>>>>>>>>>>>>> the work.
>>>>>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any
>>> use/not
>>>>> use
>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> decision.
>>>>>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
>>>>> such
>>>>>>>>>> cost
>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>>> optimisations would work properly and I would still
>>> insist
>>>>>>>>>> first on
>>>>>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable
>>> cache()`)
>>>>>>>>>>>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()`
>>>>>>> doesn’t
>>>>>>>>>>>>>>>>>>> contradict future work on automated cost based caching.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to
>>> our
>>>>>>>>>>>>>> objections
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which
>>> me,
>>>>>>>>>> Jark,
>>>>>>>>>>>>>>>> Fabian,
>>>>>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <
>>> becket.qin@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> It is true that after the first job submission, there
>>> will
>>>>> be
>>>>>>>>>> no
>>>>>>>>>>>>>>>>>>> ambiguity
>>>>>>>>>>>>>>>>>>>> in terms of whether a cached table is used or not. That
>>> is
>>>>>>> the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> cache() without returning a CachedTable.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>> from which you need to consume from if you want to
>>> benefit
>>>>>>>>>> from
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>> functionality.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I am thinking a little differently. I think it is a hint
>>>>> (as
>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>>>>> later) instead of a new operator. I'd like to be careful
>>>>>>> about
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>>>>>>> of the API. A hint is a property set on an existing
>>>>> operator,
>>>>>>>>>> but
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> itself an operator as it does not really manipulate the
>>>>> data.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of
>>> decision
>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>> intermediate result should be cached. But especially
>>> when
>>>>>>>>>>>>>> executing
>>>>>>>>>>>>>>>>>>> ad-hoc
>>>>>>>>>>>>>>>>>>>>> queries the user might better know which results need
>>> to
>>>>> be
>>>>>>>>>>>>>> cached
>>>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I
>>> would
>>>>>>>>>> consider
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course,
>>> in
>>>>>>> the
>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> might add functionality which tries to automatically
>>> cache
>>>>>>>>>>>>>> results
>>>>>>>>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>>>>>>>>> caching the latest intermediate results until so and so
>>>>> much
>>>>>>>>>>>>>> space
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
>>>>>>>>>> `CachedTable
>>>>>>>>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I agree that cache() method is needed for exactly the
>>>>> reason
>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> mentioned,
>>>>>>>>>>>>>>>>>>>> i.e. Flink cannot predict what users are going to write
>>>>>>> later,
>>>>>>>>>> so
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>> need to tell Flink explicitly that this table will be
>>> used
>>>>>>>>>> later.
>>>>>>>>>>>>>>>> What I
>>>>>>>>>>>>>>>>>>>> meant is that assuming there is already a cached table,
>>>>>>> ideally
>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>> not to specify whether the next query should read from
>>> the
>>>>>>>>>> cache
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> To explain the difference between returning / not
>>>>> returning a
>>>>>>>>>>>>>>>>>>> CachedTable,
>>>>>>>>>>>>>>>>>>>> I want compare the following two case:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> *Case 1:  returning a CachedTable*
>>>>>>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>>>>>>> val cachedTableA1 = a.cache()
>>>>>>>>>>>>>>>>>>>> val cachedTableA2 = a.cache()
>>>>>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> c = a.filter(...) // User specify that the original DAG
>>> is
>>>>>>>>>> used?
>>>>>>>>>>>>>> Or
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> optimizer decides whether DAG or cache should be used?
>>>>>>>>>>>>>>>>>>>> d = cachedTableA1.filter() // User specify that the
>>> cached
>>>>>>>>>> table
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> used.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> a.unCache() // Can cachedTableA still be used
>>> afterwards?
>>>>>>>>>>>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be
>>> used?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> *Case 2: not returning a CachedTable*
>>>>>>>>>>>>>>>>>>>> b = a.map()
>>>>>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>>>>>> a.cache() // no-op
>>>>>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the
>>> cache or
>>>>>>> DAG
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the
>>> cache or
>>>>>>> DAG
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> a.unCache()
>>>>>>>>>>>>>>>>>>>> a.unCache() // no-op
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to
>>>>> choose
>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>>>> DAG
>>>>>>>>>>>>>>>>>>>> and cache. And the unCache() call becomes tricky.
>>>>>>>>>>>>>>>>>>>> In case 2, users do not need to worry about whether
>>> cache
>>>>> or
>>>>>>>>>> DAG
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> used.
>>>>>>>>>>>>>>>>>>>> And the unCache() semantic is clear. However, the
>>> caveat is
>>>>>>>>>> that
>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>> cannot explicitly ignore the cache.
>>>>>>>>>>>>>>>>>> 
>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Just to clarify, when I say foo() like below, I assume that foo() must have
a way to release its own cache, so it must have access to table env.

void foo(Table t) {
  ...
  t.cache(); // create cache for t
  ...
  env.getCacheService().releaseCacheFor(t); // release cache for t
}

Thanks,

Jiangjie (Becket) Qin

On Tue, Jan 8, 2019 at 9:04 PM Becket Qin <be...@gmail.com> wrote:

> Hi Piotr,
>
> I don't think it is feasible to ask every third party library to have
> method signature with CacheService as an argument.
>
> And even that signature does not really solve the problem. Imagine
> function foo() looks like following:
>
> void foo(Table t) {
>   ...
>   t.cache(); // create cache for t
>   ...
>   env.getCacheService().releaseCacheFor(t); // release cache for t
> }
>
> From function foo()'s perspective, it created a cache and released it.
> However, if someone invokes foo like this:
> {
>   Table src = ...
>   Table t = src.select(...).cache()
>   foo(t)
>   // t is uncached by foo() already.
> }
>
> So the "side effect" still exists.
>
> I think the only safe way to ensure there is no side effect while sharing
> the cache is to use ref count.
>
> BTW, the discussion we are having here is exactly the reason that I prefer
> option 3. From technical perspective option 3 solves all the concerns.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Tue, Jan 8, 2019 at 8:41 PM Piotr Nowojski <pi...@da-platform.com>
> wrote:
>
>> Hi,
>>
>> I think that introducing ref counting could be confusing and it will be
>> error prone, since Flink-table’s users are not used to closing/releasing
>> resources. I was more objecting placing the
>> `uncache()`/`dropCache()`/`releaseCache()` (releaseCache sounds best to me)
>> as a method in the “Table”. It might be not obvious that it will drop the
>> cache for all of the usages of the given table. For example:
>>
>> public void foo(Table t) {
>>  // …
>>  t.releaseCache();
>> }
>>
>> public void bar(Table t) {
>>   // ...
>> }
>>
>> Table a = …
>> val cachedA = a.cache()
>>
>> foo(cachedA)
>> bar(cachedA)
>>
>>
>> My problem with above example is that `t.releaseCache()` call is not
>> doing the best possible job in communicating to the user that it will have
>> a side effects for other places, like `bar(cachedA)` call. Something like
>> this might be a better (not perfect, but just a bit better):
>>
>> public void foo(Table t, CacheService cacheService) {
>>  // …
>>  cacheService.releaseCacheFor(t);
>> }
>>
>> Table a = …
>> val cachedA = a.cache()
>>
>> foo(cachedA, env.getCacheService())
>> bar(cachedA)
>>
>>
>> Also from another perspective, maybe placing `releaseCache()` method in
>> Table might not be the best separation of concerns - `releaseCache()`
>> method seams significantly different compared to other existing methods.
>>
>> Piotrek
>>
>> > On 8 Jan 2019, at 12:28, Becket Qin <be...@gmail.com> wrote:
>> >
>> > Hi Piotr,
>> >
>> > You are right. There might be two intuitive meanings when users call
>> > 'a.uncache()', namely:
>> > 1. release the resource
>> > 2. Do not use cache for the next operation.
>> >
>> > Case (1) would likely be the dominant use case. So I would suggest we
>> > dedicate uncache() method to case (1), i.e. for resource release, but
>> not
>> > for ignoring cache.
>> >
>> > For case 2, i.e. explicitly ignoring cache (which is rare), users may
>> use
>> > something like 'hint("ignoreCache")'. I think this is better as it is a
>> > little weird for users to call `a.uncache()` while they may not even
>> know
>> > if the table is cached at all.
>> >
>> > Assuming we let `uncache()` to only release resource, one possibility is
>> > using ref count to mitigate the side effect. That means a ref count is
>> > incremented on `cache()` and decremented on `uncache()`. That means
>> > `uncache()` does not physically release the resource immediately, but
>> just
>> > means the cache could be released.
>> > That being said, I am not sure if this is really a better solution as it
>> > seems a little counter intuitive. Maybe calling it releaseCache() help a
>> > little bit?
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> >
>> >
>> >
>> > On Tue, Jan 8, 2019 at 5:36 PM Piotr Nowojski <pi...@da-platform.com>
>> wrote:
>> >
>> >> Hi Becket,
>> >>
>> >> With `uncache` there are probably two features that we can think about:
>> >>
>> >> a)
>> >>
>> >> Physically dropping the cached table from the storage, freeing up the
>> >> resources
>> >>
>> >> b)
>> >>
>> >> Hinting the optimizer to not cache the reads for the next query/table
>> >>
>> >> a) Has the issue as I wrote before, that it seemed to be an operation
>> >> inherently “flawed" with having side effects.
>> >>
>> >> I’m not sure how it would be best to express. We could make it work:
>> >>
>> >> 1. via a method on a Table as you proposed:
>> >>
>> >> void Table#dropCache()
>> >> void Table#uncache()
>> >>
>> >> 2. Operation on the environment
>> >>
>> >> env.dropCacheFor(table) // or some other argument that allows user to
>> >> identify the desired cache
>> >>
>> >> 3. Extending (from your original design doc) `setTableService` method
>> to
>> >> return some control handle like:
>> >>
>> >> TableServiceControl setTableService(TableFactory tf,
>> >>                     TableProperties properties,
>> >>                     TempTableCleanUpCallback cleanUpCallback);
>> >>
>> >> (TableServiceControl? TableService? TableServiceHandle? CacheService?)
>> >>
>> >> And having the drop cache method there:
>> >>
>> >> TableServiceControl#dropCache(table)
>> >>
>> >> Out of those options, option 1 might have a disadvantage of kind of not
>> >> making the user aware, that this is a global operation with side
>> effects.
>> >> Like the old example of:
>> >>
>> >> public void foo(Table t) {
>> >>  // …
>> >>  t.dropCache();
>> >> }
>> >>
>> >> It might not be immediately obvious that `t.dropCache()` is some kind
>> of
>> >> global operation, with side effects visible outside of the `foo`
>> function.
>> >>
>> >> On the other hand, both option 2 and 3, might have greater chance of
>> >> catching user’s attention:
>> >>
>> >> public void foo(Table t, CacheService cacheService) {
>> >>  // …
>> >>  cacheService.dropCache(t);
>> >> }
>> >>
>> >> b) could be achieved quite easily:
>> >>
>> >> Table a = …
>> >> val notCached1 = a.doNotCache()
>> >> val cachedA = a.cache()
>> >> val notCached2 = cachedA.doNotCache() // equivalent of notCached1
>> >>
>> >> `doNotCache()` would behave similarly to `cache()` - return a copy of
>> the
>> >> table with removed “cache” hint and/or added “never cache” hint.
>> >>
>> >> Piotrek
>> >>
>> >>
>> >>> On 8 Jan 2019, at 03:17, Becket Qin <be...@gmail.com> wrote:
>> >>>
>> >>> Hi Piotr,
>> >>>
>> >>> Thanks for the proposal and detailed explanation. I like the idea of
>> >>> returning a new hinted Table without modifying the original table.
>> This
>> >>> also leave the room for users to benefit from future implicit caching.
>> >>>
>> >>> Just to make sure I get the full picture. In your proposal, there will
>> >> also
>> >>> be a 'void Table#uncache()' method to release the cache, right?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Jiangjie (Becket) Qin
>> >>>
>> >>> On Mon, Jan 7, 2019 at 11:50 PM Piotr Nowojski <piotr@da-platform.com
>> >
>> >>> wrote:
>> >>>
>> >>>> Hi Becket!
>> >>>>
>> >>>> After further thinking I tend to agree that my previous proposal
>> >> (*Option
>> >>>> 2*) indeed might not be if would in the future introduce automatic
>> >> caching.
>> >>>> However I would like to propose a slightly modified version of it:
>> >>>>
>> >>>> *Option 4*
>> >>>>
>> >>>> Adding `cache()` method with following signature:
>> >>>>
>> >>>> Table Table#cache();
>> >>>>
>> >>>> Without side-effects, and `cache()` call do not modify/change
>> original
>> >>>> Table in any way.
>> >>>> It would return a copy of original table, with added hint for the
>> >>>> optimizer to cache the table, so that the future accesses to the
>> >> returned
>> >>>> table might be cached or not.
>> >>>>
>> >>>> Assuming that we are talking about a setup, where we do not have
>> >> automatic
>> >>>> caching enabled (possible future extension).
>> >>>>
>> >>>> Example #1:
>> >>>>
>> >>>> ```
>> >>>> Table a = …
>> >>>> a.foo() // not cached
>> >>>>
>> >>>> val cachedTable = a.cache();
>> >>>>
>> >>>> cachedA.bar() // maybe cached
>> >>>> a.foo() // same as before - effectively not cached
>> >>>> ```
>> >>>>
>> >>>> Both the first and the second `a.foo()` operations would behave in
>> the
>> >>>> exactly same way. Again, `a.cache()` call doesn’t affect `a` itself.
>> If
>> >> `a`
>> >>>> was not hinted for caching before `a.cache();`, then both `a.foo()`
>> >> calls
>> >>>> wouldn’t use cache.
>> >>>>
>> >>>> Returned `cachedA` would be hinted with “cache” hint, so probably
>> >>>> `cachedA.bar()` would go through cache (unless optimiser decides the
>> >>>> opposite)
>> >>>>
>> >>>> Example #2
>> >>>>
>> >>>> ```
>> >>>> Table a = …
>> >>>>
>> >>>> a.foo() // not cached
>> >>>>
>> >>>> val b = a.cache();
>> >>>>
>> >>>> a.foo() // same as before - effectively not cached
>> >>>> b.foo() // maybe cached
>> >>>>
>> >>>> val c = b.cache();
>> >>>>
>> >>>> a.foo() // same as before - effectively not cached
>> >>>> b.foo() // same as before - effectively maybe cached
>> >>>> c.foo() // maybe cached
>> >>>> ```
>> >>>>
>> >>>> Now, assuming that we have some future “automatic caching
>> optimisation”:
>> >>>>
>> >>>> Example #3
>> >>>>
>> >>>> ```
>> >>>> env.enableAutomaticCaching()
>> >>>> Table a = …
>> >>>>
>> >>>> a.foo() // might be cached, depending if `a` was selected to
>> automatic
>> >>>> caching
>> >>>>
>> >>>> val b = a.cache();
>> >>>>
>> >>>> a.foo() // same as before - might be cached, if `a` was selected to
>> >>>> automatic caching
>> >>>> b.foo() // maybe cached
>> >>>> ```
>> >>>>
>> >>>>
>> >>>> More or less this is the same behaviour as:
>> >>>>
>> >>>> Table a = ...
>> >>>> val b = a.filter(x > 20)
>> >>>>
>> >>>> calling `filter` hasn’t changed or altered `a` in anyway. If `a` was
>> >>>> previously filtered:
>> >>>>
>> >>>> Table src = …
>> >>>> val a = src.filter(x > 20)
>> >>>> val b = a.filter(x > 20)
>> >>>>
>> >>>> then yes, `a` and `b` will be the same. But the point is that neither
>> >>>> `filter` nor `cache` changes the original `a` table.
>> >>>>
>> >>>> One thing is that indeed, physically dropping cache operation, will
>> have
>> >>>> side effects and it will in a way mutate the cached table references.
>> >> But
>> >>>> this is I think unavoidable in any solution - the same issue as
>> calling
>> >>>> `.close()`, or calling destructor in C++.
>> >>>>
>> >>>> Piotrek
>> >>>>
>> >>>>> On 7 Jan 2019, at 10:41, Becket Qin <be...@gmail.com> wrote:
>> >>>>>
>> >>>>> Happy New Year, everybody!
>> >>>>>
>> >>>>> I would like to resume this discussion thread. At this point, We
>> have
>> >>>>> agreed on the first step goal of interactive programming. The open
>> >>>>> discussion is the exact API. More specifically, what should
>> *cache()*
>> >>>>> method return and what is the semantic. There are three options:
>> >>>>>
>> >>>>> *Option 1*
>> >>>>> *void cache()* OR *Table cache()* which returns the original table
>> for
>> >>>>> chained calls.
>> >>>>> *void uncache() *releases the cache.
>> >>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
>> >>>>>
>> >>>>> - Semantic: a.cache() hints that table 'a' should be cached.
>> Optimizer
>> >>>>> decides whether the cache will be used or not.
>> >>>>> - pros: simple and no confusion between CachedTable and original
>> table
>> >>>>> - cons: A table may be cached / uncached in a method invocation,
>> while
>> >>>> the
>> >>>>> caller does not know about this.
>> >>>>>
>> >>>>> *Option 2*
>> >>>>> *CachedTable cache()*
>> >>>>> *CachedTable *extends *Table *with an additional *uncache()* method
>> >>>>>
>> >>>>> - Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will
>> >> always
>> >>>>> use cache. *a.bar() *will always use original DAG.
>> >>>>> - pros: No potential side effects in method invocation.
>> >>>>> - cons: Optimizer has no chance to kick in. Future optimization will
>> >>>> become
>> >>>>> a behavior change and need users to change the code.
>> >>>>>
>> >>>>> *Option 3*
>> >>>>> *CacheHandle cache()*
>> >>>>> *CacheHandle.release() *to release a cache handle on the table. If
>> all
>> >>>>> cache handles are released, the cache could be removed.
>> >>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
>> >>>>>
>> >>>>> - Semantic: *a.cache() *hints that 'a' should be cached. Optimizer
>> >>>> decides
>> >>>>> whether the cache will be used or not. Cache is released either no
>> >> handle
>> >>>>> is on it, or the user program exits.
>> >>>>> - pros: No potential side effect in method invocation. No confusion
>> >>>> between
>> >>>>> cached table v.s original table.
>> >>>>> - cons: An additional CacheHandle exposed to the users.
>> >>>>>
>> >>>>>
>> >>>>> Personally I prefer option 3 for the following reasons:
>> >>>>> 1. It is simple. Vast majority of the users would just call
>> >>>>> *a.cache()* followed
>> >>>>> by *a.foo(),* *a.bar(), etc. *
>> >>>>> 2. There is no semantic ambiguity and semantic change if we decide
>> to
>> >> add
>> >>>>> implicit cache in the future.
>> >>>>> 3. There is no side effect in the method calls.
>> >>>>> 4. Admittedly we need to expose one more CacheHandle class to the
>> >> users.
>> >>>>> But it is not that difficult to understand given similar well known
>> >>>> concept
>> >>>>> like ref count (we can name it CacheReference if that is easier to
>> >>>>> understand). So I think it is fine.
>> >>>>>
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> Jiangjie (Becket) Qin
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <be...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>> Hi Piotrek,
>> >>>>>>
>> >>>>>> 1. Regarding optimization.
>> >>>>>> Sure there are many cases that the decision is hard to make. But
>> that
>> >>>> does
>> >>>>>> not make it any easier for the users to make those decisions. I
>> >> imagine
>> >>>> 99%
>> >>>>>> of the users would just naively use cache. I am not saying we can
>> >>>> optimize
>> >>>>>> in all the cases. But as long as we agree that at least in certain
>> >>>> cases (I
>> >>>>>> would argue most cases), optimizer can do a little better than an
>> >>>> average
>> >>>>>> user who likely knows little about Flink internals, we should not
>> push
>> >>>> the
>> >>>>>> burden of optimization to users.
>> >>>>>>
>> >>>>>> BTW, it seems some of your concerns are related to the
>> >> implementation. I
>> >>>>>> did not mention the implementation of the caching service because
>> that
>> >>>>>> should not affect the API semantic. Not sure if this helps, but
>> >> imagine
>> >>>> the
>> >>>>>> default implementation has one StorageNode service colocating with
>> >> each
>> >>>> TM.
>> >>>>>> It could be running within the TM process or in a standalone
>> process,
>> >>>>>> depending on configuration.
>> >>>>>>
>> >>>>>> The StorageNode uses memory + spill-to-disk mechanism. The cached
>> data
>> >>>>>> will just be written to the local StorageNode service. If the
>> >>>> StorageNode
>> >>>>>> is running within the TM process, the in-memory cache could just be
>> >>>> objects
>> >>>>>> so we save some serde cost. A later job referring to the cached
>> Table
>> >>>> will
>> >>>>>> be scheduled in a locality aware manner, i.e. run in the TM whose
>> peer
>> >>>>>> StorageNode hosts the data.
>> >>>>>>
>> >>>>>>
>> >>>>>> 2. Semantic
>> >>>>>> I am not sure why introducing a new hintCache() or
>> >>>>>> env.enableAutomaticCaching() method would avoid the consequence of
>> >>>> semantic
>> >>>>>> change.
>> >>>>>>
>> >>>>>> If the auto optimization is not enabled by default, users still
>> need
>> >> to
>> >>>>>> make code change to all existing programs in order to get the
>> benefit.
>> >>>>>> If the auto optimization is enabled by default, advanced users who
>> >> know
>> >>>>>> that they really want to use cache will suddenly lose the
>> opportunity
>> >>>> to do
>> >>>>>> so, unless they change the code to disable auto optimization.
>> >>>>>>
>> >>>>>>
>> >>>>>> 3. side effect
>> >>>>>> The CacheHandle is not only for where to put uncache(). It is to
>> solve
>> >>>> the
>> >>>>>> implicit performance impact by moving the uncache() to the
>> >> CacheHandle.
>> >>>>>>
>> >>>>>> - If users wants to leverage cache, they can call a.cache(). After
>> >>>>>> that, unless user explicitly release that CacheHandle, a.foo() will
>> >>>> always
>> >>>>>> leverage cache if needed (optimizer may choose to ignore cache if
>> >> that
>> >>>>>> helps accelerate the process). Any function call will not be able
>> to
>> >>>>>> release the cache because they do not have that CacheHandle.
>> >>>>>> - If some advanced users do not want to use cache at all, they will
>> >>>>>> call a.hint(ignoreCache).foo(). This will for sure ignore cache and
>> >>>> use the
>> >>>>>> original DAG to process.
>> >>>>>>
>> >>>>>>
>> >>>>>>> In vast majority of the cases, users wouldn't really care whether
>> the
>> >>>>>>> cache is used or not.
>> >>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
>> >> memory
>> >>>>>>> caching) would add additional IO costs. It’s similar as saying
>> that
>> >>>> users
>> >>>>>>> would not see a difference between Spark/Flink and MapReduce
>> >> (MapReduce
>> >>>>>>> writes data to disks after every map/reduce stage).
>> >>>>>>
>> >>>>>> What I wanted to say is that in most cases, after users call
>> cache(),
>> >>>> they
>> >>>>>> don't really care about whether auto optimization has decided to
>> >> ignore
>> >>>> the
>> >>>>>> cache or not, as long as the program runs faster.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>>
>> >>>>>> Jiangjie (Becket) Qin
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <
>> >>>> piotr@data-artisans.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> Thanks for the quick answer :)
>> >>>>>>>
>> >>>>>>> Re 1.
>> >>>>>>>
>> >>>>>>> I generally agree with you, however couple of points:
>> >>>>>>>
>> >>>>>>> a) the problem with using automatic caching is bigger, because you
>> >> will
>> >>>>>>> have to decide, how do you compare IO vs CPU costs and if you pick
>> >>>> wrong,
>> >>>>>>> additional IO costs might be enormous or even can crash your
>> system.
>> >>>> This
>> >>>>>>> is more difficult problem compared to let say join reordering,
>> where
>> >>>> the
>> >>>>>>> only issue is to have good statistics that can capture
>> correlations
>> >>>> between
>> >>>>>>> columns (when you reorder joins number of IO operations do not
>> >> change)
>> >>>>>>> c) your example is completely independent of caching.
>> >>>>>>>
>> >>>>>>> Query like this:
>> >>>>>>>
>> >>>>>>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1
>> ===`f2).as('f3,
>> >>>>>>> …).filter(‘f3 > 30)
>> >>>>>>>
>> >>>>>>> Should/could be optimised to empty result immediately, without the
>> >> need
>> >>>>>>> for any cache/materialisation and that should work even without
>> any
>> >>>>>>> statistics provided by the connector.
>> >>>>>>>
>> >>>>>>> For me prerequisite to any serious cost-based optimisations would
>> be
>> >>>> some
>> >>>>>>> reasonable benchmark coverage of the code (tpch?). Otherwise that
>> >>>> would be
>> >>>>>>> equivalent of adding not tested code, since we wouldn’t be able to
>> >>>> verify
>> >>>>>>> our assumptions, like how does the writing of 10 000 records to
>> >>>>>>> cache/RocksDB/Kafka/CSV file compare to
>> joining/filtering/processing
>> >> of
>> >>>>>>> lets say 1000 000 rows.
>> >>>>>>>
>> >>>>>>> Re 2.
>> >>>>>>>
>> >>>>>>> I wasn’t proposing to change the semantic later. I was proposing
>> that
>> >>>> we
>> >>>>>>> start now:
>> >>>>>>>
>> >>>>>>> CachedTable cachedA = a.cache()
>> >>>>>>> cachedA.foo() // Cache is used
>> >>>>>>> a.bar() // Original DAG is used
>> >>>>>>>
>> >>>>>>> And then later we can think about adding for example
>> >>>>>>>
>> >>>>>>> CachedTable cachedA = a.hintCache()
>> >>>>>>> cachedA.foo() // Cache might be used
>> >>>>>>> a.bar() // Original DAG is used
>> >>>>>>>
>> >>>>>>> Or
>> >>>>>>>
>> >>>>>>> env.enableAutomaticCaching()
>> >>>>>>> a.foo() // Cache might be used
>> >>>>>>> a.bar() // Cache might be used
>> >>>>>>>
>> >>>>>>> Or (I would still not like this option):
>> >>>>>>>
>> >>>>>>> a.hintCache()
>> >>>>>>> a.foo() // Cache might be used
>> >>>>>>> a.bar() // Cache might be used
>> >>>>>>>
>> >>>>>>> Or whatever else that will come to our mind. Even if we add some
>> >>>>>>> automatic caching in the future, keeping implicit (`CachedTable
>> >>>> cache()`)
>> >>>>>>> caching will still be useful, at least in some cases.
>> >>>>>>>
>> >>>>>>> Re 3.
>> >>>>>>>
>> >>>>>>>> 2. The source tables are immutable during one run of batch
>> >> processing
>> >>>>>>> logic.
>> >>>>>>>> 3. The cache is immutable during one run of batch processing
>> logic.
>> >>>>>>>
>> >>>>>>>> I think assumption 2 and 3 are by definition what batch
>> processing
>> >>>>>>> means,
>> >>>>>>>> i.e the data must be complete before it is processed and should
>> not
>> >>>>>>> change
>> >>>>>>>> when the processing is running.
>> >>>>>>>
>> >>>>>>> I agree that this is how batch systems SHOULD be working. However
>> I
>> >>>> know
>> >>>>>>> from my previous experience that it’s not always the case.
>> Sometimes
>> >>>> users
>> >>>>>>> are just working on some non transactional storage, which can be
>> >>>> (either
>> >>>>>>> constantly or occasionally) being modified by some other processes
>> >> for
>> >>>>>>> whatever the reasons (fixing the data, updating, adding new data
>> >> etc).
>> >>>>>>>
>> >>>>>>> But even if we ignore this point (data immutability), performance
>> >> side
>> >>>>>>> effect issue of your proposal remains. If user calls `void
>> a.cache()`
>> >>>> deep
>> >>>>>>> inside some private method, it will have implicit side effects on
>> >> other
>> >>>>>>> parts of his program that might not be obvious.
>> >>>>>>>
>> >>>>>>> Re `CacheHandle`.
>> >>>>>>>
>> >>>>>>> If I understand it correctly, it only addresses the issue where to
>> >>>> place
>> >>>>>>> method `uncache`/`dropCache`.
>> >>>>>>>
>> >>>>>>> Btw,
>> >>>>>>>
>> >>>>>>>> In vast majority of the cases, users wouldn't really care whether
>> >> the
>> >>>>>>> cache is used or not.
>> >>>>>>>
>> >>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
>> >> memory
>> >>>>>>> caching) would add additional IO costs. It’s similar as saying
>> that
>> >>>> users
>> >>>>>>> would not see a difference between Spark/Flink and MapReduce
>> >> (MapReduce
>> >>>>>>> writes data to disks after every map/reduce stage).
>> >>>>>>>
>> >>>>>>> Piotrek
>> >>>>>>>
>> >>>>>>>> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com>
>> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Piotrek,
>> >>>>>>>>
>> >>>>>>>> Not sure if you noticed, in my last email, I was proposing
>> >>>> `CacheHandle
>> >>>>>>>> cache()` to avoid the potential side effect due to function
>> calls.
>> >>>>>>>>
>> >>>>>>>> Let's look at the disagreement in your reply one by one.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 1. Optimization chances
>> >>>>>>>>
>> >>>>>>>> Optimization is never a trivial work. This is exactly why we
>> should
>> >>>> not
>> >>>>>>> let
>> >>>>>>>> user manually do that. Databases have done huge amount of work in
>> >> this
>> >>>>>>>> area. At Alibaba, we rely heavily on many optimization rules to
>> >> boost
>> >>>>>>> the
>> >>>>>>>> SQL query performance.
>> >>>>>>>>
>> >>>>>>>> In your example, if I filling the filter conditions in a certain
>> >> way,
>> >>>>>>> the
>> >>>>>>>> optimization would become obvious.
>> >>>>>>>>
>> >>>>>>>> Table src1 = … // read from connector 1
>> >>>>>>>> Table src2 = … // read from connector 2
>> >>>>>>>>
>> >>>>>>>> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1
>> ===
>> >>>>>>>> `f2).as('f3, ...)
>> >>>>>>>> a.cache() // write cache to connector 3, when writing the
>> records,
>> >>>>>>> remember
>> >>>>>>>> min and max of `f1
>> >>>>>>>>
>> >>>>>>>> a.filter('f3 > 30) // There is no need to read from any connector
>> >>>>>>> because
>> >>>>>>>> `a` does not contain any record whose 'f3 is greater than 30.
>> >>>>>>>> env.execute()
>> >>>>>>>> a.select(…)
>> >>>>>>>>
>> >>>>>>>> BTW, it seems to me that adding some basic statistics is fairly
>> >>>>>>>> straightforward and the cost is pretty marginal if not
>> ignorable. In
>> >>>>>>> fact
>> >>>>>>>> it is not only needed for optimization, but also for cases such
>> as
>> >> ML,
>> >>>>>>>> where some algorithms may need to decide their parameter based on
>> >> the
>> >>>>>>>> statistics of the data.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 2. Same API, one semantic now, another semantic later.
>> >>>>>>>>
>> >>>>>>>> I am trying to understand what is the semantic of `CachedTable
>> >>>> cache()`
>> >>>>>>> you
>> >>>>>>>> are proposing. IMO, we should avoid designing an API whose
>> semantic
>> >>>>>>> will be
>> >>>>>>>> changed later. If we have a "CachedTable cache()" method, then
>> the
>> >>>>>>> semantic
>> >>>>>>>> should be very clearly defined upfront and do not change later.
>> It
>> >>>>>>> should
>> >>>>>>>> never be "right now let's go with semantic 1, later we can
>> silently
>> >>>>>>> change
>> >>>>>>>> it to semantic 2 or 3". Such change could result in bad
>> consequence.
>> >>>> For
>> >>>>>>>> example, let's say we decide go with semantic 1:
>> >>>>>>>>
>> >>>>>>>> CachedTable cachedA = a.cache()
>> >>>>>>>> cachedA.foo() // Cache is used
>> >>>>>>>> a.bar() // Original DAG is used.
>> >>>>>>>>
>> >>>>>>>> Now majority of the users would be using cachedA.foo() in their
>> >> code.
>> >>>>>>> And
>> >>>>>>>> some advanced users will use a.bar() to explicitly skip the
>> cache.
>> >>>> Later
>> >>>>>>>> on, we added smart optimization and change the semantic to
>> semantic
>> >> 2:
>> >>>>>>>>
>> >>>>>>>> CachedTable cachedA = a.cache()
>> >>>>>>>> cachedA.foo() // Cache is used
>> >>>>>>>> a.bar() // Cache MIGHT be used, and Flink may decide to skip
>> cache
>> >> if
>> >>>>>>> it is
>> >>>>>>>> faster.
>> >>>>>>>>
>> >>>>>>>> Now most of the users who were writing cachedA.foo() will not
>> >> benefit
>> >>>>>>> from
>> >>>>>>>> this optimization at all, unless they change their code to use
>> >> a.foo()
>> >>>>>>>> instead. And those advanced users suddenly lose the option to
>> >>>> explicitly
>> >>>>>>>> ignore cache unless they change their code (assuming we care
>> enough
>> >> to
>> >>>>>>>> provide something like hint(useCache)). If we don't define the
>> >>>> semantic
>> >>>>>>>> carefully, our users will have to change their code again and
>> again
>> >>>>>>> while
>> >>>>>>>> they shouldn't have to.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 3. side effect.
>> >>>>>>>>
>> >>>>>>>> Before we talk about side effect, we have to agree on the
>> >> assumptions.
>> >>>>>>> The
>> >>>>>>>> assumptions I have are following:
>> >>>>>>>> 1. We are talking about batch processing.
>> >>>>>>>> 2. The source tables are immutable during one run of batch
>> >> processing
>> >>>>>>> logic.
>> >>>>>>>> 3. The cache is immutable during one run of batch processing
>> logic.
>> >>>>>>>>
>> >>>>>>>> I think assumption 2 and 3 are by definition what batch
>> processing
>> >>>>>>> means,
>> >>>>>>>> i.e the data must be complete before it is processed and should
>> not
>> >>>>>>> change
>> >>>>>>>> when the processing is running.
>> >>>>>>>>
>> >>>>>>>> As far as I am aware of, I don't know any batch processing system
>> >>>>>>> breaking
>> >>>>>>>> those assumptions. Even for relational database tables, where
>> >> queries
>> >>>>>>> can
>> >>>>>>>> run with concurrent modifications, necessary locking are still
>> >>>> required
>> >>>>>>> to
>> >>>>>>>> ensure the integrity of the query result.
>> >>>>>>>>
>> >>>>>>>> Please let me know if you disagree with the above assumptions. If
>> >> you
>> >>>>>>> agree
>> >>>>>>>> with these assumptions, with the `CacheHandle cache()` API in my
>> >> last
>> >>>>>>>> email, do you still see side effects?
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>>
>> >>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <
>> >>>> piotr@data-artisans.com
>> >>>>>>>>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hi Becket,
>> >>>>>>>>>
>> >>>>>>>>>> Regarding the chance of optimization, it might not be that
>> rare.
>> >>>> Some
>> >>>>>>>>> very
>> >>>>>>>>>> simple statistics could already help in many cases. For
>> example,
>> >>>>>>> simply
>> >>>>>>>>>> maintaining max and min of each fields can already eliminate
>> some
>> >>>>>>>>>> unnecessary table scan (potentially scanning the cached table)
>> if
>> >>>> the
>> >>>>>>>>>> result is doomed to be empty. A histogram would give even
>> further
>> >>>>>>>>>> information. The optimizer could be very careful and only
>> ignores
>> >>>>>>> cache
>> >>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
>> >> filter
>> >>>> on
>> >>>>>>>>> the
>> >>>>>>>>>> cache will absolutely return nothing.
>> >>>>>>>>>
>> >>>>>>>>> I do not see how this might be easy to achieve. It would require
>> >> tons
>> >>>>>>> of
>> >>>>>>>>> effort to make it work and in the end you would still have a
>> >> problem
>> >>>> of
>> >>>>>>>>> comparing/trading CPU cycles vs IO. For example:
>> >>>>>>>>>
>> >>>>>>>>> Table src1 = … // read from connector 1
>> >>>>>>>>> Table src2 = … // read from connector 2
>> >>>>>>>>>
>> >>>>>>>>> Table a = src1.filter(…).join(src2.filter(…), …)
>> >>>>>>>>> a.cache() // write cache to connector 3
>> >>>>>>>>>
>> >>>>>>>>> a.filter(…)
>> >>>>>>>>> env.execute()
>> >>>>>>>>> a.select(…)
>> >>>>>>>>>
>> >>>>>>>>> Decision whether it’s better to:
>> >>>>>>>>> A) read from connector1/connector2, filter/map and join them
>> twice
>> >>>>>>>>> B) read from connector1/connector2, filter/map and join them
>> once,
>> >>>> pay
>> >>>>>>> the
>> >>>>>>>>> price of writing to connector 3 and then reading from it
>> >>>>>>>>>
>> >>>>>>>>> Is very far from trivial. `a` can end up much larger than `src1`
>> >> and
>> >>>>>>>>> `src2`, writes to connector 3 might be extremely slow, reads
>> from
>> >>>>>>> connector
>> >>>>>>>>> 3 can be slower compared to reads from connector 1 & 2, … . You
>> >>>> really
>> >>>>>>> need
>> >>>>>>>>> to have extremely good statistics to correctly asses size of the
>> >>>>>>> output and
>> >>>>>>>>> it would still be failing many times (correlations etc). And
>> keep
>> >> in
>> >>>>>>> mind
>> >>>>>>>>> that at the moment we do not have ANY statistics at all. More
>> than
>> >>>>>>> that, it
>> >>>>>>>>> would require significantly more testing and setting up some
>> >>>>>>> benchmarks to
>> >>>>>>>>> make sure that we do not brake it with some regressions.
>> >>>>>>>>>
>> >>>>>>>>> That’s why I’m strongly opposing this idea - at least let’s not
>> >>>> starts
>> >>>>>>>>> with this. If we first start with completely manual/explicit
>> >> caching,
>> >>>>>>>>> without any magic, it would be a significant improvement for the
>> >>>> users
>> >>>>>>> for
>> >>>>>>>>> a fraction of the development cost. After implementing that,
>> when
>> >> we
>> >>>>>>>>> already have all of the working pieces, we can start working on
>> >> some
>> >>>>>>>>> optimisations rules. As I wrote before, if we start with
>> >>>>>>>>>
>> >>>>>>>>> `CachedTable cache()`
>> >>>>>>>>>
>> >>>>>>>>> We can later work on follow up stories to make it automatic.
>> >> Despite
>> >>>>>>> that
>> >>>>>>>>> I don’t like this implicit/side effect approach with `void`
>> method,
>> >>>>>>> having
>> >>>>>>>>> explicit `CachedTable cache()` wouldn’t even prevent as from
>> later
>> >>>>>>> adding
>> >>>>>>>>> `void hintCache()` method, with the exact semantic that you
>> want.
>> >>>>>>>>>
>> >>>>>>>>> On top of that I re-rise again that having implicit `void
>> >>>>>>>>> cache()/hintCache()` has other side effects and problems with
>> non
>> >>>>>>> immutable
>> >>>>>>>>> data, and being annoying when used secretly inside methods.
>> >>>>>>>>>
>> >>>>>>>>> Explicit `CachedTable cache()` just looks like much less
>> >>>> controversial
>> >>>>>>> MVP
>> >>>>>>>>> and if we decide to go further with this topic, it’s not a
>> wasted
>> >>>>>>> effort,
>> >>>>>>>>> but just lies on a stright path to more advanced/complicated
>> >>>> solutions
>> >>>>>>> in
>> >>>>>>>>> the future. Are there any drawbacks of starting with
>> `CachedTable
>> >>>>>>> cache()`
>> >>>>>>>>> that I’m missing?
>> >>>>>>>>>
>> >>>>>>>>> Piotrek
>> >>>>>>>>>
>> >>>>>>>>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi Becket,
>> >>>>>>>>>>
>> >>>>>>>>>> Introducing CacheHandle seems too complicated. That means users
>> >> have
>> >>>>>>> to
>> >>>>>>>>>> maintain Handler properly.
>> >>>>>>>>>>
>> >>>>>>>>>> And since cache is just a hint for optimizer, why not just
>> return
>> >>>>>>> Table
>> >>>>>>>>>> itself for cache method. This hint info should be kept in
>> Table I
>> >>>>>>>>> believe.
>> >>>>>>>>>>
>> >>>>>>>>>> So how about adding method cache and uncache for Table, and
>> both
>> >>>>>>> return
>> >>>>>>>>>> Table. Because what cache and uncache did is just adding some
>> hint
>> >>>>>>> info
>> >>>>>>>>>> into Table.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi Till and Piotrek,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks for the clarification. That solves quite a few
>> confusion.
>> >> My
>> >>>>>>>>>>> understanding of how cache works is same as what Till
>> describe.
>> >>>> i.e.
>> >>>>>>>>>>> cache() is a hint to Flink, but it is not guaranteed that
>> cache
>> >>>>>>> always
>> >>>>>>>>>>> exist and it might be recomputed from its lineage.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Is this the core of our disagreement here? That you would like
>> >> this
>> >>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
>> >>>>>>>>>>>
>> >>>>>>>>>>> Semantic wise, yes. That's also why I think materialize() has
>> a
>> >>>> much
>> >>>>>>>>> larger
>> >>>>>>>>>>> scope than cache(), thus it should be a different method.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Regarding the chance of optimization, it might not be that
>> rare.
>> >>>> Some
>> >>>>>>>>> very
>> >>>>>>>>>>> simple statistics could already help in many cases. For
>> example,
>> >>>>>>> simply
>> >>>>>>>>>>> maintaining max and min of each fields can already eliminate
>> some
>> >>>>>>>>>>> unnecessary table scan (potentially scanning the cached
>> table) if
>> >>>> the
>> >>>>>>>>>>> result is doomed to be empty. A histogram would give even
>> further
>> >>>>>>>>>>> information. The optimizer could be very careful and only
>> ignores
>> >>>>>>> cache
>> >>>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
>> >> filter
>> >>>>>>> on
>> >>>>>>>>> the
>> >>>>>>>>>>> cache will absolutely return nothing.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Given the above clarification on cache, I would like to
>> revisit
>> >> the
>> >>>>>>>>>>> original "void cache()" proposal and see if we can improve on
>> top
>> >>>> of
>> >>>>>>>>> that.
>> >>>>>>>>>>>
>> >>>>>>>>>>> What do you think about the following modified interface?
>> >>>>>>>>>>>
>> >>>>>>>>>>> Table {
>> >>>>>>>>>>> /**
>> >>>>>>>>>>> * This call hints Flink to maintain a cache of this table and
>> >>>>>>> leverage
>> >>>>>>>>>>> it for performance optimization if needed.
>> >>>>>>>>>>> * Note that Flink may still decide to not use the cache if it
>> is
>> >>>>>>>>> cheaper
>> >>>>>>>>>>> by doing so.
>> >>>>>>>>>>> *
>> >>>>>>>>>>> * A CacheHandle will be returned to allow user release the
>> cache
>> >>>>>>>>>>> actively. The cache will be deleted if there
>> >>>>>>>>>>> * is no unreleased cache handlers to it. When the
>> >> TableEnvironment
>> >>>>>>> is
>> >>>>>>>>>>> closed. The cache will also be deleted
>> >>>>>>>>>>> * and all the cache handlers will be released.
>> >>>>>>>>>>> *
>> >>>>>>>>>>> * @return a CacheHandle referring to the cache of this table.
>> >>>>>>>>>>> */
>> >>>>>>>>>>> CacheHandle cache();
>> >>>>>>>>>>> }
>> >>>>>>>>>>>
>> >>>>>>>>>>> CacheHandle {
>> >>>>>>>>>>> /**
>> >>>>>>>>>>> * Close the cache handle. This method does not necessarily
>> >> deletes
>> >>>>>>> the
>> >>>>>>>>>>> cache. Instead, it simply decrements the reference counter to
>> the
>> >>>>>>> cache.
>> >>>>>>>>>>> * When the there is no handle referring to a cache. The cache
>> >> will
>> >>>>>>> be
>> >>>>>>>>>>> deleted.
>> >>>>>>>>>>> *
>> >>>>>>>>>>> * @return the number of open handles to the cache after this
>> >> handle
>> >>>>>>>>> has
>> >>>>>>>>>>> been released.
>> >>>>>>>>>>> */
>> >>>>>>>>>>> int release()
>> >>>>>>>>>>> }
>> >>>>>>>>>>>
>> >>>>>>>>>>> The rationale behind this interface is following:
>> >>>>>>>>>>> In vast majority of the cases, users wouldn't really care
>> whether
>> >>>> the
>> >>>>>>>>> cache
>> >>>>>>>>>>> is used or not. So I think the most intuitive way is letting
>> >>>> cache()
>> >>>>>>>>> return
>> >>>>>>>>>>> nothing. So nobody needs to worry about the difference between
>> >>>>>>>>> operations
>> >>>>>>>>>>> on CacheTables and those on the "original" tables. This will
>> make
>> >>>>>>> maybe
>> >>>>>>>>>>> 99.9% of the users happy. There were two concerns raised for
>> this
>> >>>>>>>>> approach:
>> >>>>>>>>>>> 1. In some rare cases, users may want to ignore cache,
>> >>>>>>>>>>> 2. A table might be cached/uncached in a third party function
>> >> while
>> >>>>>>> the
>> >>>>>>>>>>> caller does not know.
>> >>>>>>>>>>>
>> >>>>>>>>>>> For the first issue, users can use hint("ignoreCache") to
>> >>>> explicitly
>> >>>>>>>>> ignore
>> >>>>>>>>>>> cache.
>> >>>>>>>>>>> For the second issue, the above proposal lets cache() return a
>> >>>>>>>>> CacheHandle,
>> >>>>>>>>>>> the only method in it is release(). Different CacheHandles
>> will
>> >>>>>>> refer to
>> >>>>>>>>>>> the same cache, if a cache no longer has any cache handle, it
>> >> will
>> >>>> be
>> >>>>>>>>>>> deleted. This will address the following case:
>> >>>>>>>>>>> {
>> >>>>>>>>>>> val handle1 = a.cache()
>> >>>>>>>>>>> process(a)
>> >>>>>>>>>>> a.select(...) // cache is still available, handle1 has not
>> been
>> >>>>>>>>> released.
>> >>>>>>>>>>> }
>> >>>>>>>>>>>
>> >>>>>>>>>>> void process(Table t) {
>> >>>>>>>>>>> val handle2 = t.cache() // new handle to cache
>> >>>>>>>>>>> t.select(...) // optimizer decides cache usage
>> >>>>>>>>>>> t.hint("ignoreCache").select(...) // cache is ignored
>> >>>>>>>>>>> handle2.release() // release the handle, but the cache may
>> still
>> >> be
>> >>>>>>>>>>> available if there are other handles
>> >>>>>>>>>>> ...
>> >>>>>>>>>>> }
>> >>>>>>>>>>>
>> >>>>>>>>>>> Does the above modified approach look reasonable to you?
>> >>>>>>>>>>>
>> >>>>>>>>>>> Cheers,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <
>> >>>> trohrmann@apache.org>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Hi Becket,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I was aiming at semantics similar to 1. I actually thought
>> that
>> >>>>>>>>> `cache()`
>> >>>>>>>>>>>> would tell the system to materialize the intermediate result
>> so
>> >>>> that
>> >>>>>>>>>>>> subsequent queries don't need to reprocess it. This means
>> that
>> >> the
>> >>>>>>>>> usage
>> >>>>>>>>>>> of
>> >>>>>>>>>>>> the cached table in this example
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> {
>> >>>>>>>>>>>> val cachedTable = a.cache()
>> >>>>>>>>>>>> val b1 = cachedTable.select(…)
>> >>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
>> >>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
>> >>>>>>>>>>>> val c1 = a.select(…)
>> >>>>>>>>>>>> val c2 = a.foo().select(…)
>> >>>>>>>>>>>> val c3 = a.bar().select(...)
>> >>>>>>>>>>>> }
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> strongly depends on interleaved calls which trigger the
>> >> execution
>> >>>> of
>> >>>>>>>>> sub
>> >>>>>>>>>>>> queries. So for example, if there is only a single
>> env.execute
>> >>>> call
>> >>>>>>> at
>> >>>>>>>>>>> the
>> >>>>>>>>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be
>> >>>> computed
>> >>>>>>> by
>> >>>>>>>>>>>> reading directly from the sources (given that there is only a
>> >>>> single
>> >>>>>>>>>>>> JobGraph). It just happens that the result of `a` will be
>> cached
>> >>>>>>> such
>> >>>>>>>>>>> that
>> >>>>>>>>>>>> we skip the processing of `a` when there are subsequent
>> queries
>> >>>>>>> reading
>> >>>>>>>>>>>> from `cachedTable`. If for some reason the system cannot
>> >>>> materialize
>> >>>>>>>>> the
>> >>>>>>>>>>>> table (e.g. running out of disk space, ttl expired), then it
>> >> could
>> >>>>>>> also
>> >>>>>>>>>>>> happen that we need to reprocess `a`. In that sense
>> >> `cachedTable`
>> >>>>>>>>> simply
>> >>>>>>>>>>> is
>> >>>>>>>>>>>> an identifier for the materialized result of `a` with the
>> >> lineage
>> >>>>>>> how
>> >>>>>>>>> to
>> >>>>>>>>>>>> reprocess it.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Cheers,
>> >>>>>>>>>>>> Till
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
>> >>>>>>>>> piotr@data-artisans.com
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi Becket,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> {
>> >>>>>>>>>>>>>> val cachedTable = a.cache()
>> >>>>>>>>>>>>>> val b = cachedTable.select(...)
>> >>>>>>>>>>>>>> val c = a.select(...)
>> >>>>>>>>>>>>>> }
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>> >>>>>>> original
>> >>>>>>>>>>> DAG
>> >>>>>>>>>>>>> as
>> >>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no
>> chance to
>> >>>>>>>>>>>> optimize.
>> >>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c
>> leaves
>> >> the
>> >>>>>>>>>>>>> optimizer
>> >>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
>> >> case,
>> >>>>>>> user
>> >>>>>>>>>>>>> lose
>> >>>>>>>>>>>>>> the option to NOT use cache.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> As you can see, neither of the options seem perfect.
>> However,
>> >> I
>> >>>>>>> guess
>> >>>>>>>>>>>> you
>> >>>>>>>>>>>>>> and Till are proposing the third option:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache
>> or
>> >>>> DAG
>> >>>>>>>>>>>> should
>> >>>>>>>>>>>>> be
>> >>>>>>>>>>>>>> used. c always use the DAG.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I am pretty sure that me, Till, Fabian and others were all
>> >>>>>>> proposing
>> >>>>>>>>>>> and
>> >>>>>>>>>>>>> advocating in favour of semantic “1”. No cost based
>> optimiser
>> >>>>>>>>> decisions
>> >>>>>>>>>>>> at
>> >>>>>>>>>>>>> all.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> {
>> >>>>>>>>>>>>> val cachedTable = a.cache()
>> >>>>>>>>>>>>> val b1 = cachedTable.select(…)
>> >>>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
>> >>>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
>> >>>>>>>>>>>>> val c1 = a.select(…)
>> >>>>>>>>>>>>> val c2 = a.foo().select(…)
>> >>>>>>>>>>>>> val c3 = a.bar().select(...)
>> >>>>>>>>>>>>> }
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and
>> c3
>> >> are
>> >>>>>>>>>>>>> re-executing whole plan for “a”.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> In the future we could discuss going one step further,
>> >>>> introducing
>> >>>>>>>>> some
>> >>>>>>>>>>>>> global optimisation (that can be manually enabled/disabled):
>> >>>>>>>>>>> deduplicate
>> >>>>>>>>>>>>> plan nodes/deduplicate sub queries/re-use sub queries
>> >> results/or
>> >>>>>>>>>>> whatever
>> >>>>>>>>>>>>> we could call it. It could do two things:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> 1. Automatically try to deduplicate fragments of the plan
>> and
>> >>>> share
>> >>>>>>>>> the
>> >>>>>>>>>>>>> result using CachedTable - in other words automatically
>> insert
>> >>>>>>>>>>>> `CachedTable
>> >>>>>>>>>>>>> cache()` calls.
>> >>>>>>>>>>>>> 2. Automatically make decision to bypass explicit
>> `CachedTable`
>> >>>>>>> access
>> >>>>>>>>>>>>> (this would be the equivalent of what you described as
>> >> “semantic
>> >>>>>>> 3”).
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> However as I wrote previously, I have big doubts if such
>> >>>> cost-based
>> >>>>>>>>>>>>> optimisation would work (this applies also to “Semantic
>> 2”). I
>> >>>>>>> would
>> >>>>>>>>>>>> expect
>> >>>>>>>>>>>>> it to do more harm than good in so many cases, that it
>> wouldn’t
>> >>>>>>> make
>> >>>>>>>>>>>> sense.
>> >>>>>>>>>>>>> Even assuming that we calculate statistics perfectly (this
>> >> ain’t
>> >>>>>>> gonna
>> >>>>>>>>>>>>> happen), it’s virtually impossible to correctly estimate
>> >> correct
>> >>>>>>>>>>> exchange
>> >>>>>>>>>>>>> rate of CPU cycles vs IO operations as it is changing so
>> much
>> >>>> from
>> >>>>>>>>>>>>> deployment to deployment.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Is this the core of our disagreement here? That you would
>> like
>> >>>> this
>> >>>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Piotrek
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <becket.qin@gmail.com
>> >
>> >>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Another potential concern for semantic 3 is that. In the
>> >> future,
>> >>>>>>> we
>> >>>>>>>>>>> may
>> >>>>>>>>>>>>> add
>> >>>>>>>>>>>>>> automatic caching to Flink. e.g. cache the intermediate
>> >> results
>> >>>> at
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>> shuffle boundary. If our semantic is that reference to the
>> >>>>>>> original
>> >>>>>>>>>>>> table
>> >>>>>>>>>>>>>> means skipping cache, those users may not be able to
>> benefit
>> >>>> from
>> >>>>>>> the
>> >>>>>>>>>>>>>> implicit cache.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <
>> >>>> becket.qin@gmail.com
>> >>>>>>>>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi Piotrek,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thanks for the reply. Thought about it again, I might have
>> >>>>>>>>>>>> misunderstood
>> >>>>>>>>>>>>>>> your proposal in earlier emails. Returning a CachedTable
>> >> might
>> >>>>>>> not
>> >>>>>>>>>>> be
>> >>>>>>>>>>>> a
>> >>>>>>>>>>>>> bad
>> >>>>>>>>>>>>>>> idea.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> I was more concerned about the semantic and its
>> intuitiveness
>> >>>>>>> when a
>> >>>>>>>>>>>>>>> CachedTable is returned. i..e, if cache() returns
>> >> CachedTable.
>> >>>>>>> What
>> >>>>>>>>>>>> are
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>> semantic in the following code:
>> >>>>>>>>>>>>>>> {
>> >>>>>>>>>>>>>>> val cachedTable = a.cache()
>> >>>>>>>>>>>>>>> val b = cachedTable.select(...)
>> >>>>>>>>>>>>>>> val c = a.select(...)
>> >>>>>>>>>>>>>>> }
>> >>>>>>>>>>>>>>> What is the difference between b and c? At the first
>> glance,
>> >> I
>> >>>>>>> see
>> >>>>>>>>>>> two
>> >>>>>>>>>>>>>>> options:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>> >>>>>>> original
>> >>>>>>>>>>>> DAG
>> >>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no
>> chance
>> >> to
>> >>>>>>>>>>>> optimize.
>> >>>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c
>> leaves
>> >>>> the
>> >>>>>>>>>>>>> optimizer
>> >>>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
>> >>>> case,
>> >>>>>>>>>>> user
>> >>>>>>>>>>>>> lose
>> >>>>>>>>>>>>>>> the option to NOT use cache.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> As you can see, neither of the options seem perfect.
>> >> However, I
>> >>>>>>>>>>> guess
>> >>>>>>>>>>>>> you
>> >>>>>>>>>>>>>>> and Till are proposing the third option:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether
>> cache or
>> >>>> DAG
>> >>>>>>>>>>>> should
>> >>>>>>>>>>>>>>> be used. c always use the DAG.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> This does address all the concerns. It is just that from
>> >>>>>>>>>>> intuitiveness
>> >>>>>>>>>>>>>>> perspective, I found that asking user to explicitly use a
>> >>>>>>>>>>> CachedTable
>> >>>>>>>>>>>>> while
>> >>>>>>>>>>>>>>> the optimizer might choose to ignore is a little weird.
>> That
>> >>>> was
>> >>>>>>>>>>> why I
>> >>>>>>>>>>>>> did
>> >>>>>>>>>>>>>>> not think about that semantic. But given there is material
>> >>>>>>> benefit,
>> >>>>>>>>>>> I
>> >>>>>>>>>>>>> think
>> >>>>>>>>>>>>>>> this semantic is acceptable.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to
>> use
>> >>>>>>> cache
>> >>>>>>>>>>> or
>> >>>>>>>>>>>>> not,
>> >>>>>>>>>>>>>>>> then why do we need “void cache()” method at all? Would
>> It
>> >>>>>>>>>>>> “increase”
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> chance of using the cache? That’s sounds strange. What
>> would
>> >>>> be
>> >>>>>>> the
>> >>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not?
>> If we
>> >>>>>>> want
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan
>> nodes
>> >>>>>>>>>>>>> deduplication”
>> >>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>> >>>>>>> optimiser
>> >>>>>>>>>>> do
>> >>>>>>>>>>>>> all of
>> >>>>>>>>>>>>>>>> the work.
>> >>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any
>> use/not
>> >> use
>> >>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>> decision.
>> >>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
>> >> such
>> >>>>>>> cost
>> >>>>>>>>>>>>> based
>> >>>>>>>>>>>>>>>> optimisations would work properly and I would still
>> insist
>> >>>>>>> first on
>> >>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable
>> cache()`)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> We are absolutely on the same page here. An explicit
>> cache()
>> >>>>>>> method
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>>>> necessary not only because optimizer may not be able to
>> make
>> >>>> the
>> >>>>>>>>>>> right
>> >>>>>>>>>>>>>>> decision, but also because of the nature of interactive
>> >>>>>>> programming.
>> >>>>>>>>>>>> For
>> >>>>>>>>>>>>>>> example, if users write the following code in Scala shell:
>> >>>>>>>>>>>>>>> val b = a.select(...)
>> >>>>>>>>>>>>>>> val c = b.select(...)
>> >>>>>>>>>>>>>>> val d = c.select(...).writeToSink(...)
>> >>>>>>>>>>>>>>> tEnv.execute()
>> >>>>>>>>>>>>>>> There is no way optimizer will know whether b or c will be
>> >> used
>> >>>>>>> in
>> >>>>>>>>>>>> later
>> >>>>>>>>>>>>>>> code, unless users hint explicitly.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
>> >>>>>>>>>>> objections
>> >>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which
>> me,
>> >>>>>>> Jark,
>> >>>>>>>>>>>>> Fabian,
>> >>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Is there any other side effects if we use semantic 3
>> >> mentioned
>> >>>>>>>>>>> above?
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> JIangjie (Becket) Qin
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
>> >>>>>>>>>>>> piotr@data-artisans.com
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Hi Becket,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Sorry for not responding long time.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Regarding case1.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would
>> >> expect
>> >>>>>>> only
>> >>>>>>>>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1`
>> >> wouldn’t
>> >>>>>>>>>>> affect
>> >>>>>>>>>>>>>>>> `cachedTableA2`. Just as in any other database dropping
>> >>>>>>> modifying
>> >>>>>>>>>>> one
>> >>>>>>>>>>>>>>>> independent table/materialised view does not affect
>> others.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> What I meant is that assuming there is already a cached
>> >>>> table,
>> >>>>>>>>>>>> ideally
>> >>>>>>>>>>>>>>>> users need
>> >>>>>>>>>>>>>>>>> not to specify whether the next query should read from
>> the
>> >>>>>>> cache
>> >>>>>>>>>>> or
>> >>>>>>>>>>>>> use
>> >>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to
>> use
>> >>>>>>> cache
>> >>>>>>>>>>> or
>> >>>>>>>>>>>>>>>> not, then why do we need “void cache()” method at all?
>> Would
>> >>>> It
>> >>>>>>>>>>>>> “increase”
>> >>>>>>>>>>>>>>>> the chance of using the cache? That’s sounds strange.
>> What
>> >>>>>>> would be
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not?
>> If we
>> >>>>>>> want
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan
>> nodes
>> >>>>>>>>>>>>> deduplication”
>> >>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>> >>>>>>> optimiser
>> >>>>>>>>>>> do
>> >>>>>>>>>>>>> all of
>> >>>>>>>>>>>>>>>> the work.
>> >>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any
>> use/not
>> >> use
>> >>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>> decision.
>> >>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
>> >> such
>> >>>>>>> cost
>> >>>>>>>>>>>>> based
>> >>>>>>>>>>>>>>>> optimisations would work properly and I would still
>> insist
>> >>>>>>> first on
>> >>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable
>> cache()`)
>> >>>>>>>>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()`
>> >>>> doesn’t
>> >>>>>>>>>>>>>>>> contradict future work on automated cost based caching.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to
>> our
>> >>>>>>>>>>> objections
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which
>> me,
>> >>>>>>> Jark,
>> >>>>>>>>>>>>> Fabian,
>> >>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Piotrek
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <
>> becket.qin@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Hi Till,
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> It is true that after the first job submission, there
>> will
>> >> be
>> >>>>>>> no
>> >>>>>>>>>>>>>>>> ambiguity
>> >>>>>>>>>>>>>>>>> in terms of whether a cached table is used or not. That
>> is
>> >>>> the
>> >>>>>>>>>>> same
>> >>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> cache() without returning a CachedTable.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
>> >>>>>>> caching
>> >>>>>>>>>>>>>>>> operator
>> >>>>>>>>>>>>>>>>>> from which you need to consume from if you want to
>> benefit
>> >>>>>>> from
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> caching
>> >>>>>>>>>>>>>>>>>> functionality.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> I am thinking a little differently. I think it is a hint
>> >> (as
>> >>>>>>> you
>> >>>>>>>>>>>>>>>> mentioned
>> >>>>>>>>>>>>>>>>> later) instead of a new operator. I'd like to be careful
>> >>>> about
>> >>>>>>> the
>> >>>>>>>>>>>>>>>> semantic
>> >>>>>>>>>>>>>>>>> of the API. A hint is a property set on an existing
>> >> operator,
>> >>>>>>> but
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>> itself an operator as it does not really manipulate the
>> >> data.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of
>> decision
>> >>>>>>> which
>> >>>>>>>>>>>>>>>>>> intermediate result should be cached. But especially
>> when
>> >>>>>>>>>>> executing
>> >>>>>>>>>>>>>>>> ad-hoc
>> >>>>>>>>>>>>>>>>>> queries the user might better know which results need
>> to
>> >> be
>> >>>>>>>>>>> cached
>> >>>>>>>>>>>>>>>> because
>> >>>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I
>> would
>> >>>>>>> consider
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course,
>> in
>> >>>> the
>> >>>>>>>>>>>> future
>> >>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>>> might add functionality which tries to automatically
>> cache
>> >>>>>>>>>>> results
>> >>>>>>>>>>>>>>>> (e.g.
>> >>>>>>>>>>>>>>>>>> caching the latest intermediate results until so and so
>> >> much
>> >>>>>>>>>>> space
>> >>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
>> >>>>>>> `CachedTable
>> >>>>>>>>>>>>>>>> cache()`.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> I agree that cache() method is needed for exactly the
>> >> reason
>> >>>>>>> you
>> >>>>>>>>>>>>>>>> mentioned,
>> >>>>>>>>>>>>>>>>> i.e. Flink cannot predict what users are going to write
>> >>>> later,
>> >>>>>>> so
>> >>>>>>>>>>>>> users
>> >>>>>>>>>>>>>>>>> need to tell Flink explicitly that this table will be
>> used
>> >>>>>>> later.
>> >>>>>>>>>>>>> What I
>> >>>>>>>>>>>>>>>>> meant is that assuming there is already a cached table,
>> >>>> ideally
>> >>>>>>>>>>>> users
>> >>>>>>>>>>>>>>>> need
>> >>>>>>>>>>>>>>>>> not to specify whether the next query should read from
>> the
>> >>>>>>> cache
>> >>>>>>>>>>> or
>> >>>>>>>>>>>>> use
>> >>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> To explain the difference between returning / not
>> >> returning a
>> >>>>>>>>>>>>>>>> CachedTable,
>> >>>>>>>>>>>>>>>>> I want compare the following two case:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> *Case 1:  returning a CachedTable*
>> >>>>>>>>>>>>>>>>> b = a.map(...)
>> >>>>>>>>>>>>>>>>> val cachedTableA1 = a.cache()
>> >>>>>>>>>>>>>>>>> val cachedTableA2 = a.cache()
>> >>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> c = a.filter(...) // User specify that the original DAG
>> is
>> >>>>>>> used?
>> >>>>>>>>>>> Or
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> optimizer decides whether DAG or cache should be used?
>> >>>>>>>>>>>>>>>>> d = cachedTableA1.filter() // User specify that the
>> cached
>> >>>>>>> table
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>>>>> used.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> a.unCache() // Can cachedTableA still be used
>> afterwards?
>> >>>>>>>>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be
>> used?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> *Case 2: not returning a CachedTable*
>> >>>>>>>>>>>>>>>>> b = a.map()
>> >>>>>>>>>>>>>>>>> a.cache()
>> >>>>>>>>>>>>>>>>> a.cache() // no-op
>> >>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the
>> cache or
>> >>>> DAG
>> >>>>>>>>>>>> should
>> >>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>> used
>> >>>>>>>>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the
>> cache or
>> >>>> DAG
>> >>>>>>>>>>>> should
>> >>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>> used
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> a.unCache()
>> >>>>>>>>>>>>>>>>> a.unCache() // no-op
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to
>> >> choose
>> >>>>>>>>>>>> between
>> >>>>>>>>>>>>>>>> DAG
>> >>>>>>>>>>>>>>>>> and cache. And the unCache() call becomes tricky.
>> >>>>>>>>>>>>>>>>> In case 2, users do not need to worry about whether
>> cache
>> >> or
>> >>>>>>> DAG
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>>>>> used.
>> >>>>>>>>>>>>>>>>> And the unCache() semantic is clear. However, the
>> caveat is
>> >>>>>>> that
>> >>>>>>>>>>>> users
>> >>>>>>>>>>>>>>>>> cannot explicitly ignore the cache.
>> >>>>>>>>>>>>>>>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotr,

I don't think it is feasible to ask every third party library to have
method signature with CacheService as an argument.

And even that signature does not really solve the problem. Imagine function
foo() looks like following:

void foo(Table t) {
  ...
  t.cache(); // create cache for t
  ...
  env.getCacheService().releaseCacheFor(t); // release cache for t
}

From function foo()'s perspective, it created a cache and released it.
However, if someone invokes foo like this:
{
  Table src = ...
  Table t = src.select(...).cache()
  foo(t)
  // t is uncached by foo() already.
}

So the "side effect" still exists.

I think the only safe way to ensure there is no side effect while sharing
the cache is to use ref count.

BTW, the discussion we are having here is exactly the reason that I prefer
option 3. From technical perspective option 3 solves all the concerns.

Thanks,

Jiangjie (Becket) Qin


On Tue, Jan 8, 2019 at 8:41 PM Piotr Nowojski <pi...@da-platform.com> wrote:

> Hi,
>
> I think that introducing ref counting could be confusing and it will be
> error prone, since Flink-table’s users are not used to closing/releasing
> resources. I was more objecting placing the
> `uncache()`/`dropCache()`/`releaseCache()` (releaseCache sounds best to me)
> as a method in the “Table”. It might be not obvious that it will drop the
> cache for all of the usages of the given table. For example:
>
> public void foo(Table t) {
>  // …
>  t.releaseCache();
> }
>
> public void bar(Table t) {
>   // ...
> }
>
> Table a = …
> val cachedA = a.cache()
>
> foo(cachedA)
> bar(cachedA)
>
>
> My problem with above example is that `t.releaseCache()` call is not doing
> the best possible job in communicating to the user that it will have a side
> effects for other places, like `bar(cachedA)` call. Something like this
> might be a better (not perfect, but just a bit better):
>
> public void foo(Table t, CacheService cacheService) {
>  // …
>  cacheService.releaseCacheFor(t);
> }
>
> Table a = …
> val cachedA = a.cache()
>
> foo(cachedA, env.getCacheService())
> bar(cachedA)
>
>
> Also from another perspective, maybe placing `releaseCache()` method in
> Table might not be the best separation of concerns - `releaseCache()`
> method seams significantly different compared to other existing methods.
>
> Piotrek
>
> > On 8 Jan 2019, at 12:28, Becket Qin <be...@gmail.com> wrote:
> >
> > Hi Piotr,
> >
> > You are right. There might be two intuitive meanings when users call
> > 'a.uncache()', namely:
> > 1. release the resource
> > 2. Do not use cache for the next operation.
> >
> > Case (1) would likely be the dominant use case. So I would suggest we
> > dedicate uncache() method to case (1), i.e. for resource release, but not
> > for ignoring cache.
> >
> > For case 2, i.e. explicitly ignoring cache (which is rare), users may use
> > something like 'hint("ignoreCache")'. I think this is better as it is a
> > little weird for users to call `a.uncache()` while they may not even know
> > if the table is cached at all.
> >
> > Assuming we let `uncache()` to only release resource, one possibility is
> > using ref count to mitigate the side effect. That means a ref count is
> > incremented on `cache()` and decremented on `uncache()`. That means
> > `uncache()` does not physically release the resource immediately, but
> just
> > means the cache could be released.
> > That being said, I am not sure if this is really a better solution as it
> > seems a little counter intuitive. Maybe calling it releaseCache() help a
> > little bit?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> >
> > On Tue, Jan 8, 2019 at 5:36 PM Piotr Nowojski <pi...@da-platform.com>
> wrote:
> >
> >> Hi Becket,
> >>
> >> With `uncache` there are probably two features that we can think about:
> >>
> >> a)
> >>
> >> Physically dropping the cached table from the storage, freeing up the
> >> resources
> >>
> >> b)
> >>
> >> Hinting the optimizer to not cache the reads for the next query/table
> >>
> >> a) Has the issue as I wrote before, that it seemed to be an operation
> >> inherently “flawed" with having side effects.
> >>
> >> I’m not sure how it would be best to express. We could make it work:
> >>
> >> 1. via a method on a Table as you proposed:
> >>
> >> void Table#dropCache()
> >> void Table#uncache()
> >>
> >> 2. Operation on the environment
> >>
> >> env.dropCacheFor(table) // or some other argument that allows user to
> >> identify the desired cache
> >>
> >> 3. Extending (from your original design doc) `setTableService` method to
> >> return some control handle like:
> >>
> >> TableServiceControl setTableService(TableFactory tf,
> >>                     TableProperties properties,
> >>                     TempTableCleanUpCallback cleanUpCallback);
> >>
> >> (TableServiceControl? TableService? TableServiceHandle? CacheService?)
> >>
> >> And having the drop cache method there:
> >>
> >> TableServiceControl#dropCache(table)
> >>
> >> Out of those options, option 1 might have a disadvantage of kind of not
> >> making the user aware, that this is a global operation with side
> effects.
> >> Like the old example of:
> >>
> >> public void foo(Table t) {
> >>  // …
> >>  t.dropCache();
> >> }
> >>
> >> It might not be immediately obvious that `t.dropCache()` is some kind of
> >> global operation, with side effects visible outside of the `foo`
> function.
> >>
> >> On the other hand, both option 2 and 3, might have greater chance of
> >> catching user’s attention:
> >>
> >> public void foo(Table t, CacheService cacheService) {
> >>  // …
> >>  cacheService.dropCache(t);
> >> }
> >>
> >> b) could be achieved quite easily:
> >>
> >> Table a = …
> >> val notCached1 = a.doNotCache()
> >> val cachedA = a.cache()
> >> val notCached2 = cachedA.doNotCache() // equivalent of notCached1
> >>
> >> `doNotCache()` would behave similarly to `cache()` - return a copy of
> the
> >> table with removed “cache” hint and/or added “never cache” hint.
> >>
> >> Piotrek
> >>
> >>
> >>> On 8 Jan 2019, at 03:17, Becket Qin <be...@gmail.com> wrote:
> >>>
> >>> Hi Piotr,
> >>>
> >>> Thanks for the proposal and detailed explanation. I like the idea of
> >>> returning a new hinted Table without modifying the original table. This
> >>> also leave the room for users to benefit from future implicit caching.
> >>>
> >>> Just to make sure I get the full picture. In your proposal, there will
> >> also
> >>> be a 'void Table#uncache()' method to release the cache, right?
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Mon, Jan 7, 2019 at 11:50 PM Piotr Nowojski <pi...@da-platform.com>
> >>> wrote:
> >>>
> >>>> Hi Becket!
> >>>>
> >>>> After further thinking I tend to agree that my previous proposal
> >> (*Option
> >>>> 2*) indeed might not be if would in the future introduce automatic
> >> caching.
> >>>> However I would like to propose a slightly modified version of it:
> >>>>
> >>>> *Option 4*
> >>>>
> >>>> Adding `cache()` method with following signature:
> >>>>
> >>>> Table Table#cache();
> >>>>
> >>>> Without side-effects, and `cache()` call do not modify/change original
> >>>> Table in any way.
> >>>> It would return a copy of original table, with added hint for the
> >>>> optimizer to cache the table, so that the future accesses to the
> >> returned
> >>>> table might be cached or not.
> >>>>
> >>>> Assuming that we are talking about a setup, where we do not have
> >> automatic
> >>>> caching enabled (possible future extension).
> >>>>
> >>>> Example #1:
> >>>>
> >>>> ```
> >>>> Table a = …
> >>>> a.foo() // not cached
> >>>>
> >>>> val cachedTable = a.cache();
> >>>>
> >>>> cachedA.bar() // maybe cached
> >>>> a.foo() // same as before - effectively not cached
> >>>> ```
> >>>>
> >>>> Both the first and the second `a.foo()` operations would behave in the
> >>>> exactly same way. Again, `a.cache()` call doesn’t affect `a` itself.
> If
> >> `a`
> >>>> was not hinted for caching before `a.cache();`, then both `a.foo()`
> >> calls
> >>>> wouldn’t use cache.
> >>>>
> >>>> Returned `cachedA` would be hinted with “cache” hint, so probably
> >>>> `cachedA.bar()` would go through cache (unless optimiser decides the
> >>>> opposite)
> >>>>
> >>>> Example #2
> >>>>
> >>>> ```
> >>>> Table a = …
> >>>>
> >>>> a.foo() // not cached
> >>>>
> >>>> val b = a.cache();
> >>>>
> >>>> a.foo() // same as before - effectively not cached
> >>>> b.foo() // maybe cached
> >>>>
> >>>> val c = b.cache();
> >>>>
> >>>> a.foo() // same as before - effectively not cached
> >>>> b.foo() // same as before - effectively maybe cached
> >>>> c.foo() // maybe cached
> >>>> ```
> >>>>
> >>>> Now, assuming that we have some future “automatic caching
> optimisation”:
> >>>>
> >>>> Example #3
> >>>>
> >>>> ```
> >>>> env.enableAutomaticCaching()
> >>>> Table a = …
> >>>>
> >>>> a.foo() // might be cached, depending if `a` was selected to automatic
> >>>> caching
> >>>>
> >>>> val b = a.cache();
> >>>>
> >>>> a.foo() // same as before - might be cached, if `a` was selected to
> >>>> automatic caching
> >>>> b.foo() // maybe cached
> >>>> ```
> >>>>
> >>>>
> >>>> More or less this is the same behaviour as:
> >>>>
> >>>> Table a = ...
> >>>> val b = a.filter(x > 20)
> >>>>
> >>>> calling `filter` hasn’t changed or altered `a` in anyway. If `a` was
> >>>> previously filtered:
> >>>>
> >>>> Table src = …
> >>>> val a = src.filter(x > 20)
> >>>> val b = a.filter(x > 20)
> >>>>
> >>>> then yes, `a` and `b` will be the same. But the point is that neither
> >>>> `filter` nor `cache` changes the original `a` table.
> >>>>
> >>>> One thing is that indeed, physically dropping cache operation, will
> have
> >>>> side effects and it will in a way mutate the cached table references.
> >> But
> >>>> this is I think unavoidable in any solution - the same issue as
> calling
> >>>> `.close()`, or calling destructor in C++.
> >>>>
> >>>> Piotrek
> >>>>
> >>>>> On 7 Jan 2019, at 10:41, Becket Qin <be...@gmail.com> wrote:
> >>>>>
> >>>>> Happy New Year, everybody!
> >>>>>
> >>>>> I would like to resume this discussion thread. At this point, We have
> >>>>> agreed on the first step goal of interactive programming. The open
> >>>>> discussion is the exact API. More specifically, what should *cache()*
> >>>>> method return and what is the semantic. There are three options:
> >>>>>
> >>>>> *Option 1*
> >>>>> *void cache()* OR *Table cache()* which returns the original table
> for
> >>>>> chained calls.
> >>>>> *void uncache() *releases the cache.
> >>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
> >>>>>
> >>>>> - Semantic: a.cache() hints that table 'a' should be cached.
> Optimizer
> >>>>> decides whether the cache will be used or not.
> >>>>> - pros: simple and no confusion between CachedTable and original
> table
> >>>>> - cons: A table may be cached / uncached in a method invocation,
> while
> >>>> the
> >>>>> caller does not know about this.
> >>>>>
> >>>>> *Option 2*
> >>>>> *CachedTable cache()*
> >>>>> *CachedTable *extends *Table *with an additional *uncache()* method
> >>>>>
> >>>>> - Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will
> >> always
> >>>>> use cache. *a.bar() *will always use original DAG.
> >>>>> - pros: No potential side effects in method invocation.
> >>>>> - cons: Optimizer has no chance to kick in. Future optimization will
> >>>> become
> >>>>> a behavior change and need users to change the code.
> >>>>>
> >>>>> *Option 3*
> >>>>> *CacheHandle cache()*
> >>>>> *CacheHandle.release() *to release a cache handle on the table. If
> all
> >>>>> cache handles are released, the cache could be removed.
> >>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
> >>>>>
> >>>>> - Semantic: *a.cache() *hints that 'a' should be cached. Optimizer
> >>>> decides
> >>>>> whether the cache will be used or not. Cache is released either no
> >> handle
> >>>>> is on it, or the user program exits.
> >>>>> - pros: No potential side effect in method invocation. No confusion
> >>>> between
> >>>>> cached table v.s original table.
> >>>>> - cons: An additional CacheHandle exposed to the users.
> >>>>>
> >>>>>
> >>>>> Personally I prefer option 3 for the following reasons:
> >>>>> 1. It is simple. Vast majority of the users would just call
> >>>>> *a.cache()* followed
> >>>>> by *a.foo(),* *a.bar(), etc. *
> >>>>> 2. There is no semantic ambiguity and semantic change if we decide to
> >> add
> >>>>> implicit cache in the future.
> >>>>> 3. There is no side effect in the method calls.
> >>>>> 4. Admittedly we need to expose one more CacheHandle class to the
> >> users.
> >>>>> But it is not that difficult to understand given similar well known
> >>>> concept
> >>>>> like ref count (we can name it CacheReference if that is easier to
> >>>>> understand). So I think it is fine.
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jiangjie (Becket) Qin
> >>>>>
> >>>>>
> >>>>> On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <be...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Hi Piotrek,
> >>>>>>
> >>>>>> 1. Regarding optimization.
> >>>>>> Sure there are many cases that the decision is hard to make. But
> that
> >>>> does
> >>>>>> not make it any easier for the users to make those decisions. I
> >> imagine
> >>>> 99%
> >>>>>> of the users would just naively use cache. I am not saying we can
> >>>> optimize
> >>>>>> in all the cases. But as long as we agree that at least in certain
> >>>> cases (I
> >>>>>> would argue most cases), optimizer can do a little better than an
> >>>> average
> >>>>>> user who likely knows little about Flink internals, we should not
> push
> >>>> the
> >>>>>> burden of optimization to users.
> >>>>>>
> >>>>>> BTW, it seems some of your concerns are related to the
> >> implementation. I
> >>>>>> did not mention the implementation of the caching service because
> that
> >>>>>> should not affect the API semantic. Not sure if this helps, but
> >> imagine
> >>>> the
> >>>>>> default implementation has one StorageNode service colocating with
> >> each
> >>>> TM.
> >>>>>> It could be running within the TM process or in a standalone
> process,
> >>>>>> depending on configuration.
> >>>>>>
> >>>>>> The StorageNode uses memory + spill-to-disk mechanism. The cached
> data
> >>>>>> will just be written to the local StorageNode service. If the
> >>>> StorageNode
> >>>>>> is running within the TM process, the in-memory cache could just be
> >>>> objects
> >>>>>> so we save some serde cost. A later job referring to the cached
> Table
> >>>> will
> >>>>>> be scheduled in a locality aware manner, i.e. run in the TM whose
> peer
> >>>>>> StorageNode hosts the data.
> >>>>>>
> >>>>>>
> >>>>>> 2. Semantic
> >>>>>> I am not sure why introducing a new hintCache() or
> >>>>>> env.enableAutomaticCaching() method would avoid the consequence of
> >>>> semantic
> >>>>>> change.
> >>>>>>
> >>>>>> If the auto optimization is not enabled by default, users still need
> >> to
> >>>>>> make code change to all existing programs in order to get the
> benefit.
> >>>>>> If the auto optimization is enabled by default, advanced users who
> >> know
> >>>>>> that they really want to use cache will suddenly lose the
> opportunity
> >>>> to do
> >>>>>> so, unless they change the code to disable auto optimization.
> >>>>>>
> >>>>>>
> >>>>>> 3. side effect
> >>>>>> The CacheHandle is not only for where to put uncache(). It is to
> solve
> >>>> the
> >>>>>> implicit performance impact by moving the uncache() to the
> >> CacheHandle.
> >>>>>>
> >>>>>> - If users wants to leverage cache, they can call a.cache(). After
> >>>>>> that, unless user explicitly release that CacheHandle, a.foo() will
> >>>> always
> >>>>>> leverage cache if needed (optimizer may choose to ignore cache if
> >> that
> >>>>>> helps accelerate the process). Any function call will not be able to
> >>>>>> release the cache because they do not have that CacheHandle.
> >>>>>> - If some advanced users do not want to use cache at all, they will
> >>>>>> call a.hint(ignoreCache).foo(). This will for sure ignore cache and
> >>>> use the
> >>>>>> original DAG to process.
> >>>>>>
> >>>>>>
> >>>>>>> In vast majority of the cases, users wouldn't really care whether
> the
> >>>>>>> cache is used or not.
> >>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
> >> memory
> >>>>>>> caching) would add additional IO costs. It’s similar as saying that
> >>>> users
> >>>>>>> would not see a difference between Spark/Flink and MapReduce
> >> (MapReduce
> >>>>>>> writes data to disks after every map/reduce stage).
> >>>>>>
> >>>>>> What I wanted to say is that in most cases, after users call
> cache(),
> >>>> they
> >>>>>> don't really care about whether auto optimization has decided to
> >> ignore
> >>>> the
> >>>>>> cache or not, as long as the program runs faster.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <
> >>>> piotr@data-artisans.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Thanks for the quick answer :)
> >>>>>>>
> >>>>>>> Re 1.
> >>>>>>>
> >>>>>>> I generally agree with you, however couple of points:
> >>>>>>>
> >>>>>>> a) the problem with using automatic caching is bigger, because you
> >> will
> >>>>>>> have to decide, how do you compare IO vs CPU costs and if you pick
> >>>> wrong,
> >>>>>>> additional IO costs might be enormous or even can crash your
> system.
> >>>> This
> >>>>>>> is more difficult problem compared to let say join reordering,
> where
> >>>> the
> >>>>>>> only issue is to have good statistics that can capture correlations
> >>>> between
> >>>>>>> columns (when you reorder joins number of IO operations do not
> >> change)
> >>>>>>> c) your example is completely independent of caching.
> >>>>>>>
> >>>>>>> Query like this:
> >>>>>>>
> >>>>>>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1
> ===`f2).as('f3,
> >>>>>>> …).filter(‘f3 > 30)
> >>>>>>>
> >>>>>>> Should/could be optimised to empty result immediately, without the
> >> need
> >>>>>>> for any cache/materialisation and that should work even without any
> >>>>>>> statistics provided by the connector.
> >>>>>>>
> >>>>>>> For me prerequisite to any serious cost-based optimisations would
> be
> >>>> some
> >>>>>>> reasonable benchmark coverage of the code (tpch?). Otherwise that
> >>>> would be
> >>>>>>> equivalent of adding not tested code, since we wouldn’t be able to
> >>>> verify
> >>>>>>> our assumptions, like how does the writing of 10 000 records to
> >>>>>>> cache/RocksDB/Kafka/CSV file compare to
> joining/filtering/processing
> >> of
> >>>>>>> lets say 1000 000 rows.
> >>>>>>>
> >>>>>>> Re 2.
> >>>>>>>
> >>>>>>> I wasn’t proposing to change the semantic later. I was proposing
> that
> >>>> we
> >>>>>>> start now:
> >>>>>>>
> >>>>>>> CachedTable cachedA = a.cache()
> >>>>>>> cachedA.foo() // Cache is used
> >>>>>>> a.bar() // Original DAG is used
> >>>>>>>
> >>>>>>> And then later we can think about adding for example
> >>>>>>>
> >>>>>>> CachedTable cachedA = a.hintCache()
> >>>>>>> cachedA.foo() // Cache might be used
> >>>>>>> a.bar() // Original DAG is used
> >>>>>>>
> >>>>>>> Or
> >>>>>>>
> >>>>>>> env.enableAutomaticCaching()
> >>>>>>> a.foo() // Cache might be used
> >>>>>>> a.bar() // Cache might be used
> >>>>>>>
> >>>>>>> Or (I would still not like this option):
> >>>>>>>
> >>>>>>> a.hintCache()
> >>>>>>> a.foo() // Cache might be used
> >>>>>>> a.bar() // Cache might be used
> >>>>>>>
> >>>>>>> Or whatever else that will come to our mind. Even if we add some
> >>>>>>> automatic caching in the future, keeping implicit (`CachedTable
> >>>> cache()`)
> >>>>>>> caching will still be useful, at least in some cases.
> >>>>>>>
> >>>>>>> Re 3.
> >>>>>>>
> >>>>>>>> 2. The source tables are immutable during one run of batch
> >> processing
> >>>>>>> logic.
> >>>>>>>> 3. The cache is immutable during one run of batch processing
> logic.
> >>>>>>>
> >>>>>>>> I think assumption 2 and 3 are by definition what batch processing
> >>>>>>> means,
> >>>>>>>> i.e the data must be complete before it is processed and should
> not
> >>>>>>> change
> >>>>>>>> when the processing is running.
> >>>>>>>
> >>>>>>> I agree that this is how batch systems SHOULD be working. However I
> >>>> know
> >>>>>>> from my previous experience that it’s not always the case.
> Sometimes
> >>>> users
> >>>>>>> are just working on some non transactional storage, which can be
> >>>> (either
> >>>>>>> constantly or occasionally) being modified by some other processes
> >> for
> >>>>>>> whatever the reasons (fixing the data, updating, adding new data
> >> etc).
> >>>>>>>
> >>>>>>> But even if we ignore this point (data immutability), performance
> >> side
> >>>>>>> effect issue of your proposal remains. If user calls `void
> a.cache()`
> >>>> deep
> >>>>>>> inside some private method, it will have implicit side effects on
> >> other
> >>>>>>> parts of his program that might not be obvious.
> >>>>>>>
> >>>>>>> Re `CacheHandle`.
> >>>>>>>
> >>>>>>> If I understand it correctly, it only addresses the issue where to
> >>>> place
> >>>>>>> method `uncache`/`dropCache`.
> >>>>>>>
> >>>>>>> Btw,
> >>>>>>>
> >>>>>>>> In vast majority of the cases, users wouldn't really care whether
> >> the
> >>>>>>> cache is used or not.
> >>>>>>>
> >>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
> >> memory
> >>>>>>> caching) would add additional IO costs. It’s similar as saying that
> >>>> users
> >>>>>>> would not see a difference between Spark/Flink and MapReduce
> >> (MapReduce
> >>>>>>> writes data to disks after every map/reduce stage).
> >>>>>>>
> >>>>>>> Piotrek
> >>>>>>>
> >>>>>>>> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>> Hi Piotrek,
> >>>>>>>>
> >>>>>>>> Not sure if you noticed, in my last email, I was proposing
> >>>> `CacheHandle
> >>>>>>>> cache()` to avoid the potential side effect due to function calls.
> >>>>>>>>
> >>>>>>>> Let's look at the disagreement in your reply one by one.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 1. Optimization chances
> >>>>>>>>
> >>>>>>>> Optimization is never a trivial work. This is exactly why we
> should
> >>>> not
> >>>>>>> let
> >>>>>>>> user manually do that. Databases have done huge amount of work in
> >> this
> >>>>>>>> area. At Alibaba, we rely heavily on many optimization rules to
> >> boost
> >>>>>>> the
> >>>>>>>> SQL query performance.
> >>>>>>>>
> >>>>>>>> In your example, if I filling the filter conditions in a certain
> >> way,
> >>>>>>> the
> >>>>>>>> optimization would become obvious.
> >>>>>>>>
> >>>>>>>> Table src1 = … // read from connector 1
> >>>>>>>> Table src2 = … // read from connector 2
> >>>>>>>>
> >>>>>>>> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
> >>>>>>>> `f2).as('f3, ...)
> >>>>>>>> a.cache() // write cache to connector 3, when writing the records,
> >>>>>>> remember
> >>>>>>>> min and max of `f1
> >>>>>>>>
> >>>>>>>> a.filter('f3 > 30) // There is no need to read from any connector
> >>>>>>> because
> >>>>>>>> `a` does not contain any record whose 'f3 is greater than 30.
> >>>>>>>> env.execute()
> >>>>>>>> a.select(…)
> >>>>>>>>
> >>>>>>>> BTW, it seems to me that adding some basic statistics is fairly
> >>>>>>>> straightforward and the cost is pretty marginal if not ignorable.
> In
> >>>>>>> fact
> >>>>>>>> it is not only needed for optimization, but also for cases such as
> >> ML,
> >>>>>>>> where some algorithms may need to decide their parameter based on
> >> the
> >>>>>>>> statistics of the data.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2. Same API, one semantic now, another semantic later.
> >>>>>>>>
> >>>>>>>> I am trying to understand what is the semantic of `CachedTable
> >>>> cache()`
> >>>>>>> you
> >>>>>>>> are proposing. IMO, we should avoid designing an API whose
> semantic
> >>>>>>> will be
> >>>>>>>> changed later. If we have a "CachedTable cache()" method, then the
> >>>>>>> semantic
> >>>>>>>> should be very clearly defined upfront and do not change later. It
> >>>>>>> should
> >>>>>>>> never be "right now let's go with semantic 1, later we can
> silently
> >>>>>>> change
> >>>>>>>> it to semantic 2 or 3". Such change could result in bad
> consequence.
> >>>> For
> >>>>>>>> example, let's say we decide go with semantic 1:
> >>>>>>>>
> >>>>>>>> CachedTable cachedA = a.cache()
> >>>>>>>> cachedA.foo() // Cache is used
> >>>>>>>> a.bar() // Original DAG is used.
> >>>>>>>>
> >>>>>>>> Now majority of the users would be using cachedA.foo() in their
> >> code.
> >>>>>>> And
> >>>>>>>> some advanced users will use a.bar() to explicitly skip the cache.
> >>>> Later
> >>>>>>>> on, we added smart optimization and change the semantic to
> semantic
> >> 2:
> >>>>>>>>
> >>>>>>>> CachedTable cachedA = a.cache()
> >>>>>>>> cachedA.foo() // Cache is used
> >>>>>>>> a.bar() // Cache MIGHT be used, and Flink may decide to skip cache
> >> if
> >>>>>>> it is
> >>>>>>>> faster.
> >>>>>>>>
> >>>>>>>> Now most of the users who were writing cachedA.foo() will not
> >> benefit
> >>>>>>> from
> >>>>>>>> this optimization at all, unless they change their code to use
> >> a.foo()
> >>>>>>>> instead. And those advanced users suddenly lose the option to
> >>>> explicitly
> >>>>>>>> ignore cache unless they change their code (assuming we care
> enough
> >> to
> >>>>>>>> provide something like hint(useCache)). If we don't define the
> >>>> semantic
> >>>>>>>> carefully, our users will have to change their code again and
> again
> >>>>>>> while
> >>>>>>>> they shouldn't have to.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 3. side effect.
> >>>>>>>>
> >>>>>>>> Before we talk about side effect, we have to agree on the
> >> assumptions.
> >>>>>>> The
> >>>>>>>> assumptions I have are following:
> >>>>>>>> 1. We are talking about batch processing.
> >>>>>>>> 2. The source tables are immutable during one run of batch
> >> processing
> >>>>>>> logic.
> >>>>>>>> 3. The cache is immutable during one run of batch processing
> logic.
> >>>>>>>>
> >>>>>>>> I think assumption 2 and 3 are by definition what batch processing
> >>>>>>> means,
> >>>>>>>> i.e the data must be complete before it is processed and should
> not
> >>>>>>> change
> >>>>>>>> when the processing is running.
> >>>>>>>>
> >>>>>>>> As far as I am aware of, I don't know any batch processing system
> >>>>>>> breaking
> >>>>>>>> those assumptions. Even for relational database tables, where
> >> queries
> >>>>>>> can
> >>>>>>>> run with concurrent modifications, necessary locking are still
> >>>> required
> >>>>>>> to
> >>>>>>>> ensure the integrity of the query result.
> >>>>>>>>
> >>>>>>>> Please let me know if you disagree with the above assumptions. If
> >> you
> >>>>>>> agree
> >>>>>>>> with these assumptions, with the `CacheHandle cache()` API in my
> >> last
> >>>>>>>> email, do you still see side effects?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <
> >>>> piotr@data-artisans.com
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Becket,
> >>>>>>>>>
> >>>>>>>>>> Regarding the chance of optimization, it might not be that rare.
> >>>> Some
> >>>>>>>>> very
> >>>>>>>>>> simple statistics could already help in many cases. For example,
> >>>>>>> simply
> >>>>>>>>>> maintaining max and min of each fields can already eliminate
> some
> >>>>>>>>>> unnecessary table scan (potentially scanning the cached table)
> if
> >>>> the
> >>>>>>>>>> result is doomed to be empty. A histogram would give even
> further
> >>>>>>>>>> information. The optimizer could be very careful and only
> ignores
> >>>>>>> cache
> >>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
> >> filter
> >>>> on
> >>>>>>>>> the
> >>>>>>>>>> cache will absolutely return nothing.
> >>>>>>>>>
> >>>>>>>>> I do not see how this might be easy to achieve. It would require
> >> tons
> >>>>>>> of
> >>>>>>>>> effort to make it work and in the end you would still have a
> >> problem
> >>>> of
> >>>>>>>>> comparing/trading CPU cycles vs IO. For example:
> >>>>>>>>>
> >>>>>>>>> Table src1 = … // read from connector 1
> >>>>>>>>> Table src2 = … // read from connector 2
> >>>>>>>>>
> >>>>>>>>> Table a = src1.filter(…).join(src2.filter(…), …)
> >>>>>>>>> a.cache() // write cache to connector 3
> >>>>>>>>>
> >>>>>>>>> a.filter(…)
> >>>>>>>>> env.execute()
> >>>>>>>>> a.select(…)
> >>>>>>>>>
> >>>>>>>>> Decision whether it’s better to:
> >>>>>>>>> A) read from connector1/connector2, filter/map and join them
> twice
> >>>>>>>>> B) read from connector1/connector2, filter/map and join them
> once,
> >>>> pay
> >>>>>>> the
> >>>>>>>>> price of writing to connector 3 and then reading from it
> >>>>>>>>>
> >>>>>>>>> Is very far from trivial. `a` can end up much larger than `src1`
> >> and
> >>>>>>>>> `src2`, writes to connector 3 might be extremely slow, reads from
> >>>>>>> connector
> >>>>>>>>> 3 can be slower compared to reads from connector 1 & 2, … . You
> >>>> really
> >>>>>>> need
> >>>>>>>>> to have extremely good statistics to correctly asses size of the
> >>>>>>> output and
> >>>>>>>>> it would still be failing many times (correlations etc). And keep
> >> in
> >>>>>>> mind
> >>>>>>>>> that at the moment we do not have ANY statistics at all. More
> than
> >>>>>>> that, it
> >>>>>>>>> would require significantly more testing and setting up some
> >>>>>>> benchmarks to
> >>>>>>>>> make sure that we do not brake it with some regressions.
> >>>>>>>>>
> >>>>>>>>> That’s why I’m strongly opposing this idea - at least let’s not
> >>>> starts
> >>>>>>>>> with this. If we first start with completely manual/explicit
> >> caching,
> >>>>>>>>> without any magic, it would be a significant improvement for the
> >>>> users
> >>>>>>> for
> >>>>>>>>> a fraction of the development cost. After implementing that, when
> >> we
> >>>>>>>>> already have all of the working pieces, we can start working on
> >> some
> >>>>>>>>> optimisations rules. As I wrote before, if we start with
> >>>>>>>>>
> >>>>>>>>> `CachedTable cache()`
> >>>>>>>>>
> >>>>>>>>> We can later work on follow up stories to make it automatic.
> >> Despite
> >>>>>>> that
> >>>>>>>>> I don’t like this implicit/side effect approach with `void`
> method,
> >>>>>>> having
> >>>>>>>>> explicit `CachedTable cache()` wouldn’t even prevent as from
> later
> >>>>>>> adding
> >>>>>>>>> `void hintCache()` method, with the exact semantic that you want.
> >>>>>>>>>
> >>>>>>>>> On top of that I re-rise again that having implicit `void
> >>>>>>>>> cache()/hintCache()` has other side effects and problems with non
> >>>>>>> immutable
> >>>>>>>>> data, and being annoying when used secretly inside methods.
> >>>>>>>>>
> >>>>>>>>> Explicit `CachedTable cache()` just looks like much less
> >>>> controversial
> >>>>>>> MVP
> >>>>>>>>> and if we decide to go further with this topic, it’s not a wasted
> >>>>>>> effort,
> >>>>>>>>> but just lies on a stright path to more advanced/complicated
> >>>> solutions
> >>>>>>> in
> >>>>>>>>> the future. Are there any drawbacks of starting with `CachedTable
> >>>>>>> cache()`
> >>>>>>>>> that I’m missing?
> >>>>>>>>>
> >>>>>>>>> Piotrek
> >>>>>>>>>
> >>>>>>>>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Becket,
> >>>>>>>>>>
> >>>>>>>>>> Introducing CacheHandle seems too complicated. That means users
> >> have
> >>>>>>> to
> >>>>>>>>>> maintain Handler properly.
> >>>>>>>>>>
> >>>>>>>>>> And since cache is just a hint for optimizer, why not just
> return
> >>>>>>> Table
> >>>>>>>>>> itself for cache method. This hint info should be kept in Table
> I
> >>>>>>>>> believe.
> >>>>>>>>>>
> >>>>>>>>>> So how about adding method cache and uncache for Table, and both
> >>>>>>> return
> >>>>>>>>>> Table. Because what cache and uncache did is just adding some
> hint
> >>>>>>> info
> >>>>>>>>>> into Table.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Till and Piotrek,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the clarification. That solves quite a few
> confusion.
> >> My
> >>>>>>>>>>> understanding of how cache works is same as what Till describe.
> >>>> i.e.
> >>>>>>>>>>> cache() is a hint to Flink, but it is not guaranteed that cache
> >>>>>>> always
> >>>>>>>>>>> exist and it might be recomputed from its lineage.
> >>>>>>>>>>>
> >>>>>>>>>>> Is this the core of our disagreement here? That you would like
> >> this
> >>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
> >>>>>>>>>>>
> >>>>>>>>>>> Semantic wise, yes. That's also why I think materialize() has a
> >>>> much
> >>>>>>>>> larger
> >>>>>>>>>>> scope than cache(), thus it should be a different method.
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding the chance of optimization, it might not be that
> rare.
> >>>> Some
> >>>>>>>>> very
> >>>>>>>>>>> simple statistics could already help in many cases. For
> example,
> >>>>>>> simply
> >>>>>>>>>>> maintaining max and min of each fields can already eliminate
> some
> >>>>>>>>>>> unnecessary table scan (potentially scanning the cached table)
> if
> >>>> the
> >>>>>>>>>>> result is doomed to be empty. A histogram would give even
> further
> >>>>>>>>>>> information. The optimizer could be very careful and only
> ignores
> >>>>>>> cache
> >>>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
> >> filter
> >>>>>>> on
> >>>>>>>>> the
> >>>>>>>>>>> cache will absolutely return nothing.
> >>>>>>>>>>>
> >>>>>>>>>>> Given the above clarification on cache, I would like to revisit
> >> the
> >>>>>>>>>>> original "void cache()" proposal and see if we can improve on
> top
> >>>> of
> >>>>>>>>> that.
> >>>>>>>>>>>
> >>>>>>>>>>> What do you think about the following modified interface?
> >>>>>>>>>>>
> >>>>>>>>>>> Table {
> >>>>>>>>>>> /**
> >>>>>>>>>>> * This call hints Flink to maintain a cache of this table and
> >>>>>>> leverage
> >>>>>>>>>>> it for performance optimization if needed.
> >>>>>>>>>>> * Note that Flink may still decide to not use the cache if it
> is
> >>>>>>>>> cheaper
> >>>>>>>>>>> by doing so.
> >>>>>>>>>>> *
> >>>>>>>>>>> * A CacheHandle will be returned to allow user release the
> cache
> >>>>>>>>>>> actively. The cache will be deleted if there
> >>>>>>>>>>> * is no unreleased cache handlers to it. When the
> >> TableEnvironment
> >>>>>>> is
> >>>>>>>>>>> closed. The cache will also be deleted
> >>>>>>>>>>> * and all the cache handlers will be released.
> >>>>>>>>>>> *
> >>>>>>>>>>> * @return a CacheHandle referring to the cache of this table.
> >>>>>>>>>>> */
> >>>>>>>>>>> CacheHandle cache();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> CacheHandle {
> >>>>>>>>>>> /**
> >>>>>>>>>>> * Close the cache handle. This method does not necessarily
> >> deletes
> >>>>>>> the
> >>>>>>>>>>> cache. Instead, it simply decrements the reference counter to
> the
> >>>>>>> cache.
> >>>>>>>>>>> * When the there is no handle referring to a cache. The cache
> >> will
> >>>>>>> be
> >>>>>>>>>>> deleted.
> >>>>>>>>>>> *
> >>>>>>>>>>> * @return the number of open handles to the cache after this
> >> handle
> >>>>>>>>> has
> >>>>>>>>>>> been released.
> >>>>>>>>>>> */
> >>>>>>>>>>> int release()
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> The rationale behind this interface is following:
> >>>>>>>>>>> In vast majority of the cases, users wouldn't really care
> whether
> >>>> the
> >>>>>>>>> cache
> >>>>>>>>>>> is used or not. So I think the most intuitive way is letting
> >>>> cache()
> >>>>>>>>> return
> >>>>>>>>>>> nothing. So nobody needs to worry about the difference between
> >>>>>>>>> operations
> >>>>>>>>>>> on CacheTables and those on the "original" tables. This will
> make
> >>>>>>> maybe
> >>>>>>>>>>> 99.9% of the users happy. There were two concerns raised for
> this
> >>>>>>>>> approach:
> >>>>>>>>>>> 1. In some rare cases, users may want to ignore cache,
> >>>>>>>>>>> 2. A table might be cached/uncached in a third party function
> >> while
> >>>>>>> the
> >>>>>>>>>>> caller does not know.
> >>>>>>>>>>>
> >>>>>>>>>>> For the first issue, users can use hint("ignoreCache") to
> >>>> explicitly
> >>>>>>>>> ignore
> >>>>>>>>>>> cache.
> >>>>>>>>>>> For the second issue, the above proposal lets cache() return a
> >>>>>>>>> CacheHandle,
> >>>>>>>>>>> the only method in it is release(). Different CacheHandles will
> >>>>>>> refer to
> >>>>>>>>>>> the same cache, if a cache no longer has any cache handle, it
> >> will
> >>>> be
> >>>>>>>>>>> deleted. This will address the following case:
> >>>>>>>>>>> {
> >>>>>>>>>>> val handle1 = a.cache()
> >>>>>>>>>>> process(a)
> >>>>>>>>>>> a.select(...) // cache is still available, handle1 has not been
> >>>>>>>>> released.
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> void process(Table t) {
> >>>>>>>>>>> val handle2 = t.cache() // new handle to cache
> >>>>>>>>>>> t.select(...) // optimizer decides cache usage
> >>>>>>>>>>> t.hint("ignoreCache").select(...) // cache is ignored
> >>>>>>>>>>> handle2.release() // release the handle, but the cache may
> still
> >> be
> >>>>>>>>>>> available if there are other handles
> >>>>>>>>>>> ...
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> Does the above modified approach look reasonable to you?
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>>
> >>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <
> >>>> trohrmann@apache.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I was aiming at semantics similar to 1. I actually thought
> that
> >>>>>>>>> `cache()`
> >>>>>>>>>>>> would tell the system to materialize the intermediate result
> so
> >>>> that
> >>>>>>>>>>>> subsequent queries don't need to reprocess it. This means that
> >> the
> >>>>>>>>> usage
> >>>>>>>>>>> of
> >>>>>>>>>>>> the cached table in this example
> >>>>>>>>>>>>
> >>>>>>>>>>>> {
> >>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>> val b1 = cachedTable.select(…)
> >>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
> >>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
> >>>>>>>>>>>> val c1 = a.select(…)
> >>>>>>>>>>>> val c2 = a.foo().select(…)
> >>>>>>>>>>>> val c3 = a.bar().select(...)
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> strongly depends on interleaved calls which trigger the
> >> execution
> >>>> of
> >>>>>>>>> sub
> >>>>>>>>>>>> queries. So for example, if there is only a single env.execute
> >>>> call
> >>>>>>> at
> >>>>>>>>>>> the
> >>>>>>>>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be
> >>>> computed
> >>>>>>> by
> >>>>>>>>>>>> reading directly from the sources (given that there is only a
> >>>> single
> >>>>>>>>>>>> JobGraph). It just happens that the result of `a` will be
> cached
> >>>>>>> such
> >>>>>>>>>>> that
> >>>>>>>>>>>> we skip the processing of `a` when there are subsequent
> queries
> >>>>>>> reading
> >>>>>>>>>>>> from `cachedTable`. If for some reason the system cannot
> >>>> materialize
> >>>>>>>>> the
> >>>>>>>>>>>> table (e.g. running out of disk space, ttl expired), then it
> >> could
> >>>>>>> also
> >>>>>>>>>>>> happen that we need to reprocess `a`. In that sense
> >> `cachedTable`
> >>>>>>>>> simply
> >>>>>>>>>>> is
> >>>>>>>>>>>> an identifier for the materialized result of `a` with the
> >> lineage
> >>>>>>> how
> >>>>>>>>> to
> >>>>>>>>>>>> reprocess it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Till
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
> >>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>>>> val b = cachedTable.select(...)
> >>>>>>>>>>>>>> val c = a.select(...)
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
> >>>>>>> original
> >>>>>>>>>>> DAG
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no chance
> to
> >>>>>>>>>>>> optimize.
> >>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves
> >> the
> >>>>>>>>>>>>> optimizer
> >>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
> >> case,
> >>>>>>> user
> >>>>>>>>>>>>> lose
> >>>>>>>>>>>>>> the option to NOT use cache.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As you can see, neither of the options seem perfect.
> However,
> >> I
> >>>>>>> guess
> >>>>>>>>>>>> you
> >>>>>>>>>>>>>> and Till are proposing the third option:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache
> or
> >>>> DAG
> >>>>>>>>>>>> should
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>> used. c always use the DAG.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am pretty sure that me, Till, Fabian and others were all
> >>>>>>> proposing
> >>>>>>>>>>> and
> >>>>>>>>>>>>> advocating in favour of semantic “1”. No cost based optimiser
> >>>>>>>>> decisions
> >>>>>>>>>>>> at
> >>>>>>>>>>>>> all.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>>> val b1 = cachedTable.select(…)
> >>>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
> >>>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
> >>>>>>>>>>>>> val c1 = a.select(…)
> >>>>>>>>>>>>> val c2 = a.foo().select(…)
> >>>>>>>>>>>>> val c3 = a.bar().select(...)
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3
> >> are
> >>>>>>>>>>>>> re-executing whole plan for “a”.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In the future we could discuss going one step further,
> >>>> introducing
> >>>>>>>>> some
> >>>>>>>>>>>>> global optimisation (that can be manually enabled/disabled):
> >>>>>>>>>>> deduplicate
> >>>>>>>>>>>>> plan nodes/deduplicate sub queries/re-use sub queries
> >> results/or
> >>>>>>>>>>> whatever
> >>>>>>>>>>>>> we could call it. It could do two things:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. Automatically try to deduplicate fragments of the plan and
> >>>> share
> >>>>>>>>> the
> >>>>>>>>>>>>> result using CachedTable - in other words automatically
> insert
> >>>>>>>>>>>> `CachedTable
> >>>>>>>>>>>>> cache()` calls.
> >>>>>>>>>>>>> 2. Automatically make decision to bypass explicit
> `CachedTable`
> >>>>>>> access
> >>>>>>>>>>>>> (this would be the equivalent of what you described as
> >> “semantic
> >>>>>>> 3”).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> However as I wrote previously, I have big doubts if such
> >>>> cost-based
> >>>>>>>>>>>>> optimisation would work (this applies also to “Semantic 2”).
> I
> >>>>>>> would
> >>>>>>>>>>>> expect
> >>>>>>>>>>>>> it to do more harm than good in so many cases, that it
> wouldn’t
> >>>>>>> make
> >>>>>>>>>>>> sense.
> >>>>>>>>>>>>> Even assuming that we calculate statistics perfectly (this
> >> ain’t
> >>>>>>> gonna
> >>>>>>>>>>>>> happen), it’s virtually impossible to correctly estimate
> >> correct
> >>>>>>>>>>> exchange
> >>>>>>>>>>>>> rate of CPU cycles vs IO operations as it is changing so much
> >>>> from
> >>>>>>>>>>>>> deployment to deployment.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Is this the core of our disagreement here? That you would
> like
> >>>> this
> >>>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Another potential concern for semantic 3 is that. In the
> >> future,
> >>>>>>> we
> >>>>>>>>>>> may
> >>>>>>>>>>>>> add
> >>>>>>>>>>>>>> automatic caching to Flink. e.g. cache the intermediate
> >> results
> >>>> at
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> shuffle boundary. If our semantic is that reference to the
> >>>>>>> original
> >>>>>>>>>>>> table
> >>>>>>>>>>>>>> means skipping cache, those users may not be able to benefit
> >>>> from
> >>>>>>> the
> >>>>>>>>>>>>>> implicit cache.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <
> >>>> becket.qin@gmail.com
> >>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the reply. Thought about it again, I might have
> >>>>>>>>>>>> misunderstood
> >>>>>>>>>>>>>>> your proposal in earlier emails. Returning a CachedTable
> >> might
> >>>>>>> not
> >>>>>>>>>>> be
> >>>>>>>>>>>> a
> >>>>>>>>>>>>> bad
> >>>>>>>>>>>>>>> idea.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I was more concerned about the semantic and its
> intuitiveness
> >>>>>>> when a
> >>>>>>>>>>>>>>> CachedTable is returned. i..e, if cache() returns
> >> CachedTable.
> >>>>>>> What
> >>>>>>>>>>>> are
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> semantic in the following code:
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>>>>> val b = cachedTable.select(...)
> >>>>>>>>>>>>>>> val c = a.select(...)
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>> What is the difference between b and c? At the first
> glance,
> >> I
> >>>>>>> see
> >>>>>>>>>>> two
> >>>>>>>>>>>>>>> options:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
> >>>>>>> original
> >>>>>>>>>>>> DAG
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no chance
> >> to
> >>>>>>>>>>>> optimize.
> >>>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c
> leaves
> >>>> the
> >>>>>>>>>>>>> optimizer
> >>>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
> >>>> case,
> >>>>>>>>>>> user
> >>>>>>>>>>>>> lose
> >>>>>>>>>>>>>>> the option to NOT use cache.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As you can see, neither of the options seem perfect.
> >> However, I
> >>>>>>>>>>> guess
> >>>>>>>>>>>>> you
> >>>>>>>>>>>>>>> and Till are proposing the third option:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache
> or
> >>>> DAG
> >>>>>>>>>>>> should
> >>>>>>>>>>>>>>> be used. c always use the DAG.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This does address all the concerns. It is just that from
> >>>>>>>>>>> intuitiveness
> >>>>>>>>>>>>>>> perspective, I found that asking user to explicitly use a
> >>>>>>>>>>> CachedTable
> >>>>>>>>>>>>> while
> >>>>>>>>>>>>>>> the optimizer might choose to ignore is a little weird.
> That
> >>>> was
> >>>>>>>>>>> why I
> >>>>>>>>>>>>> did
> >>>>>>>>>>>>>>> not think about that semantic. But given there is material
> >>>>>>> benefit,
> >>>>>>>>>>> I
> >>>>>>>>>>>>> think
> >>>>>>>>>>>>>>> this semantic is acceptable.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to
> use
> >>>>>>> cache
> >>>>>>>>>>> or
> >>>>>>>>>>>>> not,
> >>>>>>>>>>>>>>>> then why do we need “void cache()” method at all? Would It
> >>>>>>>>>>>> “increase”
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> chance of using the cache? That’s sounds strange. What
> would
> >>>> be
> >>>>>>> the
> >>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If
> we
> >>>>>>> want
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan
> nodes
> >>>>>>>>>>>>> deduplication”
> >>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
> >>>>>>> optimiser
> >>>>>>>>>>> do
> >>>>>>>>>>>>> all of
> >>>>>>>>>>>>>>>> the work.
> >>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not
> >> use
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>> decision.
> >>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
> >> such
> >>>>>>> cost
> >>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>> optimisations would work properly and I would still insist
> >>>>>>> first on
> >>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable
> cache()`)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We are absolutely on the same page here. An explicit
> cache()
> >>>>>>> method
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>> necessary not only because optimizer may not be able to
> make
> >>>> the
> >>>>>>>>>>> right
> >>>>>>>>>>>>>>> decision, but also because of the nature of interactive
> >>>>>>> programming.
> >>>>>>>>>>>> For
> >>>>>>>>>>>>>>> example, if users write the following code in Scala shell:
> >>>>>>>>>>>>>>> val b = a.select(...)
> >>>>>>>>>>>>>>> val c = b.select(...)
> >>>>>>>>>>>>>>> val d = c.select(...).writeToSink(...)
> >>>>>>>>>>>>>>> tEnv.execute()
> >>>>>>>>>>>>>>> There is no way optimizer will know whether b or c will be
> >> used
> >>>>>>> in
> >>>>>>>>>>>> later
> >>>>>>>>>>>>>>> code, unless users hint explicitly.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
> >>>>>>>>>>> objections
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which
> me,
> >>>>>>> Jark,
> >>>>>>>>>>>>> Fabian,
> >>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Is there any other side effects if we use semantic 3
> >> mentioned
> >>>>>>>>>>> above?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> JIangjie (Becket) Qin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
> >>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Sorry for not responding long time.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regarding case1.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would
> >> expect
> >>>>>>> only
> >>>>>>>>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1`
> >> wouldn’t
> >>>>>>>>>>> affect
> >>>>>>>>>>>>>>>> `cachedTableA2`. Just as in any other database dropping
> >>>>>>> modifying
> >>>>>>>>>>> one
> >>>>>>>>>>>>>>>> independent table/materialised view does not affect
> others.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> What I meant is that assuming there is already a cached
> >>>> table,
> >>>>>>>>>>>> ideally
> >>>>>>>>>>>>>>>> users need
> >>>>>>>>>>>>>>>>> not to specify whether the next query should read from
> the
> >>>>>>> cache
> >>>>>>>>>>> or
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to
> use
> >>>>>>> cache
> >>>>>>>>>>> or
> >>>>>>>>>>>>>>>> not, then why do we need “void cache()” method at all?
> Would
> >>>> It
> >>>>>>>>>>>>> “increase”
> >>>>>>>>>>>>>>>> the chance of using the cache? That’s sounds strange. What
> >>>>>>> would be
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If
> we
> >>>>>>> want
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan
> nodes
> >>>>>>>>>>>>> deduplication”
> >>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
> >>>>>>> optimiser
> >>>>>>>>>>> do
> >>>>>>>>>>>>> all of
> >>>>>>>>>>>>>>>> the work.
> >>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not
> >> use
> >>>>>>>>>>> cache
> >>>>>>>>>>>>>>>> decision.
> >>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
> >> such
> >>>>>>> cost
> >>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>> optimisations would work properly and I would still insist
> >>>>>>> first on
> >>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable
> cache()`)
> >>>>>>>>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()`
> >>>> doesn’t
> >>>>>>>>>>>>>>>> contradict future work on automated cost based caching.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
> >>>>>>>>>>> objections
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which
> me,
> >>>>>>> Jark,
> >>>>>>>>>>>>> Fabian,
> >>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <
> becket.qin@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> It is true that after the first job submission, there
> will
> >> be
> >>>>>>> no
> >>>>>>>>>>>>>>>> ambiguity
> >>>>>>>>>>>>>>>>> in terms of whether a cached table is used or not. That
> is
> >>>> the
> >>>>>>>>>>> same
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cache() without returning a CachedTable.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
> >>>>>>> caching
> >>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>> from which you need to consume from if you want to
> benefit
> >>>>>>> from
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>> functionality.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I am thinking a little differently. I think it is a hint
> >> (as
> >>>>>>> you
> >>>>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>>>> later) instead of a new operator. I'd like to be careful
> >>>> about
> >>>>>>> the
> >>>>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>>> of the API. A hint is a property set on an existing
> >> operator,
> >>>>>>> but
> >>>>>>>>>>> is
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>> itself an operator as it does not really manipulate the
> >> data.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of
> decision
> >>>>>>> which
> >>>>>>>>>>>>>>>>>> intermediate result should be cached. But especially
> when
> >>>>>>>>>>> executing
> >>>>>>>>>>>>>>>> ad-hoc
> >>>>>>>>>>>>>>>>>> queries the user might better know which results need to
> >> be
> >>>>>>>>>>> cached
> >>>>>>>>>>>>>>>> because
> >>>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
> >>>>>>> consider
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course,
> in
> >>>> the
> >>>>>>>>>>>> future
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>> might add functionality which tries to automatically
> cache
> >>>>>>>>>>> results
> >>>>>>>>>>>>>>>> (e.g.
> >>>>>>>>>>>>>>>>>> caching the latest intermediate results until so and so
> >> much
> >>>>>>>>>>> space
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
> >>>>>>> `CachedTable
> >>>>>>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I agree that cache() method is needed for exactly the
> >> reason
> >>>>>>> you
> >>>>>>>>>>>>>>>> mentioned,
> >>>>>>>>>>>>>>>>> i.e. Flink cannot predict what users are going to write
> >>>> later,
> >>>>>>> so
> >>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>> need to tell Flink explicitly that this table will be
> used
> >>>>>>> later.
> >>>>>>>>>>>>> What I
> >>>>>>>>>>>>>>>>> meant is that assuming there is already a cached table,
> >>>> ideally
> >>>>>>>>>>>> users
> >>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>> not to specify whether the next query should read from
> the
> >>>>>>> cache
> >>>>>>>>>>> or
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> To explain the difference between returning / not
> >> returning a
> >>>>>>>>>>>>>>>> CachedTable,
> >>>>>>>>>>>>>>>>> I want compare the following two case:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> *Case 1:  returning a CachedTable*
> >>>>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>>>> val cachedTableA1 = a.cache()
> >>>>>>>>>>>>>>>>> val cachedTableA2 = a.cache()
> >>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> c = a.filter(...) // User specify that the original DAG
> is
> >>>>>>> used?
> >>>>>>>>>>> Or
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> optimizer decides whether DAG or cache should be used?
> >>>>>>>>>>>>>>>>> d = cachedTableA1.filter() // User specify that the
> cached
> >>>>>>> table
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>> used.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
> >>>>>>>>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be
> used?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> *Case 2: not returning a CachedTable*
> >>>>>>>>>>>>>>>>> b = a.map()
> >>>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>>> a.cache() // no-op
> >>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache
> or
> >>>> DAG
> >>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache
> or
> >>>> DAG
> >>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> a.unCache()
> >>>>>>>>>>>>>>>>> a.unCache() // no-op
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to
> >> choose
> >>>>>>>>>>>> between
> >>>>>>>>>>>>>>>> DAG
> >>>>>>>>>>>>>>>>> and cache. And the unCache() call becomes tricky.
> >>>>>>>>>>>>>>>>> In case 2, users do not need to worry about whether cache
> >> or
> >>>>>>> DAG
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>> used.
> >>>>>>>>>>>>>>>>> And the unCache() semantic is clear. However, the caveat
> is
> >>>>>>> that
> >>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>> cannot explicitly ignore the cache.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> In order to address the issues mentioned in case 2 and
> >>>>>>> inspired by
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> discussion so far, I am thinking about using hint to
> allow
> >>>> user
> >>>>>>>>>>>>>>>> explicitly
> >>>>>>>>>>>>>>>>> ignore cache. Although we do not have hint yet, but we
> >>>> probably
> >>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>> one. So the code becomes:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> *Case 3: returning this table*
> >>>>>>>>>>>>>>>>> b = a.map()
> >>>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>>> a.cache() // no-op
> >>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache
> or
> >>>> DAG
> >>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used
> >>>>>>> instead
> >>>>>>>>>>> of
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> a.unCache()
> >>>>>>>>>>>>>>>>> a.unCache() // no-op
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We could also let cache() return this table to allow
> >> chained
> >>>>>>>>>>> method
> >>>>>>>>>>>>>>>> calls.
> >>>>>>>>>>>>>>>>> Do you think this API addresses the concerns?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <
> imjark@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> All the recent discussions are focused on whether there
> >> is a
> >>>>>>>>>>>> problem
> >>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>> cache() not return a Table.
> >>>>>>>>>>>>>>>>>> It seems that returning a Table explicitly is more clear
> >>>> (and
> >>>>>>>>>>>> safe?).
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> So whether there are any problems if cache() returns a
> >>>> Table?
> >>>>>>>>>>>>> @Becket
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <
> >>>>>>> trohrmann@apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It's true that b, c, d and e will all read from the
> >>>> original
> >>>>>>> DAG
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> generates a. But all subsequent operators (when running
> >>>>>>> multiple
> >>>>>>>>>>>>>>>> queries)
> >>>>>>>>>>>>>>>>>>> which reference cachedTableA should not need to
> reproduce
> >>>> `a`
> >>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>> directly
> >>>>>>>>>>>>>>>>>>> consume the intermediate result.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing
> a
> >>>>>>> caching
> >>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>> from which you need to consume from if you want to
> >> benefit
> >>>>>>> from
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>> functionality.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of
> >> decision
> >>>>>>> which
> >>>>>>>>>>>>>>>>>>> intermediate result should be cached. But especially
> when
> >>>>>>>>>>>> executing
> >>>>>>>>>>>>>>>>>> ad-hoc
> >>>>>>>>>>>>>>>>>>> queries the user might better know which results need
> to
> >> be
> >>>>>>>>>>> cached
> >>>>>>>>>>>>>>>>>> because
> >>>>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I
> would
> >>>>>>>>>>> consider
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course,
> in
> >>>> the
> >>>>>>>>>>>> future
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> might add functionality which tries to automatically
> >> cache
> >>>>>>>>>>> results
> >>>>>>>>>>>>>>>> (e.g.
> >>>>>>>>>>>>>>>>>>> caching the latest intermediate results until so and so
> >>>> much
> >>>>>>>>>>> space
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
> >>>>>>>>>>> `CachedTable
> >>>>>>>>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <
> >>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the clarification. I am still a little
> >>>> confused.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> If cache() returns a CachedTable, the example might
> >>>> become:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> cachedTableA = a.cache()
> >>>>>>>>>>>>>>>>>>>> d = cachedTableA.map(...)
> >>>>>>>>>>>>>>>>>>>> e = a.map()
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b,
> >> c, d
> >>>>>>> and
> >>>>>>>>>>> e
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>> going to be reading from the original DAG that
> generates
> >>>> a.
> >>>>>>> But
> >>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> naive expectation, d should be reading from the cache.
> >>>> This
> >>>>>>>>>>> seems
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> solving the potential confusion you raised, right?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Just to be clear, my understanding are all based on
> the
> >>>>>>>>>>>> assumption
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a
> the
> >>>>>>>>>>>>>>>> c*achedTableA*
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> original table *a * should be completely
> >> interchangeable.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> That said, I think a valid argument is optimization.
> >> There
> >>>>>>> are
> >>>>>>>>>>>>> indeed
> >>>>>>>>>>>>>>>>>>> cases
> >>>>>>>>>>>>>>>>>>>> that reading from the original DAG could be faster
> than
> >>>>>>> reading
> >>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> cache. For example, in the following example:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> a.filter(f1' > 100)
> >>>>>>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>>>>>> b = a.filter(f1' < 100)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Ideally the optimizer should be intelligent enough to
> >>>> decide
> >>>>>>>>>>>> which
> >>>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> faster, without user intervention. In this case, it
> will
> >>>>>>>>>>> identify
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> b
> >>>>>>>>>>>>>>>>>>>> would just be an empty table, thus skip reading from
> the
> >>>>>>> cache
> >>>>>>>>>>>>>>>>>>> completely.
> >>>>>>>>>>>>>>>>>>>> But I agree that returning a CachedTable would give
> user
> >>>> the
> >>>>>>>>>>>>> control
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> when to use cache, even though I still feel that
> letting
> >>>> the
> >>>>>>>>>>>>>>>> optimizer
> >>>>>>>>>>>>>>>>>>>> handle this is a better option in long run.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
> >>>>>>>>>>>> trohrmann@apache.org
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Yes you are right Becket that it still depends on the
> >>>>>>> actual
> >>>>>>>>>>>>>>>>>> execution
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> the job whether a consumer reads from a cached result
> >> or
> >>>>>>> not.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> My point was actually about the properties of a
> (cached
> >>>> vs.
> >>>>>>>>>>>>>>>>>> non-cached)
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> not about the execution. I would not make cache
> trigger
> >>>> the
> >>>>>>>>>>>>>>>> execution
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> the job because one loses some flexibility by eagerly
> >>>>>>>>>>> triggering
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> execution.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is
> >>>>>>> returned
> >>>>>>>>>>>> by
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> cache() method like Piotr did in order to make the
> API
> >>>> more
> >>>>>>>>>>>>>>>> explicit.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
> >>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> That is a good example. Just a minor correction, in
> >> this
> >>>>>>>>>>> case,
> >>>>>>>>>>>>> b, c
> >>>>>>>>>>>>>>>>>>>> and d
> >>>>>>>>>>>>>>>>>>>>>> will all consume from a non-cached a. This is
> because
> >>>>>>> cache
> >>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> created on the very first job submission that
> >> generates
> >>>>>>> the
> >>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> cached.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> If I understand correctly, this is example is about
> >>>>>>> whether
> >>>>>>>>>>>>>>>>>> .cache()
> >>>>>>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In
> >>>>>>> another
> >>>>>>>>>>>> word,
> >>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>>>> cache() method actually triggers a job that creates
> >> the
> >>>>>>>>>>> cache,
> >>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>> be no such confusion. Is that right?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> In the example, although d will not consume from the
> >>>>>>> cached
> >>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>> looks supposed to, from correctness perspective the
> >> code
> >>>>>>> will
> >>>>>>>>>>>>> still
> >>>>>>>>>>>>>>>>>>>>> return
> >>>>>>>>>>>>>>>>>>>>>> correct result, assuming that tables are immutable.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Personally I feel it is OK because users probably
> >> won't
> >>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>> worry
> >>>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>> whether the table is cached or not. And lazy cache
> >> could
> >>>>>>>>>>> avoid
> >>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>> unnecessary caching if a cached table is never
> created
> >>>> in
> >>>>>>> the
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>> application. But I am not opposed to do eager
> >> evaluation
> >>>>>>> of
> >>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> >>>>>>>>>>>>>>>>>> trohrmann@apache.org>
> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily
> >>>>>>> changing
> >>>>>>>>>>>>>>>>>>> properties
> >>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>> node affects all down stream consumers but does not
> >>>>>>>>>>>> necessarily
> >>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> happen before these consumers are defined. From a
> >>>> user's
> >>>>>>>>>>>>>>>>>>> perspective
> >>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>> can be quite confusing:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>>>>>>>>> d = a.map(...)
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> now b, c and d will consume from a cached operator.
> >> In
> >>>>>>> this
> >>>>>>>>>>>>> case,
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>>> would most likely expect that only d reads from a
> >>>> cached
> >>>>>>>>>>>> result.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> >>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
> >>>>>>> effects?
> >>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>> far
> >>>>>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only
> exist
> >>>> if a
> >>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Not only that. There are also performance
> >> implications
> >>>>>>> and
> >>>>>>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>> another implicit side effects of using `void
> >> cache()`.
> >>>>>>> As I
> >>>>>>>>>>>>>>>>>> wrote
> >>>>>>>>>>>>>>>>>>>>>> before,
> >>>>>>>>>>>>>>>>>>>>>>>> reading from cache might not always be desirable,
> >> thus
> >>>>>>> it
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> cause
> >>>>>>>>>>>>>>>>>>>>>>>> performance degradation and I’m fine with that -
> >>>> user's
> >>>>>>> or
> >>>>>>>>>>>>>>>>>>>>> optimiser’s
> >>>>>>>>>>>>>>>>>>>>>>>> choice. What I do not like is that this implicit
> >> side
> >>>>>>>>>>> effect
> >>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>> manifest
> >>>>>>>>>>>>>>>>>>>>>>>> in completely different part of code, that wasn’t
> >>>>>>> touched
> >>>>>>>>>>> by
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else.
> >> And
> >>>>>>> even
> >>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>>>> improves performance, it’s still a side effect of
> >>>> `void
> >>>>>>>>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>>>>>>>>> Almost
> >>>>>>>>>>>>>>>>>>>>>>>> from the definition `void` methods have only side
> >>>>>>> effects.
> >>>>>>>>>>>> As I
> >>>>>>>>>>>>>>>>>>>> wrote
> >>>>>>>>>>>>>>>>>>>>>>>> before, there are couple of scenarios where this
> >> might
> >>>>>>> be
> >>>>>>>>>>>>>>>>>>>> undesirable
> >>>>>>>>>>>>>>>>>>>>>>>> and/or unexpected, for example:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 1.
> >>>>>>>>>>>>>>>>>>>>>>>> Table b = …;
> >>>>>>>>>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>>>>>>>>> x = b.join(…)
> >>>>>>>>>>>>>>>>>>>>>>>> y = b.count()
> >>>>>>>>>>>>>>>>>>>>>>>> // ...
> >>>>>>>>>>>>>>>>>>>>>>>> // 100
> >>>>>>>>>>>>>>>>>>>>>>>> // hundred
> >>>>>>>>>>>>>>>>>>>>>>>> // lines
> >>>>>>>>>>>>>>>>>>>>>>>> // of
> >>>>>>>>>>>>>>>>>>>>>>>> // code
> >>>>>>>>>>>>>>>>>>>>>>>> // later
> >>>>>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even
> >>>> hidden
> >>>>>>> in
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>> method/file/package/dependency
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 2.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Table b = ...
> >>>>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>>>>>>>>>>>>>>> foo(b)
> >>>>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>>>> Else {
> >>>>>>>>>>>>>>>>>>>>>>>> bar(b)
> >>>>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Void foo(Table b) {
> >>>>>>>>>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>>>>>>>>> // do something with b
> >>>>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> In both above examples, `b.cache()` will
> implicitly
> >>>>>>> affect
> >>>>>>>>>>>>>>>>>>>> (semantic
> >>>>>>>>>>>>>>>>>>>>>> of a
> >>>>>>>>>>>>>>>>>>>>>>>> program in case of sources being mutable and
> >>>>>>> performance)
> >>>>>>>>>>> `z
> >>>>>>>>>>>> =
> >>>>>>>>>>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from
> >>>> obvious.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On top of that, there is still this argument of
> mine
> >>>>>>> that
> >>>>>>>>>>>>>>>>>> having
> >>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is
> more
> >>>>>>>>>>> flexible
> >>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> us
> >>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> future and for the user (as a manual option to
> >> bypass
> >>>>>>> cache
> >>>>>>>>>>>>>>>>>>> reads).
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> But Jiangjie is correct,
> >>>>>>>>>>>>>>>>>>>>>>>>> the source table in batching should be immutable.
> >> It
> >>>> is
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> user’s
> >>>>>>>>>>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a
> >> regular
> >>>>>>>>>>>>>>>>>> failover
> >>>>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>>> lead
> >>>>>>>>>>>>>>>>>>>>>>>>> to inconsistent results.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good
> >> deployment
> >>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> be.
> >>>>>>>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this
> >>>> (since
> >>>>>>> the
> >>>>>>>>>>>>>>>>>>> proper
> >>>>>>>>>>>>>>>>>>>>> fix
> >>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>> to support transactions), I’m just trying to
> >> minimise
> >>>>>>>>>>>> confusion
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> users that are not fully aware what’s going on and
> >>>>>>> operate
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> less
> >>>>>>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>>>>>>>> perfect setup. And if something bites them after
> >>>> adding
> >>>>>>>>>>>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>>>>>>>>>>> call,
> >>>>>>>>>>>>>>>>>>>>>>>> to make sure that they at least know all of the
> >> places
> >>>>>>> that
> >>>>>>>>>>>>>>>>>>> adding
> >>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>> line can affect.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks, Piotrek
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <
> >>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more
> >> replies
> >>>>>>> are
> >>>>>>>>>>>>>>>>>>>>> following.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not
> >> only
> >>>> be
> >>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>>>>>>>>> programming and not only in batching.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> It is true. Actually in stream processing,
> cache()
> >>>> has
> >>>>>>> the
> >>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>>> batch processing. The semantic is following:
> >>>>>>>>>>>>>>>>>>>>>>>>> For a table created via a series of computation,
> >> save
> >>>>>>> that
> >>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>>>>>>>>>>> reference to avoid running the computation logic
> to
> >>>>>>>>>>>>>>>>>> regenerate
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> table.
> >>>>>>>>>>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
> >>>>>>>>>>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
> >>>>>>>>>>> processing.
> >>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>> difference
> >>>>>>>>>>>>>>>>>>>>>>>>> is that stream applications will only run once as
> >>>> they
> >>>>>>> are
> >>>>>>>>>>>>>>>>>> long
> >>>>>>>>>>>>>>>>>>>>>>> running.
> >>>>>>>>>>>>>>>>>>>>>>>>> And the batch applications may be run multiple
> >> times,
> >>>>>>>>>>> hence
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>>>>>> be created and dropped each time the application
> >>>> runs.
> >>>>>>>>>>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
> >>>>>>>>>>> management
> >>>>>>>>>>>>>>>>>>>>>>> requirements
> >>>>>>>>>>>>>>>>>>>>>>>>> for the streaming cached table, such as time
> based
> >> /
> >>>>>>> size
> >>>>>>>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>>>>>>>>>> retention,
> >>>>>>>>>>>>>>>>>>>>>>>>> to address the infinite data issue. But such
> >>>>>>> requirement
> >>>>>>>>>>>> does
> >>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>> change
> >>>>>>>>>>>>>>>>>>>>>>>>> the semantic.
> >>>>>>>>>>>>>>>>>>>>>>>>> You are right that interactive programming is
> just
> >>>> one
> >>>>>>> use
> >>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>> cache().
> >>>>>>>>>>>>>>>>>>>>>>>>> It is not the only use case.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having
> >> the
> >>>>>>> `void
> >>>>>>>>>>>>>>>>>>>> cache()`
> >>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>> side effects.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> This is indeed the key point. The argument around
> >>>>>>> whether
> >>>>>>>>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>> return something already indicates that cache()
> and
> >>>>>>>>>>>>>>>>>>> materialize()
> >>>>>>>>>>>>>>>>>>>>>>> address
> >>>>>>>>>>>>>>>>>>>>>>>>> different issues.
> >>>>>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
> >>>>>>> effects?
> >>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>> far
> >>>>>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only
> exist
> >>>> if a
> >>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >>>>>>>>>>> CachedTable
> >>>>>>>>>>>>>>>>>>>>>> read-only.
> >>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that
> >> user
> >>>>>>> can
> >>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user
> >> currently
> >>>>>>> can
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I don't think anyone should insert something to a
> >>>>>>> cache.
> >>>>>>>>>>> By
> >>>>>>>>>>>>>>>>>>>>>> definition
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>> cache should only be updated when the
> corresponding
> >>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>> updated. What I am wondering is that given the
> >>>>>>> following
> >>>>>>>>>>> two
> >>>>>>>>>>>>>>>>>>>> facts:
> >>>>>>>>>>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with
> >> something
> >>>>>>> like
> >>>>>>>>>>>>>>>>>>>>> insert()),
> >>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
> >>>>>>>>>>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
> >>>>>>>>>>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable
> is
> >>>>>>>>>>> mutable
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is
> >> where I
> >>>>>>>>>>>> thought
> >>>>>>>>>>>>>>>>>>>>>>> confusing.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> >>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`.
> One
> >>>>>>> more
> >>>>>>>>>>>>>>>>>>>>> explanation
> >>>>>>>>>>>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is
> >> that
> >>>> I
> >>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>> “Table”s
> >>>>>>>>>>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way
> as
> >>>> SQL
> >>>>>>>>>>>>>>>>>> views,
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>>>>>>> difference for me is that their live scope is
> >> short
> >>>> -
> >>>>>>>>>>>>>>>>>> current
> >>>>>>>>>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>>>> is limited by different execution model. That’s
> >> why
> >>>>>>>>>>>>>>>>>> “cashing”
> >>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>>>>> for me
> >>>>>>>>>>>>>>>>>>>>>>>>>> is just materialising it.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> However I see and I understand your point of
> view.
> >>>>>>> Coming
> >>>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking
> non-SQL
> >>>>>>> world,
> >>>>>>>>>>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()`
> >> will/might
> >>>>>>> not
> >>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>> interactive programming and not only in
> batching.
> >>>> But
> >>>>>>>>>>>> naming
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>>>>>>> and not that critical to me. Especially that
> once
> >> we
> >>>>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>>>>>> proper
> >>>>>>>>>>>>>>>>>>>>>>>>>> materialised views, we can always
> deprecate/rename
> >>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>> deem
> >>>>>>>>>>>>>>>>>>>>>>>> so.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having
> >> the
> >>>>>>>>>>> `void
> >>>>>>>>>>>>>>>>>>>>> cache()`
> >>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you
> >> have
> >>>>>>>>>>>>>>>>>> mentioned.
> >>>>>>>>>>>>>>>>>>>>> True:
> >>>>>>>>>>>>>>>>>>>>>>>>>> results might be non deterministic if underlying
> >>>>>>> source
> >>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>> changing.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly
> changes
> >>>> the
> >>>>>>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized
> Table.
> >> It
> >>>>>>> can
> >>>>>>>>>>>>>>>>>> cause
> >>>>>>>>>>>>>>>>>>>>> “wtf”
> >>>>>>>>>>>>>>>>>>>>>>>> moment
> >>>>>>>>>>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in
> some
> >>>>>>> place
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> his
> >>>>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
> >>>>>>>>>>> differently.
> >>>>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table
> >> handle,
> >>>>>>> we
> >>>>>>>>>>>>>>>>>> force
> >>>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>> explicitly use the cache which removes the
> >> “random”
> >>>>>>> part
> >>>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> "suddenly
> >>>>>>>>>>>>>>>>>>>>>>>>>> some other random places are behaving
> >> differently”.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> This argument and others that I’ve raised
> (greater
> >>>>>>>>>>>>>>>>>>>>>>> flexibility/allowing
> >>>>>>>>>>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are
> >> independent
> >>>>>>> of
> >>>>>>>>>>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>>>>>>>> vs
> >>>>>>>>>>>>>>>>>>>>>>>>>> `materialize()` discussion.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the
> >>>>>>> CachedTable?
> >>>>>>>>>>>>>>>>>> This
> >>>>>>>>>>>>>>>>>>>>>> sounds
> >>>>>>>>>>>>>>>>>>>>>>>>>> pretty confusing.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >>>>>>>>>>> CachedTable
> >>>>>>>>>>>>>>>>>>>>>>> read-only. I
> >>>>>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that
> >> user
> >>>>>>> can
> >>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user
> >> currently
> >>>>>>> can
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
> >>>>>>>>>>> xingcanc@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and
> >>>>>>> `materialize()`
> >>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>> considered as two different methods where the
> >> later
> >>>>>>> one
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>> sophisticated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea
> >> is
> >>>>>>> just
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the
> >>>> TableAPI
> >>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>>>>> high-level
> >>>>>>>>>>>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the
> >>>> DataSet
> >>>>>>> API
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> force
> >>>>>>>>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching
> >> it.
> >>>>>>> Then
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>> manually register the cached dataset to a table
> >>>> again
> >>>>>>> (we
> >>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with
> an
> >>>>>>>>>>> identical
> >>>>>>>>>>>>>>>>>>>> schema
> >>>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>>>> different contents here). After all, it’s the
> >>>> dataset
> >>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> >>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those
> >> are
> >>>>>>> good
> >>>>>>>>>>>>>>>>>>>>> arguments.
> >>>>>>>>>>>>>>>>>>>>>>>> But I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> think those arguments are mostly about
> >>>> materialized
> >>>>>>>>>>> view.
> >>>>>>>>>>>>>>>>>>> Let
> >>>>>>>>>>>>>>>>>>>> me
> >>>>>>>>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and
> >>>>>>> materialize()
> >>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>> different.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite
> >>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>> implications.
> >>>>>>>>>>>>>>>>>>>>>>>> An
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish().
> When
> >>>>>>> users
> >>>>>>>>>>>>>>>>>> call
> >>>>>>>>>>>>>>>>>>>>>> cache(),
> >>>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate
> result
> >>>> as
> >>>>>>> a
> >>>>>>>>>>>>>>>>>> draft
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>>>> work,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> this intermediate result may not have any
> >>>> realistic
> >>>>>>>>>>>>>>>>>> meaning.
> >>>>>>>>>>>>>>>>>>>>>> Calling
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish
> the
> >>>>>>> cached
> >>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> any
> >>>>>>>>>>>>>>>>>>>>>>>>>> manner.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means
> "I
> >>>>>>> have
> >>>>>>>>>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>>>>>>>>>> meaningful
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to
> think
> >>>>>>> about
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> validation,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result,
> >> etc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
> >>>>>>>>>>> materialize()
> >>>>>>>>>>>>>>>>>>>> methods
> >>>>>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them.
> >> The
> >>>>>>>>>>> concept
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to
> >> say
> >>>>>>> the
> >>>>>>>>>>>>>>>>>>> related
> >>>>>>>>>>>>>>>>>>>>>> stuff
> >>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think
> >> the
> >>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>>>>>>> itself
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and
> >>>>>>> systematic
> >>>>>>>>>>>>>>>>>>> manner.
> >>>>>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>> found
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way
> >>>> beyond
> >>>>>>>>>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> programming experience.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still
> >> have
> >>>>>>> some
> >>>>>>>>>>>>>>>>>>>>> questions,
> >>>>>>>>>>>>>>>>>>>>>>>>>> though.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans
> files
> >>>>>>> from a
> >>>>>>>>>>>>>>>>>>>>> directory
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table t1 =
> source.groupBy(…).select(…).where(…)
> >>>> ….;
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> >>>>>>>>>>>>>>>>>> initialised)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger
> >> it)
> >>>>>>>>>>> writes
> >>>>>>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>>>> files
> >>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> /foo/bar
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension,
> not
> >> to
> >>>>>>> be
> >>>>>>>>>>>>>>>>>>>>> implemented
> >>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial version
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> what if someone else added some more files to
> >>>>>>> /foo/bar
> >>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>> point?
> >>>>>>>>>>>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the
> result
> >>>>>>> become
> >>>>>>>>>>>>>>>>>>>>>>>>>> non-deterministic,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> right?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future
> extension,
> >>>>>>> manual
> >>>>>>>>>>>>>>>>>>>> “cache”
> >>>>>>>>>>>>>>>>>>>>>>>> dropping
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in
> >>>> most
> >>>>>>>>>>>> cases,
> >>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>>>> talking
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental
> >> assumption
> >>>>>>> of
> >>>>>>>>>>>> such
> >>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> source data is complete before the data
> >> processing
> >>>>>>>>>>>> begins,
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> will not change during the data processing.
> IMO,
> >>>> if
> >>>>>>>>>>>>>>>>>>> additional
> >>>>>>>>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> to be added to some source during the
> >> processing,
> >>>> it
> >>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> done
> >>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>> ways
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> like union the source with another table
> >>>> containing
> >>>>>>> the
> >>>>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>> added.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> There are a few cases that computations are
> >>>> executed
> >>>>>>>>>>>>>>>>>>>> repeatedly
> >>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> changing data source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job
> >>>> every
> >>>>>>>>>>> hour
> >>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> samples
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case,
> the
> >>>>>>> source
> >>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>> between
> >>>>>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain
> >>>> unchanged
> >>>>>>>>>>>> within
> >>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>> run.
> >>>>>>>>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> usually in that case, the result will need
> >>>>>>> versioning,
> >>>>>>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> given
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result
> >> from
> >>>>>>> the
> >>>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>> by a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> certain timestamp.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Another example is something like data
> >> warehouse.
> >>>> In
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>> case,
> >>>>>>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>>>>>>> are a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of
> those
> >>>>>>>>>>> sources,
> >>>>>>>>>>>>>>>>>>> many
> >>>>>>>>>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be
> >>>>>>> created to
> >>>>>>>>>>>>>>>>>>>> generate
> >>>>>>>>>>>>>>>>>>>>>>>> derived
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated
> >> when
> >>>>>>> the
> >>>>>>>>>>>>>>>>>>>> underlying
> >>>>>>>>>>>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing
> logic
> >>>>>>> that
> >>>>>>>>>>>>>>>>>>> derives
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update
> >>>> those
> >>>>>>>>>>>>>>>>>>>>>> reports/views.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Again,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> all those derived data also need to ha
> >>>>>>>
> >>>>>>>
> >>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@da-platform.com>.
Hi,

I think that introducing ref counting could be confusing and it will be error prone, since Flink-table’s users are not used to closing/releasing resources. I was more objecting placing the `uncache()`/`dropCache()`/`releaseCache()` (releaseCache sounds best to me) as a method in the “Table”. It might be not obvious that it will drop the cache for all of the usages of the given table. For example:

public void foo(Table t) {
 // …
 t.releaseCache();
}

public void bar(Table t) {
  // ...
}

Table a = …
val cachedA = a.cache()

foo(cachedA)
bar(cachedA)


My problem with above example is that `t.releaseCache()` call is not doing the best possible job in communicating to the user that it will have a side effects for other places, like `bar(cachedA)` call. Something like this might be a better (not perfect, but just a bit better):

public void foo(Table t, CacheService cacheService) {
 // …
 cacheService.releaseCacheFor(t);
}

Table a = …
val cachedA = a.cache()

foo(cachedA, env.getCacheService())
bar(cachedA)


Also from another perspective, maybe placing `releaseCache()` method in Table might not be the best separation of concerns - `releaseCache()` method seams significantly different compared to other existing methods.

Piotrek

> On 8 Jan 2019, at 12:28, Becket Qin <be...@gmail.com> wrote:
> 
> Hi Piotr,
> 
> You are right. There might be two intuitive meanings when users call
> 'a.uncache()', namely:
> 1. release the resource
> 2. Do not use cache for the next operation.
> 
> Case (1) would likely be the dominant use case. So I would suggest we
> dedicate uncache() method to case (1), i.e. for resource release, but not
> for ignoring cache.
> 
> For case 2, i.e. explicitly ignoring cache (which is rare), users may use
> something like 'hint("ignoreCache")'. I think this is better as it is a
> little weird for users to call `a.uncache()` while they may not even know
> if the table is cached at all.
> 
> Assuming we let `uncache()` to only release resource, one possibility is
> using ref count to mitigate the side effect. That means a ref count is
> incremented on `cache()` and decremented on `uncache()`. That means
> `uncache()` does not physically release the resource immediately, but just
> means the cache could be released.
> That being said, I am not sure if this is really a better solution as it
> seems a little counter intuitive. Maybe calling it releaseCache() help a
> little bit?
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> 
> 
> On Tue, Jan 8, 2019 at 5:36 PM Piotr Nowojski <pi...@da-platform.com> wrote:
> 
>> Hi Becket,
>> 
>> With `uncache` there are probably two features that we can think about:
>> 
>> a)
>> 
>> Physically dropping the cached table from the storage, freeing up the
>> resources
>> 
>> b)
>> 
>> Hinting the optimizer to not cache the reads for the next query/table
>> 
>> a) Has the issue as I wrote before, that it seemed to be an operation
>> inherently “flawed" with having side effects.
>> 
>> I’m not sure how it would be best to express. We could make it work:
>> 
>> 1. via a method on a Table as you proposed:
>> 
>> void Table#dropCache()
>> void Table#uncache()
>> 
>> 2. Operation on the environment
>> 
>> env.dropCacheFor(table) // or some other argument that allows user to
>> identify the desired cache
>> 
>> 3. Extending (from your original design doc) `setTableService` method to
>> return some control handle like:
>> 
>> TableServiceControl setTableService(TableFactory tf,
>>                     TableProperties properties,
>>                     TempTableCleanUpCallback cleanUpCallback);
>> 
>> (TableServiceControl? TableService? TableServiceHandle? CacheService?)
>> 
>> And having the drop cache method there:
>> 
>> TableServiceControl#dropCache(table)
>> 
>> Out of those options, option 1 might have a disadvantage of kind of not
>> making the user aware, that this is a global operation with side effects.
>> Like the old example of:
>> 
>> public void foo(Table t) {
>>  // …
>>  t.dropCache();
>> }
>> 
>> It might not be immediately obvious that `t.dropCache()` is some kind of
>> global operation, with side effects visible outside of the `foo` function.
>> 
>> On the other hand, both option 2 and 3, might have greater chance of
>> catching user’s attention:
>> 
>> public void foo(Table t, CacheService cacheService) {
>>  // …
>>  cacheService.dropCache(t);
>> }
>> 
>> b) could be achieved quite easily:
>> 
>> Table a = …
>> val notCached1 = a.doNotCache()
>> val cachedA = a.cache()
>> val notCached2 = cachedA.doNotCache() // equivalent of notCached1
>> 
>> `doNotCache()` would behave similarly to `cache()` - return a copy of the
>> table with removed “cache” hint and/or added “never cache” hint.
>> 
>> Piotrek
>> 
>> 
>>> On 8 Jan 2019, at 03:17, Becket Qin <be...@gmail.com> wrote:
>>> 
>>> Hi Piotr,
>>> 
>>> Thanks for the proposal and detailed explanation. I like the idea of
>>> returning a new hinted Table without modifying the original table. This
>>> also leave the room for users to benefit from future implicit caching.
>>> 
>>> Just to make sure I get the full picture. In your proposal, there will
>> also
>>> be a 'void Table#uncache()' method to release the cache, right?
>>> 
>>> Thanks,
>>> 
>>> Jiangjie (Becket) Qin
>>> 
>>> On Mon, Jan 7, 2019 at 11:50 PM Piotr Nowojski <pi...@da-platform.com>
>>> wrote:
>>> 
>>>> Hi Becket!
>>>> 
>>>> After further thinking I tend to agree that my previous proposal
>> (*Option
>>>> 2*) indeed might not be if would in the future introduce automatic
>> caching.
>>>> However I would like to propose a slightly modified version of it:
>>>> 
>>>> *Option 4*
>>>> 
>>>> Adding `cache()` method with following signature:
>>>> 
>>>> Table Table#cache();
>>>> 
>>>> Without side-effects, and `cache()` call do not modify/change original
>>>> Table in any way.
>>>> It would return a copy of original table, with added hint for the
>>>> optimizer to cache the table, so that the future accesses to the
>> returned
>>>> table might be cached or not.
>>>> 
>>>> Assuming that we are talking about a setup, where we do not have
>> automatic
>>>> caching enabled (possible future extension).
>>>> 
>>>> Example #1:
>>>> 
>>>> ```
>>>> Table a = …
>>>> a.foo() // not cached
>>>> 
>>>> val cachedTable = a.cache();
>>>> 
>>>> cachedA.bar() // maybe cached
>>>> a.foo() // same as before - effectively not cached
>>>> ```
>>>> 
>>>> Both the first and the second `a.foo()` operations would behave in the
>>>> exactly same way. Again, `a.cache()` call doesn’t affect `a` itself. If
>> `a`
>>>> was not hinted for caching before `a.cache();`, then both `a.foo()`
>> calls
>>>> wouldn’t use cache.
>>>> 
>>>> Returned `cachedA` would be hinted with “cache” hint, so probably
>>>> `cachedA.bar()` would go through cache (unless optimiser decides the
>>>> opposite)
>>>> 
>>>> Example #2
>>>> 
>>>> ```
>>>> Table a = …
>>>> 
>>>> a.foo() // not cached
>>>> 
>>>> val b = a.cache();
>>>> 
>>>> a.foo() // same as before - effectively not cached
>>>> b.foo() // maybe cached
>>>> 
>>>> val c = b.cache();
>>>> 
>>>> a.foo() // same as before - effectively not cached
>>>> b.foo() // same as before - effectively maybe cached
>>>> c.foo() // maybe cached
>>>> ```
>>>> 
>>>> Now, assuming that we have some future “automatic caching optimisation”:
>>>> 
>>>> Example #3
>>>> 
>>>> ```
>>>> env.enableAutomaticCaching()
>>>> Table a = …
>>>> 
>>>> a.foo() // might be cached, depending if `a` was selected to automatic
>>>> caching
>>>> 
>>>> val b = a.cache();
>>>> 
>>>> a.foo() // same as before - might be cached, if `a` was selected to
>>>> automatic caching
>>>> b.foo() // maybe cached
>>>> ```
>>>> 
>>>> 
>>>> More or less this is the same behaviour as:
>>>> 
>>>> Table a = ...
>>>> val b = a.filter(x > 20)
>>>> 
>>>> calling `filter` hasn’t changed or altered `a` in anyway. If `a` was
>>>> previously filtered:
>>>> 
>>>> Table src = …
>>>> val a = src.filter(x > 20)
>>>> val b = a.filter(x > 20)
>>>> 
>>>> then yes, `a` and `b` will be the same. But the point is that neither
>>>> `filter` nor `cache` changes the original `a` table.
>>>> 
>>>> One thing is that indeed, physically dropping cache operation, will have
>>>> side effects and it will in a way mutate the cached table references.
>> But
>>>> this is I think unavoidable in any solution - the same issue as calling
>>>> `.close()`, or calling destructor in C++.
>>>> 
>>>> Piotrek
>>>> 
>>>>> On 7 Jan 2019, at 10:41, Becket Qin <be...@gmail.com> wrote:
>>>>> 
>>>>> Happy New Year, everybody!
>>>>> 
>>>>> I would like to resume this discussion thread. At this point, We have
>>>>> agreed on the first step goal of interactive programming. The open
>>>>> discussion is the exact API. More specifically, what should *cache()*
>>>>> method return and what is the semantic. There are three options:
>>>>> 
>>>>> *Option 1*
>>>>> *void cache()* OR *Table cache()* which returns the original table for
>>>>> chained calls.
>>>>> *void uncache() *releases the cache.
>>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
>>>>> 
>>>>> - Semantic: a.cache() hints that table 'a' should be cached. Optimizer
>>>>> decides whether the cache will be used or not.
>>>>> - pros: simple and no confusion between CachedTable and original table
>>>>> - cons: A table may be cached / uncached in a method invocation, while
>>>> the
>>>>> caller does not know about this.
>>>>> 
>>>>> *Option 2*
>>>>> *CachedTable cache()*
>>>>> *CachedTable *extends *Table *with an additional *uncache()* method
>>>>> 
>>>>> - Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will
>> always
>>>>> use cache. *a.bar() *will always use original DAG.
>>>>> - pros: No potential side effects in method invocation.
>>>>> - cons: Optimizer has no chance to kick in. Future optimization will
>>>> become
>>>>> a behavior change and need users to change the code.
>>>>> 
>>>>> *Option 3*
>>>>> *CacheHandle cache()*
>>>>> *CacheHandle.release() *to release a cache handle on the table. If all
>>>>> cache handles are released, the cache could be removed.
>>>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
>>>>> 
>>>>> - Semantic: *a.cache() *hints that 'a' should be cached. Optimizer
>>>> decides
>>>>> whether the cache will be used or not. Cache is released either no
>> handle
>>>>> is on it, or the user program exits.
>>>>> - pros: No potential side effect in method invocation. No confusion
>>>> between
>>>>> cached table v.s original table.
>>>>> - cons: An additional CacheHandle exposed to the users.
>>>>> 
>>>>> 
>>>>> Personally I prefer option 3 for the following reasons:
>>>>> 1. It is simple. Vast majority of the users would just call
>>>>> *a.cache()* followed
>>>>> by *a.foo(),* *a.bar(), etc. *
>>>>> 2. There is no semantic ambiguity and semantic change if we decide to
>> add
>>>>> implicit cache in the future.
>>>>> 3. There is no side effect in the method calls.
>>>>> 4. Admittedly we need to expose one more CacheHandle class to the
>> users.
>>>>> But it is not that difficult to understand given similar well known
>>>> concept
>>>>> like ref count (we can name it CacheReference if that is easier to
>>>>> understand). So I think it is fine.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jiangjie (Becket) Qin
>>>>> 
>>>>> 
>>>>> On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <be...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi Piotrek,
>>>>>> 
>>>>>> 1. Regarding optimization.
>>>>>> Sure there are many cases that the decision is hard to make. But that
>>>> does
>>>>>> not make it any easier for the users to make those decisions. I
>> imagine
>>>> 99%
>>>>>> of the users would just naively use cache. I am not saying we can
>>>> optimize
>>>>>> in all the cases. But as long as we agree that at least in certain
>>>> cases (I
>>>>>> would argue most cases), optimizer can do a little better than an
>>>> average
>>>>>> user who likely knows little about Flink internals, we should not push
>>>> the
>>>>>> burden of optimization to users.
>>>>>> 
>>>>>> BTW, it seems some of your concerns are related to the
>> implementation. I
>>>>>> did not mention the implementation of the caching service because that
>>>>>> should not affect the API semantic. Not sure if this helps, but
>> imagine
>>>> the
>>>>>> default implementation has one StorageNode service colocating with
>> each
>>>> TM.
>>>>>> It could be running within the TM process or in a standalone process,
>>>>>> depending on configuration.
>>>>>> 
>>>>>> The StorageNode uses memory + spill-to-disk mechanism. The cached data
>>>>>> will just be written to the local StorageNode service. If the
>>>> StorageNode
>>>>>> is running within the TM process, the in-memory cache could just be
>>>> objects
>>>>>> so we save some serde cost. A later job referring to the cached Table
>>>> will
>>>>>> be scheduled in a locality aware manner, i.e. run in the TM whose peer
>>>>>> StorageNode hosts the data.
>>>>>> 
>>>>>> 
>>>>>> 2. Semantic
>>>>>> I am not sure why introducing a new hintCache() or
>>>>>> env.enableAutomaticCaching() method would avoid the consequence of
>>>> semantic
>>>>>> change.
>>>>>> 
>>>>>> If the auto optimization is not enabled by default, users still need
>> to
>>>>>> make code change to all existing programs in order to get the benefit.
>>>>>> If the auto optimization is enabled by default, advanced users who
>> know
>>>>>> that they really want to use cache will suddenly lose the opportunity
>>>> to do
>>>>>> so, unless they change the code to disable auto optimization.
>>>>>> 
>>>>>> 
>>>>>> 3. side effect
>>>>>> The CacheHandle is not only for where to put uncache(). It is to solve
>>>> the
>>>>>> implicit performance impact by moving the uncache() to the
>> CacheHandle.
>>>>>> 
>>>>>> - If users wants to leverage cache, they can call a.cache(). After
>>>>>> that, unless user explicitly release that CacheHandle, a.foo() will
>>>> always
>>>>>> leverage cache if needed (optimizer may choose to ignore cache if
>> that
>>>>>> helps accelerate the process). Any function call will not be able to
>>>>>> release the cache because they do not have that CacheHandle.
>>>>>> - If some advanced users do not want to use cache at all, they will
>>>>>> call a.hint(ignoreCache).foo(). This will for sure ignore cache and
>>>> use the
>>>>>> original DAG to process.
>>>>>> 
>>>>>> 
>>>>>>> In vast majority of the cases, users wouldn't really care whether the
>>>>>>> cache is used or not.
>>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
>> memory
>>>>>>> caching) would add additional IO costs. It’s similar as saying that
>>>> users
>>>>>>> would not see a difference between Spark/Flink and MapReduce
>> (MapReduce
>>>>>>> writes data to disks after every map/reduce stage).
>>>>>> 
>>>>>> What I wanted to say is that in most cases, after users call cache(),
>>>> they
>>>>>> don't really care about whether auto optimization has decided to
>> ignore
>>>> the
>>>>>> cache or not, as long as the program runs faster.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jiangjie (Becket) Qin
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <
>>>> piotr@data-artisans.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Thanks for the quick answer :)
>>>>>>> 
>>>>>>> Re 1.
>>>>>>> 
>>>>>>> I generally agree with you, however couple of points:
>>>>>>> 
>>>>>>> a) the problem with using automatic caching is bigger, because you
>> will
>>>>>>> have to decide, how do you compare IO vs CPU costs and if you pick
>>>> wrong,
>>>>>>> additional IO costs might be enormous or even can crash your system.
>>>> This
>>>>>>> is more difficult problem compared to let say join reordering, where
>>>> the
>>>>>>> only issue is to have good statistics that can capture correlations
>>>> between
>>>>>>> columns (when you reorder joins number of IO operations do not
>> change)
>>>>>>> c) your example is completely independent of caching.
>>>>>>> 
>>>>>>> Query like this:
>>>>>>> 
>>>>>>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===`f2).as('f3,
>>>>>>> …).filter(‘f3 > 30)
>>>>>>> 
>>>>>>> Should/could be optimised to empty result immediately, without the
>> need
>>>>>>> for any cache/materialisation and that should work even without any
>>>>>>> statistics provided by the connector.
>>>>>>> 
>>>>>>> For me prerequisite to any serious cost-based optimisations would be
>>>> some
>>>>>>> reasonable benchmark coverage of the code (tpch?). Otherwise that
>>>> would be
>>>>>>> equivalent of adding not tested code, since we wouldn’t be able to
>>>> verify
>>>>>>> our assumptions, like how does the writing of 10 000 records to
>>>>>>> cache/RocksDB/Kafka/CSV file compare to joining/filtering/processing
>> of
>>>>>>> lets say 1000 000 rows.
>>>>>>> 
>>>>>>> Re 2.
>>>>>>> 
>>>>>>> I wasn’t proposing to change the semantic later. I was proposing that
>>>> we
>>>>>>> start now:
>>>>>>> 
>>>>>>> CachedTable cachedA = a.cache()
>>>>>>> cachedA.foo() // Cache is used
>>>>>>> a.bar() // Original DAG is used
>>>>>>> 
>>>>>>> And then later we can think about adding for example
>>>>>>> 
>>>>>>> CachedTable cachedA = a.hintCache()
>>>>>>> cachedA.foo() // Cache might be used
>>>>>>> a.bar() // Original DAG is used
>>>>>>> 
>>>>>>> Or
>>>>>>> 
>>>>>>> env.enableAutomaticCaching()
>>>>>>> a.foo() // Cache might be used
>>>>>>> a.bar() // Cache might be used
>>>>>>> 
>>>>>>> Or (I would still not like this option):
>>>>>>> 
>>>>>>> a.hintCache()
>>>>>>> a.foo() // Cache might be used
>>>>>>> a.bar() // Cache might be used
>>>>>>> 
>>>>>>> Or whatever else that will come to our mind. Even if we add some
>>>>>>> automatic caching in the future, keeping implicit (`CachedTable
>>>> cache()`)
>>>>>>> caching will still be useful, at least in some cases.
>>>>>>> 
>>>>>>> Re 3.
>>>>>>> 
>>>>>>>> 2. The source tables are immutable during one run of batch
>> processing
>>>>>>> logic.
>>>>>>>> 3. The cache is immutable during one run of batch processing logic.
>>>>>>> 
>>>>>>>> I think assumption 2 and 3 are by definition what batch processing
>>>>>>> means,
>>>>>>>> i.e the data must be complete before it is processed and should not
>>>>>>> change
>>>>>>>> when the processing is running.
>>>>>>> 
>>>>>>> I agree that this is how batch systems SHOULD be working. However I
>>>> know
>>>>>>> from my previous experience that it’s not always the case. Sometimes
>>>> users
>>>>>>> are just working on some non transactional storage, which can be
>>>> (either
>>>>>>> constantly or occasionally) being modified by some other processes
>> for
>>>>>>> whatever the reasons (fixing the data, updating, adding new data
>> etc).
>>>>>>> 
>>>>>>> But even if we ignore this point (data immutability), performance
>> side
>>>>>>> effect issue of your proposal remains. If user calls `void a.cache()`
>>>> deep
>>>>>>> inside some private method, it will have implicit side effects on
>> other
>>>>>>> parts of his program that might not be obvious.
>>>>>>> 
>>>>>>> Re `CacheHandle`.
>>>>>>> 
>>>>>>> If I understand it correctly, it only addresses the issue where to
>>>> place
>>>>>>> method `uncache`/`dropCache`.
>>>>>>> 
>>>>>>> Btw,
>>>>>>> 
>>>>>>>> In vast majority of the cases, users wouldn't really care whether
>> the
>>>>>>> cache is used or not.
>>>>>>> 
>>>>>>> I wouldn’t agree with that, because “caching” (if not purely in
>> memory
>>>>>>> caching) would add additional IO costs. It’s similar as saying that
>>>> users
>>>>>>> would not see a difference between Spark/Flink and MapReduce
>> (MapReduce
>>>>>>> writes data to disks after every map/reduce stage).
>>>>>>> 
>>>>>>> Piotrek
>>>>>>> 
>>>>>>>> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Piotrek,
>>>>>>>> 
>>>>>>>> Not sure if you noticed, in my last email, I was proposing
>>>> `CacheHandle
>>>>>>>> cache()` to avoid the potential side effect due to function calls.
>>>>>>>> 
>>>>>>>> Let's look at the disagreement in your reply one by one.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 1. Optimization chances
>>>>>>>> 
>>>>>>>> Optimization is never a trivial work. This is exactly why we should
>>>> not
>>>>>>> let
>>>>>>>> user manually do that. Databases have done huge amount of work in
>> this
>>>>>>>> area. At Alibaba, we rely heavily on many optimization rules to
>> boost
>>>>>>> the
>>>>>>>> SQL query performance.
>>>>>>>> 
>>>>>>>> In your example, if I filling the filter conditions in a certain
>> way,
>>>>>>> the
>>>>>>>> optimization would become obvious.
>>>>>>>> 
>>>>>>>> Table src1 = … // read from connector 1
>>>>>>>> Table src2 = … // read from connector 2
>>>>>>>> 
>>>>>>>> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
>>>>>>>> `f2).as('f3, ...)
>>>>>>>> a.cache() // write cache to connector 3, when writing the records,
>>>>>>> remember
>>>>>>>> min and max of `f1
>>>>>>>> 
>>>>>>>> a.filter('f3 > 30) // There is no need to read from any connector
>>>>>>> because
>>>>>>>> `a` does not contain any record whose 'f3 is greater than 30.
>>>>>>>> env.execute()
>>>>>>>> a.select(…)
>>>>>>>> 
>>>>>>>> BTW, it seems to me that adding some basic statistics is fairly
>>>>>>>> straightforward and the cost is pretty marginal if not ignorable. In
>>>>>>> fact
>>>>>>>> it is not only needed for optimization, but also for cases such as
>> ML,
>>>>>>>> where some algorithms may need to decide their parameter based on
>> the
>>>>>>>> statistics of the data.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2. Same API, one semantic now, another semantic later.
>>>>>>>> 
>>>>>>>> I am trying to understand what is the semantic of `CachedTable
>>>> cache()`
>>>>>>> you
>>>>>>>> are proposing. IMO, we should avoid designing an API whose semantic
>>>>>>> will be
>>>>>>>> changed later. If we have a "CachedTable cache()" method, then the
>>>>>>> semantic
>>>>>>>> should be very clearly defined upfront and do not change later. It
>>>>>>> should
>>>>>>>> never be "right now let's go with semantic 1, later we can silently
>>>>>>> change
>>>>>>>> it to semantic 2 or 3". Such change could result in bad consequence.
>>>> For
>>>>>>>> example, let's say we decide go with semantic 1:
>>>>>>>> 
>>>>>>>> CachedTable cachedA = a.cache()
>>>>>>>> cachedA.foo() // Cache is used
>>>>>>>> a.bar() // Original DAG is used.
>>>>>>>> 
>>>>>>>> Now majority of the users would be using cachedA.foo() in their
>> code.
>>>>>>> And
>>>>>>>> some advanced users will use a.bar() to explicitly skip the cache.
>>>> Later
>>>>>>>> on, we added smart optimization and change the semantic to semantic
>> 2:
>>>>>>>> 
>>>>>>>> CachedTable cachedA = a.cache()
>>>>>>>> cachedA.foo() // Cache is used
>>>>>>>> a.bar() // Cache MIGHT be used, and Flink may decide to skip cache
>> if
>>>>>>> it is
>>>>>>>> faster.
>>>>>>>> 
>>>>>>>> Now most of the users who were writing cachedA.foo() will not
>> benefit
>>>>>>> from
>>>>>>>> this optimization at all, unless they change their code to use
>> a.foo()
>>>>>>>> instead. And those advanced users suddenly lose the option to
>>>> explicitly
>>>>>>>> ignore cache unless they change their code (assuming we care enough
>> to
>>>>>>>> provide something like hint(useCache)). If we don't define the
>>>> semantic
>>>>>>>> carefully, our users will have to change their code again and again
>>>>>>> while
>>>>>>>> they shouldn't have to.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 3. side effect.
>>>>>>>> 
>>>>>>>> Before we talk about side effect, we have to agree on the
>> assumptions.
>>>>>>> The
>>>>>>>> assumptions I have are following:
>>>>>>>> 1. We are talking about batch processing.
>>>>>>>> 2. The source tables are immutable during one run of batch
>> processing
>>>>>>> logic.
>>>>>>>> 3. The cache is immutable during one run of batch processing logic.
>>>>>>>> 
>>>>>>>> I think assumption 2 and 3 are by definition what batch processing
>>>>>>> means,
>>>>>>>> i.e the data must be complete before it is processed and should not
>>>>>>> change
>>>>>>>> when the processing is running.
>>>>>>>> 
>>>>>>>> As far as I am aware of, I don't know any batch processing system
>>>>>>> breaking
>>>>>>>> those assumptions. Even for relational database tables, where
>> queries
>>>>>>> can
>>>>>>>> run with concurrent modifications, necessary locking are still
>>>> required
>>>>>>> to
>>>>>>>> ensure the integrity of the query result.
>>>>>>>> 
>>>>>>>> Please let me know if you disagree with the above assumptions. If
>> you
>>>>>>> agree
>>>>>>>> with these assumptions, with the `CacheHandle cache()` API in my
>> last
>>>>>>>> email, do you still see side effects?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <
>>>> piotr@data-artisans.com
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Becket,
>>>>>>>>> 
>>>>>>>>>> Regarding the chance of optimization, it might not be that rare.
>>>> Some
>>>>>>>>> very
>>>>>>>>>> simple statistics could already help in many cases. For example,
>>>>>>> simply
>>>>>>>>>> maintaining max and min of each fields can already eliminate some
>>>>>>>>>> unnecessary table scan (potentially scanning the cached table) if
>>>> the
>>>>>>>>>> result is doomed to be empty. A histogram would give even further
>>>>>>>>>> information. The optimizer could be very careful and only ignores
>>>>>>> cache
>>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
>> filter
>>>> on
>>>>>>>>> the
>>>>>>>>>> cache will absolutely return nothing.
>>>>>>>>> 
>>>>>>>>> I do not see how this might be easy to achieve. It would require
>> tons
>>>>>>> of
>>>>>>>>> effort to make it work and in the end you would still have a
>> problem
>>>> of
>>>>>>>>> comparing/trading CPU cycles vs IO. For example:
>>>>>>>>> 
>>>>>>>>> Table src1 = … // read from connector 1
>>>>>>>>> Table src2 = … // read from connector 2
>>>>>>>>> 
>>>>>>>>> Table a = src1.filter(…).join(src2.filter(…), …)
>>>>>>>>> a.cache() // write cache to connector 3
>>>>>>>>> 
>>>>>>>>> a.filter(…)
>>>>>>>>> env.execute()
>>>>>>>>> a.select(…)
>>>>>>>>> 
>>>>>>>>> Decision whether it’s better to:
>>>>>>>>> A) read from connector1/connector2, filter/map and join them twice
>>>>>>>>> B) read from connector1/connector2, filter/map and join them once,
>>>> pay
>>>>>>> the
>>>>>>>>> price of writing to connector 3 and then reading from it
>>>>>>>>> 
>>>>>>>>> Is very far from trivial. `a` can end up much larger than `src1`
>> and
>>>>>>>>> `src2`, writes to connector 3 might be extremely slow, reads from
>>>>>>> connector
>>>>>>>>> 3 can be slower compared to reads from connector 1 & 2, … . You
>>>> really
>>>>>>> need
>>>>>>>>> to have extremely good statistics to correctly asses size of the
>>>>>>> output and
>>>>>>>>> it would still be failing many times (correlations etc). And keep
>> in
>>>>>>> mind
>>>>>>>>> that at the moment we do not have ANY statistics at all. More than
>>>>>>> that, it
>>>>>>>>> would require significantly more testing and setting up some
>>>>>>> benchmarks to
>>>>>>>>> make sure that we do not brake it with some regressions.
>>>>>>>>> 
>>>>>>>>> That’s why I’m strongly opposing this idea - at least let’s not
>>>> starts
>>>>>>>>> with this. If we first start with completely manual/explicit
>> caching,
>>>>>>>>> without any magic, it would be a significant improvement for the
>>>> users
>>>>>>> for
>>>>>>>>> a fraction of the development cost. After implementing that, when
>> we
>>>>>>>>> already have all of the working pieces, we can start working on
>> some
>>>>>>>>> optimisations rules. As I wrote before, if we start with
>>>>>>>>> 
>>>>>>>>> `CachedTable cache()`
>>>>>>>>> 
>>>>>>>>> We can later work on follow up stories to make it automatic.
>> Despite
>>>>>>> that
>>>>>>>>> I don’t like this implicit/side effect approach with `void` method,
>>>>>>> having
>>>>>>>>> explicit `CachedTable cache()` wouldn’t even prevent as from later
>>>>>>> adding
>>>>>>>>> `void hintCache()` method, with the exact semantic that you want.
>>>>>>>>> 
>>>>>>>>> On top of that I re-rise again that having implicit `void
>>>>>>>>> cache()/hintCache()` has other side effects and problems with non
>>>>>>> immutable
>>>>>>>>> data, and being annoying when used secretly inside methods.
>>>>>>>>> 
>>>>>>>>> Explicit `CachedTable cache()` just looks like much less
>>>> controversial
>>>>>>> MVP
>>>>>>>>> and if we decide to go further with this topic, it’s not a wasted
>>>>>>> effort,
>>>>>>>>> but just lies on a stright path to more advanced/complicated
>>>> solutions
>>>>>>> in
>>>>>>>>> the future. Are there any drawbacks of starting with `CachedTable
>>>>>>> cache()`
>>>>>>>>> that I’m missing?
>>>>>>>>> 
>>>>>>>>> Piotrek
>>>>>>>>> 
>>>>>>>>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Becket,
>>>>>>>>>> 
>>>>>>>>>> Introducing CacheHandle seems too complicated. That means users
>> have
>>>>>>> to
>>>>>>>>>> maintain Handler properly.
>>>>>>>>>> 
>>>>>>>>>> And since cache is just a hint for optimizer, why not just return
>>>>>>> Table
>>>>>>>>>> itself for cache method. This hint info should be kept in Table I
>>>>>>>>> believe.
>>>>>>>>>> 
>>>>>>>>>> So how about adding method cache and uncache for Table, and both
>>>>>>> return
>>>>>>>>>> Table. Because what cache and uncache did is just adding some hint
>>>>>>> info
>>>>>>>>>> into Table.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
>>>>>>>>>> 
>>>>>>>>>>> Hi Till and Piotrek,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the clarification. That solves quite a few confusion.
>> My
>>>>>>>>>>> understanding of how cache works is same as what Till describe.
>>>> i.e.
>>>>>>>>>>> cache() is a hint to Flink, but it is not guaranteed that cache
>>>>>>> always
>>>>>>>>>>> exist and it might be recomputed from its lineage.
>>>>>>>>>>> 
>>>>>>>>>>> Is this the core of our disagreement here? That you would like
>> this
>>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
>>>>>>>>>>> 
>>>>>>>>>>> Semantic wise, yes. That's also why I think materialize() has a
>>>> much
>>>>>>>>> larger
>>>>>>>>>>> scope than cache(), thus it should be a different method.
>>>>>>>>>>> 
>>>>>>>>>>> Regarding the chance of optimization, it might not be that rare.
>>>> Some
>>>>>>>>> very
>>>>>>>>>>> simple statistics could already help in many cases. For example,
>>>>>>> simply
>>>>>>>>>>> maintaining max and min of each fields can already eliminate some
>>>>>>>>>>> unnecessary table scan (potentially scanning the cached table) if
>>>> the
>>>>>>>>>>> result is doomed to be empty. A histogram would give even further
>>>>>>>>>>> information. The optimizer could be very careful and only ignores
>>>>>>> cache
>>>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
>> filter
>>>>>>> on
>>>>>>>>> the
>>>>>>>>>>> cache will absolutely return nothing.
>>>>>>>>>>> 
>>>>>>>>>>> Given the above clarification on cache, I would like to revisit
>> the
>>>>>>>>>>> original "void cache()" proposal and see if we can improve on top
>>>> of
>>>>>>>>> that.
>>>>>>>>>>> 
>>>>>>>>>>> What do you think about the following modified interface?
>>>>>>>>>>> 
>>>>>>>>>>> Table {
>>>>>>>>>>> /**
>>>>>>>>>>> * This call hints Flink to maintain a cache of this table and
>>>>>>> leverage
>>>>>>>>>>> it for performance optimization if needed.
>>>>>>>>>>> * Note that Flink may still decide to not use the cache if it is
>>>>>>>>> cheaper
>>>>>>>>>>> by doing so.
>>>>>>>>>>> *
>>>>>>>>>>> * A CacheHandle will be returned to allow user release the cache
>>>>>>>>>>> actively. The cache will be deleted if there
>>>>>>>>>>> * is no unreleased cache handlers to it. When the
>> TableEnvironment
>>>>>>> is
>>>>>>>>>>> closed. The cache will also be deleted
>>>>>>>>>>> * and all the cache handlers will be released.
>>>>>>>>>>> *
>>>>>>>>>>> * @return a CacheHandle referring to the cache of this table.
>>>>>>>>>>> */
>>>>>>>>>>> CacheHandle cache();
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> CacheHandle {
>>>>>>>>>>> /**
>>>>>>>>>>> * Close the cache handle. This method does not necessarily
>> deletes
>>>>>>> the
>>>>>>>>>>> cache. Instead, it simply decrements the reference counter to the
>>>>>>> cache.
>>>>>>>>>>> * When the there is no handle referring to a cache. The cache
>> will
>>>>>>> be
>>>>>>>>>>> deleted.
>>>>>>>>>>> *
>>>>>>>>>>> * @return the number of open handles to the cache after this
>> handle
>>>>>>>>> has
>>>>>>>>>>> been released.
>>>>>>>>>>> */
>>>>>>>>>>> int release()
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> The rationale behind this interface is following:
>>>>>>>>>>> In vast majority of the cases, users wouldn't really care whether
>>>> the
>>>>>>>>> cache
>>>>>>>>>>> is used or not. So I think the most intuitive way is letting
>>>> cache()
>>>>>>>>> return
>>>>>>>>>>> nothing. So nobody needs to worry about the difference between
>>>>>>>>> operations
>>>>>>>>>>> on CacheTables and those on the "original" tables. This will make
>>>>>>> maybe
>>>>>>>>>>> 99.9% of the users happy. There were two concerns raised for this
>>>>>>>>> approach:
>>>>>>>>>>> 1. In some rare cases, users may want to ignore cache,
>>>>>>>>>>> 2. A table might be cached/uncached in a third party function
>> while
>>>>>>> the
>>>>>>>>>>> caller does not know.
>>>>>>>>>>> 
>>>>>>>>>>> For the first issue, users can use hint("ignoreCache") to
>>>> explicitly
>>>>>>>>> ignore
>>>>>>>>>>> cache.
>>>>>>>>>>> For the second issue, the above proposal lets cache() return a
>>>>>>>>> CacheHandle,
>>>>>>>>>>> the only method in it is release(). Different CacheHandles will
>>>>>>> refer to
>>>>>>>>>>> the same cache, if a cache no longer has any cache handle, it
>> will
>>>> be
>>>>>>>>>>> deleted. This will address the following case:
>>>>>>>>>>> {
>>>>>>>>>>> val handle1 = a.cache()
>>>>>>>>>>> process(a)
>>>>>>>>>>> a.select(...) // cache is still available, handle1 has not been
>>>>>>>>> released.
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> void process(Table t) {
>>>>>>>>>>> val handle2 = t.cache() // new handle to cache
>>>>>>>>>>> t.select(...) // optimizer decides cache usage
>>>>>>>>>>> t.hint("ignoreCache").select(...) // cache is ignored
>>>>>>>>>>> handle2.release() // release the handle, but the cache may still
>> be
>>>>>>>>>>> available if there are other handles
>>>>>>>>>>> ...
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> Does the above modified approach look reasonable to you?
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> 
>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <
>>>> trohrmann@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>> 
>>>>>>>>>>>> I was aiming at semantics similar to 1. I actually thought that
>>>>>>>>> `cache()`
>>>>>>>>>>>> would tell the system to materialize the intermediate result so
>>>> that
>>>>>>>>>>>> subsequent queries don't need to reprocess it. This means that
>> the
>>>>>>>>> usage
>>>>>>>>>>> of
>>>>>>>>>>>> the cached table in this example
>>>>>>>>>>>> 
>>>>>>>>>>>> {
>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>> val b1 = cachedTable.select(…)
>>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
>>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
>>>>>>>>>>>> val c1 = a.select(…)
>>>>>>>>>>>> val c2 = a.foo().select(…)
>>>>>>>>>>>> val c3 = a.bar().select(...)
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> strongly depends on interleaved calls which trigger the
>> execution
>>>> of
>>>>>>>>> sub
>>>>>>>>>>>> queries. So for example, if there is only a single env.execute
>>>> call
>>>>>>> at
>>>>>>>>>>> the
>>>>>>>>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be
>>>> computed
>>>>>>> by
>>>>>>>>>>>> reading directly from the sources (given that there is only a
>>>> single
>>>>>>>>>>>> JobGraph). It just happens that the result of `a` will be cached
>>>>>>> such
>>>>>>>>>>> that
>>>>>>>>>>>> we skip the processing of `a` when there are subsequent queries
>>>>>>> reading
>>>>>>>>>>>> from `cachedTable`. If for some reason the system cannot
>>>> materialize
>>>>>>>>> the
>>>>>>>>>>>> table (e.g. running out of disk space, ttl expired), then it
>> could
>>>>>>> also
>>>>>>>>>>>> happen that we need to reprocess `a`. In that sense
>> `cachedTable`
>>>>>>>>> simply
>>>>>>>>>>> is
>>>>>>>>>>>> an identifier for the materialized result of `a` with the
>> lineage
>>>>>>> how
>>>>>>>>> to
>>>>>>>>>>>> reprocess it.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Till
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>>>> val b = cachedTable.select(...)
>>>>>>>>>>>>>> val c = a.select(...)
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>>>>>>> original
>>>>>>>>>>> DAG
>>>>>>>>>>>>> as
>>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no chance to
>>>>>>>>>>>> optimize.
>>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves
>> the
>>>>>>>>>>>>> optimizer
>>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
>> case,
>>>>>>> user
>>>>>>>>>>>>> lose
>>>>>>>>>>>>>> the option to NOT use cache.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As you can see, neither of the options seem perfect. However,
>> I
>>>>>>> guess
>>>>>>>>>>>> you
>>>>>>>>>>>>>> and Till are proposing the third option:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or
>>>> DAG
>>>>>>>>>>>> should
>>>>>>>>>>>>> be
>>>>>>>>>>>>>> used. c always use the DAG.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am pretty sure that me, Till, Fabian and others were all
>>>>>>> proposing
>>>>>>>>>>> and
>>>>>>>>>>>>> advocating in favour of semantic “1”. No cost based optimiser
>>>>>>>>> decisions
>>>>>>>>>>>> at
>>>>>>>>>>>>> all.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> {
>>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>>> val b1 = cachedTable.select(…)
>>>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
>>>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
>>>>>>>>>>>>> val c1 = a.select(…)
>>>>>>>>>>>>> val c2 = a.foo().select(…)
>>>>>>>>>>>>> val c3 = a.bar().select(...)
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3
>> are
>>>>>>>>>>>>> re-executing whole plan for “a”.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In the future we could discuss going one step further,
>>>> introducing
>>>>>>>>> some
>>>>>>>>>>>>> global optimisation (that can be manually enabled/disabled):
>>>>>>>>>>> deduplicate
>>>>>>>>>>>>> plan nodes/deduplicate sub queries/re-use sub queries
>> results/or
>>>>>>>>>>> whatever
>>>>>>>>>>>>> we could call it. It could do two things:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. Automatically try to deduplicate fragments of the plan and
>>>> share
>>>>>>>>> the
>>>>>>>>>>>>> result using CachedTable - in other words automatically insert
>>>>>>>>>>>> `CachedTable
>>>>>>>>>>>>> cache()` calls.
>>>>>>>>>>>>> 2. Automatically make decision to bypass explicit `CachedTable`
>>>>>>> access
>>>>>>>>>>>>> (this would be the equivalent of what you described as
>> “semantic
>>>>>>> 3”).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However as I wrote previously, I have big doubts if such
>>>> cost-based
>>>>>>>>>>>>> optimisation would work (this applies also to “Semantic 2”). I
>>>>>>> would
>>>>>>>>>>>> expect
>>>>>>>>>>>>> it to do more harm than good in so many cases, that it wouldn’t
>>>>>>> make
>>>>>>>>>>>> sense.
>>>>>>>>>>>>> Even assuming that we calculate statistics perfectly (this
>> ain’t
>>>>>>> gonna
>>>>>>>>>>>>> happen), it’s virtually impossible to correctly estimate
>> correct
>>>>>>>>>>> exchange
>>>>>>>>>>>>> rate of CPU cycles vs IO operations as it is changing so much
>>>> from
>>>>>>>>>>>>> deployment to deployment.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Is this the core of our disagreement here? That you would like
>>>> this
>>>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Another potential concern for semantic 3 is that. In the
>> future,
>>>>>>> we
>>>>>>>>>>> may
>>>>>>>>>>>>> add
>>>>>>>>>>>>>> automatic caching to Flink. e.g. cache the intermediate
>> results
>>>> at
>>>>>>>>>>> the
>>>>>>>>>>>>>> shuffle boundary. If our semantic is that reference to the
>>>>>>> original
>>>>>>>>>>>> table
>>>>>>>>>>>>>> means skipping cache, those users may not be able to benefit
>>>> from
>>>>>>> the
>>>>>>>>>>>>>> implicit cache.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <
>>>> becket.qin@gmail.com
>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the reply. Thought about it again, I might have
>>>>>>>>>>>> misunderstood
>>>>>>>>>>>>>>> your proposal in earlier emails. Returning a CachedTable
>> might
>>>>>>> not
>>>>>>>>>>> be
>>>>>>>>>>>> a
>>>>>>>>>>>>> bad
>>>>>>>>>>>>>>> idea.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I was more concerned about the semantic and its intuitiveness
>>>>>>> when a
>>>>>>>>>>>>>>> CachedTable is returned. i..e, if cache() returns
>> CachedTable.
>>>>>>> What
>>>>>>>>>>>> are
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> semantic in the following code:
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>>>>> val b = cachedTable.select(...)
>>>>>>>>>>>>>>> val c = a.select(...)
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> What is the difference between b and c? At the first glance,
>> I
>>>>>>> see
>>>>>>>>>>> two
>>>>>>>>>>>>>>> options:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>>>>>>> original
>>>>>>>>>>>> DAG
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no chance
>> to
>>>>>>>>>>>> optimize.
>>>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves
>>>> the
>>>>>>>>>>>>> optimizer
>>>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
>>>> case,
>>>>>>>>>>> user
>>>>>>>>>>>>> lose
>>>>>>>>>>>>>>> the option to NOT use cache.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As you can see, neither of the options seem perfect.
>> However, I
>>>>>>>>>>> guess
>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> and Till are proposing the third option:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or
>>>> DAG
>>>>>>>>>>>> should
>>>>>>>>>>>>>>> be used. c always use the DAG.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This does address all the concerns. It is just that from
>>>>>>>>>>> intuitiveness
>>>>>>>>>>>>>>> perspective, I found that asking user to explicitly use a
>>>>>>>>>>> CachedTable
>>>>>>>>>>>>> while
>>>>>>>>>>>>>>> the optimizer might choose to ignore is a little weird. That
>>>> was
>>>>>>>>>>> why I
>>>>>>>>>>>>> did
>>>>>>>>>>>>>>> not think about that semantic. But given there is material
>>>>>>> benefit,
>>>>>>>>>>> I
>>>>>>>>>>>>> think
>>>>>>>>>>>>>>> this semantic is acceptable.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
>>>>>>> cache
>>>>>>>>>>> or
>>>>>>>>>>>>> not,
>>>>>>>>>>>>>>>> then why do we need “void cache()” method at all? Would It
>>>>>>>>>>>> “increase”
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> chance of using the cache? That’s sounds strange. What would
>>>> be
>>>>>>> the
>>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
>>>>>>> want
>>>>>>>>>>> to
>>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>>>>>>>>>>> deduplication”
>>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>>>>>>> optimiser
>>>>>>>>>>> do
>>>>>>>>>>>>> all of
>>>>>>>>>>>>>>>> the work.
>>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not
>> use
>>>>>>>>>>> cache
>>>>>>>>>>>>>>>> decision.
>>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
>> such
>>>>>>> cost
>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>> optimisations would work properly and I would still insist
>>>>>>> first on
>>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We are absolutely on the same page here. An explicit cache()
>>>>>>> method
>>>>>>>>>>> is
>>>>>>>>>>>>>>> necessary not only because optimizer may not be able to make
>>>> the
>>>>>>>>>>> right
>>>>>>>>>>>>>>> decision, but also because of the nature of interactive
>>>>>>> programming.
>>>>>>>>>>>> For
>>>>>>>>>>>>>>> example, if users write the following code in Scala shell:
>>>>>>>>>>>>>>> val b = a.select(...)
>>>>>>>>>>>>>>> val c = b.select(...)
>>>>>>>>>>>>>>> val d = c.select(...).writeToSink(...)
>>>>>>>>>>>>>>> tEnv.execute()
>>>>>>>>>>>>>>> There is no way optimizer will know whether b or c will be
>> used
>>>>>>> in
>>>>>>>>>>>> later
>>>>>>>>>>>>>>> code, unless users hint explicitly.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
>>>>>>>>>>> objections
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
>>>>>>> Jark,
>>>>>>>>>>>>> Fabian,
>>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is there any other side effects if we use semantic 3
>> mentioned
>>>>>>>>>>> above?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> JIangjie (Becket) Qin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sorry for not responding long time.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regarding case1.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would
>> expect
>>>>>>> only
>>>>>>>>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1`
>> wouldn’t
>>>>>>>>>>> affect
>>>>>>>>>>>>>>>> `cachedTableA2`. Just as in any other database dropping
>>>>>>> modifying
>>>>>>>>>>> one
>>>>>>>>>>>>>>>> independent table/materialised view does not affect others.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What I meant is that assuming there is already a cached
>>>> table,
>>>>>>>>>>>> ideally
>>>>>>>>>>>>>>>> users need
>>>>>>>>>>>>>>>>> not to specify whether the next query should read from the
>>>>>>> cache
>>>>>>>>>>> or
>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
>>>>>>> cache
>>>>>>>>>>> or
>>>>>>>>>>>>>>>> not, then why do we need “void cache()” method at all? Would
>>>> It
>>>>>>>>>>>>> “increase”
>>>>>>>>>>>>>>>> the chance of using the cache? That’s sounds strange. What
>>>>>>> would be
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
>>>>>>> want
>>>>>>>>>>> to
>>>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>>>>>>>>>>> deduplication”
>>>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>>>>>>> optimiser
>>>>>>>>>>> do
>>>>>>>>>>>>> all of
>>>>>>>>>>>>>>>> the work.
>>>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not
>> use
>>>>>>>>>>> cache
>>>>>>>>>>>>>>>> decision.
>>>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
>> such
>>>>>>> cost
>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>> optimisations would work properly and I would still insist
>>>>>>> first on
>>>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>>>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()`
>>>> doesn’t
>>>>>>>>>>>>>>>> contradict future work on automated cost based caching.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
>>>>>>>>>>> objections
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
>>>>>>> Jark,
>>>>>>>>>>>>> Fabian,
>>>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It is true that after the first job submission, there will
>> be
>>>>>>> no
>>>>>>>>>>>>>>>> ambiguity
>>>>>>>>>>>>>>>>> in terms of whether a cached table is used or not. That is
>>>> the
>>>>>>>>>>> same
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cache() without returning a CachedTable.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
>>>>>>> caching
>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>> from which you need to consume from if you want to benefit
>>>>>>> from
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>> functionality.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I am thinking a little differently. I think it is a hint
>> (as
>>>>>>> you
>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>> later) instead of a new operator. I'd like to be careful
>>>> about
>>>>>>> the
>>>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>>>> of the API. A hint is a property set on an existing
>> operator,
>>>>>>> but
>>>>>>>>>>> is
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> itself an operator as it does not really manipulate the
>> data.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
>>>>>>> which
>>>>>>>>>>>>>>>>>> intermediate result should be cached. But especially when
>>>>>>>>>>> executing
>>>>>>>>>>>>>>>> ad-hoc
>>>>>>>>>>>>>>>>>> queries the user might better know which results need to
>> be
>>>>>>>>>>> cached
>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>>>>>>> consider
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in
>>>> the
>>>>>>>>>>>> future
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>> might add functionality which tries to automatically cache
>>>>>>>>>>> results
>>>>>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>>>>>> caching the latest intermediate results until so and so
>> much
>>>>>>>>>>> space
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
>>>>>>> `CachedTable
>>>>>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I agree that cache() method is needed for exactly the
>> reason
>>>>>>> you
>>>>>>>>>>>>>>>> mentioned,
>>>>>>>>>>>>>>>>> i.e. Flink cannot predict what users are going to write
>>>> later,
>>>>>>> so
>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>> need to tell Flink explicitly that this table will be used
>>>>>>> later.
>>>>>>>>>>>>> What I
>>>>>>>>>>>>>>>>> meant is that assuming there is already a cached table,
>>>> ideally
>>>>>>>>>>>> users
>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>> not to specify whether the next query should read from the
>>>>>>> cache
>>>>>>>>>>> or
>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> To explain the difference between returning / not
>> returning a
>>>>>>>>>>>>>>>> CachedTable,
>>>>>>>>>>>>>>>>> I want compare the following two case:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> *Case 1:  returning a CachedTable*
>>>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>>>> val cachedTableA1 = a.cache()
>>>>>>>>>>>>>>>>> val cachedTableA2 = a.cache()
>>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> c = a.filter(...) // User specify that the original DAG is
>>>>>>> used?
>>>>>>>>>>> Or
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> optimizer decides whether DAG or cache should be used?
>>>>>>>>>>>>>>>>> d = cachedTableA1.filter() // User specify that the cached
>>>>>>> table
>>>>>>>>>>> is
>>>>>>>>>>>>>>>> used.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
>>>>>>>>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> *Case 2: not returning a CachedTable*
>>>>>>>>>>>>>>>>> b = a.map()
>>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>>> a.cache() // no-op
>>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or
>>>> DAG
>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or
>>>> DAG
>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> a.unCache()
>>>>>>>>>>>>>>>>> a.unCache() // no-op
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to
>> choose
>>>>>>>>>>>> between
>>>>>>>>>>>>>>>> DAG
>>>>>>>>>>>>>>>>> and cache. And the unCache() call becomes tricky.
>>>>>>>>>>>>>>>>> In case 2, users do not need to worry about whether cache
>> or
>>>>>>> DAG
>>>>>>>>>>> is
>>>>>>>>>>>>>>>> used.
>>>>>>>>>>>>>>>>> And the unCache() semantic is clear. However, the caveat is
>>>>>>> that
>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>> cannot explicitly ignore the cache.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In order to address the issues mentioned in case 2 and
>>>>>>> inspired by
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> discussion so far, I am thinking about using hint to allow
>>>> user
>>>>>>>>>>>>>>>> explicitly
>>>>>>>>>>>>>>>>> ignore cache. Although we do not have hint yet, but we
>>>> probably
>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> one. So the code becomes:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> *Case 3: returning this table*
>>>>>>>>>>>>>>>>> b = a.map()
>>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>>> a.cache() // no-op
>>>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or
>>>> DAG
>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used
>>>>>>> instead
>>>>>>>>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> a.unCache()
>>>>>>>>>>>>>>>>> a.unCache() // no-op
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We could also let cache() return this table to allow
>> chained
>>>>>>>>>>> method
>>>>>>>>>>>>>>>> calls.
>>>>>>>>>>>>>>>>> Do you think this API addresses the concerns?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> All the recent discussions are focused on whether there
>> is a
>>>>>>>>>>>> problem
>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>> cache() not return a Table.
>>>>>>>>>>>>>>>>>> It seems that returning a Table explicitly is more clear
>>>> (and
>>>>>>>>>>>> safe?).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> So whether there are any problems if cache() returns a
>>>> Table?
>>>>>>>>>>>>> @Becket
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <
>>>>>>> trohrmann@apache.org
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It's true that b, c, d and e will all read from the
>>>> original
>>>>>>> DAG
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> generates a. But all subsequent operators (when running
>>>>>>> multiple
>>>>>>>>>>>>>>>> queries)
>>>>>>>>>>>>>>>>>>> which reference cachedTableA should not need to reproduce
>>>> `a`
>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>>>> consume the intermediate result.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
>>>>>>> caching
>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>> from which you need to consume from if you want to
>> benefit
>>>>>>> from
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>> functionality.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of
>> decision
>>>>>>> which
>>>>>>>>>>>>>>>>>>> intermediate result should be cached. But especially when
>>>>>>>>>>>> executing
>>>>>>>>>>>>>>>>>> ad-hoc
>>>>>>>>>>>>>>>>>>> queries the user might better know which results need to
>> be
>>>>>>>>>>> cached
>>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>>>>>>>>>>> consider
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in
>>>> the
>>>>>>>>>>>> future
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> might add functionality which tries to automatically
>> cache
>>>>>>>>>>> results
>>>>>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>>>>>>> caching the latest intermediate results until so and so
>>>> much
>>>>>>>>>>> space
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
>>>>>>>>>>> `CachedTable
>>>>>>>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <
>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for the clarification. I am still a little
>>>> confused.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> If cache() returns a CachedTable, the example might
>>>> become:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> cachedTableA = a.cache()
>>>>>>>>>>>>>>>>>>>> d = cachedTableA.map(...)
>>>>>>>>>>>>>>>>>>>> e = a.map()
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b,
>> c, d
>>>>>>> and
>>>>>>>>>>> e
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> going to be reading from the original DAG that generates
>>>> a.
>>>>>>> But
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> naive expectation, d should be reading from the cache.
>>>> This
>>>>>>>>>>> seems
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> solving the potential confusion you raised, right?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Just to be clear, my understanding are all based on the
>>>>>>>>>>>> assumption
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
>>>>>>>>>>>>>>>> c*achedTableA*
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> original table *a * should be completely
>> interchangeable.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> That said, I think a valid argument is optimization.
>> There
>>>>>>> are
>>>>>>>>>>>>> indeed
>>>>>>>>>>>>>>>>>>> cases
>>>>>>>>>>>>>>>>>>>> that reading from the original DAG could be faster than
>>>>>>> reading
>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> cache. For example, in the following example:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> a.filter(f1' > 100)
>>>>>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>>>>>> b = a.filter(f1' < 100)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Ideally the optimizer should be intelligent enough to
>>>> decide
>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> faster, without user intervention. In this case, it will
>>>>>>>>>>> identify
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>>>> would just be an empty table, thus skip reading from the
>>>>>>> cache
>>>>>>>>>>>>>>>>>>> completely.
>>>>>>>>>>>>>>>>>>>> But I agree that returning a CachedTable would give user
>>>> the
>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> when to use cache, even though I still feel that letting
>>>> the
>>>>>>>>>>>>>>>> optimizer
>>>>>>>>>>>>>>>>>>>> handle this is a better option in long run.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
>>>>>>>>>>>> trohrmann@apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Yes you are right Becket that it still depends on the
>>>>>>> actual
>>>>>>>>>>>>>>>>>> execution
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> the job whether a consumer reads from a cached result
>> or
>>>>>>> not.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> My point was actually about the properties of a (cached
>>>> vs.
>>>>>>>>>>>>>>>>>> non-cached)
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> not about the execution. I would not make cache trigger
>>>> the
>>>>>>>>>>>>>>>> execution
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> the job because one loses some flexibility by eagerly
>>>>>>>>>>> triggering
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> execution.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is
>>>>>>> returned
>>>>>>>>>>>> by
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> cache() method like Piotr did in order to make the API
>>>> more
>>>>>>>>>>>>>>>> explicit.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> That is a good example. Just a minor correction, in
>> this
>>>>>>>>>>> case,
>>>>>>>>>>>>> b, c
>>>>>>>>>>>>>>>>>>>> and d
>>>>>>>>>>>>>>>>>>>>>> will all consume from a non-cached a. This is because
>>>>>>> cache
>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> created on the very first job submission that
>> generates
>>>>>>> the
>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> cached.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> If I understand correctly, this is example is about
>>>>>>> whether
>>>>>>>>>>>>>>>>>> .cache()
>>>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In
>>>>>>> another
>>>>>>>>>>>> word,
>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>> cache() method actually triggers a job that creates
>> the
>>>>>>>>>>> cache,
>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>> be no such confusion. Is that right?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In the example, although d will not consume from the
>>>>>>> cached
>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>> looks supposed to, from correctness perspective the
>> code
>>>>>>> will
>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>> correct result, assuming that tables are immutable.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Personally I feel it is OK because users probably
>> won't
>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>> worry
>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>> whether the table is cached or not. And lazy cache
>> could
>>>>>>>>>>> avoid
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>> unnecessary caching if a cached table is never created
>>>> in
>>>>>>> the
>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>> application. But I am not opposed to do eager
>> evaluation
>>>>>>> of
>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
>>>>>>>>>>>>>>>>>> trohrmann@apache.org>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily
>>>>>>> changing
>>>>>>>>>>>>>>>>>>> properties
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>> node affects all down stream consumers but does not
>>>>>>>>>>>> necessarily
>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> happen before these consumers are defined. From a
>>>> user's
>>>>>>>>>>>>>>>>>>> perspective
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> can be quite confusing:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>>>>>>>>> d = a.map(...)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> now b, c and d will consume from a cached operator.
>> In
>>>>>>> this
>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>> would most likely expect that only d reads from a
>>>> cached
>>>>>>>>>>>> result.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
>>>>>>> effects?
>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>> far
>>>>>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist
>>>> if a
>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Not only that. There are also performance
>> implications
>>>>>>> and
>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>> another implicit side effects of using `void
>> cache()`.
>>>>>>> As I
>>>>>>>>>>>>>>>>>> wrote
>>>>>>>>>>>>>>>>>>>>>> before,
>>>>>>>>>>>>>>>>>>>>>>>> reading from cache might not always be desirable,
>> thus
>>>>>>> it
>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> cause
>>>>>>>>>>>>>>>>>>>>>>>> performance degradation and I’m fine with that -
>>>> user's
>>>>>>> or
>>>>>>>>>>>>>>>>>>>>> optimiser’s
>>>>>>>>>>>>>>>>>>>>>>>> choice. What I do not like is that this implicit
>> side
>>>>>>>>>>> effect
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>> manifest
>>>>>>>>>>>>>>>>>>>>>>>> in completely different part of code, that wasn’t
>>>>>>> touched
>>>>>>>>>>> by
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else.
>> And
>>>>>>> even
>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>>>> improves performance, it’s still a side effect of
>>>> `void
>>>>>>>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>>>>>>>>> Almost
>>>>>>>>>>>>>>>>>>>>>>>> from the definition `void` methods have only side
>>>>>>> effects.
>>>>>>>>>>>> As I
>>>>>>>>>>>>>>>>>>>> wrote
>>>>>>>>>>>>>>>>>>>>>>>> before, there are couple of scenarios where this
>> might
>>>>>>> be
>>>>>>>>>>>>>>>>>>>> undesirable
>>>>>>>>>>>>>>>>>>>>>>>> and/or unexpected, for example:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>>>>>>>>>>> Table b = …;
>>>>>>>>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>>>>>>>>> x = b.join(…)
>>>>>>>>>>>>>>>>>>>>>>>> y = b.count()
>>>>>>>>>>>>>>>>>>>>>>>> // ...
>>>>>>>>>>>>>>>>>>>>>>>> // 100
>>>>>>>>>>>>>>>>>>>>>>>> // hundred
>>>>>>>>>>>>>>>>>>>>>>>> // lines
>>>>>>>>>>>>>>>>>>>>>>>> // of
>>>>>>>>>>>>>>>>>>>>>>>> // code
>>>>>>>>>>>>>>>>>>>>>>>> // later
>>>>>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even
>>>> hidden
>>>>>>> in
>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>> method/file/package/dependency
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Table b = ...
>>>>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>>>>>>>>>>>>>>> foo(b)
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>> Else {
>>>>>>>>>>>>>>>>>>>>>>>> bar(b)
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Void foo(Table b) {
>>>>>>>>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>>>>>>>>> // do something with b
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly
>>>>>>> affect
>>>>>>>>>>>>>>>>>>>> (semantic
>>>>>>>>>>>>>>>>>>>>>> of a
>>>>>>>>>>>>>>>>>>>>>>>> program in case of sources being mutable and
>>>>>>> performance)
>>>>>>>>>>> `z
>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from
>>>> obvious.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On top of that, there is still this argument of mine
>>>>>>> that
>>>>>>>>>>>>>>>>>> having
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
>>>>>>>>>>> flexible
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> us
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> future and for the user (as a manual option to
>> bypass
>>>>>>> cache
>>>>>>>>>>>>>>>>>>> reads).
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> But Jiangjie is correct,
>>>>>>>>>>>>>>>>>>>>>>>>> the source table in batching should be immutable.
>> It
>>>> is
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> user’s
>>>>>>>>>>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a
>> regular
>>>>>>>>>>>>>>>>>> failover
>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>> lead
>>>>>>>>>>>>>>>>>>>>>>>>> to inconsistent results.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good
>> deployment
>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> be.
>>>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this
>>>> (since
>>>>>>> the
>>>>>>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>>>>>> fix
>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> to support transactions), I’m just trying to
>> minimise
>>>>>>>>>>>> confusion
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> users that are not fully aware what’s going on and
>>>>>>> operate
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> less
>>>>>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>>>>>> perfect setup. And if something bites them after
>>>> adding
>>>>>>>>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>>>>>>>>> call,
>>>>>>>>>>>>>>>>>>>>>>>> to make sure that they at least know all of the
>> places
>>>>>>> that
>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> line can affect.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Piotrek
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <
>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more
>> replies
>>>>>>> are
>>>>>>>>>>>>>>>>>>>>> following.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not
>> only
>>>> be
>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>>>>>>>> programming and not only in batching.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache()
>>>> has
>>>>>>> the
>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>> batch processing. The semantic is following:
>>>>>>>>>>>>>>>>>>>>>>>>> For a table created via a series of computation,
>> save
>>>>>>> that
>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>>>>>>>>> reference to avoid running the computation logic to
>>>>>>>>>>>>>>>>>> regenerate
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
>>>>>>>>>>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
>>>>>>>>>>> processing.
>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>> difference
>>>>>>>>>>>>>>>>>>>>>>>>> is that stream applications will only run once as
>>>> they
>>>>>>> are
>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>>>>> running.
>>>>>>>>>>>>>>>>>>>>>>>>> And the batch applications may be run multiple
>> times,
>>>>>>>>>>> hence
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>>>> be created and dropped each time the application
>>>> runs.
>>>>>>>>>>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
>>>>>>>>>>> management
>>>>>>>>>>>>>>>>>>>>>>> requirements
>>>>>>>>>>>>>>>>>>>>>>>>> for the streaming cached table, such as time based
>> /
>>>>>>> size
>>>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>>>>>>>> retention,
>>>>>>>>>>>>>>>>>>>>>>>>> to address the infinite data issue. But such
>>>>>>> requirement
>>>>>>>>>>>> does
>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>>>>>>>> the semantic.
>>>>>>>>>>>>>>>>>>>>>>>>> You are right that interactive programming is just
>>>> one
>>>>>>> use
>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> cache().
>>>>>>>>>>>>>>>>>>>>>>>>> It is not the only use case.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having
>> the
>>>>>>> `void
>>>>>>>>>>>>>>>>>>>> cache()`
>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>> side effects.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> This is indeed the key point. The argument around
>>>>>>> whether
>>>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>> return something already indicates that cache() and
>>>>>>>>>>>>>>>>>>> materialize()
>>>>>>>>>>>>>>>>>>>>>>> address
>>>>>>>>>>>>>>>>>>>>>>>>> different issues.
>>>>>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
>>>>>>> effects?
>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>> far
>>>>>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist
>>>> if a
>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>>>>>>>>>>> CachedTable
>>>>>>>>>>>>>>>>>>>>>> read-only.
>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that
>> user
>>>>>>> can
>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user
>> currently
>>>>>>> can
>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I don't think anyone should insert something to a
>>>>>>> cache.
>>>>>>>>>>> By
>>>>>>>>>>>>>>>>>>>>>> definition
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> cache should only be updated when the corresponding
>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>> updated. What I am wondering is that given the
>>>>>>> following
>>>>>>>>>>> two
>>>>>>>>>>>>>>>>>>>> facts:
>>>>>>>>>>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with
>> something
>>>>>>> like
>>>>>>>>>>>>>>>>>>>>> insert()),
>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
>>>>>>>>>>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
>>>>>>>>>>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
>>>>>>>>>>> mutable
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is
>> where I
>>>>>>>>>>>> thought
>>>>>>>>>>>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One
>>>>>>> more
>>>>>>>>>>>>>>>>>>>>> explanation
>>>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is
>> that
>>>> I
>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>> “Table”s
>>>>>>>>>>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as
>>>> SQL
>>>>>>>>>>>>>>>>>> views,
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>>> difference for me is that their live scope is
>> short
>>>> -
>>>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>> is limited by different execution model. That’s
>> why
>>>>>>>>>>>>>>>>>> “cashing”
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>>> for me
>>>>>>>>>>>>>>>>>>>>>>>>>> is just materialising it.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> However I see and I understand your point of view.
>>>>>>> Coming
>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL
>>>>>>> world,
>>>>>>>>>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()`
>> will/might
>>>>>>> not
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> interactive programming and not only in batching.
>>>> But
>>>>>>>>>>>> naming
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>>>>>> and not that critical to me. Especially that once
>> we
>>>>>>>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>> deem
>>>>>>>>>>>>>>>>>>>>>>>> so.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having
>> the
>>>>>>>>>>> `void
>>>>>>>>>>>>>>>>>>>>> cache()`
>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you
>> have
>>>>>>>>>>>>>>>>>> mentioned.
>>>>>>>>>>>>>>>>>>>>> True:
>>>>>>>>>>>>>>>>>>>>>>>>>> results might be non deterministic if underlying
>>>>>>> source
>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>> changing.
>>>>>>>>>>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes
>>>> the
>>>>>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table.
>> It
>>>>>>> can
>>>>>>>>>>>>>>>>>> cause
>>>>>>>>>>>>>>>>>>>>> “wtf”
>>>>>>>>>>>>>>>>>>>>>>>> moment
>>>>>>>>>>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some
>>>>>>> place
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> his
>>>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
>>>>>>>>>>> differently.
>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table
>> handle,
>>>>>>> we
>>>>>>>>>>>>>>>>>> force
>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> explicitly use the cache which removes the
>> “random”
>>>>>>> part
>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> "suddenly
>>>>>>>>>>>>>>>>>>>>>>>>>> some other random places are behaving
>> differently”.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
>>>>>>>>>>>>>>>>>>>>>>> flexibility/allowing
>>>>>>>>>>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are
>> independent
>>>>>>> of
>>>>>>>>>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>>>>>>>> vs
>>>>>>>>>>>>>>>>>>>>>>>>>> `materialize()` discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the
>>>>>>> CachedTable?
>>>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>>>>>> sounds
>>>>>>>>>>>>>>>>>>>>>>>>>> pretty confusing.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>>>>>>>>>>> CachedTable
>>>>>>>>>>>>>>>>>>>>>>> read-only. I
>>>>>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that
>> user
>>>>>>> can
>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user
>> currently
>>>>>>> can
>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
>>>>>>>>>>> xingcanc@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and
>>>>>>> `materialize()`
>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>> considered as two different methods where the
>> later
>>>>>>> one
>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>> sophisticated.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea
>> is
>>>>>>> just
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the
>>>> TableAPI
>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>> high-level
>>>>>>>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the
>>>> DataSet
>>>>>>> API
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> force
>>>>>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching
>> it.
>>>>>>> Then
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>> manually register the cached dataset to a table
>>>> again
>>>>>>> (we
>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
>>>>>>>>>>> identical
>>>>>>>>>>>>>>>>>>>> schema
>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>> different contents here). After all, it’s the
>>>> dataset
>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those
>> are
>>>>>>> good
>>>>>>>>>>>>>>>>>>>>> arguments.
>>>>>>>>>>>>>>>>>>>>>>>> But I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> think those arguments are mostly about
>>>> materialized
>>>>>>>>>>> view.
>>>>>>>>>>>>>>>>>>> Let
>>>>>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and
>>>>>>> materialize()
>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>> different.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite
>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>> implications.
>>>>>>>>>>>>>>>>>>>>>>>> An
>>>>>>>>>>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When
>>>>>>> users
>>>>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>>>>>>>>> cache(),
>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result
>>>> as
>>>>>>> a
>>>>>>>>>>>>>>>>>> draft
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>>>> work,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this intermediate result may not have any
>>>> realistic
>>>>>>>>>>>>>>>>>> meaning.
>>>>>>>>>>>>>>>>>>>>>> Calling
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the
>>>>>>> cached
>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>>>>>>>>>>>>> manner.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I
>>>>>>> have
>>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>>>>>>> meaningful
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think
>>>>>>> about
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> validation,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result,
>> etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
>>>>>>>>>>> materialize()
>>>>>>>>>>>>>>>>>>>> methods
>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them.
>> The
>>>>>>>>>>> concept
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to
>> say
>>>>>>> the
>>>>>>>>>>>>>>>>>>> related
>>>>>>>>>>>>>>>>>>>>>> stuff
>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think
>> the
>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and
>>>>>>> systematic
>>>>>>>>>>>>>>>>>>> manner.
>>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> found
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way
>>>> beyond
>>>>>>>>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>>>>>>>>>> programming experience.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still
>> have
>>>>>>> some
>>>>>>>>>>>>>>>>>>>>> questions,
>>>>>>>>>>>>>>>>>>>>>>>>>> though.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files
>>>>>>> from a
>>>>>>>>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…)
>>>> ….;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>>>>>>>>>>>>>>>>>> initialised)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger
>> it)
>>>>>>>>>>> writes
>>>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not
>> to
>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> implemented
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial version
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> what if someone else added some more files to
>>>>>>> /foo/bar
>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result
>>>>>>> become
>>>>>>>>>>>>>>>>>>>>>>>>>> non-deterministic,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension,
>>>>>>> manual
>>>>>>>>>>>>>>>>>>>> “cache”
>>>>>>>>>>>>>>>>>>>>>>>> dropping
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in
>>>> most
>>>>>>>>>>>> cases,
>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental
>> assumption
>>>>>>> of
>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> source data is complete before the data
>> processing
>>>>>>>>>>>> begins,
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO,
>>>> if
>>>>>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to be added to some source during the
>> processing,
>>>> it
>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> ways
>>>>>>>>>>>>>>>>>>>>>>>>>>>> like union the source with another table
>>>> containing
>>>>>>> the
>>>>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> There are a few cases that computations are
>>>> executed
>>>>>>>>>>>>>>>>>>>> repeatedly
>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> changing data source.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job
>>>> every
>>>>>>>>>>> hour
>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> samples
>>>>>>>>>>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the
>>>>>>> source
>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain
>>>> unchanged
>>>>>>>>>>>> within
>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>> run.
>>>>>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>>>>>>> usually in that case, the result will need
>>>>>>> versioning,
>>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result
>> from
>>>>>>> the
>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>> by a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another example is something like data
>> warehouse.
>>>> In
>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>>>> are a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
>>>>>>>>>>> sources,
>>>>>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be
>>>>>>> created to
>>>>>>>>>>>>>>>>>>>> generate
>>>>>>>>>>>>>>>>>>>>>>>> derived
>>>>>>>>>>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated
>> when
>>>>>>> the
>>>>>>>>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic
>>>>>>> that
>>>>>>>>>>>>>>>>>>> derives
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update
>>>> those
>>>>>>>>>>>>>>>>>>>>>> reports/views.
>>>>>>>>>>>>>>>>>>>>>>>>>> Again,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> all those derived data also need to ha
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotr,

You are right. There might be two intuitive meanings when users call
'a.uncache()', namely:
1. release the resource
2. Do not use cache for the next operation.

Case (1) would likely be the dominant use case. So I would suggest we
dedicate uncache() method to case (1), i.e. for resource release, but not
for ignoring cache.

For case 2, i.e. explicitly ignoring cache (which is rare), users may use
something like 'hint("ignoreCache")'. I think this is better as it is a
little weird for users to call `a.uncache()` while they may not even know
if the table is cached at all.

Assuming we let `uncache()` to only release resource, one possibility is
using ref count to mitigate the side effect. That means a ref count is
incremented on `cache()` and decremented on `uncache()`. That means
`uncache()` does not physically release the resource immediately, but just
means the cache could be released.
That being said, I am not sure if this is really a better solution as it
seems a little counter intuitive. Maybe calling it releaseCache() help a
little bit?

Thanks,

Jiangjie (Becket) Qin




On Tue, Jan 8, 2019 at 5:36 PM Piotr Nowojski <pi...@da-platform.com> wrote:

> Hi Becket,
>
> With `uncache` there are probably two features that we can think about:
>
> a)
>
> Physically dropping the cached table from the storage, freeing up the
> resources
>
> b)
>
> Hinting the optimizer to not cache the reads for the next query/table
>
> a) Has the issue as I wrote before, that it seemed to be an operation
> inherently “flawed" with having side effects.
>
> I’m not sure how it would be best to express. We could make it work:
>
> 1. via a method on a Table as you proposed:
>
> void Table#dropCache()
> void Table#uncache()
>
> 2. Operation on the environment
>
> env.dropCacheFor(table) // or some other argument that allows user to
> identify the desired cache
>
> 3. Extending (from your original design doc) `setTableService` method to
> return some control handle like:
>
> TableServiceControl setTableService(TableFactory tf,
>                      TableProperties properties,
>                      TempTableCleanUpCallback cleanUpCallback);
>
> (TableServiceControl? TableService? TableServiceHandle? CacheService?)
>
> And having the drop cache method there:
>
> TableServiceControl#dropCache(table)
>
> Out of those options, option 1 might have a disadvantage of kind of not
> making the user aware, that this is a global operation with side effects.
> Like the old example of:
>
> public void foo(Table t) {
>   // …
>   t.dropCache();
> }
>
> It might not be immediately obvious that `t.dropCache()` is some kind of
> global operation, with side effects visible outside of the `foo` function.
>
> On the other hand, both option 2 and 3, might have greater chance of
> catching user’s attention:
>
> public void foo(Table t, CacheService cacheService) {
>   // …
>   cacheService.dropCache(t);
> }
>
> b) could be achieved quite easily:
>
> Table a = …
> val notCached1 = a.doNotCache()
> val cachedA = a.cache()
> val notCached2 = cachedA.doNotCache() // equivalent of notCached1
>
> `doNotCache()` would behave similarly to `cache()` - return a copy of the
> table with removed “cache” hint and/or added “never cache” hint.
>
> Piotrek
>
>
> > On 8 Jan 2019, at 03:17, Becket Qin <be...@gmail.com> wrote:
> >
> > Hi Piotr,
> >
> > Thanks for the proposal and detailed explanation. I like the idea of
> > returning a new hinted Table without modifying the original table. This
> > also leave the room for users to benefit from future implicit caching.
> >
> > Just to make sure I get the full picture. In your proposal, there will
> also
> > be a 'void Table#uncache()' method to release the cache, right?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Jan 7, 2019 at 11:50 PM Piotr Nowojski <pi...@da-platform.com>
> > wrote:
> >
> >> Hi Becket!
> >>
> >> After further thinking I tend to agree that my previous proposal
> (*Option
> >> 2*) indeed might not be if would in the future introduce automatic
> caching.
> >> However I would like to propose a slightly modified version of it:
> >>
> >> *Option 4*
> >>
> >> Adding `cache()` method with following signature:
> >>
> >> Table Table#cache();
> >>
> >> Without side-effects, and `cache()` call do not modify/change original
> >> Table in any way.
> >> It would return a copy of original table, with added hint for the
> >> optimizer to cache the table, so that the future accesses to the
> returned
> >> table might be cached or not.
> >>
> >> Assuming that we are talking about a setup, where we do not have
> automatic
> >> caching enabled (possible future extension).
> >>
> >> Example #1:
> >>
> >> ```
> >> Table a = …
> >> a.foo() // not cached
> >>
> >> val cachedTable = a.cache();
> >>
> >> cachedA.bar() // maybe cached
> >> a.foo() // same as before - effectively not cached
> >> ```
> >>
> >> Both the first and the second `a.foo()` operations would behave in the
> >> exactly same way. Again, `a.cache()` call doesn’t affect `a` itself. If
> `a`
> >> was not hinted for caching before `a.cache();`, then both `a.foo()`
> calls
> >> wouldn’t use cache.
> >>
> >> Returned `cachedA` would be hinted with “cache” hint, so probably
> >> `cachedA.bar()` would go through cache (unless optimiser decides the
> >> opposite)
> >>
> >> Example #2
> >>
> >> ```
> >> Table a = …
> >>
> >> a.foo() // not cached
> >>
> >> val b = a.cache();
> >>
> >> a.foo() // same as before - effectively not cached
> >> b.foo() // maybe cached
> >>
> >> val c = b.cache();
> >>
> >> a.foo() // same as before - effectively not cached
> >> b.foo() // same as before - effectively maybe cached
> >> c.foo() // maybe cached
> >> ```
> >>
> >> Now, assuming that we have some future “automatic caching optimisation”:
> >>
> >> Example #3
> >>
> >> ```
> >> env.enableAutomaticCaching()
> >> Table a = …
> >>
> >> a.foo() // might be cached, depending if `a` was selected to automatic
> >> caching
> >>
> >> val b = a.cache();
> >>
> >> a.foo() // same as before - might be cached, if `a` was selected to
> >> automatic caching
> >> b.foo() // maybe cached
> >> ```
> >>
> >>
> >> More or less this is the same behaviour as:
> >>
> >> Table a = ...
> >> val b = a.filter(x > 20)
> >>
> >> calling `filter` hasn’t changed or altered `a` in anyway. If `a` was
> >> previously filtered:
> >>
> >> Table src = …
> >> val a = src.filter(x > 20)
> >> val b = a.filter(x > 20)
> >>
> >> then yes, `a` and `b` will be the same. But the point is that neither
> >> `filter` nor `cache` changes the original `a` table.
> >>
> >> One thing is that indeed, physically dropping cache operation, will have
> >> side effects and it will in a way mutate the cached table references.
> But
> >> this is I think unavoidable in any solution - the same issue as calling
> >> `.close()`, or calling destructor in C++.
> >>
> >> Piotrek
> >>
> >>> On 7 Jan 2019, at 10:41, Becket Qin <be...@gmail.com> wrote:
> >>>
> >>> Happy New Year, everybody!
> >>>
> >>> I would like to resume this discussion thread. At this point, We have
> >>> agreed on the first step goal of interactive programming. The open
> >>> discussion is the exact API. More specifically, what should *cache()*
> >>> method return and what is the semantic. There are three options:
> >>>
> >>> *Option 1*
> >>> *void cache()* OR *Table cache()* which returns the original table for
> >>> chained calls.
> >>> *void uncache() *releases the cache.
> >>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
> >>>
> >>> - Semantic: a.cache() hints that table 'a' should be cached. Optimizer
> >>> decides whether the cache will be used or not.
> >>> - pros: simple and no confusion between CachedTable and original table
> >>> - cons: A table may be cached / uncached in a method invocation, while
> >> the
> >>> caller does not know about this.
> >>>
> >>> *Option 2*
> >>> *CachedTable cache()*
> >>> *CachedTable *extends *Table *with an additional *uncache()* method
> >>>
> >>> - Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will
> always
> >>> use cache. *a.bar() *will always use original DAG.
> >>> - pros: No potential side effects in method invocation.
> >>> - cons: Optimizer has no chance to kick in. Future optimization will
> >> become
> >>> a behavior change and need users to change the code.
> >>>
> >>> *Option 3*
> >>> *CacheHandle cache()*
> >>> *CacheHandle.release() *to release a cache handle on the table. If all
> >>> cache handles are released, the cache could be removed.
> >>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
> >>>
> >>> - Semantic: *a.cache() *hints that 'a' should be cached. Optimizer
> >> decides
> >>> whether the cache will be used or not. Cache is released either no
> handle
> >>> is on it, or the user program exits.
> >>> - pros: No potential side effect in method invocation. No confusion
> >> between
> >>> cached table v.s original table.
> >>> - cons: An additional CacheHandle exposed to the users.
> >>>
> >>>
> >>> Personally I prefer option 3 for the following reasons:
> >>> 1. It is simple. Vast majority of the users would just call
> >>> *a.cache()* followed
> >>> by *a.foo(),* *a.bar(), etc. *
> >>> 2. There is no semantic ambiguity and semantic change if we decide to
> add
> >>> implicit cache in the future.
> >>> 3. There is no side effect in the method calls.
> >>> 4. Admittedly we need to expose one more CacheHandle class to the
> users.
> >>> But it is not that difficult to understand given similar well known
> >> concept
> >>> like ref count (we can name it CacheReference if that is easier to
> >>> understand). So I think it is fine.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>>
> >>> On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <be...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi Piotrek,
> >>>>
> >>>> 1. Regarding optimization.
> >>>> Sure there are many cases that the decision is hard to make. But that
> >> does
> >>>> not make it any easier for the users to make those decisions. I
> imagine
> >> 99%
> >>>> of the users would just naively use cache. I am not saying we can
> >> optimize
> >>>> in all the cases. But as long as we agree that at least in certain
> >> cases (I
> >>>> would argue most cases), optimizer can do a little better than an
> >> average
> >>>> user who likely knows little about Flink internals, we should not push
> >> the
> >>>> burden of optimization to users.
> >>>>
> >>>> BTW, it seems some of your concerns are related to the
> implementation. I
> >>>> did not mention the implementation of the caching service because that
> >>>> should not affect the API semantic. Not sure if this helps, but
> imagine
> >> the
> >>>> default implementation has one StorageNode service colocating with
> each
> >> TM.
> >>>> It could be running within the TM process or in a standalone process,
> >>>> depending on configuration.
> >>>>
> >>>> The StorageNode uses memory + spill-to-disk mechanism. The cached data
> >>>> will just be written to the local StorageNode service. If the
> >> StorageNode
> >>>> is running within the TM process, the in-memory cache could just be
> >> objects
> >>>> so we save some serde cost. A later job referring to the cached Table
> >> will
> >>>> be scheduled in a locality aware manner, i.e. run in the TM whose peer
> >>>> StorageNode hosts the data.
> >>>>
> >>>>
> >>>> 2. Semantic
> >>>> I am not sure why introducing a new hintCache() or
> >>>> env.enableAutomaticCaching() method would avoid the consequence of
> >> semantic
> >>>> change.
> >>>>
> >>>> If the auto optimization is not enabled by default, users still need
> to
> >>>> make code change to all existing programs in order to get the benefit.
> >>>> If the auto optimization is enabled by default, advanced users who
> know
> >>>> that they really want to use cache will suddenly lose the opportunity
> >> to do
> >>>> so, unless they change the code to disable auto optimization.
> >>>>
> >>>>
> >>>> 3. side effect
> >>>> The CacheHandle is not only for where to put uncache(). It is to solve
> >> the
> >>>> implicit performance impact by moving the uncache() to the
> CacheHandle.
> >>>>
> >>>>  - If users wants to leverage cache, they can call a.cache(). After
> >>>>  that, unless user explicitly release that CacheHandle, a.foo() will
> >> always
> >>>>  leverage cache if needed (optimizer may choose to ignore cache if
> that
> >>>>  helps accelerate the process). Any function call will not be able to
> >>>>  release the cache because they do not have that CacheHandle.
> >>>>  - If some advanced users do not want to use cache at all, they will
> >>>>  call a.hint(ignoreCache).foo(). This will for sure ignore cache and
> >> use the
> >>>>  original DAG to process.
> >>>>
> >>>>
> >>>>> In vast majority of the cases, users wouldn't really care whether the
> >>>>> cache is used or not.
> >>>>> I wouldn’t agree with that, because “caching” (if not purely in
> memory
> >>>>> caching) would add additional IO costs. It’s similar as saying that
> >> users
> >>>>> would not see a difference between Spark/Flink and MapReduce
> (MapReduce
> >>>>> writes data to disks after every map/reduce stage).
> >>>>
> >>>> What I wanted to say is that in most cases, after users call cache(),
> >> they
> >>>> don't really care about whether auto optimization has decided to
> ignore
> >> the
> >>>> cache or not, as long as the program runs faster.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <
> >> piotr@data-artisans.com>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Thanks for the quick answer :)
> >>>>>
> >>>>> Re 1.
> >>>>>
> >>>>> I generally agree with you, however couple of points:
> >>>>>
> >>>>> a) the problem with using automatic caching is bigger, because you
> will
> >>>>> have to decide, how do you compare IO vs CPU costs and if you pick
> >> wrong,
> >>>>> additional IO costs might be enormous or even can crash your system.
> >> This
> >>>>> is more difficult problem compared to let say join reordering, where
> >> the
> >>>>> only issue is to have good statistics that can capture correlations
> >> between
> >>>>> columns (when you reorder joins number of IO operations do not
> change)
> >>>>> c) your example is completely independent of caching.
> >>>>>
> >>>>> Query like this:
> >>>>>
> >>>>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===`f2).as('f3,
> >>>>> …).filter(‘f3 > 30)
> >>>>>
> >>>>> Should/could be optimised to empty result immediately, without the
> need
> >>>>> for any cache/materialisation and that should work even without any
> >>>>> statistics provided by the connector.
> >>>>>
> >>>>> For me prerequisite to any serious cost-based optimisations would be
> >> some
> >>>>> reasonable benchmark coverage of the code (tpch?). Otherwise that
> >> would be
> >>>>> equivalent of adding not tested code, since we wouldn’t be able to
> >> verify
> >>>>> our assumptions, like how does the writing of 10 000 records to
> >>>>> cache/RocksDB/Kafka/CSV file compare to joining/filtering/processing
> of
> >>>>> lets say 1000 000 rows.
> >>>>>
> >>>>> Re 2.
> >>>>>
> >>>>> I wasn’t proposing to change the semantic later. I was proposing that
> >> we
> >>>>> start now:
> >>>>>
> >>>>> CachedTable cachedA = a.cache()
> >>>>> cachedA.foo() // Cache is used
> >>>>> a.bar() // Original DAG is used
> >>>>>
> >>>>> And then later we can think about adding for example
> >>>>>
> >>>>> CachedTable cachedA = a.hintCache()
> >>>>> cachedA.foo() // Cache might be used
> >>>>> a.bar() // Original DAG is used
> >>>>>
> >>>>> Or
> >>>>>
> >>>>> env.enableAutomaticCaching()
> >>>>> a.foo() // Cache might be used
> >>>>> a.bar() // Cache might be used
> >>>>>
> >>>>> Or (I would still not like this option):
> >>>>>
> >>>>> a.hintCache()
> >>>>> a.foo() // Cache might be used
> >>>>> a.bar() // Cache might be used
> >>>>>
> >>>>> Or whatever else that will come to our mind. Even if we add some
> >>>>> automatic caching in the future, keeping implicit (`CachedTable
> >> cache()`)
> >>>>> caching will still be useful, at least in some cases.
> >>>>>
> >>>>> Re 3.
> >>>>>
> >>>>>> 2. The source tables are immutable during one run of batch
> processing
> >>>>> logic.
> >>>>>> 3. The cache is immutable during one run of batch processing logic.
> >>>>>
> >>>>>> I think assumption 2 and 3 are by definition what batch processing
> >>>>> means,
> >>>>>> i.e the data must be complete before it is processed and should not
> >>>>> change
> >>>>>> when the processing is running.
> >>>>>
> >>>>> I agree that this is how batch systems SHOULD be working. However I
> >> know
> >>>>> from my previous experience that it’s not always the case. Sometimes
> >> users
> >>>>> are just working on some non transactional storage, which can be
> >> (either
> >>>>> constantly or occasionally) being modified by some other processes
> for
> >>>>> whatever the reasons (fixing the data, updating, adding new data
> etc).
> >>>>>
> >>>>> But even if we ignore this point (data immutability), performance
> side
> >>>>> effect issue of your proposal remains. If user calls `void a.cache()`
> >> deep
> >>>>> inside some private method, it will have implicit side effects on
> other
> >>>>> parts of his program that might not be obvious.
> >>>>>
> >>>>> Re `CacheHandle`.
> >>>>>
> >>>>> If I understand it correctly, it only addresses the issue where to
> >> place
> >>>>> method `uncache`/`dropCache`.
> >>>>>
> >>>>> Btw,
> >>>>>
> >>>>>> In vast majority of the cases, users wouldn't really care whether
> the
> >>>>> cache is used or not.
> >>>>>
> >>>>> I wouldn’t agree with that, because “caching” (if not purely in
> memory
> >>>>> caching) would add additional IO costs. It’s similar as saying that
> >> users
> >>>>> would not see a difference between Spark/Flink and MapReduce
> (MapReduce
> >>>>> writes data to disks after every map/reduce stage).
> >>>>>
> >>>>> Piotrek
> >>>>>
> >>>>>> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Piotrek,
> >>>>>>
> >>>>>> Not sure if you noticed, in my last email, I was proposing
> >> `CacheHandle
> >>>>>> cache()` to avoid the potential side effect due to function calls.
> >>>>>>
> >>>>>> Let's look at the disagreement in your reply one by one.
> >>>>>>
> >>>>>>
> >>>>>> 1. Optimization chances
> >>>>>>
> >>>>>> Optimization is never a trivial work. This is exactly why we should
> >> not
> >>>>> let
> >>>>>> user manually do that. Databases have done huge amount of work in
> this
> >>>>>> area. At Alibaba, we rely heavily on many optimization rules to
> boost
> >>>>> the
> >>>>>> SQL query performance.
> >>>>>>
> >>>>>> In your example, if I filling the filter conditions in a certain
> way,
> >>>>> the
> >>>>>> optimization would become obvious.
> >>>>>>
> >>>>>> Table src1 = … // read from connector 1
> >>>>>> Table src2 = … // read from connector 2
> >>>>>>
> >>>>>> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
> >>>>>> `f2).as('f3, ...)
> >>>>>> a.cache() // write cache to connector 3, when writing the records,
> >>>>> remember
> >>>>>> min and max of `f1
> >>>>>>
> >>>>>> a.filter('f3 > 30) // There is no need to read from any connector
> >>>>> because
> >>>>>> `a` does not contain any record whose 'f3 is greater than 30.
> >>>>>> env.execute()
> >>>>>> a.select(…)
> >>>>>>
> >>>>>> BTW, it seems to me that adding some basic statistics is fairly
> >>>>>> straightforward and the cost is pretty marginal if not ignorable. In
> >>>>> fact
> >>>>>> it is not only needed for optimization, but also for cases such as
> ML,
> >>>>>> where some algorithms may need to decide their parameter based on
> the
> >>>>>> statistics of the data.
> >>>>>>
> >>>>>>
> >>>>>> 2. Same API, one semantic now, another semantic later.
> >>>>>>
> >>>>>> I am trying to understand what is the semantic of `CachedTable
> >> cache()`
> >>>>> you
> >>>>>> are proposing. IMO, we should avoid designing an API whose semantic
> >>>>> will be
> >>>>>> changed later. If we have a "CachedTable cache()" method, then the
> >>>>> semantic
> >>>>>> should be very clearly defined upfront and do not change later. It
> >>>>> should
> >>>>>> never be "right now let's go with semantic 1, later we can silently
> >>>>> change
> >>>>>> it to semantic 2 or 3". Such change could result in bad consequence.
> >> For
> >>>>>> example, let's say we decide go with semantic 1:
> >>>>>>
> >>>>>> CachedTable cachedA = a.cache()
> >>>>>> cachedA.foo() // Cache is used
> >>>>>> a.bar() // Original DAG is used.
> >>>>>>
> >>>>>> Now majority of the users would be using cachedA.foo() in their
> code.
> >>>>> And
> >>>>>> some advanced users will use a.bar() to explicitly skip the cache.
> >> Later
> >>>>>> on, we added smart optimization and change the semantic to semantic
> 2:
> >>>>>>
> >>>>>> CachedTable cachedA = a.cache()
> >>>>>> cachedA.foo() // Cache is used
> >>>>>> a.bar() // Cache MIGHT be used, and Flink may decide to skip cache
> if
> >>>>> it is
> >>>>>> faster.
> >>>>>>
> >>>>>> Now most of the users who were writing cachedA.foo() will not
> benefit
> >>>>> from
> >>>>>> this optimization at all, unless they change their code to use
> a.foo()
> >>>>>> instead. And those advanced users suddenly lose the option to
> >> explicitly
> >>>>>> ignore cache unless they change their code (assuming we care enough
> to
> >>>>>> provide something like hint(useCache)). If we don't define the
> >> semantic
> >>>>>> carefully, our users will have to change their code again and again
> >>>>> while
> >>>>>> they shouldn't have to.
> >>>>>>
> >>>>>>
> >>>>>> 3. side effect.
> >>>>>>
> >>>>>> Before we talk about side effect, we have to agree on the
> assumptions.
> >>>>> The
> >>>>>> assumptions I have are following:
> >>>>>> 1. We are talking about batch processing.
> >>>>>> 2. The source tables are immutable during one run of batch
> processing
> >>>>> logic.
> >>>>>> 3. The cache is immutable during one run of batch processing logic.
> >>>>>>
> >>>>>> I think assumption 2 and 3 are by definition what batch processing
> >>>>> means,
> >>>>>> i.e the data must be complete before it is processed and should not
> >>>>> change
> >>>>>> when the processing is running.
> >>>>>>
> >>>>>> As far as I am aware of, I don't know any batch processing system
> >>>>> breaking
> >>>>>> those assumptions. Even for relational database tables, where
> queries
> >>>>> can
> >>>>>> run with concurrent modifications, necessary locking are still
> >> required
> >>>>> to
> >>>>>> ensure the integrity of the query result.
> >>>>>>
> >>>>>> Please let me know if you disagree with the above assumptions. If
> you
> >>>>> agree
> >>>>>> with these assumptions, with the `CacheHandle cache()` API in my
> last
> >>>>>> email, do you still see side effects?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <
> >> piotr@data-artisans.com
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Becket,
> >>>>>>>
> >>>>>>>> Regarding the chance of optimization, it might not be that rare.
> >> Some
> >>>>>>> very
> >>>>>>>> simple statistics could already help in many cases. For example,
> >>>>> simply
> >>>>>>>> maintaining max and min of each fields can already eliminate some
> >>>>>>>> unnecessary table scan (potentially scanning the cached table) if
> >> the
> >>>>>>>> result is doomed to be empty. A histogram would give even further
> >>>>>>>> information. The optimizer could be very careful and only ignores
> >>>>> cache
> >>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
> filter
> >> on
> >>>>>>> the
> >>>>>>>> cache will absolutely return nothing.
> >>>>>>>
> >>>>>>> I do not see how this might be easy to achieve. It would require
> tons
> >>>>> of
> >>>>>>> effort to make it work and in the end you would still have a
> problem
> >> of
> >>>>>>> comparing/trading CPU cycles vs IO. For example:
> >>>>>>>
> >>>>>>> Table src1 = … // read from connector 1
> >>>>>>> Table src2 = … // read from connector 2
> >>>>>>>
> >>>>>>> Table a = src1.filter(…).join(src2.filter(…), …)
> >>>>>>> a.cache() // write cache to connector 3
> >>>>>>>
> >>>>>>> a.filter(…)
> >>>>>>> env.execute()
> >>>>>>> a.select(…)
> >>>>>>>
> >>>>>>> Decision whether it’s better to:
> >>>>>>> A) read from connector1/connector2, filter/map and join them twice
> >>>>>>> B) read from connector1/connector2, filter/map and join them once,
> >> pay
> >>>>> the
> >>>>>>> price of writing to connector 3 and then reading from it
> >>>>>>>
> >>>>>>> Is very far from trivial. `a` can end up much larger than `src1`
> and
> >>>>>>> `src2`, writes to connector 3 might be extremely slow, reads from
> >>>>> connector
> >>>>>>> 3 can be slower compared to reads from connector 1 & 2, … . You
> >> really
> >>>>> need
> >>>>>>> to have extremely good statistics to correctly asses size of the
> >>>>> output and
> >>>>>>> it would still be failing many times (correlations etc). And keep
> in
> >>>>> mind
> >>>>>>> that at the moment we do not have ANY statistics at all. More than
> >>>>> that, it
> >>>>>>> would require significantly more testing and setting up some
> >>>>> benchmarks to
> >>>>>>> make sure that we do not brake it with some regressions.
> >>>>>>>
> >>>>>>> That’s why I’m strongly opposing this idea - at least let’s not
> >> starts
> >>>>>>> with this. If we first start with completely manual/explicit
> caching,
> >>>>>>> without any magic, it would be a significant improvement for the
> >> users
> >>>>> for
> >>>>>>> a fraction of the development cost. After implementing that, when
> we
> >>>>>>> already have all of the working pieces, we can start working on
> some
> >>>>>>> optimisations rules. As I wrote before, if we start with
> >>>>>>>
> >>>>>>> `CachedTable cache()`
> >>>>>>>
> >>>>>>> We can later work on follow up stories to make it automatic.
> Despite
> >>>>> that
> >>>>>>> I don’t like this implicit/side effect approach with `void` method,
> >>>>> having
> >>>>>>> explicit `CachedTable cache()` wouldn’t even prevent as from later
> >>>>> adding
> >>>>>>> `void hintCache()` method, with the exact semantic that you want.
> >>>>>>>
> >>>>>>> On top of that I re-rise again that having implicit `void
> >>>>>>> cache()/hintCache()` has other side effects and problems with non
> >>>>> immutable
> >>>>>>> data, and being annoying when used secretly inside methods.
> >>>>>>>
> >>>>>>> Explicit `CachedTable cache()` just looks like much less
> >> controversial
> >>>>> MVP
> >>>>>>> and if we decide to go further with this topic, it’s not a wasted
> >>>>> effort,
> >>>>>>> but just lies on a stright path to more advanced/complicated
> >> solutions
> >>>>> in
> >>>>>>> the future. Are there any drawbacks of starting with `CachedTable
> >>>>> cache()`
> >>>>>>> that I’m missing?
> >>>>>>>
> >>>>>>> Piotrek
> >>>>>>>
> >>>>>>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> Introducing CacheHandle seems too complicated. That means users
> have
> >>>>> to
> >>>>>>>> maintain Handler properly.
> >>>>>>>>
> >>>>>>>> And since cache is just a hint for optimizer, why not just return
> >>>>> Table
> >>>>>>>> itself for cache method. This hint info should be kept in Table I
> >>>>>>> believe.
> >>>>>>>>
> >>>>>>>> So how about adding method cache and uncache for Table, and both
> >>>>> return
> >>>>>>>> Table. Because what cache and uncache did is just adding some hint
> >>>>> info
> >>>>>>>> into Table.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
> >>>>>>>>
> >>>>>>>>> Hi Till and Piotrek,
> >>>>>>>>>
> >>>>>>>>> Thanks for the clarification. That solves quite a few confusion.
> My
> >>>>>>>>> understanding of how cache works is same as what Till describe.
> >> i.e.
> >>>>>>>>> cache() is a hint to Flink, but it is not guaranteed that cache
> >>>>> always
> >>>>>>>>> exist and it might be recomputed from its lineage.
> >>>>>>>>>
> >>>>>>>>> Is this the core of our disagreement here? That you would like
> this
> >>>>>>>>>> “cache()” to be mostly hint for the optimiser?
> >>>>>>>>>
> >>>>>>>>> Semantic wise, yes. That's also why I think materialize() has a
> >> much
> >>>>>>> larger
> >>>>>>>>> scope than cache(), thus it should be a different method.
> >>>>>>>>>
> >>>>>>>>> Regarding the chance of optimization, it might not be that rare.
> >> Some
> >>>>>>> very
> >>>>>>>>> simple statistics could already help in many cases. For example,
> >>>>> simply
> >>>>>>>>> maintaining max and min of each fields can already eliminate some
> >>>>>>>>> unnecessary table scan (potentially scanning the cached table) if
> >> the
> >>>>>>>>> result is doomed to be empty. A histogram would give even further
> >>>>>>>>> information. The optimizer could be very careful and only ignores
> >>>>> cache
> >>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a
> filter
> >>>>> on
> >>>>>>> the
> >>>>>>>>> cache will absolutely return nothing.
> >>>>>>>>>
> >>>>>>>>> Given the above clarification on cache, I would like to revisit
> the
> >>>>>>>>> original "void cache()" proposal and see if we can improve on top
> >> of
> >>>>>>> that.
> >>>>>>>>>
> >>>>>>>>> What do you think about the following modified interface?
> >>>>>>>>>
> >>>>>>>>> Table {
> >>>>>>>>> /**
> >>>>>>>>> * This call hints Flink to maintain a cache of this table and
> >>>>> leverage
> >>>>>>>>> it for performance optimization if needed.
> >>>>>>>>> * Note that Flink may still decide to not use the cache if it is
> >>>>>>> cheaper
> >>>>>>>>> by doing so.
> >>>>>>>>> *
> >>>>>>>>> * A CacheHandle will be returned to allow user release the cache
> >>>>>>>>> actively. The cache will be deleted if there
> >>>>>>>>> * is no unreleased cache handlers to it. When the
> TableEnvironment
> >>>>> is
> >>>>>>>>> closed. The cache will also be deleted
> >>>>>>>>> * and all the cache handlers will be released.
> >>>>>>>>> *
> >>>>>>>>> * @return a CacheHandle referring to the cache of this table.
> >>>>>>>>> */
> >>>>>>>>> CacheHandle cache();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> CacheHandle {
> >>>>>>>>> /**
> >>>>>>>>> * Close the cache handle. This method does not necessarily
> deletes
> >>>>> the
> >>>>>>>>> cache. Instead, it simply decrements the reference counter to the
> >>>>> cache.
> >>>>>>>>> * When the there is no handle referring to a cache. The cache
> will
> >>>>> be
> >>>>>>>>> deleted.
> >>>>>>>>> *
> >>>>>>>>> * @return the number of open handles to the cache after this
> handle
> >>>>>>> has
> >>>>>>>>> been released.
> >>>>>>>>> */
> >>>>>>>>> int release()
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> The rationale behind this interface is following:
> >>>>>>>>> In vast majority of the cases, users wouldn't really care whether
> >> the
> >>>>>>> cache
> >>>>>>>>> is used or not. So I think the most intuitive way is letting
> >> cache()
> >>>>>>> return
> >>>>>>>>> nothing. So nobody needs to worry about the difference between
> >>>>>>> operations
> >>>>>>>>> on CacheTables and those on the "original" tables. This will make
> >>>>> maybe
> >>>>>>>>> 99.9% of the users happy. There were two concerns raised for this
> >>>>>>> approach:
> >>>>>>>>> 1. In some rare cases, users may want to ignore cache,
> >>>>>>>>> 2. A table might be cached/uncached in a third party function
> while
> >>>>> the
> >>>>>>>>> caller does not know.
> >>>>>>>>>
> >>>>>>>>> For the first issue, users can use hint("ignoreCache") to
> >> explicitly
> >>>>>>> ignore
> >>>>>>>>> cache.
> >>>>>>>>> For the second issue, the above proposal lets cache() return a
> >>>>>>> CacheHandle,
> >>>>>>>>> the only method in it is release(). Different CacheHandles will
> >>>>> refer to
> >>>>>>>>> the same cache, if a cache no longer has any cache handle, it
> will
> >> be
> >>>>>>>>> deleted. This will address the following case:
> >>>>>>>>> {
> >>>>>>>>> val handle1 = a.cache()
> >>>>>>>>> process(a)
> >>>>>>>>> a.select(...) // cache is still available, handle1 has not been
> >>>>>>> released.
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> void process(Table t) {
> >>>>>>>>> val handle2 = t.cache() // new handle to cache
> >>>>>>>>> t.select(...) // optimizer decides cache usage
> >>>>>>>>> t.hint("ignoreCache").select(...) // cache is ignored
> >>>>>>>>> handle2.release() // release the handle, but the cache may still
> be
> >>>>>>>>> available if there are other handles
> >>>>>>>>> ...
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> Does the above modified approach look reasonable to you?
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>>
> >>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <
> >> trohrmann@apache.org>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Becket,
> >>>>>>>>>>
> >>>>>>>>>> I was aiming at semantics similar to 1. I actually thought that
> >>>>>>> `cache()`
> >>>>>>>>>> would tell the system to materialize the intermediate result so
> >> that
> >>>>>>>>>> subsequent queries don't need to reprocess it. This means that
> the
> >>>>>>> usage
> >>>>>>>>> of
> >>>>>>>>>> the cached table in this example
> >>>>>>>>>>
> >>>>>>>>>> {
> >>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>> val b1 = cachedTable.select(…)
> >>>>>>>>>> val b2 = cachedTable.foo().select(…)
> >>>>>>>>>> val b3 = cachedTable.bar().select(...)
> >>>>>>>>>> val c1 = a.select(…)
> >>>>>>>>>> val c2 = a.foo().select(…)
> >>>>>>>>>> val c3 = a.bar().select(...)
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> strongly depends on interleaved calls which trigger the
> execution
> >> of
> >>>>>>> sub
> >>>>>>>>>> queries. So for example, if there is only a single env.execute
> >> call
> >>>>> at
> >>>>>>>>> the
> >>>>>>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be
> >> computed
> >>>>> by
> >>>>>>>>>> reading directly from the sources (given that there is only a
> >> single
> >>>>>>>>>> JobGraph). It just happens that the result of `a` will be cached
> >>>>> such
> >>>>>>>>> that
> >>>>>>>>>> we skip the processing of `a` when there are subsequent queries
> >>>>> reading
> >>>>>>>>>> from `cachedTable`. If for some reason the system cannot
> >> materialize
> >>>>>>> the
> >>>>>>>>>> table (e.g. running out of disk space, ttl expired), then it
> could
> >>>>> also
> >>>>>>>>>> happen that we need to reprocess `a`. In that sense
> `cachedTable`
> >>>>>>> simply
> >>>>>>>>> is
> >>>>>>>>>> an identifier for the materialized result of `a` with the
> lineage
> >>>>> how
> >>>>>>> to
> >>>>>>>>>> reprocess it.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Till
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
> >>>>>>> piotr@data-artisans.com
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>
> >>>>>>>>>>>> {
> >>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>> val b = cachedTable.select(...)
> >>>>>>>>>>>> val c = a.select(...)
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
> >>>>> original
> >>>>>>>>> DAG
> >>>>>>>>>>> as
> >>>>>>>>>>>> user demanded so. In this case, the optimizer has no chance to
> >>>>>>>>>> optimize.
> >>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves
> the
> >>>>>>>>>>> optimizer
> >>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
> case,
> >>>>> user
> >>>>>>>>>>> lose
> >>>>>>>>>>>> the option to NOT use cache.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As you can see, neither of the options seem perfect. However,
> I
> >>>>> guess
> >>>>>>>>>> you
> >>>>>>>>>>>> and Till are proposing the third option:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or
> >> DAG
> >>>>>>>>>> should
> >>>>>>>>>>> be
> >>>>>>>>>>>> used. c always use the DAG.
> >>>>>>>>>>>
> >>>>>>>>>>> I am pretty sure that me, Till, Fabian and others were all
> >>>>> proposing
> >>>>>>>>> and
> >>>>>>>>>>> advocating in favour of semantic “1”. No cost based optimiser
> >>>>>>> decisions
> >>>>>>>>>> at
> >>>>>>>>>>> all.
> >>>>>>>>>>>
> >>>>>>>>>>> {
> >>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>> val b1 = cachedTable.select(…)
> >>>>>>>>>>> val b2 = cachedTable.foo().select(…)
> >>>>>>>>>>> val b3 = cachedTable.bar().select(...)
> >>>>>>>>>>> val c1 = a.select(…)
> >>>>>>>>>>> val c2 = a.foo().select(…)
> >>>>>>>>>>> val c3 = a.bar().select(...)
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3
> are
> >>>>>>>>>>> re-executing whole plan for “a”.
> >>>>>>>>>>>
> >>>>>>>>>>> In the future we could discuss going one step further,
> >> introducing
> >>>>>>> some
> >>>>>>>>>>> global optimisation (that can be manually enabled/disabled):
> >>>>>>>>> deduplicate
> >>>>>>>>>>> plan nodes/deduplicate sub queries/re-use sub queries
> results/or
> >>>>>>>>> whatever
> >>>>>>>>>>> we could call it. It could do two things:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Automatically try to deduplicate fragments of the plan and
> >> share
> >>>>>>> the
> >>>>>>>>>>> result using CachedTable - in other words automatically insert
> >>>>>>>>>> `CachedTable
> >>>>>>>>>>> cache()` calls.
> >>>>>>>>>>> 2. Automatically make decision to bypass explicit `CachedTable`
> >>>>> access
> >>>>>>>>>>> (this would be the equivalent of what you described as
> “semantic
> >>>>> 3”).
> >>>>>>>>>>>
> >>>>>>>>>>> However as I wrote previously, I have big doubts if such
> >> cost-based
> >>>>>>>>>>> optimisation would work (this applies also to “Semantic 2”). I
> >>>>> would
> >>>>>>>>>> expect
> >>>>>>>>>>> it to do more harm than good in so many cases, that it wouldn’t
> >>>>> make
> >>>>>>>>>> sense.
> >>>>>>>>>>> Even assuming that we calculate statistics perfectly (this
> ain’t
> >>>>> gonna
> >>>>>>>>>>> happen), it’s virtually impossible to correctly estimate
> correct
> >>>>>>>>> exchange
> >>>>>>>>>>> rate of CPU cycles vs IO operations as it is changing so much
> >> from
> >>>>>>>>>>> deployment to deployment.
> >>>>>>>>>>>
> >>>>>>>>>>> Is this the core of our disagreement here? That you would like
> >> this
> >>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
> >>>>>>>>>>>
> >>>>>>>>>>> Piotrek
> >>>>>>>>>>>
> >>>>>>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another potential concern for semantic 3 is that. In the
> future,
> >>>>> we
> >>>>>>>>> may
> >>>>>>>>>>> add
> >>>>>>>>>>>> automatic caching to Flink. e.g. cache the intermediate
> results
> >> at
> >>>>>>>>> the
> >>>>>>>>>>>> shuffle boundary. If our semantic is that reference to the
> >>>>> original
> >>>>>>>>>> table
> >>>>>>>>>>>> means skipping cache, those users may not be able to benefit
> >> from
> >>>>> the
> >>>>>>>>>>>> implicit cache.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <
> >> becket.qin@gmail.com
> >>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the reply. Thought about it again, I might have
> >>>>>>>>>> misunderstood
> >>>>>>>>>>>>> your proposal in earlier emails. Returning a CachedTable
> might
> >>>>> not
> >>>>>>>>> be
> >>>>>>>>>> a
> >>>>>>>>>>> bad
> >>>>>>>>>>>>> idea.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I was more concerned about the semantic and its intuitiveness
> >>>>> when a
> >>>>>>>>>>>>> CachedTable is returned. i..e, if cache() returns
> CachedTable.
> >>>>> What
> >>>>>>>>>> are
> >>>>>>>>>>> the
> >>>>>>>>>>>>> semantic in the following code:
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>>>> val b = cachedTable.select(...)
> >>>>>>>>>>>>> val c = a.select(...)
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>> What is the difference between b and c? At the first glance,
> I
> >>>>> see
> >>>>>>>>> two
> >>>>>>>>>>>>> options:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
> >>>>> original
> >>>>>>>>>> DAG
> >>>>>>>>>>> as
> >>>>>>>>>>>>> user demanded so. In this case, the optimizer has no chance
> to
> >>>>>>>>>> optimize.
> >>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves
> >> the
> >>>>>>>>>>> optimizer
> >>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
> >> case,
> >>>>>>>>> user
> >>>>>>>>>>> lose
> >>>>>>>>>>>>> the option to NOT use cache.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As you can see, neither of the options seem perfect.
> However, I
> >>>>>>>>> guess
> >>>>>>>>>>> you
> >>>>>>>>>>>>> and Till are proposing the third option:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or
> >> DAG
> >>>>>>>>>> should
> >>>>>>>>>>>>> be used. c always use the DAG.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This does address all the concerns. It is just that from
> >>>>>>>>> intuitiveness
> >>>>>>>>>>>>> perspective, I found that asking user to explicitly use a
> >>>>>>>>> CachedTable
> >>>>>>>>>>> while
> >>>>>>>>>>>>> the optimizer might choose to ignore is a little weird. That
> >> was
> >>>>>>>>> why I
> >>>>>>>>>>> did
> >>>>>>>>>>>>> not think about that semantic. But given there is material
> >>>>> benefit,
> >>>>>>>>> I
> >>>>>>>>>>> think
> >>>>>>>>>>>>> this semantic is acceptable.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
> >>>>> cache
> >>>>>>>>> or
> >>>>>>>>>>> not,
> >>>>>>>>>>>>>> then why do we need “void cache()” method at all? Would It
> >>>>>>>>>> “increase”
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> chance of using the cache? That’s sounds strange. What would
> >> be
> >>>>> the
> >>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
> >>>>> want
> >>>>>>>>> to
> >>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
> >>>>>>>>>>> deduplication”
> >>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
> >>>>> optimiser
> >>>>>>>>> do
> >>>>>>>>>>> all of
> >>>>>>>>>>>>>> the work.
> >>>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not
> use
> >>>>>>>>> cache
> >>>>>>>>>>>>>> decision.
> >>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
> such
> >>>>> cost
> >>>>>>>>>>> based
> >>>>>>>>>>>>>> optimisations would work properly and I would still insist
> >>>>> first on
> >>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> We are absolutely on the same page here. An explicit cache()
> >>>>> method
> >>>>>>>>> is
> >>>>>>>>>>>>> necessary not only because optimizer may not be able to make
> >> the
> >>>>>>>>> right
> >>>>>>>>>>>>> decision, but also because of the nature of interactive
> >>>>> programming.
> >>>>>>>>>> For
> >>>>>>>>>>>>> example, if users write the following code in Scala shell:
> >>>>>>>>>>>>> val b = a.select(...)
> >>>>>>>>>>>>> val c = b.select(...)
> >>>>>>>>>>>>> val d = c.select(...).writeToSink(...)
> >>>>>>>>>>>>> tEnv.execute()
> >>>>>>>>>>>>> There is no way optimizer will know whether b or c will be
> used
> >>>>> in
> >>>>>>>>>> later
> >>>>>>>>>>>>> code, unless users hint explicitly.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
> >>>>>>>>> objections
> >>>>>>>>>> of
> >>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
> >>>>> Jark,
> >>>>>>>>>>> Fabian,
> >>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Is there any other side effects if we use semantic 3
> mentioned
> >>>>>>>>> above?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> JIangjie (Becket) Qin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
> >>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Sorry for not responding long time.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regarding case1.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would
> expect
> >>>>> only
> >>>>>>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1`
> wouldn’t
> >>>>>>>>> affect
> >>>>>>>>>>>>>> `cachedTableA2`. Just as in any other database dropping
> >>>>> modifying
> >>>>>>>>> one
> >>>>>>>>>>>>>> independent table/materialised view does not affect others.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What I meant is that assuming there is already a cached
> >> table,
> >>>>>>>>>> ideally
> >>>>>>>>>>>>>> users need
> >>>>>>>>>>>>>>> not to specify whether the next query should read from the
> >>>>> cache
> >>>>>>>>> or
> >>>>>>>>>>> use
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
> >>>>> cache
> >>>>>>>>> or
> >>>>>>>>>>>>>> not, then why do we need “void cache()” method at all? Would
> >> It
> >>>>>>>>>>> “increase”
> >>>>>>>>>>>>>> the chance of using the cache? That’s sounds strange. What
> >>>>> would be
> >>>>>>>>>> the
> >>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
> >>>>> want
> >>>>>>>>> to
> >>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
> >>>>>>>>>>> deduplication”
> >>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
> >>>>> optimiser
> >>>>>>>>> do
> >>>>>>>>>>> all of
> >>>>>>>>>>>>>> the work.
> >>>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not
> use
> >>>>>>>>> cache
> >>>>>>>>>>>>>> decision.
> >>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether
> such
> >>>>> cost
> >>>>>>>>>>> based
> >>>>>>>>>>>>>> optimisations would work properly and I would still insist
> >>>>> first on
> >>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
> >>>>>>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()`
> >> doesn’t
> >>>>>>>>>>>>>> contradict future work on automated cost based caching.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
> >>>>>>>>> objections
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
> >>>>> Jark,
> >>>>>>>>>>> Fabian,
> >>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It is true that after the first job submission, there will
> be
> >>>>> no
> >>>>>>>>>>>>>> ambiguity
> >>>>>>>>>>>>>>> in terms of whether a cached table is used or not. That is
> >> the
> >>>>>>>>> same
> >>>>>>>>>>> for
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> cache() without returning a CachedTable.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
> >>>>> caching
> >>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>> from which you need to consume from if you want to benefit
> >>>>> from
> >>>>>>>>> the
> >>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>> functionality.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am thinking a little differently. I think it is a hint
> (as
> >>>>> you
> >>>>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>> later) instead of a new operator. I'd like to be careful
> >> about
> >>>>> the
> >>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>> of the API. A hint is a property set on an existing
> operator,
> >>>>> but
> >>>>>>>>> is
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>> itself an operator as it does not really manipulate the
> data.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
> >>>>> which
> >>>>>>>>>>>>>>>> intermediate result should be cached. But especially when
> >>>>>>>>> executing
> >>>>>>>>>>>>>> ad-hoc
> >>>>>>>>>>>>>>>> queries the user might better know which results need to
> be
> >>>>>>>>> cached
> >>>>>>>>>>>>>> because
> >>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
> >>>>> consider
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in
> >> the
> >>>>>>>>>> future
> >>>>>>>>>>> we
> >>>>>>>>>>>>>>>> might add functionality which tries to automatically cache
> >>>>>>>>> results
> >>>>>>>>>>>>>> (e.g.
> >>>>>>>>>>>>>>>> caching the latest intermediate results until so and so
> much
> >>>>>>>>> space
> >>>>>>>>>> is
> >>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
> >>>>> `CachedTable
> >>>>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I agree that cache() method is needed for exactly the
> reason
> >>>>> you
> >>>>>>>>>>>>>> mentioned,
> >>>>>>>>>>>>>>> i.e. Flink cannot predict what users are going to write
> >> later,
> >>>>> so
> >>>>>>>>>>> users
> >>>>>>>>>>>>>>> need to tell Flink explicitly that this table will be used
> >>>>> later.
> >>>>>>>>>>> What I
> >>>>>>>>>>>>>>> meant is that assuming there is already a cached table,
> >> ideally
> >>>>>>>>>> users
> >>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>> not to specify whether the next query should read from the
> >>>>> cache
> >>>>>>>>> or
> >>>>>>>>>>> use
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> To explain the difference between returning / not
> returning a
> >>>>>>>>>>>>>> CachedTable,
> >>>>>>>>>>>>>>> I want compare the following two case:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Case 1:  returning a CachedTable*
> >>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>> val cachedTableA1 = a.cache()
> >>>>>>>>>>>>>>> val cachedTableA2 = a.cache()
> >>>>>>>>>>>>>>> b.print() // Just to make sure a is cached.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> c = a.filter(...) // User specify that the original DAG is
> >>>>> used?
> >>>>>>>>> Or
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> optimizer decides whether DAG or cache should be used?
> >>>>>>>>>>>>>>> d = cachedTableA1.filter() // User specify that the cached
> >>>>> table
> >>>>>>>>> is
> >>>>>>>>>>>>>> used.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
> >>>>>>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Case 2: not returning a CachedTable*
> >>>>>>>>>>>>>>> b = a.map()
> >>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>> a.cache() // no-op
> >>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or
> >> DAG
> >>>>>>>>>> should
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or
> >> DAG
> >>>>>>>>>> should
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a.unCache()
> >>>>>>>>>>>>>>> a.unCache() // no-op
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to
> choose
> >>>>>>>>>> between
> >>>>>>>>>>>>>> DAG
> >>>>>>>>>>>>>>> and cache. And the unCache() call becomes tricky.
> >>>>>>>>>>>>>>> In case 2, users do not need to worry about whether cache
> or
> >>>>> DAG
> >>>>>>>>> is
> >>>>>>>>>>>>>> used.
> >>>>>>>>>>>>>>> And the unCache() semantic is clear. However, the caveat is
> >>>>> that
> >>>>>>>>>> users
> >>>>>>>>>>>>>>> cannot explicitly ignore the cache.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In order to address the issues mentioned in case 2 and
> >>>>> inspired by
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> discussion so far, I am thinking about using hint to allow
> >> user
> >>>>>>>>>>>>>> explicitly
> >>>>>>>>>>>>>>> ignore cache. Although we do not have hint yet, but we
> >> probably
> >>>>>>>>>> should
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>> one. So the code becomes:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Case 3: returning this table*
> >>>>>>>>>>>>>>> b = a.map()
> >>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>> a.cache() // no-op
> >>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or
> >> DAG
> >>>>>>>>>> should
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used
> >>>>> instead
> >>>>>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a.unCache()
> >>>>>>>>>>>>>>> a.unCache() // no-op
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We could also let cache() return this table to allow
> chained
> >>>>>>>>> method
> >>>>>>>>>>>>>> calls.
> >>>>>>>>>>>>>>> Do you think this API addresses the concerns?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> All the recent discussions are focused on whether there
> is a
> >>>>>>>>>> problem
> >>>>>>>>>>> if
> >>>>>>>>>>>>>>>> cache() not return a Table.
> >>>>>>>>>>>>>>>> It seems that returning a Table explicitly is more clear
> >> (and
> >>>>>>>>>> safe?).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> So whether there are any problems if cache() returns a
> >> Table?
> >>>>>>>>>>> @Becket
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <
> >>>>> trohrmann@apache.org
> >>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> It's true that b, c, d and e will all read from the
> >> original
> >>>>> DAG
> >>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> generates a. But all subsequent operators (when running
> >>>>> multiple
> >>>>>>>>>>>>>> queries)
> >>>>>>>>>>>>>>>>> which reference cachedTableA should not need to reproduce
> >> `a`
> >>>>>>>>> but
> >>>>>>>>>>>>>>>> directly
> >>>>>>>>>>>>>>>>> consume the intermediate result.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
> >>>>> caching
> >>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>> from which you need to consume from if you want to
> benefit
> >>>>> from
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>> functionality.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of
> decision
> >>>>> which
> >>>>>>>>>>>>>>>>> intermediate result should be cached. But especially when
> >>>>>>>>>> executing
> >>>>>>>>>>>>>>>> ad-hoc
> >>>>>>>>>>>>>>>>> queries the user might better know which results need to
> be
> >>>>>>>>> cached
> >>>>>>>>>>>>>>>> because
> >>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
> >>>>>>>>> consider
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in
> >> the
> >>>>>>>>>> future
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> might add functionality which tries to automatically
> cache
> >>>>>>>>> results
> >>>>>>>>>>>>>> (e.g.
> >>>>>>>>>>>>>>>>> caching the latest intermediate results until so and so
> >> much
> >>>>>>>>> space
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
> >>>>>>>>> `CachedTable
> >>>>>>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <
> >>>>> becket.qin@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for the clarification. I am still a little
> >> confused.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If cache() returns a CachedTable, the example might
> >> become:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> cachedTableA = a.cache()
> >>>>>>>>>>>>>>>>>> d = cachedTableA.map(...)
> >>>>>>>>>>>>>>>>>> e = a.map()
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b,
> c, d
> >>>>> and
> >>>>>>>>> e
> >>>>>>>>>>> are
> >>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>> going to be reading from the original DAG that generates
> >> a.
> >>>>> But
> >>>>>>>>>>> with
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>> naive expectation, d should be reading from the cache.
> >> This
> >>>>>>>>> seems
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>> solving the potential confusion you raised, right?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Just to be clear, my understanding are all based on the
> >>>>>>>>>> assumption
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
> >>>>>>>>>>>>>> c*achedTableA*
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> original table *a * should be completely
> interchangeable.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> That said, I think a valid argument is optimization.
> There
> >>>>> are
> >>>>>>>>>>> indeed
> >>>>>>>>>>>>>>>>> cases
> >>>>>>>>>>>>>>>>>> that reading from the original DAG could be faster than
> >>>>> reading
> >>>>>>>>>>> from
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> cache. For example, in the following example:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> a.filter(f1' > 100)
> >>>>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>>>> b = a.filter(f1' < 100)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Ideally the optimizer should be intelligent enough to
> >> decide
> >>>>>>>>>> which
> >>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> faster, without user intervention. In this case, it will
> >>>>>>>>> identify
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> b
> >>>>>>>>>>>>>>>>>> would just be an empty table, thus skip reading from the
> >>>>> cache
> >>>>>>>>>>>>>>>>> completely.
> >>>>>>>>>>>>>>>>>> But I agree that returning a CachedTable would give user
> >> the
> >>>>>>>>>>> control
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> when to use cache, even though I still feel that letting
> >> the
> >>>>>>>>>>>>>> optimizer
> >>>>>>>>>>>>>>>>>> handle this is a better option in long run.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
> >>>>>>>>>> trohrmann@apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Yes you are right Becket that it still depends on the
> >>>>> actual
> >>>>>>>>>>>>>>>> execution
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> the job whether a consumer reads from a cached result
> or
> >>>>> not.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> My point was actually about the properties of a (cached
> >> vs.
> >>>>>>>>>>>>>>>> non-cached)
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> not about the execution. I would not make cache trigger
> >> the
> >>>>>>>>>>>>>> execution
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> the job because one loses some flexibility by eagerly
> >>>>>>>>> triggering
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> execution.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is
> >>>>> returned
> >>>>>>>>>> by
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> cache() method like Piotr did in order to make the API
> >> more
> >>>>>>>>>>>>>> explicit.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
> >>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> That is a good example. Just a minor correction, in
> this
> >>>>>>>>> case,
> >>>>>>>>>>> b, c
> >>>>>>>>>>>>>>>>>> and d
> >>>>>>>>>>>>>>>>>>>> will all consume from a non-cached a. This is because
> >>>>> cache
> >>>>>>>>>> will
> >>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> created on the very first job submission that
> generates
> >>>>> the
> >>>>>>>>>> table
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> cached.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> If I understand correctly, this is example is about
> >>>>> whether
> >>>>>>>>>>>>>>>> .cache()
> >>>>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In
> >>>>> another
> >>>>>>>>>> word,
> >>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>> cache() method actually triggers a job that creates
> the
> >>>>>>>>> cache,
> >>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>> be no such confusion. Is that right?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> In the example, although d will not consume from the
> >>>>> cached
> >>>>>>>>>> Table
> >>>>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>> looks supposed to, from correctness perspective the
> code
> >>>>> will
> >>>>>>>>>>> still
> >>>>>>>>>>>>>>>>>>> return
> >>>>>>>>>>>>>>>>>>>> correct result, assuming that tables are immutable.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Personally I feel it is OK because users probably
> won't
> >>>>>>>>> really
> >>>>>>>>>>>>>>>> worry
> >>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>> whether the table is cached or not. And lazy cache
> could
> >>>>>>>>> avoid
> >>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>> unnecessary caching if a cached table is never created
> >> in
> >>>>> the
> >>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>> application. But I am not opposed to do eager
> evaluation
> >>>>> of
> >>>>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> >>>>>>>>>>>>>>>> trohrmann@apache.org>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily
> >>>>> changing
> >>>>>>>>>>>>>>>>> properties
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> node affects all down stream consumers but does not
> >>>>>>>>>> necessarily
> >>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> happen before these consumers are defined. From a
> >> user's
> >>>>>>>>>>>>>>>>> perspective
> >>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>> can be quite confusing:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>>>>>>> d = a.map(...)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> now b, c and d will consume from a cached operator.
> In
> >>>>> this
> >>>>>>>>>>> case,
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>> would most likely expect that only d reads from a
> >> cached
> >>>>>>>>>> result.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> >>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
> >>>>> effects?
> >>>>>>>>> So
> >>>>>>>>>>>>>>>>> far
> >>>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist
> >> if a
> >>>>>>>>>> table
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Not only that. There are also performance
> implications
> >>>>> and
> >>>>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>> another implicit side effects of using `void
> cache()`.
> >>>>> As I
> >>>>>>>>>>>>>>>> wrote
> >>>>>>>>>>>>>>>>>>>> before,
> >>>>>>>>>>>>>>>>>>>>>> reading from cache might not always be desirable,
> thus
> >>>>> it
> >>>>>>>>> can
> >>>>>>>>>>>>>>>>> cause
> >>>>>>>>>>>>>>>>>>>>>> performance degradation and I’m fine with that -
> >> user's
> >>>>> or
> >>>>>>>>>>>>>>>>>>> optimiser’s
> >>>>>>>>>>>>>>>>>>>>>> choice. What I do not like is that this implicit
> side
> >>>>>>>>> effect
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> manifest
> >>>>>>>>>>>>>>>>>>>>>> in completely different part of code, that wasn’t
> >>>>> touched
> >>>>>>>>> by
> >>>>>>>>>> a
> >>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else.
> And
> >>>>> even
> >>>>>>>>> if
> >>>>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>>>> improves performance, it’s still a side effect of
> >> `void
> >>>>>>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>>>>>>> Almost
> >>>>>>>>>>>>>>>>>>>>>> from the definition `void` methods have only side
> >>>>> effects.
> >>>>>>>>>> As I
> >>>>>>>>>>>>>>>>>> wrote
> >>>>>>>>>>>>>>>>>>>>>> before, there are couple of scenarios where this
> might
> >>>>> be
> >>>>>>>>>>>>>>>>>> undesirable
> >>>>>>>>>>>>>>>>>>>>>> and/or unexpected, for example:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 1.
> >>>>>>>>>>>>>>>>>>>>>> Table b = …;
> >>>>>>>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>>>>>>> x = b.join(…)
> >>>>>>>>>>>>>>>>>>>>>> y = b.count()
> >>>>>>>>>>>>>>>>>>>>>> // ...
> >>>>>>>>>>>>>>>>>>>>>> // 100
> >>>>>>>>>>>>>>>>>>>>>> // hundred
> >>>>>>>>>>>>>>>>>>>>>> // lines
> >>>>>>>>>>>>>>>>>>>>>> // of
> >>>>>>>>>>>>>>>>>>>>>> // code
> >>>>>>>>>>>>>>>>>>>>>> // later
> >>>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even
> >> hidden
> >>>>> in
> >>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>> method/file/package/dependency
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 2.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Table b = ...
> >>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>>>>>>>>>>>>> foo(b)
> >>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>> Else {
> >>>>>>>>>>>>>>>>>>>>>> bar(b)
> >>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Void foo(Table b) {
> >>>>>>>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>>>>>>> // do something with b
> >>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly
> >>>>> affect
> >>>>>>>>>>>>>>>>>> (semantic
> >>>>>>>>>>>>>>>>>>>> of a
> >>>>>>>>>>>>>>>>>>>>>> program in case of sources being mutable and
> >>>>> performance)
> >>>>>>>>> `z
> >>>>>>>>>> =
> >>>>>>>>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from
> >> obvious.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On top of that, there is still this argument of mine
> >>>>> that
> >>>>>>>>>>>>>>>> having
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
> >>>>>>>>> flexible
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> us
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> future and for the user (as a manual option to
> bypass
> >>>>> cache
> >>>>>>>>>>>>>>>>> reads).
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> But Jiangjie is correct,
> >>>>>>>>>>>>>>>>>>>>>>> the source table in batching should be immutable.
> It
> >> is
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>>> user’s
> >>>>>>>>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a
> regular
> >>>>>>>>>>>>>>>> failover
> >>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>> lead
> >>>>>>>>>>>>>>>>>>>>>>> to inconsistent results.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good
> deployment
> >>>>>>>>> should
> >>>>>>>>>>>>>>>> be.
> >>>>>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this
> >> (since
> >>>>> the
> >>>>>>>>>>>>>>>>> proper
> >>>>>>>>>>>>>>>>>>> fix
> >>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>> to support transactions), I’m just trying to
> minimise
> >>>>>>>>>> confusion
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> users that are not fully aware what’s going on and
> >>>>> operate
> >>>>>>>>> in
> >>>>>>>>>>>>>>>>> less
> >>>>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>>>>>> perfect setup. And if something bites them after
> >> adding
> >>>>>>>>>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>>>>>>>>> call,
> >>>>>>>>>>>>>>>>>>>>>> to make sure that they at least know all of the
> places
> >>>>> that
> >>>>>>>>>>>>>>>>> adding
> >>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>> line can affect.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks, Piotrek
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <
> >>>>> becket.qin@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more
> replies
> >>>>> are
> >>>>>>>>>>>>>>>>>>> following.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not
> only
> >> be
> >>>>>>>>> used
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>>>>>>> programming and not only in batching.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache()
> >> has
> >>>>> the
> >>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>> batch processing. The semantic is following:
> >>>>>>>>>>>>>>>>>>>>>>> For a table created via a series of computation,
> save
> >>>>> that
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>>>>>>>>> reference to avoid running the computation logic to
> >>>>>>>>>>>>>>>> regenerate
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> table.
> >>>>>>>>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
> >>>>>>>>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
> >>>>>>>>> processing.
> >>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>> difference
> >>>>>>>>>>>>>>>>>>>>>>> is that stream applications will only run once as
> >> they
> >>>>> are
> >>>>>>>>>>>>>>>> long
> >>>>>>>>>>>>>>>>>>>>> running.
> >>>>>>>>>>>>>>>>>>>>>>> And the batch applications may be run multiple
> times,
> >>>>>>>>> hence
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>>>> be created and dropped each time the application
> >> runs.
> >>>>>>>>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
> >>>>>>>>> management
> >>>>>>>>>>>>>>>>>>>>> requirements
> >>>>>>>>>>>>>>>>>>>>>>> for the streaming cached table, such as time based
> /
> >>>>> size
> >>>>>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>>>>>>>> retention,
> >>>>>>>>>>>>>>>>>>>>>>> to address the infinite data issue. But such
> >>>>> requirement
> >>>>>>>>>> does
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>> change
> >>>>>>>>>>>>>>>>>>>>>>> the semantic.
> >>>>>>>>>>>>>>>>>>>>>>> You are right that interactive programming is just
> >> one
> >>>>> use
> >>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> cache().
> >>>>>>>>>>>>>>>>>>>>>>> It is not the only use case.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having
> the
> >>>>> `void
> >>>>>>>>>>>>>>>>>> cache()`
> >>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>> side effects.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> This is indeed the key point. The argument around
> >>>>> whether
> >>>>>>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>> return something already indicates that cache() and
> >>>>>>>>>>>>>>>>> materialize()
> >>>>>>>>>>>>>>>>>>>>> address
> >>>>>>>>>>>>>>>>>>>>>>> different issues.
> >>>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
> >>>>> effects?
> >>>>>>>>> So
> >>>>>>>>>>>>>>>>> far
> >>>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist
> >> if a
> >>>>>>>>>> table
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >>>>>>>>> CachedTable
> >>>>>>>>>>>>>>>>>>>> read-only.
> >>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that
> user
> >>>>> can
> >>>>>>>>>> not
> >>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user
> currently
> >>>>> can
> >>>>>>>>> not
> >>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I don't think anyone should insert something to a
> >>>>> cache.
> >>>>>>>>> By
> >>>>>>>>>>>>>>>>>>>> definition
> >>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> cache should only be updated when the corresponding
> >>>>>>>>> original
> >>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>> updated. What I am wondering is that given the
> >>>>> following
> >>>>>>>>> two
> >>>>>>>>>>>>>>>>>> facts:
> >>>>>>>>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with
> something
> >>>>> like
> >>>>>>>>>>>>>>>>>>> insert()),
> >>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
> >>>>>>>>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
> >>>>>>>>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
> >>>>>>>>> mutable
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is
> where I
> >>>>>>>>>> thought
> >>>>>>>>>>>>>>>>>>>>> confusing.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> >>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One
> >>>>> more
> >>>>>>>>>>>>>>>>>>> explanation
> >>>>>>>>>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is
> that
> >> I
> >>>>>>>>> think
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>> “Table”s
> >>>>>>>>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as
> >> SQL
> >>>>>>>>>>>>>>>> views,
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>>>>> difference for me is that their live scope is
> short
> >> -
> >>>>>>>>>>>>>>>> current
> >>>>>>>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>> is limited by different execution model. That’s
> why
> >>>>>>>>>>>>>>>> “cashing”
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>>> for me
> >>>>>>>>>>>>>>>>>>>>>>>> is just materialising it.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> However I see and I understand your point of view.
> >>>>> Coming
> >>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL
> >>>>> world,
> >>>>>>>>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()`
> will/might
> >>>>> not
> >>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> interactive programming and not only in batching.
> >> But
> >>>>>>>>>> naming
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>>>>> and not that critical to me. Especially that once
> we
> >>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>>>> proper
> >>>>>>>>>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
> >>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>> deem
> >>>>>>>>>>>>>>>>>>>>>> so.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having
> the
> >>>>>>>>> `void
> >>>>>>>>>>>>>>>>>>> cache()`
> >>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you
> have
> >>>>>>>>>>>>>>>> mentioned.
> >>>>>>>>>>>>>>>>>>> True:
> >>>>>>>>>>>>>>>>>>>>>>>> results might be non deterministic if underlying
> >>>>> source
> >>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>> changing.
> >>>>>>>>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes
> >> the
> >>>>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table.
> It
> >>>>> can
> >>>>>>>>>>>>>>>> cause
> >>>>>>>>>>>>>>>>>>> “wtf”
> >>>>>>>>>>>>>>>>>>>>>> moment
> >>>>>>>>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some
> >>>>> place
> >>>>>>>>> in
> >>>>>>>>>>>>>>>> his
> >>>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
> >>>>>>>>> differently.
> >>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table
> handle,
> >>>>> we
> >>>>>>>>>>>>>>>> force
> >>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> explicitly use the cache which removes the
> “random”
> >>>>> part
> >>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> "suddenly
> >>>>>>>>>>>>>>>>>>>>>>>> some other random places are behaving
> differently”.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
> >>>>>>>>>>>>>>>>>>>>> flexibility/allowing
> >>>>>>>>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are
> independent
> >>>>> of
> >>>>>>>>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>>>>>> vs
> >>>>>>>>>>>>>>>>>>>>>>>> `materialize()` discussion.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the
> >>>>> CachedTable?
> >>>>>>>>>>>>>>>> This
> >>>>>>>>>>>>>>>>>>>> sounds
> >>>>>>>>>>>>>>>>>>>>>>>> pretty confusing.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >>>>>>>>> CachedTable
> >>>>>>>>>>>>>>>>>>>>> read-only. I
> >>>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that
> user
> >>>>> can
> >>>>>>>>>> not
> >>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user
> currently
> >>>>> can
> >>>>>>>>> not
> >>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
> >>>>>>>>> xingcanc@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and
> >>>>> `materialize()`
> >>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>> considered as two different methods where the
> later
> >>>>> one
> >>>>>>>>> is
> >>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>> sophisticated.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea
> is
> >>>>> just
> >>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the
> >> TableAPI
> >>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>>> high-level
> >>>>>>>>>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the
> >> DataSet
> >>>>> API
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> force
> >>>>>>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching
> it.
> >>>>> Then
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>> manually register the cached dataset to a table
> >> again
> >>>>> (we
> >>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
> >>>>>>>>> identical
> >>>>>>>>>>>>>>>>>> schema
> >>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>>> different contents here). After all, it’s the
> >> dataset
> >>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> >>>>>>>>>>>>>>>>>> becket.qin@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those
> are
> >>>>> good
> >>>>>>>>>>>>>>>>>>> arguments.
> >>>>>>>>>>>>>>>>>>>>>> But I
> >>>>>>>>>>>>>>>>>>>>>>>>>> think those arguments are mostly about
> >> materialized
> >>>>>>>>> view.
> >>>>>>>>>>>>>>>>> Let
> >>>>>>>>>>>>>>>>>> me
> >>>>>>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and
> >>>>> materialize()
> >>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>> different.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite
> >>>>> different
> >>>>>>>>>>>>>>>>>>>> implications.
> >>>>>>>>>>>>>>>>>>>>>> An
> >>>>>>>>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When
> >>>>> users
> >>>>>>>>>>>>>>>> call
> >>>>>>>>>>>>>>>>>>>> cache(),
> >>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result
> >> as
> >>>>> a
> >>>>>>>>>>>>>>>> draft
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>>>> work,
> >>>>>>>>>>>>>>>>>>>>>>>>>> this intermediate result may not have any
> >> realistic
> >>>>>>>>>>>>>>>> meaning.
> >>>>>>>>>>>>>>>>>>>> Calling
> >>>>>>>>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the
> >>>>> cached
> >>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> any
> >>>>>>>>>>>>>>>>>>>>>>>> manner.
> >>>>>>>>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I
> >>>>> have
> >>>>>>>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>>>>>>>> meaningful
> >>>>>>>>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think
> >>>>> about
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> validation,
> >>>>>>>>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result,
> etc.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
> >>>>>>>>> materialize()
> >>>>>>>>>>>>>>>>>> methods
> >>>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them.
> The
> >>>>>>>>> concept
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to
> say
> >>>>> the
> >>>>>>>>>>>>>>>>> related
> >>>>>>>>>>>>>>>>>>>> stuff
> >>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think
> the
> >>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>>>>> itself
> >>>>>>>>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and
> >>>>> systematic
> >>>>>>>>>>>>>>>>> manner.
> >>>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>> found
> >>>>>>>>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way
> >> beyond
> >>>>>>>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>>>>>>>>> programming experience.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still
> have
> >>>>> some
> >>>>>>>>>>>>>>>>>>> questions,
> >>>>>>>>>>>>>>>>>>>>>>>> though.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files
> >>>>> from a
> >>>>>>>>>>>>>>>>>>> directory
> >>>>>>>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…)
> >> ….;
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> >>>>>>>>>>>>>>>> initialised)
> >>>>>>>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger
> it)
> >>>>>>>>> writes
> >>>>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>> files
> >>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> /foo/bar
> >>>>>>>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not
> to
> >>>>> be
> >>>>>>>>>>>>>>>>>>> implemented
> >>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> initial version
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> what if someone else added some more files to
> >>>>> /foo/bar
> >>>>>>>>> at
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>> point?
> >>>>>>>>>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result
> >>>>> become
> >>>>>>>>>>>>>>>>>>>>>>>> non-deterministic,
> >>>>>>>>>>>>>>>>>>>>>>>>>> right?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension,
> >>>>> manual
> >>>>>>>>>>>>>>>>>> “cache”
> >>>>>>>>>>>>>>>>>>>>>> dropping
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in
> >> most
> >>>>>>>>>> cases,
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>> talking
> >>>>>>>>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental
> assumption
> >>>>> of
> >>>>>>>>>> such
> >>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> source data is complete before the data
> processing
> >>>>>>>>>> begins,
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO,
> >> if
> >>>>>>>>>>>>>>>>> additional
> >>>>>>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>>>>>>>>> to be added to some source during the
> processing,
> >> it
> >>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> done
> >>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> ways
> >>>>>>>>>>>>>>>>>>>>>>>>>> like union the source with another table
> >> containing
> >>>>> the
> >>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>> added.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> There are a few cases that computations are
> >> executed
> >>>>>>>>>>>>>>>>>> repeatedly
> >>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> changing data source.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job
> >> every
> >>>>>>>>> hour
> >>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> samples
> >>>>>>>>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the
> >>>>> source
> >>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>> between
> >>>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain
> >> unchanged
> >>>>>>>>>> within
> >>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>> run.
> >>>>>>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>>>>>> usually in that case, the result will need
> >>>>> versioning,
> >>>>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>> given
> >>>>>>>>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result
> from
> >>>>> the
> >>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>> by a
> >>>>>>>>>>>>>>>>>>>>>>>>>> certain timestamp.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Another example is something like data
> warehouse.
> >> In
> >>>>>>>>> this
> >>>>>>>>>>>>>>>>>> case,
> >>>>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>>>>> are a
> >>>>>>>>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
> >>>>>>>>> sources,
> >>>>>>>>>>>>>>>>> many
> >>>>>>>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be
> >>>>> created to
> >>>>>>>>>>>>>>>>>> generate
> >>>>>>>>>>>>>>>>>>>>>> derived
> >>>>>>>>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated
> when
> >>>>> the
> >>>>>>>>>>>>>>>>>> underlying
> >>>>>>>>>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic
> >>>>> that
> >>>>>>>>>>>>>>>>> derives
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update
> >> those
> >>>>>>>>>>>>>>>>>>>> reports/views.
> >>>>>>>>>>>>>>>>>>>>>>>> Again,
> >>>>>>>>>>>>>>>>>>>>>>>>>> all those derived data also need to ha
> >>>>>
> >>>>>
> >>
> >>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@da-platform.com>.
Hi Becket,

With `uncache` there are probably two features that we can think about:

a)

Physically dropping the cached table from the storage, freeing up the resources

b)

Hinting the optimizer to not cache the reads for the next query/table

a) Has the issue as I wrote before, that it seemed to be an operation inherently “flawed" with having side effects.

I’m not sure how it would be best to express. We could make it work:

1. via a method on a Table as you proposed:

void Table#dropCache()
void Table#uncache()

2. Operation on the environment

env.dropCacheFor(table) // or some other argument that allows user to identify the desired cache

3. Extending (from your original design doc) `setTableService` method to return some control handle like:

TableServiceControl setTableService(TableFactory tf, 
                     TableProperties properties, 
                     TempTableCleanUpCallback cleanUpCallback);

(TableServiceControl? TableService? TableServiceHandle? CacheService?)

And having the drop cache method there:

TableServiceControl#dropCache(table)

Out of those options, option 1 might have a disadvantage of kind of not making the user aware, that this is a global operation with side effects. Like the old example of:

public void foo(Table t) {
  // …
  t.dropCache();
}

It might not be immediately obvious that `t.dropCache()` is some kind of global operation, with side effects visible outside of the `foo` function.

On the other hand, both option 2 and 3, might have greater chance of catching user’s attention:

public void foo(Table t, CacheService cacheService) {
  // …
  cacheService.dropCache(t);
}

b) could be achieved quite easily:

Table a = …
val notCached1 = a.doNotCache()
val cachedA = a.cache()
val notCached2 = cachedA.doNotCache() // equivalent of notCached1

`doNotCache()` would behave similarly to `cache()` - return a copy of the table with removed “cache” hint and/or added “never cache” hint.

Piotrek


> On 8 Jan 2019, at 03:17, Becket Qin <be...@gmail.com> wrote:
> 
> Hi Piotr,
> 
> Thanks for the proposal and detailed explanation. I like the idea of
> returning a new hinted Table without modifying the original table. This
> also leave the room for users to benefit from future implicit caching.
> 
> Just to make sure I get the full picture. In your proposal, there will also
> be a 'void Table#uncache()' method to release the cache, right?
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Mon, Jan 7, 2019 at 11:50 PM Piotr Nowojski <pi...@da-platform.com>
> wrote:
> 
>> Hi Becket!
>> 
>> After further thinking I tend to agree that my previous proposal (*Option
>> 2*) indeed might not be if would in the future introduce automatic caching.
>> However I would like to propose a slightly modified version of it:
>> 
>> *Option 4*
>> 
>> Adding `cache()` method with following signature:
>> 
>> Table Table#cache();
>> 
>> Without side-effects, and `cache()` call do not modify/change original
>> Table in any way.
>> It would return a copy of original table, with added hint for the
>> optimizer to cache the table, so that the future accesses to the returned
>> table might be cached or not.
>> 
>> Assuming that we are talking about a setup, where we do not have automatic
>> caching enabled (possible future extension).
>> 
>> Example #1:
>> 
>> ```
>> Table a = …
>> a.foo() // not cached
>> 
>> val cachedTable = a.cache();
>> 
>> cachedA.bar() // maybe cached
>> a.foo() // same as before - effectively not cached
>> ```
>> 
>> Both the first and the second `a.foo()` operations would behave in the
>> exactly same way. Again, `a.cache()` call doesn’t affect `a` itself. If `a`
>> was not hinted for caching before `a.cache();`, then both `a.foo()` calls
>> wouldn’t use cache.
>> 
>> Returned `cachedA` would be hinted with “cache” hint, so probably
>> `cachedA.bar()` would go through cache (unless optimiser decides the
>> opposite)
>> 
>> Example #2
>> 
>> ```
>> Table a = …
>> 
>> a.foo() // not cached
>> 
>> val b = a.cache();
>> 
>> a.foo() // same as before - effectively not cached
>> b.foo() // maybe cached
>> 
>> val c = b.cache();
>> 
>> a.foo() // same as before - effectively not cached
>> b.foo() // same as before - effectively maybe cached
>> c.foo() // maybe cached
>> ```
>> 
>> Now, assuming that we have some future “automatic caching optimisation”:
>> 
>> Example #3
>> 
>> ```
>> env.enableAutomaticCaching()
>> Table a = …
>> 
>> a.foo() // might be cached, depending if `a` was selected to automatic
>> caching
>> 
>> val b = a.cache();
>> 
>> a.foo() // same as before - might be cached, if `a` was selected to
>> automatic caching
>> b.foo() // maybe cached
>> ```
>> 
>> 
>> More or less this is the same behaviour as:
>> 
>> Table a = ...
>> val b = a.filter(x > 20)
>> 
>> calling `filter` hasn’t changed or altered `a` in anyway. If `a` was
>> previously filtered:
>> 
>> Table src = …
>> val a = src.filter(x > 20)
>> val b = a.filter(x > 20)
>> 
>> then yes, `a` and `b` will be the same. But the point is that neither
>> `filter` nor `cache` changes the original `a` table.
>> 
>> One thing is that indeed, physically dropping cache operation, will have
>> side effects and it will in a way mutate the cached table references. But
>> this is I think unavoidable in any solution - the same issue as calling
>> `.close()`, or calling destructor in C++.
>> 
>> Piotrek
>> 
>>> On 7 Jan 2019, at 10:41, Becket Qin <be...@gmail.com> wrote:
>>> 
>>> Happy New Year, everybody!
>>> 
>>> I would like to resume this discussion thread. At this point, We have
>>> agreed on the first step goal of interactive programming. The open
>>> discussion is the exact API. More specifically, what should *cache()*
>>> method return and what is the semantic. There are three options:
>>> 
>>> *Option 1*
>>> *void cache()* OR *Table cache()* which returns the original table for
>>> chained calls.
>>> *void uncache() *releases the cache.
>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
>>> 
>>> - Semantic: a.cache() hints that table 'a' should be cached. Optimizer
>>> decides whether the cache will be used or not.
>>> - pros: simple and no confusion between CachedTable and original table
>>> - cons: A table may be cached / uncached in a method invocation, while
>> the
>>> caller does not know about this.
>>> 
>>> *Option 2*
>>> *CachedTable cache()*
>>> *CachedTable *extends *Table *with an additional *uncache()* method
>>> 
>>> - Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will always
>>> use cache. *a.bar() *will always use original DAG.
>>> - pros: No potential side effects in method invocation.
>>> - cons: Optimizer has no chance to kick in. Future optimization will
>> become
>>> a behavior change and need users to change the code.
>>> 
>>> *Option 3*
>>> *CacheHandle cache()*
>>> *CacheHandle.release() *to release a cache handle on the table. If all
>>> cache handles are released, the cache could be removed.
>>> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
>>> 
>>> - Semantic: *a.cache() *hints that 'a' should be cached. Optimizer
>> decides
>>> whether the cache will be used or not. Cache is released either no handle
>>> is on it, or the user program exits.
>>> - pros: No potential side effect in method invocation. No confusion
>> between
>>> cached table v.s original table.
>>> - cons: An additional CacheHandle exposed to the users.
>>> 
>>> 
>>> Personally I prefer option 3 for the following reasons:
>>> 1. It is simple. Vast majority of the users would just call
>>> *a.cache()* followed
>>> by *a.foo(),* *a.bar(), etc. *
>>> 2. There is no semantic ambiguity and semantic change if we decide to add
>>> implicit cache in the future.
>>> 3. There is no side effect in the method calls.
>>> 4. Admittedly we need to expose one more CacheHandle class to the users.
>>> But it is not that difficult to understand given similar well known
>> concept
>>> like ref count (we can name it CacheReference if that is easier to
>>> understand). So I think it is fine.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Jiangjie (Becket) Qin
>>> 
>>> 
>>> On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <be...@gmail.com>
>> wrote:
>>> 
>>>> Hi Piotrek,
>>>> 
>>>> 1. Regarding optimization.
>>>> Sure there are many cases that the decision is hard to make. But that
>> does
>>>> not make it any easier for the users to make those decisions. I imagine
>> 99%
>>>> of the users would just naively use cache. I am not saying we can
>> optimize
>>>> in all the cases. But as long as we agree that at least in certain
>> cases (I
>>>> would argue most cases), optimizer can do a little better than an
>> average
>>>> user who likely knows little about Flink internals, we should not push
>> the
>>>> burden of optimization to users.
>>>> 
>>>> BTW, it seems some of your concerns are related to the implementation. I
>>>> did not mention the implementation of the caching service because that
>>>> should not affect the API semantic. Not sure if this helps, but imagine
>> the
>>>> default implementation has one StorageNode service colocating with each
>> TM.
>>>> It could be running within the TM process or in a standalone process,
>>>> depending on configuration.
>>>> 
>>>> The StorageNode uses memory + spill-to-disk mechanism. The cached data
>>>> will just be written to the local StorageNode service. If the
>> StorageNode
>>>> is running within the TM process, the in-memory cache could just be
>> objects
>>>> so we save some serde cost. A later job referring to the cached Table
>> will
>>>> be scheduled in a locality aware manner, i.e. run in the TM whose peer
>>>> StorageNode hosts the data.
>>>> 
>>>> 
>>>> 2. Semantic
>>>> I am not sure why introducing a new hintCache() or
>>>> env.enableAutomaticCaching() method would avoid the consequence of
>> semantic
>>>> change.
>>>> 
>>>> If the auto optimization is not enabled by default, users still need to
>>>> make code change to all existing programs in order to get the benefit.
>>>> If the auto optimization is enabled by default, advanced users who know
>>>> that they really want to use cache will suddenly lose the opportunity
>> to do
>>>> so, unless they change the code to disable auto optimization.
>>>> 
>>>> 
>>>> 3. side effect
>>>> The CacheHandle is not only for where to put uncache(). It is to solve
>> the
>>>> implicit performance impact by moving the uncache() to the CacheHandle.
>>>> 
>>>>  - If users wants to leverage cache, they can call a.cache(). After
>>>>  that, unless user explicitly release that CacheHandle, a.foo() will
>> always
>>>>  leverage cache if needed (optimizer may choose to ignore cache if that
>>>>  helps accelerate the process). Any function call will not be able to
>>>>  release the cache because they do not have that CacheHandle.
>>>>  - If some advanced users do not want to use cache at all, they will
>>>>  call a.hint(ignoreCache).foo(). This will for sure ignore cache and
>> use the
>>>>  original DAG to process.
>>>> 
>>>> 
>>>>> In vast majority of the cases, users wouldn't really care whether the
>>>>> cache is used or not.
>>>>> I wouldn’t agree with that, because “caching” (if not purely in memory
>>>>> caching) would add additional IO costs. It’s similar as saying that
>> users
>>>>> would not see a difference between Spark/Flink and MapReduce (MapReduce
>>>>> writes data to disks after every map/reduce stage).
>>>> 
>>>> What I wanted to say is that in most cases, after users call cache(),
>> they
>>>> don't really care about whether auto optimization has decided to ignore
>> the
>>>> cache or not, as long as the program runs faster.
>>>> 
>>>> Thanks,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <
>> piotr@data-artisans.com>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Thanks for the quick answer :)
>>>>> 
>>>>> Re 1.
>>>>> 
>>>>> I generally agree with you, however couple of points:
>>>>> 
>>>>> a) the problem with using automatic caching is bigger, because you will
>>>>> have to decide, how do you compare IO vs CPU costs and if you pick
>> wrong,
>>>>> additional IO costs might be enormous or even can crash your system.
>> This
>>>>> is more difficult problem compared to let say join reordering, where
>> the
>>>>> only issue is to have good statistics that can capture correlations
>> between
>>>>> columns (when you reorder joins number of IO operations do not change)
>>>>> c) your example is completely independent of caching.
>>>>> 
>>>>> Query like this:
>>>>> 
>>>>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===`f2).as('f3,
>>>>> …).filter(‘f3 > 30)
>>>>> 
>>>>> Should/could be optimised to empty result immediately, without the need
>>>>> for any cache/materialisation and that should work even without any
>>>>> statistics provided by the connector.
>>>>> 
>>>>> For me prerequisite to any serious cost-based optimisations would be
>> some
>>>>> reasonable benchmark coverage of the code (tpch?). Otherwise that
>> would be
>>>>> equivalent of adding not tested code, since we wouldn’t be able to
>> verify
>>>>> our assumptions, like how does the writing of 10 000 records to
>>>>> cache/RocksDB/Kafka/CSV file compare to joining/filtering/processing of
>>>>> lets say 1000 000 rows.
>>>>> 
>>>>> Re 2.
>>>>> 
>>>>> I wasn’t proposing to change the semantic later. I was proposing that
>> we
>>>>> start now:
>>>>> 
>>>>> CachedTable cachedA = a.cache()
>>>>> cachedA.foo() // Cache is used
>>>>> a.bar() // Original DAG is used
>>>>> 
>>>>> And then later we can think about adding for example
>>>>> 
>>>>> CachedTable cachedA = a.hintCache()
>>>>> cachedA.foo() // Cache might be used
>>>>> a.bar() // Original DAG is used
>>>>> 
>>>>> Or
>>>>> 
>>>>> env.enableAutomaticCaching()
>>>>> a.foo() // Cache might be used
>>>>> a.bar() // Cache might be used
>>>>> 
>>>>> Or (I would still not like this option):
>>>>> 
>>>>> a.hintCache()
>>>>> a.foo() // Cache might be used
>>>>> a.bar() // Cache might be used
>>>>> 
>>>>> Or whatever else that will come to our mind. Even if we add some
>>>>> automatic caching in the future, keeping implicit (`CachedTable
>> cache()`)
>>>>> caching will still be useful, at least in some cases.
>>>>> 
>>>>> Re 3.
>>>>> 
>>>>>> 2. The source tables are immutable during one run of batch processing
>>>>> logic.
>>>>>> 3. The cache is immutable during one run of batch processing logic.
>>>>> 
>>>>>> I think assumption 2 and 3 are by definition what batch processing
>>>>> means,
>>>>>> i.e the data must be complete before it is processed and should not
>>>>> change
>>>>>> when the processing is running.
>>>>> 
>>>>> I agree that this is how batch systems SHOULD be working. However I
>> know
>>>>> from my previous experience that it’s not always the case. Sometimes
>> users
>>>>> are just working on some non transactional storage, which can be
>> (either
>>>>> constantly or occasionally) being modified by some other processes for
>>>>> whatever the reasons (fixing the data, updating, adding new data etc).
>>>>> 
>>>>> But even if we ignore this point (data immutability), performance side
>>>>> effect issue of your proposal remains. If user calls `void a.cache()`
>> deep
>>>>> inside some private method, it will have implicit side effects on other
>>>>> parts of his program that might not be obvious.
>>>>> 
>>>>> Re `CacheHandle`.
>>>>> 
>>>>> If I understand it correctly, it only addresses the issue where to
>> place
>>>>> method `uncache`/`dropCache`.
>>>>> 
>>>>> Btw,
>>>>> 
>>>>>> In vast majority of the cases, users wouldn't really care whether the
>>>>> cache is used or not.
>>>>> 
>>>>> I wouldn’t agree with that, because “caching” (if not purely in memory
>>>>> caching) would add additional IO costs. It’s similar as saying that
>> users
>>>>> would not see a difference between Spark/Flink and MapReduce (MapReduce
>>>>> writes data to disks after every map/reduce stage).
>>>>> 
>>>>> Piotrek
>>>>> 
>>>>>> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi Piotrek,
>>>>>> 
>>>>>> Not sure if you noticed, in my last email, I was proposing
>> `CacheHandle
>>>>>> cache()` to avoid the potential side effect due to function calls.
>>>>>> 
>>>>>> Let's look at the disagreement in your reply one by one.
>>>>>> 
>>>>>> 
>>>>>> 1. Optimization chances
>>>>>> 
>>>>>> Optimization is never a trivial work. This is exactly why we should
>> not
>>>>> let
>>>>>> user manually do that. Databases have done huge amount of work in this
>>>>>> area. At Alibaba, we rely heavily on many optimization rules to boost
>>>>> the
>>>>>> SQL query performance.
>>>>>> 
>>>>>> In your example, if I filling the filter conditions in a certain way,
>>>>> the
>>>>>> optimization would become obvious.
>>>>>> 
>>>>>> Table src1 = … // read from connector 1
>>>>>> Table src2 = … // read from connector 2
>>>>>> 
>>>>>> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
>>>>>> `f2).as('f3, ...)
>>>>>> a.cache() // write cache to connector 3, when writing the records,
>>>>> remember
>>>>>> min and max of `f1
>>>>>> 
>>>>>> a.filter('f3 > 30) // There is no need to read from any connector
>>>>> because
>>>>>> `a` does not contain any record whose 'f3 is greater than 30.
>>>>>> env.execute()
>>>>>> a.select(…)
>>>>>> 
>>>>>> BTW, it seems to me that adding some basic statistics is fairly
>>>>>> straightforward and the cost is pretty marginal if not ignorable. In
>>>>> fact
>>>>>> it is not only needed for optimization, but also for cases such as ML,
>>>>>> where some algorithms may need to decide their parameter based on the
>>>>>> statistics of the data.
>>>>>> 
>>>>>> 
>>>>>> 2. Same API, one semantic now, another semantic later.
>>>>>> 
>>>>>> I am trying to understand what is the semantic of `CachedTable
>> cache()`
>>>>> you
>>>>>> are proposing. IMO, we should avoid designing an API whose semantic
>>>>> will be
>>>>>> changed later. If we have a "CachedTable cache()" method, then the
>>>>> semantic
>>>>>> should be very clearly defined upfront and do not change later. It
>>>>> should
>>>>>> never be "right now let's go with semantic 1, later we can silently
>>>>> change
>>>>>> it to semantic 2 or 3". Such change could result in bad consequence.
>> For
>>>>>> example, let's say we decide go with semantic 1:
>>>>>> 
>>>>>> CachedTable cachedA = a.cache()
>>>>>> cachedA.foo() // Cache is used
>>>>>> a.bar() // Original DAG is used.
>>>>>> 
>>>>>> Now majority of the users would be using cachedA.foo() in their code.
>>>>> And
>>>>>> some advanced users will use a.bar() to explicitly skip the cache.
>> Later
>>>>>> on, we added smart optimization and change the semantic to semantic 2:
>>>>>> 
>>>>>> CachedTable cachedA = a.cache()
>>>>>> cachedA.foo() // Cache is used
>>>>>> a.bar() // Cache MIGHT be used, and Flink may decide to skip cache if
>>>>> it is
>>>>>> faster.
>>>>>> 
>>>>>> Now most of the users who were writing cachedA.foo() will not benefit
>>>>> from
>>>>>> this optimization at all, unless they change their code to use a.foo()
>>>>>> instead. And those advanced users suddenly lose the option to
>> explicitly
>>>>>> ignore cache unless they change their code (assuming we care enough to
>>>>>> provide something like hint(useCache)). If we don't define the
>> semantic
>>>>>> carefully, our users will have to change their code again and again
>>>>> while
>>>>>> they shouldn't have to.
>>>>>> 
>>>>>> 
>>>>>> 3. side effect.
>>>>>> 
>>>>>> Before we talk about side effect, we have to agree on the assumptions.
>>>>> The
>>>>>> assumptions I have are following:
>>>>>> 1. We are talking about batch processing.
>>>>>> 2. The source tables are immutable during one run of batch processing
>>>>> logic.
>>>>>> 3. The cache is immutable during one run of batch processing logic.
>>>>>> 
>>>>>> I think assumption 2 and 3 are by definition what batch processing
>>>>> means,
>>>>>> i.e the data must be complete before it is processed and should not
>>>>> change
>>>>>> when the processing is running.
>>>>>> 
>>>>>> As far as I am aware of, I don't know any batch processing system
>>>>> breaking
>>>>>> those assumptions. Even for relational database tables, where queries
>>>>> can
>>>>>> run with concurrent modifications, necessary locking are still
>> required
>>>>> to
>>>>>> ensure the integrity of the query result.
>>>>>> 
>>>>>> Please let me know if you disagree with the above assumptions. If you
>>>>> agree
>>>>>> with these assumptions, with the `CacheHandle cache()` API in my last
>>>>>> email, do you still see side effects?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jiangjie (Becket) Qin
>>>>>> 
>>>>>> 
>>>>>> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <
>> piotr@data-artisans.com
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Becket,
>>>>>>> 
>>>>>>>> Regarding the chance of optimization, it might not be that rare.
>> Some
>>>>>>> very
>>>>>>>> simple statistics could already help in many cases. For example,
>>>>> simply
>>>>>>>> maintaining max and min of each fields can already eliminate some
>>>>>>>> unnecessary table scan (potentially scanning the cached table) if
>> the
>>>>>>>> result is doomed to be empty. A histogram would give even further
>>>>>>>> information. The optimizer could be very careful and only ignores
>>>>> cache
>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a filter
>> on
>>>>>>> the
>>>>>>>> cache will absolutely return nothing.
>>>>>>> 
>>>>>>> I do not see how this might be easy to achieve. It would require tons
>>>>> of
>>>>>>> effort to make it work and in the end you would still have a problem
>> of
>>>>>>> comparing/trading CPU cycles vs IO. For example:
>>>>>>> 
>>>>>>> Table src1 = … // read from connector 1
>>>>>>> Table src2 = … // read from connector 2
>>>>>>> 
>>>>>>> Table a = src1.filter(…).join(src2.filter(…), …)
>>>>>>> a.cache() // write cache to connector 3
>>>>>>> 
>>>>>>> a.filter(…)
>>>>>>> env.execute()
>>>>>>> a.select(…)
>>>>>>> 
>>>>>>> Decision whether it’s better to:
>>>>>>> A) read from connector1/connector2, filter/map and join them twice
>>>>>>> B) read from connector1/connector2, filter/map and join them once,
>> pay
>>>>> the
>>>>>>> price of writing to connector 3 and then reading from it
>>>>>>> 
>>>>>>> Is very far from trivial. `a` can end up much larger than `src1` and
>>>>>>> `src2`, writes to connector 3 might be extremely slow, reads from
>>>>> connector
>>>>>>> 3 can be slower compared to reads from connector 1 & 2, … . You
>> really
>>>>> need
>>>>>>> to have extremely good statistics to correctly asses size of the
>>>>> output and
>>>>>>> it would still be failing many times (correlations etc). And keep in
>>>>> mind
>>>>>>> that at the moment we do not have ANY statistics at all. More than
>>>>> that, it
>>>>>>> would require significantly more testing and setting up some
>>>>> benchmarks to
>>>>>>> make sure that we do not brake it with some regressions.
>>>>>>> 
>>>>>>> That’s why I’m strongly opposing this idea - at least let’s not
>> starts
>>>>>>> with this. If we first start with completely manual/explicit caching,
>>>>>>> without any magic, it would be a significant improvement for the
>> users
>>>>> for
>>>>>>> a fraction of the development cost. After implementing that, when we
>>>>>>> already have all of the working pieces, we can start working on some
>>>>>>> optimisations rules. As I wrote before, if we start with
>>>>>>> 
>>>>>>> `CachedTable cache()`
>>>>>>> 
>>>>>>> We can later work on follow up stories to make it automatic. Despite
>>>>> that
>>>>>>> I don’t like this implicit/side effect approach with `void` method,
>>>>> having
>>>>>>> explicit `CachedTable cache()` wouldn’t even prevent as from later
>>>>> adding
>>>>>>> `void hintCache()` method, with the exact semantic that you want.
>>>>>>> 
>>>>>>> On top of that I re-rise again that having implicit `void
>>>>>>> cache()/hintCache()` has other side effects and problems with non
>>>>> immutable
>>>>>>> data, and being annoying when used secretly inside methods.
>>>>>>> 
>>>>>>> Explicit `CachedTable cache()` just looks like much less
>> controversial
>>>>> MVP
>>>>>>> and if we decide to go further with this topic, it’s not a wasted
>>>>> effort,
>>>>>>> but just lies on a stright path to more advanced/complicated
>> solutions
>>>>> in
>>>>>>> the future. Are there any drawbacks of starting with `CachedTable
>>>>> cache()`
>>>>>>> that I’m missing?
>>>>>>> 
>>>>>>> Piotrek
>>>>>>> 
>>>>>>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Becket,
>>>>>>>> 
>>>>>>>> Introducing CacheHandle seems too complicated. That means users have
>>>>> to
>>>>>>>> maintain Handler properly.
>>>>>>>> 
>>>>>>>> And since cache is just a hint for optimizer, why not just return
>>>>> Table
>>>>>>>> itself for cache method. This hint info should be kept in Table I
>>>>>>> believe.
>>>>>>>> 
>>>>>>>> So how about adding method cache and uncache for Table, and both
>>>>> return
>>>>>>>> Table. Because what cache and uncache did is just adding some hint
>>>>> info
>>>>>>>> into Table.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
>>>>>>>> 
>>>>>>>>> Hi Till and Piotrek,
>>>>>>>>> 
>>>>>>>>> Thanks for the clarification. That solves quite a few confusion. My
>>>>>>>>> understanding of how cache works is same as what Till describe.
>> i.e.
>>>>>>>>> cache() is a hint to Flink, but it is not guaranteed that cache
>>>>> always
>>>>>>>>> exist and it might be recomputed from its lineage.
>>>>>>>>> 
>>>>>>>>> Is this the core of our disagreement here? That you would like this
>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
>>>>>>>>> 
>>>>>>>>> Semantic wise, yes. That's also why I think materialize() has a
>> much
>>>>>>> larger
>>>>>>>>> scope than cache(), thus it should be a different method.
>>>>>>>>> 
>>>>>>>>> Regarding the chance of optimization, it might not be that rare.
>> Some
>>>>>>> very
>>>>>>>>> simple statistics could already help in many cases. For example,
>>>>> simply
>>>>>>>>> maintaining max and min of each fields can already eliminate some
>>>>>>>>> unnecessary table scan (potentially scanning the cached table) if
>> the
>>>>>>>>> result is doomed to be empty. A histogram would give even further
>>>>>>>>> information. The optimizer could be very careful and only ignores
>>>>> cache
>>>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a filter
>>>>> on
>>>>>>> the
>>>>>>>>> cache will absolutely return nothing.
>>>>>>>>> 
>>>>>>>>> Given the above clarification on cache, I would like to revisit the
>>>>>>>>> original "void cache()" proposal and see if we can improve on top
>> of
>>>>>>> that.
>>>>>>>>> 
>>>>>>>>> What do you think about the following modified interface?
>>>>>>>>> 
>>>>>>>>> Table {
>>>>>>>>> /**
>>>>>>>>> * This call hints Flink to maintain a cache of this table and
>>>>> leverage
>>>>>>>>> it for performance optimization if needed.
>>>>>>>>> * Note that Flink may still decide to not use the cache if it is
>>>>>>> cheaper
>>>>>>>>> by doing so.
>>>>>>>>> *
>>>>>>>>> * A CacheHandle will be returned to allow user release the cache
>>>>>>>>> actively. The cache will be deleted if there
>>>>>>>>> * is no unreleased cache handlers to it. When the TableEnvironment
>>>>> is
>>>>>>>>> closed. The cache will also be deleted
>>>>>>>>> * and all the cache handlers will be released.
>>>>>>>>> *
>>>>>>>>> * @return a CacheHandle referring to the cache of this table.
>>>>>>>>> */
>>>>>>>>> CacheHandle cache();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> CacheHandle {
>>>>>>>>> /**
>>>>>>>>> * Close the cache handle. This method does not necessarily deletes
>>>>> the
>>>>>>>>> cache. Instead, it simply decrements the reference counter to the
>>>>> cache.
>>>>>>>>> * When the there is no handle referring to a cache. The cache will
>>>>> be
>>>>>>>>> deleted.
>>>>>>>>> *
>>>>>>>>> * @return the number of open handles to the cache after this handle
>>>>>>> has
>>>>>>>>> been released.
>>>>>>>>> */
>>>>>>>>> int release()
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> The rationale behind this interface is following:
>>>>>>>>> In vast majority of the cases, users wouldn't really care whether
>> the
>>>>>>> cache
>>>>>>>>> is used or not. So I think the most intuitive way is letting
>> cache()
>>>>>>> return
>>>>>>>>> nothing. So nobody needs to worry about the difference between
>>>>>>> operations
>>>>>>>>> on CacheTables and those on the "original" tables. This will make
>>>>> maybe
>>>>>>>>> 99.9% of the users happy. There were two concerns raised for this
>>>>>>> approach:
>>>>>>>>> 1. In some rare cases, users may want to ignore cache,
>>>>>>>>> 2. A table might be cached/uncached in a third party function while
>>>>> the
>>>>>>>>> caller does not know.
>>>>>>>>> 
>>>>>>>>> For the first issue, users can use hint("ignoreCache") to
>> explicitly
>>>>>>> ignore
>>>>>>>>> cache.
>>>>>>>>> For the second issue, the above proposal lets cache() return a
>>>>>>> CacheHandle,
>>>>>>>>> the only method in it is release(). Different CacheHandles will
>>>>> refer to
>>>>>>>>> the same cache, if a cache no longer has any cache handle, it will
>> be
>>>>>>>>> deleted. This will address the following case:
>>>>>>>>> {
>>>>>>>>> val handle1 = a.cache()
>>>>>>>>> process(a)
>>>>>>>>> a.select(...) // cache is still available, handle1 has not been
>>>>>>> released.
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> void process(Table t) {
>>>>>>>>> val handle2 = t.cache() // new handle to cache
>>>>>>>>> t.select(...) // optimizer decides cache usage
>>>>>>>>> t.hint("ignoreCache").select(...) // cache is ignored
>>>>>>>>> handle2.release() // release the handle, but the cache may still be
>>>>>>>>> available if there are other handles
>>>>>>>>> ...
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> Does the above modified approach look reasonable to you?
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <
>> trohrmann@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Becket,
>>>>>>>>>> 
>>>>>>>>>> I was aiming at semantics similar to 1. I actually thought that
>>>>>>> `cache()`
>>>>>>>>>> would tell the system to materialize the intermediate result so
>> that
>>>>>>>>>> subsequent queries don't need to reprocess it. This means that the
>>>>>>> usage
>>>>>>>>> of
>>>>>>>>>> the cached table in this example
>>>>>>>>>> 
>>>>>>>>>> {
>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>> val b1 = cachedTable.select(…)
>>>>>>>>>> val b2 = cachedTable.foo().select(…)
>>>>>>>>>> val b3 = cachedTable.bar().select(...)
>>>>>>>>>> val c1 = a.select(…)
>>>>>>>>>> val c2 = a.foo().select(…)
>>>>>>>>>> val c3 = a.bar().select(...)
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> strongly depends on interleaved calls which trigger the execution
>> of
>>>>>>> sub
>>>>>>>>>> queries. So for example, if there is only a single env.execute
>> call
>>>>> at
>>>>>>>>> the
>>>>>>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be
>> computed
>>>>> by
>>>>>>>>>> reading directly from the sources (given that there is only a
>> single
>>>>>>>>>> JobGraph). It just happens that the result of `a` will be cached
>>>>> such
>>>>>>>>> that
>>>>>>>>>> we skip the processing of `a` when there are subsequent queries
>>>>> reading
>>>>>>>>>> from `cachedTable`. If for some reason the system cannot
>> materialize
>>>>>>> the
>>>>>>>>>> table (e.g. running out of disk space, ttl expired), then it could
>>>>> also
>>>>>>>>>> happen that we need to reprocess `a`. In that sense `cachedTable`
>>>>>>> simply
>>>>>>>>> is
>>>>>>>>>> an identifier for the materialized result of `a` with the lineage
>>>>> how
>>>>>>> to
>>>>>>>>>> reprocess it.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Till
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
>>>>>>> piotr@data-artisans.com
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Becket,
>>>>>>>>>>> 
>>>>>>>>>>>> {
>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>> val b = cachedTable.select(...)
>>>>>>>>>>>> val c = a.select(...)
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>>>>> original
>>>>>>>>> DAG
>>>>>>>>>>> as
>>>>>>>>>>>> user demanded so. In this case, the optimizer has no chance to
>>>>>>>>>> optimize.
>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
>>>>>>>>>>> optimizer
>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this case,
>>>>> user
>>>>>>>>>>> lose
>>>>>>>>>>>> the option to NOT use cache.
>>>>>>>>>>>> 
>>>>>>>>>>>> As you can see, neither of the options seem perfect. However, I
>>>>> guess
>>>>>>>>>> you
>>>>>>>>>>>> and Till are proposing the third option:
>>>>>>>>>>>> 
>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or
>> DAG
>>>>>>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>>> used. c always use the DAG.
>>>>>>>>>>> 
>>>>>>>>>>> I am pretty sure that me, Till, Fabian and others were all
>>>>> proposing
>>>>>>>>> and
>>>>>>>>>>> advocating in favour of semantic “1”. No cost based optimiser
>>>>>>> decisions
>>>>>>>>>> at
>>>>>>>>>>> all.
>>>>>>>>>>> 
>>>>>>>>>>> {
>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>> val b1 = cachedTable.select(…)
>>>>>>>>>>> val b2 = cachedTable.foo().select(…)
>>>>>>>>>>> val b3 = cachedTable.bar().select(...)
>>>>>>>>>>> val c1 = a.select(…)
>>>>>>>>>>> val c2 = a.foo().select(…)
>>>>>>>>>>> val c3 = a.bar().select(...)
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
>>>>>>>>>>> re-executing whole plan for “a”.
>>>>>>>>>>> 
>>>>>>>>>>> In the future we could discuss going one step further,
>> introducing
>>>>>>> some
>>>>>>>>>>> global optimisation (that can be manually enabled/disabled):
>>>>>>>>> deduplicate
>>>>>>>>>>> plan nodes/deduplicate sub queries/re-use sub queries results/or
>>>>>>>>> whatever
>>>>>>>>>>> we could call it. It could do two things:
>>>>>>>>>>> 
>>>>>>>>>>> 1. Automatically try to deduplicate fragments of the plan and
>> share
>>>>>>> the
>>>>>>>>>>> result using CachedTable - in other words automatically insert
>>>>>>>>>> `CachedTable
>>>>>>>>>>> cache()` calls.
>>>>>>>>>>> 2. Automatically make decision to bypass explicit `CachedTable`
>>>>> access
>>>>>>>>>>> (this would be the equivalent of what you described as “semantic
>>>>> 3”).
>>>>>>>>>>> 
>>>>>>>>>>> However as I wrote previously, I have big doubts if such
>> cost-based
>>>>>>>>>>> optimisation would work (this applies also to “Semantic 2”). I
>>>>> would
>>>>>>>>>> expect
>>>>>>>>>>> it to do more harm than good in so many cases, that it wouldn’t
>>>>> make
>>>>>>>>>> sense.
>>>>>>>>>>> Even assuming that we calculate statistics perfectly (this ain’t
>>>>> gonna
>>>>>>>>>>> happen), it’s virtually impossible to correctly estimate correct
>>>>>>>>> exchange
>>>>>>>>>>> rate of CPU cycles vs IO operations as it is changing so much
>> from
>>>>>>>>>>> deployment to deployment.
>>>>>>>>>>> 
>>>>>>>>>>> Is this the core of our disagreement here? That you would like
>> this
>>>>>>>>>>> “cache()” to be mostly hint for the optimiser?
>>>>>>>>>>> 
>>>>>>>>>>> Piotrek
>>>>>>>>>>> 
>>>>>>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com>
>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Another potential concern for semantic 3 is that. In the future,
>>>>> we
>>>>>>>>> may
>>>>>>>>>>> add
>>>>>>>>>>>> automatic caching to Flink. e.g. cache the intermediate results
>> at
>>>>>>>>> the
>>>>>>>>>>>> shuffle boundary. If our semantic is that reference to the
>>>>> original
>>>>>>>>>> table
>>>>>>>>>>>> means skipping cache, those users may not be able to benefit
>> from
>>>>> the
>>>>>>>>>>>> implicit cache.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <
>> becket.qin@gmail.com
>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the reply. Thought about it again, I might have
>>>>>>>>>> misunderstood
>>>>>>>>>>>>> your proposal in earlier emails. Returning a CachedTable might
>>>>> not
>>>>>>>>> be
>>>>>>>>>> a
>>>>>>>>>>> bad
>>>>>>>>>>>>> idea.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I was more concerned about the semantic and its intuitiveness
>>>>> when a
>>>>>>>>>>>>> CachedTable is returned. i..e, if cache() returns CachedTable.
>>>>> What
>>>>>>>>>> are
>>>>>>>>>>> the
>>>>>>>>>>>>> semantic in the following code:
>>>>>>>>>>>>> {
>>>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>>>> val b = cachedTable.select(...)
>>>>>>>>>>>>> val c = a.select(...)
>>>>>>>>>>>>> }
>>>>>>>>>>>>> What is the difference between b and c? At the first glance, I
>>>>> see
>>>>>>>>> two
>>>>>>>>>>>>> options:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>>>>> original
>>>>>>>>>> DAG
>>>>>>>>>>> as
>>>>>>>>>>>>> user demanded so. In this case, the optimizer has no chance to
>>>>>>>>>> optimize.
>>>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves
>> the
>>>>>>>>>>> optimizer
>>>>>>>>>>>>> to choose whether the cache or DAG should be used. In this
>> case,
>>>>>>>>> user
>>>>>>>>>>> lose
>>>>>>>>>>>>> the option to NOT use cache.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As you can see, neither of the options seem perfect. However, I
>>>>>>>>> guess
>>>>>>>>>>> you
>>>>>>>>>>>>> and Till are proposing the third option:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or
>> DAG
>>>>>>>>>> should
>>>>>>>>>>>>> be used. c always use the DAG.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This does address all the concerns. It is just that from
>>>>>>>>> intuitiveness
>>>>>>>>>>>>> perspective, I found that asking user to explicitly use a
>>>>>>>>> CachedTable
>>>>>>>>>>> while
>>>>>>>>>>>>> the optimizer might choose to ignore is a little weird. That
>> was
>>>>>>>>> why I
>>>>>>>>>>> did
>>>>>>>>>>>>> not think about that semantic. But given there is material
>>>>> benefit,
>>>>>>>>> I
>>>>>>>>>>> think
>>>>>>>>>>>>> this semantic is acceptable.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
>>>>> cache
>>>>>>>>> or
>>>>>>>>>>> not,
>>>>>>>>>>>>>> then why do we need “void cache()” method at all? Would It
>>>>>>>>>> “increase”
>>>>>>>>>>> the
>>>>>>>>>>>>>> chance of using the cache? That’s sounds strange. What would
>> be
>>>>> the
>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
>>>>> want
>>>>>>>>> to
>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>>>>>>>>> deduplication”
>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>>>>> optimiser
>>>>>>>>> do
>>>>>>>>>>> all of
>>>>>>>>>>>>>> the work.
>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not use
>>>>>>>>> cache
>>>>>>>>>>>>>> decision.
>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
>>>>> cost
>>>>>>>>>>> based
>>>>>>>>>>>>>> optimisations would work properly and I would still insist
>>>>> first on
>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> We are absolutely on the same page here. An explicit cache()
>>>>> method
>>>>>>>>> is
>>>>>>>>>>>>> necessary not only because optimizer may not be able to make
>> the
>>>>>>>>> right
>>>>>>>>>>>>> decision, but also because of the nature of interactive
>>>>> programming.
>>>>>>>>>> For
>>>>>>>>>>>>> example, if users write the following code in Scala shell:
>>>>>>>>>>>>> val b = a.select(...)
>>>>>>>>>>>>> val c = b.select(...)
>>>>>>>>>>>>> val d = c.select(...).writeToSink(...)
>>>>>>>>>>>>> tEnv.execute()
>>>>>>>>>>>>> There is no way optimizer will know whether b or c will be used
>>>>> in
>>>>>>>>>> later
>>>>>>>>>>>>> code, unless users hint explicitly.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
>>>>>>>>> objections
>>>>>>>>>> of
>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
>>>>> Jark,
>>>>>>>>>>> Fabian,
>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Is there any other side effects if we use semantic 3 mentioned
>>>>>>>>> above?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> JIangjie (Becket) Qin
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Sorry for not responding long time.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regarding case1.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would expect
>>>>> only
>>>>>>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t
>>>>>>>>> affect
>>>>>>>>>>>>>> `cachedTableA2`. Just as in any other database dropping
>>>>> modifying
>>>>>>>>> one
>>>>>>>>>>>>>> independent table/materialised view does not affect others.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What I meant is that assuming there is already a cached
>> table,
>>>>>>>>>> ideally
>>>>>>>>>>>>>> users need
>>>>>>>>>>>>>>> not to specify whether the next query should read from the
>>>>> cache
>>>>>>>>> or
>>>>>>>>>>> use
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
>>>>> cache
>>>>>>>>> or
>>>>>>>>>>>>>> not, then why do we need “void cache()” method at all? Would
>> It
>>>>>>>>>>> “increase”
>>>>>>>>>>>>>> the chance of using the cache? That’s sounds strange. What
>>>>> would be
>>>>>>>>>> the
>>>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
>>>>> want
>>>>>>>>> to
>>>>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>>>>>>>>> deduplication”
>>>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>>>>> optimiser
>>>>>>>>> do
>>>>>>>>>>> all of
>>>>>>>>>>>>>> the work.
>>>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not use
>>>>>>>>> cache
>>>>>>>>>>>>>> decision.
>>>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
>>>>> cost
>>>>>>>>>>> based
>>>>>>>>>>>>>> optimisations would work properly and I would still insist
>>>>> first on
>>>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()`
>> doesn’t
>>>>>>>>>>>>>> contradict future work on automated cost based caching.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
>>>>>>>>> objections
>>>>>>>>>>> of
>>>>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
>>>>> Jark,
>>>>>>>>>>> Fabian,
>>>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com>
>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It is true that after the first job submission, there will be
>>>>> no
>>>>>>>>>>>>>> ambiguity
>>>>>>>>>>>>>>> in terms of whether a cached table is used or not. That is
>> the
>>>>>>>>> same
>>>>>>>>>>> for
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cache() without returning a CachedTable.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
>>>>> caching
>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>> from which you need to consume from if you want to benefit
>>>>> from
>>>>>>>>> the
>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>> functionality.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am thinking a little differently. I think it is a hint (as
>>>>> you
>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>> later) instead of a new operator. I'd like to be careful
>> about
>>>>> the
>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>> of the API. A hint is a property set on an existing operator,
>>>>> but
>>>>>>>>> is
>>>>>>>>>>> not
>>>>>>>>>>>>>>> itself an operator as it does not really manipulate the data.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
>>>>> which
>>>>>>>>>>>>>>>> intermediate result should be cached. But especially when
>>>>>>>>> executing
>>>>>>>>>>>>>> ad-hoc
>>>>>>>>>>>>>>>> queries the user might better know which results need to be
>>>>>>>>> cached
>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>>>>> consider
>>>>>>>>>> the
>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in
>> the
>>>>>>>>>> future
>>>>>>>>>>> we
>>>>>>>>>>>>>>>> might add functionality which tries to automatically cache
>>>>>>>>> results
>>>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>>>> caching the latest intermediate results until so and so much
>>>>>>>>> space
>>>>>>>>>> is
>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
>>>>> `CachedTable
>>>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I agree that cache() method is needed for exactly the reason
>>>>> you
>>>>>>>>>>>>>> mentioned,
>>>>>>>>>>>>>>> i.e. Flink cannot predict what users are going to write
>> later,
>>>>> so
>>>>>>>>>>> users
>>>>>>>>>>>>>>> need to tell Flink explicitly that this table will be used
>>>>> later.
>>>>>>>>>>> What I
>>>>>>>>>>>>>>> meant is that assuming there is already a cached table,
>> ideally
>>>>>>>>>> users
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>> not to specify whether the next query should read from the
>>>>> cache
>>>>>>>>> or
>>>>>>>>>>> use
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> To explain the difference between returning / not returning a
>>>>>>>>>>>>>> CachedTable,
>>>>>>>>>>>>>>> I want compare the following two case:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *Case 1:  returning a CachedTable*
>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>> val cachedTableA1 = a.cache()
>>>>>>>>>>>>>>> val cachedTableA2 = a.cache()
>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> c = a.filter(...) // User specify that the original DAG is
>>>>> used?
>>>>>>>>> Or
>>>>>>>>>>> the
>>>>>>>>>>>>>>> optimizer decides whether DAG or cache should be used?
>>>>>>>>>>>>>>> d = cachedTableA1.filter() // User specify that the cached
>>>>> table
>>>>>>>>> is
>>>>>>>>>>>>>> used.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
>>>>>>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *Case 2: not returning a CachedTable*
>>>>>>>>>>>>>>> b = a.map()
>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>> a.cache() // no-op
>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or
>> DAG
>>>>>>>>>> should
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or
>> DAG
>>>>>>>>>> should
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> a.unCache()
>>>>>>>>>>>>>>> a.unCache() // no-op
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to choose
>>>>>>>>>> between
>>>>>>>>>>>>>> DAG
>>>>>>>>>>>>>>> and cache. And the unCache() call becomes tricky.
>>>>>>>>>>>>>>> In case 2, users do not need to worry about whether cache or
>>>>> DAG
>>>>>>>>> is
>>>>>>>>>>>>>> used.
>>>>>>>>>>>>>>> And the unCache() semantic is clear. However, the caveat is
>>>>> that
>>>>>>>>>> users
>>>>>>>>>>>>>>> cannot explicitly ignore the cache.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In order to address the issues mentioned in case 2 and
>>>>> inspired by
>>>>>>>>>> the
>>>>>>>>>>>>>>> discussion so far, I am thinking about using hint to allow
>> user
>>>>>>>>>>>>>> explicitly
>>>>>>>>>>>>>>> ignore cache. Although we do not have hint yet, but we
>> probably
>>>>>>>>>> should
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> one. So the code becomes:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *Case 3: returning this table*
>>>>>>>>>>>>>>> b = a.map()
>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>> a.cache() // no-op
>>>>>>>>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or
>> DAG
>>>>>>>>>> should
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used
>>>>> instead
>>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> a.unCache()
>>>>>>>>>>>>>>> a.unCache() // no-op
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We could also let cache() return this table to allow chained
>>>>>>>>> method
>>>>>>>>>>>>>> calls.
>>>>>>>>>>>>>>> Do you think this API addresses the concerns?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com>
>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> All the recent discussions are focused on whether there is a
>>>>>>>>>> problem
>>>>>>>>>>> if
>>>>>>>>>>>>>>>> cache() not return a Table.
>>>>>>>>>>>>>>>> It seems that returning a Table explicitly is more clear
>> (and
>>>>>>>>>> safe?).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So whether there are any problems if cache() returns a
>> Table?
>>>>>>>>>>> @Becket
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <
>>>>> trohrmann@apache.org
>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It's true that b, c, d and e will all read from the
>> original
>>>>> DAG
>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> generates a. But all subsequent operators (when running
>>>>> multiple
>>>>>>>>>>>>>> queries)
>>>>>>>>>>>>>>>>> which reference cachedTableA should not need to reproduce
>> `a`
>>>>>>>>> but
>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>> consume the intermediate result.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
>>>>> caching
>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>> from which you need to consume from if you want to benefit
>>>>> from
>>>>>>>>>> the
>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>> functionality.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
>>>>> which
>>>>>>>>>>>>>>>>> intermediate result should be cached. But especially when
>>>>>>>>>> executing
>>>>>>>>>>>>>>>> ad-hoc
>>>>>>>>>>>>>>>>> queries the user might better know which results need to be
>>>>>>>>> cached
>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>>>>>>>>> consider
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in
>> the
>>>>>>>>>> future
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> might add functionality which tries to automatically cache
>>>>>>>>> results
>>>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>>>>> caching the latest intermediate results until so and so
>> much
>>>>>>>>> space
>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
>>>>>>>>> `CachedTable
>>>>>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <
>>>>> becket.qin@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the clarification. I am still a little
>> confused.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> If cache() returns a CachedTable, the example might
>> become:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> cachedTableA = a.cache()
>>>>>>>>>>>>>>>>>> d = cachedTableA.map(...)
>>>>>>>>>>>>>>>>>> e = a.map()
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b, c, d
>>>>> and
>>>>>>>>> e
>>>>>>>>>>> are
>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> going to be reading from the original DAG that generates
>> a.
>>>>> But
>>>>>>>>>>> with
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> naive expectation, d should be reading from the cache.
>> This
>>>>>>>>> seems
>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> solving the potential confusion you raised, right?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Just to be clear, my understanding are all based on the
>>>>>>>>>> assumption
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
>>>>>>>>>>>>>> c*achedTableA*
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> original table *a * should be completely interchangeable.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> That said, I think a valid argument is optimization. There
>>>>> are
>>>>>>>>>>> indeed
>>>>>>>>>>>>>>>>> cases
>>>>>>>>>>>>>>>>>> that reading from the original DAG could be faster than
>>>>> reading
>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> cache. For example, in the following example:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> a.filter(f1' > 100)
>>>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>>>> b = a.filter(f1' < 100)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Ideally the optimizer should be intelligent enough to
>> decide
>>>>>>>>>> which
>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> faster, without user intervention. In this case, it will
>>>>>>>>> identify
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>> would just be an empty table, thus skip reading from the
>>>>> cache
>>>>>>>>>>>>>>>>> completely.
>>>>>>>>>>>>>>>>>> But I agree that returning a CachedTable would give user
>> the
>>>>>>>>>>> control
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> when to use cache, even though I still feel that letting
>> the
>>>>>>>>>>>>>> optimizer
>>>>>>>>>>>>>>>>>> handle this is a better option in long run.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
>>>>>>>>>> trohrmann@apache.org
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Yes you are right Becket that it still depends on the
>>>>> actual
>>>>>>>>>>>>>>>> execution
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> the job whether a consumer reads from a cached result or
>>>>> not.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> My point was actually about the properties of a (cached
>> vs.
>>>>>>>>>>>>>>>> non-cached)
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> not about the execution. I would not make cache trigger
>> the
>>>>>>>>>>>>>> execution
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> the job because one loses some flexibility by eagerly
>>>>>>>>> triggering
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> execution.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is
>>>>> returned
>>>>>>>>>> by
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> cache() method like Piotr did in order to make the API
>> more
>>>>>>>>>>>>>> explicit.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> That is a good example. Just a minor correction, in this
>>>>>>>>> case,
>>>>>>>>>>> b, c
>>>>>>>>>>>>>>>>>> and d
>>>>>>>>>>>>>>>>>>>> will all consume from a non-cached a. This is because
>>>>> cache
>>>>>>>>>> will
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> created on the very first job submission that generates
>>>>> the
>>>>>>>>>> table
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> cached.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> If I understand correctly, this is example is about
>>>>> whether
>>>>>>>>>>>>>>>> .cache()
>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In
>>>>> another
>>>>>>>>>> word,
>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>> cache() method actually triggers a job that creates the
>>>>>>>>> cache,
>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>> be no such confusion. Is that right?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> In the example, although d will not consume from the
>>>>> cached
>>>>>>>>>> Table
>>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> looks supposed to, from correctness perspective the code
>>>>> will
>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>> correct result, assuming that tables are immutable.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Personally I feel it is OK because users probably won't
>>>>>>>>> really
>>>>>>>>>>>>>>>> worry
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>> whether the table is cached or not. And lazy cache could
>>>>>>>>> avoid
>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>> unnecessary caching if a cached table is never created
>> in
>>>>> the
>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>> application. But I am not opposed to do eager evaluation
>>>>> of
>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
>>>>>>>>>>>>>>>> trohrmann@apache.org>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily
>>>>> changing
>>>>>>>>>>>>>>>>> properties
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> node affects all down stream consumers but does not
>>>>>>>>>> necessarily
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> happen before these consumers are defined. From a
>> user's
>>>>>>>>>>>>>>>>> perspective
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> can be quite confusing:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>>>>>>> d = a.map(...)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> now b, c and d will consume from a cached operator. In
>>>>> this
>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>> would most likely expect that only d reads from a
>> cached
>>>>>>>>>> result.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
>>>>> effects?
>>>>>>>>> So
>>>>>>>>>>>>>>>>> far
>>>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist
>> if a
>>>>>>>>>> table
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Not only that. There are also performance implications
>>>>> and
>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>> another implicit side effects of using `void cache()`.
>>>>> As I
>>>>>>>>>>>>>>>> wrote
>>>>>>>>>>>>>>>>>>>> before,
>>>>>>>>>>>>>>>>>>>>>> reading from cache might not always be desirable, thus
>>>>> it
>>>>>>>>> can
>>>>>>>>>>>>>>>>> cause
>>>>>>>>>>>>>>>>>>>>>> performance degradation and I’m fine with that -
>> user's
>>>>> or
>>>>>>>>>>>>>>>>>>> optimiser’s
>>>>>>>>>>>>>>>>>>>>>> choice. What I do not like is that this implicit side
>>>>>>>>> effect
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> manifest
>>>>>>>>>>>>>>>>>>>>>> in completely different part of code, that wasn’t
>>>>> touched
>>>>>>>>> by
>>>>>>>>>> a
>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else. And
>>>>> even
>>>>>>>>> if
>>>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>>>> improves performance, it’s still a side effect of
>> `void
>>>>>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>>>>>>> Almost
>>>>>>>>>>>>>>>>>>>>>> from the definition `void` methods have only side
>>>>> effects.
>>>>>>>>>> As I
>>>>>>>>>>>>>>>>>> wrote
>>>>>>>>>>>>>>>>>>>>>> before, there are couple of scenarios where this might
>>>>> be
>>>>>>>>>>>>>>>>>> undesirable
>>>>>>>>>>>>>>>>>>>>>> and/or unexpected, for example:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>>>>>>>>> Table b = …;
>>>>>>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>>>>>>> x = b.join(…)
>>>>>>>>>>>>>>>>>>>>>> y = b.count()
>>>>>>>>>>>>>>>>>>>>>> // ...
>>>>>>>>>>>>>>>>>>>>>> // 100
>>>>>>>>>>>>>>>>>>>>>> // hundred
>>>>>>>>>>>>>>>>>>>>>> // lines
>>>>>>>>>>>>>>>>>>>>>> // of
>>>>>>>>>>>>>>>>>>>>>> // code
>>>>>>>>>>>>>>>>>>>>>> // later
>>>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even
>> hidden
>>>>> in
>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>> method/file/package/dependency
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Table b = ...
>>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>>>>>>>>>>>>> foo(b)
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> Else {
>>>>>>>>>>>>>>>>>>>>>> bar(b)
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Void foo(Table b) {
>>>>>>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>>>>>>> // do something with b
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly
>>>>> affect
>>>>>>>>>>>>>>>>>> (semantic
>>>>>>>>>>>>>>>>>>>> of a
>>>>>>>>>>>>>>>>>>>>>> program in case of sources being mutable and
>>>>> performance)
>>>>>>>>> `z
>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from
>> obvious.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On top of that, there is still this argument of mine
>>>>> that
>>>>>>>>>>>>>>>> having
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
>>>>>>>>> flexible
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> us
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> future and for the user (as a manual option to bypass
>>>>> cache
>>>>>>>>>>>>>>>>> reads).
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> But Jiangjie is correct,
>>>>>>>>>>>>>>>>>>>>>>> the source table in batching should be immutable. It
>> is
>>>>>>>>> the
>>>>>>>>>>>>>>>>>> user’s
>>>>>>>>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
>>>>>>>>>>>>>>>> failover
>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>> lead
>>>>>>>>>>>>>>>>>>>>>>> to inconsistent results.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment
>>>>>>>>> should
>>>>>>>>>>>>>>>> be.
>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this
>> (since
>>>>> the
>>>>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>>>> fix
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> to support transactions), I’m just trying to minimise
>>>>>>>>>> confusion
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> users that are not fully aware what’s going on and
>>>>> operate
>>>>>>>>> in
>>>>>>>>>>>>>>>>> less
>>>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>>>> perfect setup. And if something bites them after
>> adding
>>>>>>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>>>>>>> call,
>>>>>>>>>>>>>>>>>>>>>> to make sure that they at least know all of the places
>>>>> that
>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>> line can affect.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks, Piotrek
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <
>>>>> becket.qin@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more replies
>>>>> are
>>>>>>>>>>>>>>>>>>> following.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not only
>> be
>>>>>>>>> used
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>>>>>> programming and not only in batching.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache()
>> has
>>>>> the
>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>> batch processing. The semantic is following:
>>>>>>>>>>>>>>>>>>>>>>> For a table created via a series of computation, save
>>>>> that
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>>>>>>> reference to avoid running the computation logic to
>>>>>>>>>>>>>>>> regenerate
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
>>>>>>>>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
>>>>>>>>> processing.
>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>> difference
>>>>>>>>>>>>>>>>>>>>>>> is that stream applications will only run once as
>> they
>>>>> are
>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>>> running.
>>>>>>>>>>>>>>>>>>>>>>> And the batch applications may be run multiple times,
>>>>>>>>> hence
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>> be created and dropped each time the application
>> runs.
>>>>>>>>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
>>>>>>>>> management
>>>>>>>>>>>>>>>>>>>>> requirements
>>>>>>>>>>>>>>>>>>>>>>> for the streaming cached table, such as time based /
>>>>> size
>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>>>>>> retention,
>>>>>>>>>>>>>>>>>>>>>>> to address the infinite data issue. But such
>>>>> requirement
>>>>>>>>>> does
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>>>>>> the semantic.
>>>>>>>>>>>>>>>>>>>>>>> You are right that interactive programming is just
>> one
>>>>> use
>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> cache().
>>>>>>>>>>>>>>>>>>>>>>> It is not the only use case.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
>>>>> `void
>>>>>>>>>>>>>>>>>> cache()`
>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>> side effects.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> This is indeed the key point. The argument around
>>>>> whether
>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>> return something already indicates that cache() and
>>>>>>>>>>>>>>>>> materialize()
>>>>>>>>>>>>>>>>>>>>> address
>>>>>>>>>>>>>>>>>>>>>>> different issues.
>>>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
>>>>> effects?
>>>>>>>>> So
>>>>>>>>>>>>>>>>> far
>>>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist
>> if a
>>>>>>>>>> table
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>>>>>>>>> CachedTable
>>>>>>>>>>>>>>>>>>>> read-only.
>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
>>>>> can
>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently
>>>>> can
>>>>>>>>> not
>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I don't think anyone should insert something to a
>>>>> cache.
>>>>>>>>> By
>>>>>>>>>>>>>>>>>>>> definition
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> cache should only be updated when the corresponding
>>>>>>>>> original
>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> updated. What I am wondering is that given the
>>>>> following
>>>>>>>>> two
>>>>>>>>>>>>>>>>>> facts:
>>>>>>>>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with something
>>>>> like
>>>>>>>>>>>>>>>>>>> insert()),
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
>>>>>>>>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
>>>>>>>>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
>>>>>>>>> mutable
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is where I
>>>>>>>>>> thought
>>>>>>>>>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One
>>>>> more
>>>>>>>>>>>>>>>>>>> explanation
>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is that
>> I
>>>>>>>>> think
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>> “Table”s
>>>>>>>>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as
>> SQL
>>>>>>>>>>>>>>>> views,
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>> difference for me is that their live scope is short
>> -
>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>> is limited by different execution model. That’s why
>>>>>>>>>>>>>>>> “cashing”
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>> for me
>>>>>>>>>>>>>>>>>>>>>>>> is just materialising it.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> However I see and I understand your point of view.
>>>>> Coming
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL
>>>>> world,
>>>>>>>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might
>>>>> not
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> interactive programming and not only in batching.
>> But
>>>>>>>>>> naming
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>>>> and not that critical to me. Especially that once we
>>>>>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> deem
>>>>>>>>>>>>>>>>>>>>>> so.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
>>>>>>>>> `void
>>>>>>>>>>>>>>>>>>> cache()`
>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you have
>>>>>>>>>>>>>>>> mentioned.
>>>>>>>>>>>>>>>>>>> True:
>>>>>>>>>>>>>>>>>>>>>>>> results might be non deterministic if underlying
>>>>> source
>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>> changing.
>>>>>>>>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes
>> the
>>>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It
>>>>> can
>>>>>>>>>>>>>>>> cause
>>>>>>>>>>>>>>>>>>> “wtf”
>>>>>>>>>>>>>>>>>>>>>> moment
>>>>>>>>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some
>>>>> place
>>>>>>>>> in
>>>>>>>>>>>>>>>> his
>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
>>>>>>>>> differently.
>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle,
>>>>> we
>>>>>>>>>>>>>>>> force
>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> explicitly use the cache which removes the “random”
>>>>> part
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> "suddenly
>>>>>>>>>>>>>>>>>>>>>>>> some other random places are behaving differently”.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
>>>>>>>>>>>>>>>>>>>>> flexibility/allowing
>>>>>>>>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are independent
>>>>> of
>>>>>>>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>>>>>> vs
>>>>>>>>>>>>>>>>>>>>>>>> `materialize()` discussion.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the
>>>>> CachedTable?
>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>>>> sounds
>>>>>>>>>>>>>>>>>>>>>>>> pretty confusing.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>>>>>>>>> CachedTable
>>>>>>>>>>>>>>>>>>>>> read-only. I
>>>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
>>>>> can
>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently
>>>>> can
>>>>>>>>> not
>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
>>>>>>>>> xingcanc@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and
>>>>> `materialize()`
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>> considered as two different methods where the later
>>>>> one
>>>>>>>>> is
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>> sophisticated.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea is
>>>>> just
>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the
>> TableAPI
>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>> high-level
>>>>>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the
>> DataSet
>>>>> API
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> force
>>>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching it.
>>>>> Then
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>> manually register the cached dataset to a table
>> again
>>>>> (we
>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
>>>>>>>>> identical
>>>>>>>>>>>>>>>>>> schema
>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>> different contents here). After all, it’s the
>> dataset
>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
>>>>>>>>>>>>>>>>>> becket.qin@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are
>>>>> good
>>>>>>>>>>>>>>>>>>> arguments.
>>>>>>>>>>>>>>>>>>>>>> But I
>>>>>>>>>>>>>>>>>>>>>>>>>> think those arguments are mostly about
>> materialized
>>>>>>>>> view.
>>>>>>>>>>>>>>>>> Let
>>>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and
>>>>> materialize()
>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>> different.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite
>>>>> different
>>>>>>>>>>>>>>>>>>>> implications.
>>>>>>>>>>>>>>>>>>>>>> An
>>>>>>>>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When
>>>>> users
>>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>>>>>>> cache(),
>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result
>> as
>>>>> a
>>>>>>>>>>>>>>>> draft
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>> work,
>>>>>>>>>>>>>>>>>>>>>>>>>> this intermediate result may not have any
>> realistic
>>>>>>>>>>>>>>>> meaning.
>>>>>>>>>>>>>>>>>>>> Calling
>>>>>>>>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the
>>>>> cached
>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>>>>>>>>>>> manner.
>>>>>>>>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I
>>>>> have
>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>>>>> meaningful
>>>>>>>>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think
>>>>> about
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> validation,
>>>>>>>>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
>>>>>>>>> materialize()
>>>>>>>>>>>>>>>>>> methods
>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them. The
>>>>>>>>> concept
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to say
>>>>> the
>>>>>>>>>>>>>>>>> related
>>>>>>>>>>>>>>>>>>>> stuff
>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and
>>>>> systematic
>>>>>>>>>>>>>>>>> manner.
>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> found
>>>>>>>>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way
>> beyond
>>>>>>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>>>>>>>> programming experience.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still have
>>>>> some
>>>>>>>>>>>>>>>>>>> questions,
>>>>>>>>>>>>>>>>>>>>>>>> though.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files
>>>>> from a
>>>>>>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…)
>> ….;
>>>>>>>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>>>>>>>>>>>>>>>> initialised)
>>>>>>>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
>>>>>>>>> writes
>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to
>>>>> be
>>>>>>>>>>>>>>>>>>> implemented
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> initial version
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> what if someone else added some more files to
>>>>> /foo/bar
>>>>>>>>> at
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result
>>>>> become
>>>>>>>>>>>>>>>>>>>>>>>> non-deterministic,
>>>>>>>>>>>>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension,
>>>>> manual
>>>>>>>>>>>>>>>>>> “cache”
>>>>>>>>>>>>>>>>>>>>>> dropping
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in
>> most
>>>>>>>>>> cases,
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental assumption
>>>>> of
>>>>>>>>>> such
>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> source data is complete before the data processing
>>>>>>>>>> begins,
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO,
>> if
>>>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>> to be added to some source during the processing,
>> it
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> ways
>>>>>>>>>>>>>>>>>>>>>>>>>> like union the source with another table
>> containing
>>>>> the
>>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> There are a few cases that computations are
>> executed
>>>>>>>>>>>>>>>>>> repeatedly
>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> changing data source.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job
>> every
>>>>>>>>> hour
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> samples
>>>>>>>>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the
>>>>> source
>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain
>> unchanged
>>>>>>>>>> within
>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>> run.
>>>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>>>>> usually in that case, the result will need
>>>>> versioning,
>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result from
>>>>> the
>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>> by a
>>>>>>>>>>>>>>>>>>>>>>>>>> certain timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Another example is something like data warehouse.
>> In
>>>>>>>>> this
>>>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>> are a
>>>>>>>>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
>>>>>>>>> sources,
>>>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be
>>>>> created to
>>>>>>>>>>>>>>>>>> generate
>>>>>>>>>>>>>>>>>>>>>> derived
>>>>>>>>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated when
>>>>> the
>>>>>>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic
>>>>> that
>>>>>>>>>>>>>>>>> derives
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update
>> those
>>>>>>>>>>>>>>>>>>>> reports/views.
>>>>>>>>>>>>>>>>>>>>>>>> Again,
>>>>>>>>>>>>>>>>>>>>>>>>>> all those derived data also need to ha
>>>>> 
>>>>> 
>> 
>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotr,

Thanks for the proposal and detailed explanation. I like the idea of
returning a new hinted Table without modifying the original table. This
also leave the room for users to benefit from future implicit caching.

Just to make sure I get the full picture. In your proposal, there will also
be a 'void Table#uncache()' method to release the cache, right?

Thanks,

Jiangjie (Becket) Qin

On Mon, Jan 7, 2019 at 11:50 PM Piotr Nowojski <pi...@da-platform.com>
wrote:

> Hi Becket!
>
> After further thinking I tend to agree that my previous proposal (*Option
> 2*) indeed might not be if would in the future introduce automatic caching.
> However I would like to propose a slightly modified version of it:
>
> *Option 4*
>
> Adding `cache()` method with following signature:
>
> Table Table#cache();
>
> Without side-effects, and `cache()` call do not modify/change original
> Table in any way.
> It would return a copy of original table, with added hint for the
> optimizer to cache the table, so that the future accesses to the returned
> table might be cached or not.
>
> Assuming that we are talking about a setup, where we do not have automatic
> caching enabled (possible future extension).
>
> Example #1:
>
> ```
> Table a = …
> a.foo() // not cached
>
> val cachedTable = a.cache();
>
> cachedA.bar() // maybe cached
> a.foo() // same as before - effectively not cached
> ```
>
> Both the first and the second `a.foo()` operations would behave in the
> exactly same way. Again, `a.cache()` call doesn’t affect `a` itself. If `a`
> was not hinted for caching before `a.cache();`, then both `a.foo()` calls
> wouldn’t use cache.
>
> Returned `cachedA` would be hinted with “cache” hint, so probably
> `cachedA.bar()` would go through cache (unless optimiser decides the
> opposite)
>
> Example #2
>
> ```
> Table a = …
>
> a.foo() // not cached
>
> val b = a.cache();
>
> a.foo() // same as before - effectively not cached
> b.foo() // maybe cached
>
> val c = b.cache();
>
> a.foo() // same as before - effectively not cached
> b.foo() // same as before - effectively maybe cached
> c.foo() // maybe cached
> ```
>
> Now, assuming that we have some future “automatic caching optimisation”:
>
> Example #3
>
> ```
> env.enableAutomaticCaching()
> Table a = …
>
> a.foo() // might be cached, depending if `a` was selected to automatic
> caching
>
> val b = a.cache();
>
> a.foo() // same as before - might be cached, if `a` was selected to
> automatic caching
> b.foo() // maybe cached
> ```
>
>
> More or less this is the same behaviour as:
>
> Table a = ...
> val b = a.filter(x > 20)
>
> calling `filter` hasn’t changed or altered `a` in anyway. If `a` was
> previously filtered:
>
> Table src = …
> val a = src.filter(x > 20)
> val b = a.filter(x > 20)
>
> then yes, `a` and `b` will be the same. But the point is that neither
> `filter` nor `cache` changes the original `a` table.
>
> One thing is that indeed, physically dropping cache operation, will have
> side effects and it will in a way mutate the cached table references. But
> this is I think unavoidable in any solution - the same issue as calling
> `.close()`, or calling destructor in C++.
>
> Piotrek
>
> > On 7 Jan 2019, at 10:41, Becket Qin <be...@gmail.com> wrote:
> >
> > Happy New Year, everybody!
> >
> > I would like to resume this discussion thread. At this point, We have
> > agreed on the first step goal of interactive programming. The open
> > discussion is the exact API. More specifically, what should *cache()*
> > method return and what is the semantic. There are three options:
> >
> > *Option 1*
> > *void cache()* OR *Table cache()* which returns the original table for
> > chained calls.
> > *void uncache() *releases the cache.
> > *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
> >
> > - Semantic: a.cache() hints that table 'a' should be cached. Optimizer
> > decides whether the cache will be used or not.
> > - pros: simple and no confusion between CachedTable and original table
> > - cons: A table may be cached / uncached in a method invocation, while
> the
> > caller does not know about this.
> >
> > *Option 2*
> > *CachedTable cache()*
> > *CachedTable *extends *Table *with an additional *uncache()* method
> >
> > - Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will always
> > use cache. *a.bar() *will always use original DAG.
> > - pros: No potential side effects in method invocation.
> > - cons: Optimizer has no chance to kick in. Future optimization will
> become
> > a behavior change and need users to change the code.
> >
> > *Option 3*
> > *CacheHandle cache()*
> > *CacheHandle.release() *to release a cache handle on the table. If all
> > cache handles are released, the cache could be removed.
> > *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
> >
> > - Semantic: *a.cache() *hints that 'a' should be cached. Optimizer
> decides
> > whether the cache will be used or not. Cache is released either no handle
> > is on it, or the user program exits.
> > - pros: No potential side effect in method invocation. No confusion
> between
> > cached table v.s original table.
> > - cons: An additional CacheHandle exposed to the users.
> >
> >
> > Personally I prefer option 3 for the following reasons:
> > 1. It is simple. Vast majority of the users would just call
> > *a.cache()* followed
> > by *a.foo(),* *a.bar(), etc. *
> > 2. There is no semantic ambiguity and semantic change if we decide to add
> > implicit cache in the future.
> > 3. There is no side effect in the method calls.
> > 4. Admittedly we need to expose one more CacheHandle class to the users.
> > But it is not that difficult to understand given similar well known
> concept
> > like ref count (we can name it CacheReference if that is easier to
> > understand). So I think it is fine.
> >
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <be...@gmail.com>
> wrote:
> >
> >> Hi Piotrek,
> >>
> >> 1. Regarding optimization.
> >> Sure there are many cases that the decision is hard to make. But that
> does
> >> not make it any easier for the users to make those decisions. I imagine
> 99%
> >> of the users would just naively use cache. I am not saying we can
> optimize
> >> in all the cases. But as long as we agree that at least in certain
> cases (I
> >> would argue most cases), optimizer can do a little better than an
> average
> >> user who likely knows little about Flink internals, we should not push
> the
> >> burden of optimization to users.
> >>
> >> BTW, it seems some of your concerns are related to the implementation. I
> >> did not mention the implementation of the caching service because that
> >> should not affect the API semantic. Not sure if this helps, but imagine
> the
> >> default implementation has one StorageNode service colocating with each
> TM.
> >> It could be running within the TM process or in a standalone process,
> >> depending on configuration.
> >>
> >> The StorageNode uses memory + spill-to-disk mechanism. The cached data
> >> will just be written to the local StorageNode service. If the
> StorageNode
> >> is running within the TM process, the in-memory cache could just be
> objects
> >> so we save some serde cost. A later job referring to the cached Table
> will
> >> be scheduled in a locality aware manner, i.e. run in the TM whose peer
> >> StorageNode hosts the data.
> >>
> >>
> >> 2. Semantic
> >> I am not sure why introducing a new hintCache() or
> >> env.enableAutomaticCaching() method would avoid the consequence of
> semantic
> >> change.
> >>
> >> If the auto optimization is not enabled by default, users still need to
> >> make code change to all existing programs in order to get the benefit.
> >> If the auto optimization is enabled by default, advanced users who know
> >> that they really want to use cache will suddenly lose the opportunity
> to do
> >> so, unless they change the code to disable auto optimization.
> >>
> >>
> >> 3. side effect
> >> The CacheHandle is not only for where to put uncache(). It is to solve
> the
> >> implicit performance impact by moving the uncache() to the CacheHandle.
> >>
> >>   - If users wants to leverage cache, they can call a.cache(). After
> >>   that, unless user explicitly release that CacheHandle, a.foo() will
> always
> >>   leverage cache if needed (optimizer may choose to ignore cache if that
> >>   helps accelerate the process). Any function call will not be able to
> >>   release the cache because they do not have that CacheHandle.
> >>   - If some advanced users do not want to use cache at all, they will
> >>   call a.hint(ignoreCache).foo(). This will for sure ignore cache and
> use the
> >>   original DAG to process.
> >>
> >>
> >>> In vast majority of the cases, users wouldn't really care whether the
> >>> cache is used or not.
> >>> I wouldn’t agree with that, because “caching” (if not purely in memory
> >>> caching) would add additional IO costs. It’s similar as saying that
> users
> >>> would not see a difference between Spark/Flink and MapReduce (MapReduce
> >>> writes data to disks after every map/reduce stage).
> >>
> >> What I wanted to say is that in most cases, after users call cache(),
> they
> >> don't really care about whether auto optimization has decided to ignore
> the
> >> cache or not, as long as the program runs faster.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <
> piotr@data-artisans.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Thanks for the quick answer :)
> >>>
> >>> Re 1.
> >>>
> >>> I generally agree with you, however couple of points:
> >>>
> >>> a) the problem with using automatic caching is bigger, because you will
> >>> have to decide, how do you compare IO vs CPU costs and if you pick
> wrong,
> >>> additional IO costs might be enormous or even can crash your system.
> This
> >>> is more difficult problem compared to let say join reordering, where
> the
> >>> only issue is to have good statistics that can capture correlations
> between
> >>> columns (when you reorder joins number of IO operations do not change)
> >>> c) your example is completely independent of caching.
> >>>
> >>> Query like this:
> >>>
> >>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===`f2).as('f3,
> >>> …).filter(‘f3 > 30)
> >>>
> >>> Should/could be optimised to empty result immediately, without the need
> >>> for any cache/materialisation and that should work even without any
> >>> statistics provided by the connector.
> >>>
> >>> For me prerequisite to any serious cost-based optimisations would be
> some
> >>> reasonable benchmark coverage of the code (tpch?). Otherwise that
> would be
> >>> equivalent of adding not tested code, since we wouldn’t be able to
> verify
> >>> our assumptions, like how does the writing of 10 000 records to
> >>> cache/RocksDB/Kafka/CSV file compare to joining/filtering/processing of
> >>> lets say 1000 000 rows.
> >>>
> >>> Re 2.
> >>>
> >>> I wasn’t proposing to change the semantic later. I was proposing that
> we
> >>> start now:
> >>>
> >>> CachedTable cachedA = a.cache()
> >>> cachedA.foo() // Cache is used
> >>> a.bar() // Original DAG is used
> >>>
> >>> And then later we can think about adding for example
> >>>
> >>> CachedTable cachedA = a.hintCache()
> >>> cachedA.foo() // Cache might be used
> >>> a.bar() // Original DAG is used
> >>>
> >>> Or
> >>>
> >>> env.enableAutomaticCaching()
> >>> a.foo() // Cache might be used
> >>> a.bar() // Cache might be used
> >>>
> >>> Or (I would still not like this option):
> >>>
> >>> a.hintCache()
> >>> a.foo() // Cache might be used
> >>> a.bar() // Cache might be used
> >>>
> >>> Or whatever else that will come to our mind. Even if we add some
> >>> automatic caching in the future, keeping implicit (`CachedTable
> cache()`)
> >>> caching will still be useful, at least in some cases.
> >>>
> >>> Re 3.
> >>>
> >>>> 2. The source tables are immutable during one run of batch processing
> >>> logic.
> >>>> 3. The cache is immutable during one run of batch processing logic.
> >>>
> >>>> I think assumption 2 and 3 are by definition what batch processing
> >>> means,
> >>>> i.e the data must be complete before it is processed and should not
> >>> change
> >>>> when the processing is running.
> >>>
> >>> I agree that this is how batch systems SHOULD be working. However I
> know
> >>> from my previous experience that it’s not always the case. Sometimes
> users
> >>> are just working on some non transactional storage, which can be
> (either
> >>> constantly or occasionally) being modified by some other processes for
> >>> whatever the reasons (fixing the data, updating, adding new data etc).
> >>>
> >>> But even if we ignore this point (data immutability), performance side
> >>> effect issue of your proposal remains. If user calls `void a.cache()`
> deep
> >>> inside some private method, it will have implicit side effects on other
> >>> parts of his program that might not be obvious.
> >>>
> >>> Re `CacheHandle`.
> >>>
> >>> If I understand it correctly, it only addresses the issue where to
> place
> >>> method `uncache`/`dropCache`.
> >>>
> >>> Btw,
> >>>
> >>>> In vast majority of the cases, users wouldn't really care whether the
> >>> cache is used or not.
> >>>
> >>> I wouldn’t agree with that, because “caching” (if not purely in memory
> >>> caching) would add additional IO costs. It’s similar as saying that
> users
> >>> would not see a difference between Spark/Flink and MapReduce (MapReduce
> >>> writes data to disks after every map/reduce stage).
> >>>
> >>> Piotrek
> >>>
> >>>> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com> wrote:
> >>>>
> >>>> Hi Piotrek,
> >>>>
> >>>> Not sure if you noticed, in my last email, I was proposing
> `CacheHandle
> >>>> cache()` to avoid the potential side effect due to function calls.
> >>>>
> >>>> Let's look at the disagreement in your reply one by one.
> >>>>
> >>>>
> >>>> 1. Optimization chances
> >>>>
> >>>> Optimization is never a trivial work. This is exactly why we should
> not
> >>> let
> >>>> user manually do that. Databases have done huge amount of work in this
> >>>> area. At Alibaba, we rely heavily on many optimization rules to boost
> >>> the
> >>>> SQL query performance.
> >>>>
> >>>> In your example, if I filling the filter conditions in a certain way,
> >>> the
> >>>> optimization would become obvious.
> >>>>
> >>>> Table src1 = … // read from connector 1
> >>>> Table src2 = … // read from connector 2
> >>>>
> >>>> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
> >>>> `f2).as('f3, ...)
> >>>> a.cache() // write cache to connector 3, when writing the records,
> >>> remember
> >>>> min and max of `f1
> >>>>
> >>>> a.filter('f3 > 30) // There is no need to read from any connector
> >>> because
> >>>> `a` does not contain any record whose 'f3 is greater than 30.
> >>>> env.execute()
> >>>> a.select(…)
> >>>>
> >>>> BTW, it seems to me that adding some basic statistics is fairly
> >>>> straightforward and the cost is pretty marginal if not ignorable. In
> >>> fact
> >>>> it is not only needed for optimization, but also for cases such as ML,
> >>>> where some algorithms may need to decide their parameter based on the
> >>>> statistics of the data.
> >>>>
> >>>>
> >>>> 2. Same API, one semantic now, another semantic later.
> >>>>
> >>>> I am trying to understand what is the semantic of `CachedTable
> cache()`
> >>> you
> >>>> are proposing. IMO, we should avoid designing an API whose semantic
> >>> will be
> >>>> changed later. If we have a "CachedTable cache()" method, then the
> >>> semantic
> >>>> should be very clearly defined upfront and do not change later. It
> >>> should
> >>>> never be "right now let's go with semantic 1, later we can silently
> >>> change
> >>>> it to semantic 2 or 3". Such change could result in bad consequence.
> For
> >>>> example, let's say we decide go with semantic 1:
> >>>>
> >>>> CachedTable cachedA = a.cache()
> >>>> cachedA.foo() // Cache is used
> >>>> a.bar() // Original DAG is used.
> >>>>
> >>>> Now majority of the users would be using cachedA.foo() in their code.
> >>> And
> >>>> some advanced users will use a.bar() to explicitly skip the cache.
> Later
> >>>> on, we added smart optimization and change the semantic to semantic 2:
> >>>>
> >>>> CachedTable cachedA = a.cache()
> >>>> cachedA.foo() // Cache is used
> >>>> a.bar() // Cache MIGHT be used, and Flink may decide to skip cache if
> >>> it is
> >>>> faster.
> >>>>
> >>>> Now most of the users who were writing cachedA.foo() will not benefit
> >>> from
> >>>> this optimization at all, unless they change their code to use a.foo()
> >>>> instead. And those advanced users suddenly lose the option to
> explicitly
> >>>> ignore cache unless they change their code (assuming we care enough to
> >>>> provide something like hint(useCache)). If we don't define the
> semantic
> >>>> carefully, our users will have to change their code again and again
> >>> while
> >>>> they shouldn't have to.
> >>>>
> >>>>
> >>>> 3. side effect.
> >>>>
> >>>> Before we talk about side effect, we have to agree on the assumptions.
> >>> The
> >>>> assumptions I have are following:
> >>>> 1. We are talking about batch processing.
> >>>> 2. The source tables are immutable during one run of batch processing
> >>> logic.
> >>>> 3. The cache is immutable during one run of batch processing logic.
> >>>>
> >>>> I think assumption 2 and 3 are by definition what batch processing
> >>> means,
> >>>> i.e the data must be complete before it is processed and should not
> >>> change
> >>>> when the processing is running.
> >>>>
> >>>> As far as I am aware of, I don't know any batch processing system
> >>> breaking
> >>>> those assumptions. Even for relational database tables, where queries
> >>> can
> >>>> run with concurrent modifications, necessary locking are still
> required
> >>> to
> >>>> ensure the integrity of the query result.
> >>>>
> >>>> Please let me know if you disagree with the above assumptions. If you
> >>> agree
> >>>> with these assumptions, with the `CacheHandle cache()` API in my last
> >>>> email, do you still see side effects?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <
> piotr@data-artisans.com
> >>>>
> >>>> wrote:
> >>>>
> >>>>> Hi Becket,
> >>>>>
> >>>>>> Regarding the chance of optimization, it might not be that rare.
> Some
> >>>>> very
> >>>>>> simple statistics could already help in many cases. For example,
> >>> simply
> >>>>>> maintaining max and min of each fields can already eliminate some
> >>>>>> unnecessary table scan (potentially scanning the cached table) if
> the
> >>>>>> result is doomed to be empty. A histogram would give even further
> >>>>>> information. The optimizer could be very careful and only ignores
> >>> cache
> >>>>>> when it is 100% sure doing that is cheaper. e.g. only when a filter
> on
> >>>>> the
> >>>>>> cache will absolutely return nothing.
> >>>>>
> >>>>> I do not see how this might be easy to achieve. It would require tons
> >>> of
> >>>>> effort to make it work and in the end you would still have a problem
> of
> >>>>> comparing/trading CPU cycles vs IO. For example:
> >>>>>
> >>>>> Table src1 = … // read from connector 1
> >>>>> Table src2 = … // read from connector 2
> >>>>>
> >>>>> Table a = src1.filter(…).join(src2.filter(…), …)
> >>>>> a.cache() // write cache to connector 3
> >>>>>
> >>>>> a.filter(…)
> >>>>> env.execute()
> >>>>> a.select(…)
> >>>>>
> >>>>> Decision whether it’s better to:
> >>>>> A) read from connector1/connector2, filter/map and join them twice
> >>>>> B) read from connector1/connector2, filter/map and join them once,
> pay
> >>> the
> >>>>> price of writing to connector 3 and then reading from it
> >>>>>
> >>>>> Is very far from trivial. `a` can end up much larger than `src1` and
> >>>>> `src2`, writes to connector 3 might be extremely slow, reads from
> >>> connector
> >>>>> 3 can be slower compared to reads from connector 1 & 2, … . You
> really
> >>> need
> >>>>> to have extremely good statistics to correctly asses size of the
> >>> output and
> >>>>> it would still be failing many times (correlations etc). And keep in
> >>> mind
> >>>>> that at the moment we do not have ANY statistics at all. More than
> >>> that, it
> >>>>> would require significantly more testing and setting up some
> >>> benchmarks to
> >>>>> make sure that we do not brake it with some regressions.
> >>>>>
> >>>>> That’s why I’m strongly opposing this idea - at least let’s not
> starts
> >>>>> with this. If we first start with completely manual/explicit caching,
> >>>>> without any magic, it would be a significant improvement for the
> users
> >>> for
> >>>>> a fraction of the development cost. After implementing that, when we
> >>>>> already have all of the working pieces, we can start working on some
> >>>>> optimisations rules. As I wrote before, if we start with
> >>>>>
> >>>>> `CachedTable cache()`
> >>>>>
> >>>>> We can later work on follow up stories to make it automatic. Despite
> >>> that
> >>>>> I don’t like this implicit/side effect approach with `void` method,
> >>> having
> >>>>> explicit `CachedTable cache()` wouldn’t even prevent as from later
> >>> adding
> >>>>> `void hintCache()` method, with the exact semantic that you want.
> >>>>>
> >>>>> On top of that I re-rise again that having implicit `void
> >>>>> cache()/hintCache()` has other side effects and problems with non
> >>> immutable
> >>>>> data, and being annoying when used secretly inside methods.
> >>>>>
> >>>>> Explicit `CachedTable cache()` just looks like much less
> controversial
> >>> MVP
> >>>>> and if we decide to go further with this topic, it’s not a wasted
> >>> effort,
> >>>>> but just lies on a stright path to more advanced/complicated
> solutions
> >>> in
> >>>>> the future. Are there any drawbacks of starting with `CachedTable
> >>> cache()`
> >>>>> that I’m missing?
> >>>>>
> >>>>> Piotrek
> >>>>>
> >>>>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Becket,
> >>>>>>
> >>>>>> Introducing CacheHandle seems too complicated. That means users have
> >>> to
> >>>>>> maintain Handler properly.
> >>>>>>
> >>>>>> And since cache is just a hint for optimizer, why not just return
> >>> Table
> >>>>>> itself for cache method. This hint info should be kept in Table I
> >>>>> believe.
> >>>>>>
> >>>>>> So how about adding method cache and uncache for Table, and both
> >>> return
> >>>>>> Table. Because what cache and uncache did is just adding some hint
> >>> info
> >>>>>> into Table.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
> >>>>>>
> >>>>>>> Hi Till and Piotrek,
> >>>>>>>
> >>>>>>> Thanks for the clarification. That solves quite a few confusion. My
> >>>>>>> understanding of how cache works is same as what Till describe.
> i.e.
> >>>>>>> cache() is a hint to Flink, but it is not guaranteed that cache
> >>> always
> >>>>>>> exist and it might be recomputed from its lineage.
> >>>>>>>
> >>>>>>> Is this the core of our disagreement here? That you would like this
> >>>>>>>> “cache()” to be mostly hint for the optimiser?
> >>>>>>>
> >>>>>>> Semantic wise, yes. That's also why I think materialize() has a
> much
> >>>>> larger
> >>>>>>> scope than cache(), thus it should be a different method.
> >>>>>>>
> >>>>>>> Regarding the chance of optimization, it might not be that rare.
> Some
> >>>>> very
> >>>>>>> simple statistics could already help in many cases. For example,
> >>> simply
> >>>>>>> maintaining max and min of each fields can already eliminate some
> >>>>>>> unnecessary table scan (potentially scanning the cached table) if
> the
> >>>>>>> result is doomed to be empty. A histogram would give even further
> >>>>>>> information. The optimizer could be very careful and only ignores
> >>> cache
> >>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a filter
> >>> on
> >>>>> the
> >>>>>>> cache will absolutely return nothing.
> >>>>>>>
> >>>>>>> Given the above clarification on cache, I would like to revisit the
> >>>>>>> original "void cache()" proposal and see if we can improve on top
> of
> >>>>> that.
> >>>>>>>
> >>>>>>> What do you think about the following modified interface?
> >>>>>>>
> >>>>>>> Table {
> >>>>>>> /**
> >>>>>>> * This call hints Flink to maintain a cache of this table and
> >>> leverage
> >>>>>>> it for performance optimization if needed.
> >>>>>>> * Note that Flink may still decide to not use the cache if it is
> >>>>> cheaper
> >>>>>>> by doing so.
> >>>>>>> *
> >>>>>>> * A CacheHandle will be returned to allow user release the cache
> >>>>>>> actively. The cache will be deleted if there
> >>>>>>> * is no unreleased cache handlers to it. When the TableEnvironment
> >>> is
> >>>>>>> closed. The cache will also be deleted
> >>>>>>> * and all the cache handlers will be released.
> >>>>>>> *
> >>>>>>> * @return a CacheHandle referring to the cache of this table.
> >>>>>>> */
> >>>>>>> CacheHandle cache();
> >>>>>>> }
> >>>>>>>
> >>>>>>> CacheHandle {
> >>>>>>> /**
> >>>>>>> * Close the cache handle. This method does not necessarily deletes
> >>> the
> >>>>>>> cache. Instead, it simply decrements the reference counter to the
> >>> cache.
> >>>>>>> * When the there is no handle referring to a cache. The cache will
> >>> be
> >>>>>>> deleted.
> >>>>>>> *
> >>>>>>> * @return the number of open handles to the cache after this handle
> >>>>> has
> >>>>>>> been released.
> >>>>>>> */
> >>>>>>> int release()
> >>>>>>> }
> >>>>>>>
> >>>>>>> The rationale behind this interface is following:
> >>>>>>> In vast majority of the cases, users wouldn't really care whether
> the
> >>>>> cache
> >>>>>>> is used or not. So I think the most intuitive way is letting
> cache()
> >>>>> return
> >>>>>>> nothing. So nobody needs to worry about the difference between
> >>>>> operations
> >>>>>>> on CacheTables and those on the "original" tables. This will make
> >>> maybe
> >>>>>>> 99.9% of the users happy. There were two concerns raised for this
> >>>>> approach:
> >>>>>>> 1. In some rare cases, users may want to ignore cache,
> >>>>>>> 2. A table might be cached/uncached in a third party function while
> >>> the
> >>>>>>> caller does not know.
> >>>>>>>
> >>>>>>> For the first issue, users can use hint("ignoreCache") to
> explicitly
> >>>>> ignore
> >>>>>>> cache.
> >>>>>>> For the second issue, the above proposal lets cache() return a
> >>>>> CacheHandle,
> >>>>>>> the only method in it is release(). Different CacheHandles will
> >>> refer to
> >>>>>>> the same cache, if a cache no longer has any cache handle, it will
> be
> >>>>>>> deleted. This will address the following case:
> >>>>>>> {
> >>>>>>> val handle1 = a.cache()
> >>>>>>> process(a)
> >>>>>>> a.select(...) // cache is still available, handle1 has not been
> >>>>> released.
> >>>>>>> }
> >>>>>>>
> >>>>>>> void process(Table t) {
> >>>>>>> val handle2 = t.cache() // new handle to cache
> >>>>>>> t.select(...) // optimizer decides cache usage
> >>>>>>> t.hint("ignoreCache").select(...) // cache is ignored
> >>>>>>> handle2.release() // release the handle, but the cache may still be
> >>>>>>> available if there are other handles
> >>>>>>> ...
> >>>>>>> }
> >>>>>>>
> >>>>>>> Does the above modified approach look reasonable to you?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Jiangjie (Becket) Qin
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <
> trohrmann@apache.org>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> I was aiming at semantics similar to 1. I actually thought that
> >>>>> `cache()`
> >>>>>>>> would tell the system to materialize the intermediate result so
> that
> >>>>>>>> subsequent queries don't need to reprocess it. This means that the
> >>>>> usage
> >>>>>>> of
> >>>>>>>> the cached table in this example
> >>>>>>>>
> >>>>>>>> {
> >>>>>>>> val cachedTable = a.cache()
> >>>>>>>> val b1 = cachedTable.select(…)
> >>>>>>>> val b2 = cachedTable.foo().select(…)
> >>>>>>>> val b3 = cachedTable.bar().select(...)
> >>>>>>>> val c1 = a.select(…)
> >>>>>>>> val c2 = a.foo().select(…)
> >>>>>>>> val c3 = a.bar().select(...)
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> strongly depends on interleaved calls which trigger the execution
> of
> >>>>> sub
> >>>>>>>> queries. So for example, if there is only a single env.execute
> call
> >>> at
> >>>>>>> the
> >>>>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be
> computed
> >>> by
> >>>>>>>> reading directly from the sources (given that there is only a
> single
> >>>>>>>> JobGraph). It just happens that the result of `a` will be cached
> >>> such
> >>>>>>> that
> >>>>>>>> we skip the processing of `a` when there are subsequent queries
> >>> reading
> >>>>>>>> from `cachedTable`. If for some reason the system cannot
> materialize
> >>>>> the
> >>>>>>>> table (e.g. running out of disk space, ttl expired), then it could
> >>> also
> >>>>>>>> happen that we need to reprocess `a`. In that sense `cachedTable`
> >>>>> simply
> >>>>>>> is
> >>>>>>>> an identifier for the materialized result of `a` with the lineage
> >>> how
> >>>>> to
> >>>>>>>> reprocess it.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Till
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
> >>>>> piotr@data-artisans.com
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Becket,
> >>>>>>>>>
> >>>>>>>>>> {
> >>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>> val b = cachedTable.select(...)
> >>>>>>>>>> val c = a.select(...)
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
> >>> original
> >>>>>>> DAG
> >>>>>>>>> as
> >>>>>>>>>> user demanded so. In this case, the optimizer has no chance to
> >>>>>>>> optimize.
> >>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
> >>>>>>>>> optimizer
> >>>>>>>>>> to choose whether the cache or DAG should be used. In this case,
> >>> user
> >>>>>>>>> lose
> >>>>>>>>>> the option to NOT use cache.
> >>>>>>>>>>
> >>>>>>>>>> As you can see, neither of the options seem perfect. However, I
> >>> guess
> >>>>>>>> you
> >>>>>>>>>> and Till are proposing the third option:
> >>>>>>>>>>
> >>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or
> DAG
> >>>>>>>> should
> >>>>>>>>> be
> >>>>>>>>>> used. c always use the DAG.
> >>>>>>>>>
> >>>>>>>>> I am pretty sure that me, Till, Fabian and others were all
> >>> proposing
> >>>>>>> and
> >>>>>>>>> advocating in favour of semantic “1”. No cost based optimiser
> >>>>> decisions
> >>>>>>>> at
> >>>>>>>>> all.
> >>>>>>>>>
> >>>>>>>>> {
> >>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>> val b1 = cachedTable.select(…)
> >>>>>>>>> val b2 = cachedTable.foo().select(…)
> >>>>>>>>> val b3 = cachedTable.bar().select(...)
> >>>>>>>>> val c1 = a.select(…)
> >>>>>>>>> val c2 = a.foo().select(…)
> >>>>>>>>> val c3 = a.bar().select(...)
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
> >>>>>>>>> re-executing whole plan for “a”.
> >>>>>>>>>
> >>>>>>>>> In the future we could discuss going one step further,
> introducing
> >>>>> some
> >>>>>>>>> global optimisation (that can be manually enabled/disabled):
> >>>>>>> deduplicate
> >>>>>>>>> plan nodes/deduplicate sub queries/re-use sub queries results/or
> >>>>>>> whatever
> >>>>>>>>> we could call it. It could do two things:
> >>>>>>>>>
> >>>>>>>>> 1. Automatically try to deduplicate fragments of the plan and
> share
> >>>>> the
> >>>>>>>>> result using CachedTable - in other words automatically insert
> >>>>>>>> `CachedTable
> >>>>>>>>> cache()` calls.
> >>>>>>>>> 2. Automatically make decision to bypass explicit `CachedTable`
> >>> access
> >>>>>>>>> (this would be the equivalent of what you described as “semantic
> >>> 3”).
> >>>>>>>>>
> >>>>>>>>> However as I wrote previously, I have big doubts if such
> cost-based
> >>>>>>>>> optimisation would work (this applies also to “Semantic 2”). I
> >>> would
> >>>>>>>> expect
> >>>>>>>>> it to do more harm than good in so many cases, that it wouldn’t
> >>> make
> >>>>>>>> sense.
> >>>>>>>>> Even assuming that we calculate statistics perfectly (this ain’t
> >>> gonna
> >>>>>>>>> happen), it’s virtually impossible to correctly estimate correct
> >>>>>>> exchange
> >>>>>>>>> rate of CPU cycles vs IO operations as it is changing so much
> from
> >>>>>>>>> deployment to deployment.
> >>>>>>>>>
> >>>>>>>>> Is this the core of our disagreement here? That you would like
> this
> >>>>>>>>> “cache()” to be mostly hint for the optimiser?
> >>>>>>>>>
> >>>>>>>>> Piotrek
> >>>>>>>>>
> >>>>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com>
> >>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Another potential concern for semantic 3 is that. In the future,
> >>> we
> >>>>>>> may
> >>>>>>>>> add
> >>>>>>>>>> automatic caching to Flink. e.g. cache the intermediate results
> at
> >>>>>>> the
> >>>>>>>>>> shuffle boundary. If our semantic is that reference to the
> >>> original
> >>>>>>>> table
> >>>>>>>>>> means skipping cache, those users may not be able to benefit
> from
> >>> the
> >>>>>>>>>> implicit cache.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <
> becket.qin@gmail.com
> >>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the reply. Thought about it again, I might have
> >>>>>>>> misunderstood
> >>>>>>>>>>> your proposal in earlier emails. Returning a CachedTable might
> >>> not
> >>>>>>> be
> >>>>>>>> a
> >>>>>>>>> bad
> >>>>>>>>>>> idea.
> >>>>>>>>>>>
> >>>>>>>>>>> I was more concerned about the semantic and its intuitiveness
> >>> when a
> >>>>>>>>>>> CachedTable is returned. i..e, if cache() returns CachedTable.
> >>> What
> >>>>>>>> are
> >>>>>>>>> the
> >>>>>>>>>>> semantic in the following code:
> >>>>>>>>>>> {
> >>>>>>>>>>> val cachedTable = a.cache()
> >>>>>>>>>>> val b = cachedTable.select(...)
> >>>>>>>>>>> val c = a.select(...)
> >>>>>>>>>>> }
> >>>>>>>>>>> What is the difference between b and c? At the first glance, I
> >>> see
> >>>>>>> two
> >>>>>>>>>>> options:
> >>>>>>>>>>>
> >>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
> >>> original
> >>>>>>>> DAG
> >>>>>>>>> as
> >>>>>>>>>>> user demanded so. In this case, the optimizer has no chance to
> >>>>>>>> optimize.
> >>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves
> the
> >>>>>>>>> optimizer
> >>>>>>>>>>> to choose whether the cache or DAG should be used. In this
> case,
> >>>>>>> user
> >>>>>>>>> lose
> >>>>>>>>>>> the option to NOT use cache.
> >>>>>>>>>>>
> >>>>>>>>>>> As you can see, neither of the options seem perfect. However, I
> >>>>>>> guess
> >>>>>>>>> you
> >>>>>>>>>>> and Till are proposing the third option:
> >>>>>>>>>>>
> >>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or
> DAG
> >>>>>>>> should
> >>>>>>>>>>> be used. c always use the DAG.
> >>>>>>>>>>>
> >>>>>>>>>>> This does address all the concerns. It is just that from
> >>>>>>> intuitiveness
> >>>>>>>>>>> perspective, I found that asking user to explicitly use a
> >>>>>>> CachedTable
> >>>>>>>>> while
> >>>>>>>>>>> the optimizer might choose to ignore is a little weird. That
> was
> >>>>>>> why I
> >>>>>>>>> did
> >>>>>>>>>>> not think about that semantic. But given there is material
> >>> benefit,
> >>>>>>> I
> >>>>>>>>> think
> >>>>>>>>>>> this semantic is acceptable.
> >>>>>>>>>>>
> >>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
> >>> cache
> >>>>>>> or
> >>>>>>>>> not,
> >>>>>>>>>>>> then why do we need “void cache()” method at all? Would It
> >>>>>>>> “increase”
> >>>>>>>>> the
> >>>>>>>>>>>> chance of using the cache? That’s sounds strange. What would
> be
> >>> the
> >>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
> >>> want
> >>>>>>> to
> >>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
> >>>>>>>>> deduplication”
> >>>>>>>>>>>> I would turn it on globally, not per table, and let the
> >>> optimiser
> >>>>>>> do
> >>>>>>>>> all of
> >>>>>>>>>>>> the work.
> >>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not use
> >>>>>>> cache
> >>>>>>>>>>>> decision.
> >>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
> >>> cost
> >>>>>>>>> based
> >>>>>>>>>>>> optimisations would work properly and I would still insist
> >>> first on
> >>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
> >>>>>>>>>>>>
> >>>>>>>>>>> We are absolutely on the same page here. An explicit cache()
> >>> method
> >>>>>>> is
> >>>>>>>>>>> necessary not only because optimizer may not be able to make
> the
> >>>>>>> right
> >>>>>>>>>>> decision, but also because of the nature of interactive
> >>> programming.
> >>>>>>>> For
> >>>>>>>>>>> example, if users write the following code in Scala shell:
> >>>>>>>>>>> val b = a.select(...)
> >>>>>>>>>>> val c = b.select(...)
> >>>>>>>>>>> val d = c.select(...).writeToSink(...)
> >>>>>>>>>>> tEnv.execute()
> >>>>>>>>>>> There is no way optimizer will know whether b or c will be used
> >>> in
> >>>>>>>> later
> >>>>>>>>>>> code, unless users hint explicitly.
> >>>>>>>>>>>
> >>>>>>>>>>> At the same time I’m not sure if you have responded to our
> >>>>>>> objections
> >>>>>>>> of
> >>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
> >>> Jark,
> >>>>>>>>> Fabian,
> >>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>>>>
> >>>>>>>>>>> Is there any other side effects if we use semantic 3 mentioned
> >>>>>>> above?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> JIangjie (Becket) Qin
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
> >>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Sorry for not responding long time.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regarding case1.
> >>>>>>>>>>>>
> >>>>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would expect
> >>> only
> >>>>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t
> >>>>>>> affect
> >>>>>>>>>>>> `cachedTableA2`. Just as in any other database dropping
> >>> modifying
> >>>>>>> one
> >>>>>>>>>>>> independent table/materialised view does not affect others.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> What I meant is that assuming there is already a cached
> table,
> >>>>>>>> ideally
> >>>>>>>>>>>> users need
> >>>>>>>>>>>>> not to specify whether the next query should read from the
> >>> cache
> >>>>>>> or
> >>>>>>>>> use
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
> >>> cache
> >>>>>>> or
> >>>>>>>>>>>> not, then why do we need “void cache()” method at all? Would
> It
> >>>>>>>>> “increase”
> >>>>>>>>>>>> the chance of using the cache? That’s sounds strange. What
> >>> would be
> >>>>>>>> the
> >>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
> >>> want
> >>>>>>> to
> >>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
> >>>>>>>>> deduplication”
> >>>>>>>>>>>> I would turn it on globally, not per table, and let the
> >>> optimiser
> >>>>>>> do
> >>>>>>>>> all of
> >>>>>>>>>>>> the work.
> >>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not use
> >>>>>>> cache
> >>>>>>>>>>>> decision.
> >>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
> >>> cost
> >>>>>>>>> based
> >>>>>>>>>>>> optimisations would work properly and I would still insist
> >>> first on
> >>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
> >>>>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()`
> doesn’t
> >>>>>>>>>>>> contradict future work on automated cost based caching.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> At the same time I’m not sure if you have responded to our
> >>>>>>> objections
> >>>>>>>>> of
> >>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
> >>> Jark,
> >>>>>>>>> Fabian,
> >>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com>
> >>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It is true that after the first job submission, there will be
> >>> no
> >>>>>>>>>>>> ambiguity
> >>>>>>>>>>>>> in terms of whether a cached table is used or not. That is
> the
> >>>>>>> same
> >>>>>>>>> for
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> cache() without returning a CachedTable.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
> >>> caching
> >>>>>>>>>>>> operator
> >>>>>>>>>>>>>> from which you need to consume from if you want to benefit
> >>> from
> >>>>>>> the
> >>>>>>>>>>>> caching
> >>>>>>>>>>>>>> functionality.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am thinking a little differently. I think it is a hint (as
> >>> you
> >>>>>>>>>>>> mentioned
> >>>>>>>>>>>>> later) instead of a new operator. I'd like to be careful
> about
> >>> the
> >>>>>>>>>>>> semantic
> >>>>>>>>>>>>> of the API. A hint is a property set on an existing operator,
> >>> but
> >>>>>>> is
> >>>>>>>>> not
> >>>>>>>>>>>>> itself an operator as it does not really manipulate the data.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
> >>> which
> >>>>>>>>>>>>>> intermediate result should be cached. But especially when
> >>>>>>> executing
> >>>>>>>>>>>> ad-hoc
> >>>>>>>>>>>>>> queries the user might better know which results need to be
> >>>>>>> cached
> >>>>>>>>>>>> because
> >>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
> >>> consider
> >>>>>>>> the
> >>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in
> the
> >>>>>>>> future
> >>>>>>>>> we
> >>>>>>>>>>>>>> might add functionality which tries to automatically cache
> >>>>>>> results
> >>>>>>>>>>>> (e.g.
> >>>>>>>>>>>>>> caching the latest intermediate results until so and so much
> >>>>>>> space
> >>>>>>>> is
> >>>>>>>>>>>>>> used). But this should hopefully not contradict with
> >>> `CachedTable
> >>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree that cache() method is needed for exactly the reason
> >>> you
> >>>>>>>>>>>> mentioned,
> >>>>>>>>>>>>> i.e. Flink cannot predict what users are going to write
> later,
> >>> so
> >>>>>>>>> users
> >>>>>>>>>>>>> need to tell Flink explicitly that this table will be used
> >>> later.
> >>>>>>>>> What I
> >>>>>>>>>>>>> meant is that assuming there is already a cached table,
> ideally
> >>>>>>>> users
> >>>>>>>>>>>> need
> >>>>>>>>>>>>> not to specify whether the next query should read from the
> >>> cache
> >>>>>>> or
> >>>>>>>>> use
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To explain the difference between returning / not returning a
> >>>>>>>>>>>> CachedTable,
> >>>>>>>>>>>>> I want compare the following two case:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Case 1:  returning a CachedTable*
> >>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>> val cachedTableA1 = a.cache()
> >>>>>>>>>>>>> val cachedTableA2 = a.cache()
> >>>>>>>>>>>>> b.print() // Just to make sure a is cached.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> c = a.filter(...) // User specify that the original DAG is
> >>> used?
> >>>>>>> Or
> >>>>>>>>> the
> >>>>>>>>>>>>> optimizer decides whether DAG or cache should be used?
> >>>>>>>>>>>>> d = cachedTableA1.filter() // User specify that the cached
> >>> table
> >>>>>>> is
> >>>>>>>>>>>> used.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
> >>>>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Case 2: not returning a CachedTable*
> >>>>>>>>>>>>> b = a.map()
> >>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>> a.cache() // no-op
> >>>>>>>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or
> DAG
> >>>>>>>> should
> >>>>>>>>>>>> be
> >>>>>>>>>>>>> used
> >>>>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or
> DAG
> >>>>>>>> should
> >>>>>>>>>>>> be
> >>>>>>>>>>>>> used
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> a.unCache()
> >>>>>>>>>>>>> a.unCache() // no-op
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to choose
> >>>>>>>> between
> >>>>>>>>>>>> DAG
> >>>>>>>>>>>>> and cache. And the unCache() call becomes tricky.
> >>>>>>>>>>>>> In case 2, users do not need to worry about whether cache or
> >>> DAG
> >>>>>>> is
> >>>>>>>>>>>> used.
> >>>>>>>>>>>>> And the unCache() semantic is clear. However, the caveat is
> >>> that
> >>>>>>>> users
> >>>>>>>>>>>>> cannot explicitly ignore the cache.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In order to address the issues mentioned in case 2 and
> >>> inspired by
> >>>>>>>> the
> >>>>>>>>>>>>> discussion so far, I am thinking about using hint to allow
> user
> >>>>>>>>>>>> explicitly
> >>>>>>>>>>>>> ignore cache. Although we do not have hint yet, but we
> probably
> >>>>>>>> should
> >>>>>>>>>>>> have
> >>>>>>>>>>>>> one. So the code becomes:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Case 3: returning this table*
> >>>>>>>>>>>>> b = a.map()
> >>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>> a.cache() // no-op
> >>>>>>>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or
> DAG
> >>>>>>>> should
> >>>>>>>>>>>> be
> >>>>>>>>>>>>> used
> >>>>>>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used
> >>> instead
> >>>>>>> of
> >>>>>>>>> the
> >>>>>>>>>>>>> cache.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> a.unCache()
> >>>>>>>>>>>>> a.unCache() // no-op
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We could also let cache() return this table to allow chained
> >>>>>>> method
> >>>>>>>>>>>> calls.
> >>>>>>>>>>>>> Do you think this API addresses the concerns?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com>
> >>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> All the recent discussions are focused on whether there is a
> >>>>>>>> problem
> >>>>>>>>> if
> >>>>>>>>>>>>>> cache() not return a Table.
> >>>>>>>>>>>>>> It seems that returning a Table explicitly is more clear
> (and
> >>>>>>>> safe?).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So whether there are any problems if cache() returns a
> Table?
> >>>>>>>>> @Becket
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <
> >>> trohrmann@apache.org
> >>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It's true that b, c, d and e will all read from the
> original
> >>> DAG
> >>>>>>>>> that
> >>>>>>>>>>>>>>> generates a. But all subsequent operators (when running
> >>> multiple
> >>>>>>>>>>>> queries)
> >>>>>>>>>>>>>>> which reference cachedTableA should not need to reproduce
> `a`
> >>>>>>> but
> >>>>>>>>>>>>>> directly
> >>>>>>>>>>>>>>> consume the intermediate result.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
> >>> caching
> >>>>>>>>>>>> operator
> >>>>>>>>>>>>>>> from which you need to consume from if you want to benefit
> >>> from
> >>>>>>>> the
> >>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>> functionality.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
> >>> which
> >>>>>>>>>>>>>>> intermediate result should be cached. But especially when
> >>>>>>>> executing
> >>>>>>>>>>>>>> ad-hoc
> >>>>>>>>>>>>>>> queries the user might better know which results need to be
> >>>>>>> cached
> >>>>>>>>>>>>>> because
> >>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
> >>>>>>> consider
> >>>>>>>>> the
> >>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in
> the
> >>>>>>>> future
> >>>>>>>>>>>> we
> >>>>>>>>>>>>>>> might add functionality which tries to automatically cache
> >>>>>>> results
> >>>>>>>>>>>> (e.g.
> >>>>>>>>>>>>>>> caching the latest intermediate results until so and so
> much
> >>>>>>> space
> >>>>>>>>> is
> >>>>>>>>>>>>>>> used). But this should hopefully not contradict with
> >>>>>>> `CachedTable
> >>>>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <
> >>> becket.qin@gmail.com
> >>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the clarification. I am still a little
> confused.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If cache() returns a CachedTable, the example might
> become:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> cachedTableA = a.cache()
> >>>>>>>>>>>>>>>> d = cachedTableA.map(...)
> >>>>>>>>>>>>>>>> e = a.map()
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b, c, d
> >>> and
> >>>>>>> e
> >>>>>>>>> are
> >>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>> going to be reading from the original DAG that generates
> a.
> >>> But
> >>>>>>>>> with
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> naive expectation, d should be reading from the cache.
> This
> >>>>>>> seems
> >>>>>>>>> not
> >>>>>>>>>>>>>>>> solving the potential confusion you raised, right?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Just to be clear, my understanding are all based on the
> >>>>>>>> assumption
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
> >>>>>>>>>>>> c*achedTableA*
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> original table *a * should be completely interchangeable.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> That said, I think a valid argument is optimization. There
> >>> are
> >>>>>>>>> indeed
> >>>>>>>>>>>>>>> cases
> >>>>>>>>>>>>>>>> that reading from the original DAG could be faster than
> >>> reading
> >>>>>>>>> from
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> cache. For example, in the following example:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> a.filter(f1' > 100)
> >>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>> b = a.filter(f1' < 100)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Ideally the optimizer should be intelligent enough to
> decide
> >>>>>>>> which
> >>>>>>>>>>>> way
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> faster, without user intervention. In this case, it will
> >>>>>>> identify
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>> b
> >>>>>>>>>>>>>>>> would just be an empty table, thus skip reading from the
> >>> cache
> >>>>>>>>>>>>>>> completely.
> >>>>>>>>>>>>>>>> But I agree that returning a CachedTable would give user
> the
> >>>>>>>>> control
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> when to use cache, even though I still feel that letting
> the
> >>>>>>>>>>>> optimizer
> >>>>>>>>>>>>>>>> handle this is a better option in long run.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
> >>>>>>>> trohrmann@apache.org
> >>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Yes you are right Becket that it still depends on the
> >>> actual
> >>>>>>>>>>>>>> execution
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the job whether a consumer reads from a cached result or
> >>> not.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> My point was actually about the properties of a (cached
> vs.
> >>>>>>>>>>>>>> non-cached)
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> not about the execution. I would not make cache trigger
> the
> >>>>>>>>>>>> execution
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the job because one loses some flexibility by eagerly
> >>>>>>> triggering
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>> execution.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is
> >>> returned
> >>>>>>>> by
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cache() method like Piotr did in order to make the API
> more
> >>>>>>>>>>>> explicit.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
> >>>>>>> becket.qin@gmail.com
> >>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> That is a good example. Just a minor correction, in this
> >>>>>>> case,
> >>>>>>>>> b, c
> >>>>>>>>>>>>>>>> and d
> >>>>>>>>>>>>>>>>>> will all consume from a non-cached a. This is because
> >>> cache
> >>>>>>>> will
> >>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> created on the very first job submission that generates
> >>> the
> >>>>>>>> table
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> cached.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If I understand correctly, this is example is about
> >>> whether
> >>>>>>>>>>>>>> .cache()
> >>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In
> >>> another
> >>>>>>>> word,
> >>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>> cache() method actually triggers a job that creates the
> >>>>>>> cache,
> >>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>> be no such confusion. Is that right?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In the example, although d will not consume from the
> >>> cached
> >>>>>>>> Table
> >>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> looks supposed to, from correctness perspective the code
> >>> will
> >>>>>>>>> still
> >>>>>>>>>>>>>>>>> return
> >>>>>>>>>>>>>>>>>> correct result, assuming that tables are immutable.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Personally I feel it is OK because users probably won't
> >>>>>>> really
> >>>>>>>>>>>>>> worry
> >>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>> whether the table is cached or not. And lazy cache could
> >>>>>>> avoid
> >>>>>>>>> some
> >>>>>>>>>>>>>>>>>> unnecessary caching if a cached table is never created
> in
> >>> the
> >>>>>>>>> user
> >>>>>>>>>>>>>>>>>> application. But I am not opposed to do eager evaluation
> >>> of
> >>>>>>>>> cache.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> >>>>>>>>>>>>>> trohrmann@apache.org>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily
> >>> changing
> >>>>>>>>>>>>>>> properties
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> node affects all down stream consumers but does not
> >>>>>>>> necessarily
> >>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> happen before these consumers are defined. From a
> user's
> >>>>>>>>>>>>>>> perspective
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> can be quite confusing:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>>>>> d = a.map(...)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> now b, c and d will consume from a cached operator. In
> >>> this
> >>>>>>>>> case,
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>> would most likely expect that only d reads from a
> cached
> >>>>>>>> result.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
> >>> effects?
> >>>>>>> So
> >>>>>>>>>>>>>>> far
> >>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist
> if a
> >>>>>>>> table
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Not only that. There are also performance implications
> >>> and
> >>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>> another implicit side effects of using `void cache()`.
> >>> As I
> >>>>>>>>>>>>>> wrote
> >>>>>>>>>>>>>>>>>> before,
> >>>>>>>>>>>>>>>>>>>> reading from cache might not always be desirable, thus
> >>> it
> >>>>>>> can
> >>>>>>>>>>>>>>> cause
> >>>>>>>>>>>>>>>>>>>> performance degradation and I’m fine with that -
> user's
> >>> or
> >>>>>>>>>>>>>>>>> optimiser’s
> >>>>>>>>>>>>>>>>>>>> choice. What I do not like is that this implicit side
> >>>>>>> effect
> >>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> manifest
> >>>>>>>>>>>>>>>>>>>> in completely different part of code, that wasn’t
> >>> touched
> >>>>>>> by
> >>>>>>>> a
> >>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else. And
> >>> even
> >>>>>>> if
> >>>>>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>>>>> improves performance, it’s still a side effect of
> `void
> >>>>>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>>>>> Almost
> >>>>>>>>>>>>>>>>>>>> from the definition `void` methods have only side
> >>> effects.
> >>>>>>>> As I
> >>>>>>>>>>>>>>>> wrote
> >>>>>>>>>>>>>>>>>>>> before, there are couple of scenarios where this might
> >>> be
> >>>>>>>>>>>>>>>> undesirable
> >>>>>>>>>>>>>>>>>>>> and/or unexpected, for example:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1.
> >>>>>>>>>>>>>>>>>>>> Table b = …;
> >>>>>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>>>>> x = b.join(…)
> >>>>>>>>>>>>>>>>>>>> y = b.count()
> >>>>>>>>>>>>>>>>>>>> // ...
> >>>>>>>>>>>>>>>>>>>> // 100
> >>>>>>>>>>>>>>>>>>>> // hundred
> >>>>>>>>>>>>>>>>>>>> // lines
> >>>>>>>>>>>>>>>>>>>> // of
> >>>>>>>>>>>>>>>>>>>> // code
> >>>>>>>>>>>>>>>>>>>> // later
> >>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even
> hidden
> >>> in
> >>>>>>> a
> >>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>> method/file/package/dependency
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Table b = ...
> >>>>>>>>>>>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>>>>>>>>>>> foo(b)
> >>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>> Else {
> >>>>>>>>>>>>>>>>>>>> bar(b)
> >>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Void foo(Table b) {
> >>>>>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>>>>> // do something with b
> >>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly
> >>> affect
> >>>>>>>>>>>>>>>> (semantic
> >>>>>>>>>>>>>>>>>> of a
> >>>>>>>>>>>>>>>>>>>> program in case of sources being mutable and
> >>> performance)
> >>>>>>> `z
> >>>>>>>> =
> >>>>>>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from
> obvious.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On top of that, there is still this argument of mine
> >>> that
> >>>>>>>>>>>>>> having
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
> >>>>>>> flexible
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> us
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> future and for the user (as a manual option to bypass
> >>> cache
> >>>>>>>>>>>>>>> reads).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> But Jiangjie is correct,
> >>>>>>>>>>>>>>>>>>>>> the source table in batching should be immutable. It
> is
> >>>>>>> the
> >>>>>>>>>>>>>>>> user’s
> >>>>>>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
> >>>>>>>>>>>>>> failover
> >>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>> lead
> >>>>>>>>>>>>>>>>>>>>> to inconsistent results.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment
> >>>>>>> should
> >>>>>>>>>>>>>> be.
> >>>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this
> (since
> >>> the
> >>>>>>>>>>>>>>> proper
> >>>>>>>>>>>>>>>>> fix
> >>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> to support transactions), I’m just trying to minimise
> >>>>>>>> confusion
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> users that are not fully aware what’s going on and
> >>> operate
> >>>>>>> in
> >>>>>>>>>>>>>>> less
> >>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>>>> perfect setup. And if something bites them after
> adding
> >>>>>>>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>>>>>>> call,
> >>>>>>>>>>>>>>>>>>>> to make sure that they at least know all of the places
> >>> that
> >>>>>>>>>>>>>>> adding
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>> line can affect.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks, Piotrek
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <
> >>> becket.qin@gmail.com
> >>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more replies
> >>> are
> >>>>>>>>>>>>>>>>> following.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not only
> be
> >>>>>>> used
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>>>>> programming and not only in batching.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache()
> has
> >>> the
> >>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>> batch processing. The semantic is following:
> >>>>>>>>>>>>>>>>>>>>> For a table created via a series of computation, save
> >>> that
> >>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>>>>>>> reference to avoid running the computation logic to
> >>>>>>>>>>>>>> regenerate
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> table.
> >>>>>>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
> >>>>>>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
> >>>>>>> processing.
> >>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>> difference
> >>>>>>>>>>>>>>>>>>>>> is that stream applications will only run once as
> they
> >>> are
> >>>>>>>>>>>>>> long
> >>>>>>>>>>>>>>>>>>> running.
> >>>>>>>>>>>>>>>>>>>>> And the batch applications may be run multiple times,
> >>>>>>> hence
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>> be created and dropped each time the application
> runs.
> >>>>>>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
> >>>>>>> management
> >>>>>>>>>>>>>>>>>>> requirements
> >>>>>>>>>>>>>>>>>>>>> for the streaming cached table, such as time based /
> >>> size
> >>>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>>>>>> retention,
> >>>>>>>>>>>>>>>>>>>>> to address the infinite data issue. But such
> >>> requirement
> >>>>>>>> does
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> change
> >>>>>>>>>>>>>>>>>>>>> the semantic.
> >>>>>>>>>>>>>>>>>>>>> You are right that interactive programming is just
> one
> >>> use
> >>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> cache().
> >>>>>>>>>>>>>>>>>>>>> It is not the only use case.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
> >>> `void
> >>>>>>>>>>>>>>>> cache()`
> >>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>> side effects.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> This is indeed the key point. The argument around
> >>> whether
> >>>>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>> return something already indicates that cache() and
> >>>>>>>>>>>>>>> materialize()
> >>>>>>>>>>>>>>>>>>> address
> >>>>>>>>>>>>>>>>>>>>> different issues.
> >>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
> >>> effects?
> >>>>>>> So
> >>>>>>>>>>>>>>> far
> >>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist
> if a
> >>>>>>>> table
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >>>>>>> CachedTable
> >>>>>>>>>>>>>>>>>> read-only.
> >>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
> >>> can
> >>>>>>>> not
> >>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently
> >>> can
> >>>>>>> not
> >>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I don't think anyone should insert something to a
> >>> cache.
> >>>>>>> By
> >>>>>>>>>>>>>>>>>> definition
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> cache should only be updated when the corresponding
> >>>>>>> original
> >>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> updated. What I am wondering is that given the
> >>> following
> >>>>>>> two
> >>>>>>>>>>>>>>>> facts:
> >>>>>>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with something
> >>> like
> >>>>>>>>>>>>>>>>> insert()),
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
> >>>>>>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
> >>>>>>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
> >>>>>>> mutable
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is where I
> >>>>>>>> thought
> >>>>>>>>>>>>>>>>>>> confusing.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> >>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One
> >>> more
> >>>>>>>>>>>>>>>>> explanation
> >>>>>>>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is that
> I
> >>>>>>> think
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>> “Table”s
> >>>>>>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as
> SQL
> >>>>>>>>>>>>>> views,
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>>> difference for me is that their live scope is short
> -
> >>>>>>>>>>>>>> current
> >>>>>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>> is limited by different execution model. That’s why
> >>>>>>>>>>>>>> “cashing”
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>> for me
> >>>>>>>>>>>>>>>>>>>>>> is just materialising it.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> However I see and I understand your point of view.
> >>> Coming
> >>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL
> >>> world,
> >>>>>>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might
> >>> not
> >>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> interactive programming and not only in batching.
> But
> >>>>>>>> naming
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>>>>> and not that critical to me. Especially that once we
> >>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>> proper
> >>>>>>>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
> >>>>>>>> `cache()`
> >>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> deem
> >>>>>>>>>>>>>>>>>>>> so.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
> >>>>>>> `void
> >>>>>>>>>>>>>>>>> cache()`
> >>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you have
> >>>>>>>>>>>>>> mentioned.
> >>>>>>>>>>>>>>>>> True:
> >>>>>>>>>>>>>>>>>>>>>> results might be non deterministic if underlying
> >>> source
> >>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>> changing.
> >>>>>>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes
> the
> >>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It
> >>> can
> >>>>>>>>>>>>>> cause
> >>>>>>>>>>>>>>>>> “wtf”
> >>>>>>>>>>>>>>>>>>>> moment
> >>>>>>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some
> >>> place
> >>>>>>> in
> >>>>>>>>>>>>>> his
> >>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
> >>>>>>> differently.
> >>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle,
> >>> we
> >>>>>>>>>>>>>> force
> >>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> explicitly use the cache which removes the “random”
> >>> part
> >>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> "suddenly
> >>>>>>>>>>>>>>>>>>>>>> some other random places are behaving differently”.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
> >>>>>>>>>>>>>>>>>>> flexibility/allowing
> >>>>>>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are independent
> >>> of
> >>>>>>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>>>> vs
> >>>>>>>>>>>>>>>>>>>>>> `materialize()` discussion.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the
> >>> CachedTable?
> >>>>>>>>>>>>>> This
> >>>>>>>>>>>>>>>>>> sounds
> >>>>>>>>>>>>>>>>>>>>>> pretty confusing.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >>>>>>> CachedTable
> >>>>>>>>>>>>>>>>>>> read-only. I
> >>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
> >>> can
> >>>>>>>> not
> >>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently
> >>> can
> >>>>>>> not
> >>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
> >>>>>>> xingcanc@gmail.com
> >>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and
> >>> `materialize()`
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> considered as two different methods where the later
> >>> one
> >>>>>>> is
> >>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>> sophisticated.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea is
> >>> just
> >>>>>>> to
> >>>>>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the
> TableAPI
> >>>>>>> is a
> >>>>>>>>>>>>>>>>>> high-level
> >>>>>>>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the
> DataSet
> >>> API
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> force
> >>>>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching it.
> >>> Then
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>> manually register the cached dataset to a table
> again
> >>> (we
> >>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
> >>>>>>> identical
> >>>>>>>>>>>>>>>> schema
> >>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>> different contents here). After all, it’s the
> dataset
> >>>>>>>> rather
> >>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> >>>>>>>>>>>>>>>> becket.qin@gmail.com>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are
> >>> good
> >>>>>>>>>>>>>>>>> arguments.
> >>>>>>>>>>>>>>>>>>>> But I
> >>>>>>>>>>>>>>>>>>>>>>>> think those arguments are mostly about
> materialized
> >>>>>>> view.
> >>>>>>>>>>>>>>> Let
> >>>>>>>>>>>>>>>> me
> >>>>>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and
> >>> materialize()
> >>>>>>>> are
> >>>>>>>>>>>>>>>>>>> different.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite
> >>> different
> >>>>>>>>>>>>>>>>>> implications.
> >>>>>>>>>>>>>>>>>>>> An
> >>>>>>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When
> >>> users
> >>>>>>>>>>>>>> call
> >>>>>>>>>>>>>>>>>> cache(),
> >>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result
> as
> >>> a
> >>>>>>>>>>>>>> draft
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>>>> work,
> >>>>>>>>>>>>>>>>>>>>>>>> this intermediate result may not have any
> realistic
> >>>>>>>>>>>>>> meaning.
> >>>>>>>>>>>>>>>>>> Calling
> >>>>>>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the
> >>> cached
> >>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> any
> >>>>>>>>>>>>>>>>>>>>>> manner.
> >>>>>>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I
> >>> have
> >>>>>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>>>>>> meaningful
> >>>>>>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think
> >>> about
> >>>>>>>> the
> >>>>>>>>>>>>>>>>>>> validation,
> >>>>>>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
> >>>>>>> materialize()
> >>>>>>>>>>>>>>>> methods
> >>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them. The
> >>>>>>> concept
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to say
> >>> the
> >>>>>>>>>>>>>>> related
> >>>>>>>>>>>>>>>>>> stuff
> >>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
> >>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>>> itself
> >>>>>>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and
> >>> systematic
> >>>>>>>>>>>>>>> manner.
> >>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>> found
> >>>>>>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way
> beyond
> >>>>>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>>>>>>> programming experience.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still have
> >>> some
> >>>>>>>>>>>>>>>>> questions,
> >>>>>>>>>>>>>>>>>>>>>> though.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files
> >>> from a
> >>>>>>>>>>>>>>>>> directory
> >>>>>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
> >>>>>>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…)
> ….;
> >>>>>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> >>>>>>>>>>>>>> initialised)
> >>>>>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
> >>>>>>> writes
> >>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>> files
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>> /foo/bar
> >>>>>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to
> >>> be
> >>>>>>>>>>>>>>>>> implemented
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>> initial version
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> what if someone else added some more files to
> >>> /foo/bar
> >>>>>>> at
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> point?
> >>>>>>>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result
> >>> become
> >>>>>>>>>>>>>>>>>>>>>> non-deterministic,
> >>>>>>>>>>>>>>>>>>>>>>>> right?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension,
> >>> manual
> >>>>>>>>>>>>>>>> “cache”
> >>>>>>>>>>>>>>>>>>>> dropping
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in
> most
> >>>>>>>> cases,
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>> talking
> >>>>>>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental assumption
> >>> of
> >>>>>>>> such
> >>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> source data is complete before the data processing
> >>>>>>>> begins,
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO,
> if
> >>>>>>>>>>>>>>> additional
> >>>>>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>>>>>>> to be added to some source during the processing,
> it
> >>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> done
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> ways
> >>>>>>>>>>>>>>>>>>>>>>>> like union the source with another table
> containing
> >>> the
> >>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> added.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> There are a few cases that computations are
> executed
> >>>>>>>>>>>>>>>> repeatedly
> >>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> changing data source.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job
> every
> >>>>>>> hour
> >>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> samples
> >>>>>>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the
> >>> source
> >>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>> between
> >>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain
> unchanged
> >>>>>>>> within
> >>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>> run.
> >>>>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>>>> usually in that case, the result will need
> >>> versioning,
> >>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> given
> >>>>>>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result from
> >>> the
> >>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>> by a
> >>>>>>>>>>>>>>>>>>>>>>>> certain timestamp.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Another example is something like data warehouse.
> In
> >>>>>>> this
> >>>>>>>>>>>>>>>> case,
> >>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>>> are a
> >>>>>>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
> >>>>>>> sources,
> >>>>>>>>>>>>>>> many
> >>>>>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be
> >>> created to
> >>>>>>>>>>>>>>>> generate
> >>>>>>>>>>>>>>>>>>>> derived
> >>>>>>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated when
> >>> the
> >>>>>>>>>>>>>>>> underlying
> >>>>>>>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic
> >>> that
> >>>>>>>>>>>>>>> derives
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update
> those
> >>>>>>>>>>>>>>>>>> reports/views.
> >>>>>>>>>>>>>>>>>>>>>> Again,
> >>>>>>>>>>>>>>>>>>>>>>>> all those derived data also need to ha
> >>>
> >>>
>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@da-platform.com>.
Hi Becket!

After further thinking I tend to agree that my previous proposal (*Option 2*) indeed might not be if would in the future introduce automatic caching. However I would like to propose a slightly modified version of it:

*Option 4*

Adding `cache()` method with following signature:

Table Table#cache();  

Without side-effects, and `cache()` call do not modify/change original Table in any way. 
It would return a copy of original table, with added hint for the optimizer to cache the table, so that the future accesses to the returned table might be cached or not.

Assuming that we are talking about a setup, where we do not have automatic caching enabled (possible future extension).

Example #1:

```
Table a = …
a.foo() // not cached

val cachedTable = a.cache();

cachedA.bar() // maybe cached
a.foo() // same as before - effectively not cached
```

Both the first and the second `a.foo()` operations would behave in the exactly same way. Again, `a.cache()` call doesn’t affect `a` itself. If `a` was not hinted for caching before `a.cache();`, then both `a.foo()` calls wouldn’t use cache. 

Returned `cachedA` would be hinted with “cache” hint, so probably `cachedA.bar()` would go through cache (unless optimiser decides the opposite)

Example #2

```
Table a = …

a.foo() // not cached

val b = a.cache();

a.foo() // same as before - effectively not cached
b.foo() // maybe cached

val c = b.cache();

a.foo() // same as before - effectively not cached
b.foo() // same as before - effectively maybe cached
c.foo() // maybe cached
```

Now, assuming that we have some future “automatic caching optimisation”:

Example #3

```
env.enableAutomaticCaching()
Table a = …

a.foo() // might be cached, depending if `a` was selected to automatic caching

val b = a.cache();

a.foo() // same as before - might be cached, if `a` was selected to automatic caching
b.foo() // maybe cached
```


More or less this is the same behaviour as:

Table a = ...
val b = a.filter(x > 20)

calling `filter` hasn’t changed or altered `a` in anyway. If `a` was previously filtered:

Table src = …
val a = src.filter(x > 20)
val b = a.filter(x > 20)

then yes, `a` and `b` will be the same. But the point is that neither `filter` nor `cache` changes the original `a` table.

One thing is that indeed, physically dropping cache operation, will have side effects and it will in a way mutate the cached table references. But this is I think unavoidable in any solution - the same issue as calling `.close()`, or calling destructor in C++.

Piotrek

> On 7 Jan 2019, at 10:41, Becket Qin <be...@gmail.com> wrote:
> 
> Happy New Year, everybody!
> 
> I would like to resume this discussion thread. At this point, We have
> agreed on the first step goal of interactive programming. The open
> discussion is the exact API. More specifically, what should *cache()*
> method return and what is the semantic. There are three options:
> 
> *Option 1*
> *void cache()* OR *Table cache()* which returns the original table for
> chained calls.
> *void uncache() *releases the cache.
> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
> 
> - Semantic: a.cache() hints that table 'a' should be cached. Optimizer
> decides whether the cache will be used or not.
> - pros: simple and no confusion between CachedTable and original table
> - cons: A table may be cached / uncached in a method invocation, while the
> caller does not know about this.
> 
> *Option 2*
> *CachedTable cache()*
> *CachedTable *extends *Table *with an additional *uncache()* method
> 
> - Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will always
> use cache. *a.bar() *will always use original DAG.
> - pros: No potential side effects in method invocation.
> - cons: Optimizer has no chance to kick in. Future optimization will become
> a behavior change and need users to change the code.
> 
> *Option 3*
> *CacheHandle cache()*
> *CacheHandle.release() *to release a cache handle on the table. If all
> cache handles are released, the cache could be removed.
> *Table.hint(ignoreCache).foo()* to ignore cache for operation foo().
> 
> - Semantic: *a.cache() *hints that 'a' should be cached. Optimizer decides
> whether the cache will be used or not. Cache is released either no handle
> is on it, or the user program exits.
> - pros: No potential side effect in method invocation. No confusion between
> cached table v.s original table.
> - cons: An additional CacheHandle exposed to the users.
> 
> 
> Personally I prefer option 3 for the following reasons:
> 1. It is simple. Vast majority of the users would just call
> *a.cache()* followed
> by *a.foo(),* *a.bar(), etc. *
> 2. There is no semantic ambiguity and semantic change if we decide to add
> implicit cache in the future.
> 3. There is no side effect in the method calls.
> 4. Admittedly we need to expose one more CacheHandle class to the users.
> But it is not that difficult to understand given similar well known concept
> like ref count (we can name it CacheReference if that is easier to
> understand). So I think it is fine.
> 
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <be...@gmail.com> wrote:
> 
>> Hi Piotrek,
>> 
>> 1. Regarding optimization.
>> Sure there are many cases that the decision is hard to make. But that does
>> not make it any easier for the users to make those decisions. I imagine 99%
>> of the users would just naively use cache. I am not saying we can optimize
>> in all the cases. But as long as we agree that at least in certain cases (I
>> would argue most cases), optimizer can do a little better than an average
>> user who likely knows little about Flink internals, we should not push the
>> burden of optimization to users.
>> 
>> BTW, it seems some of your concerns are related to the implementation. I
>> did not mention the implementation of the caching service because that
>> should not affect the API semantic. Not sure if this helps, but imagine the
>> default implementation has one StorageNode service colocating with each TM.
>> It could be running within the TM process or in a standalone process,
>> depending on configuration.
>> 
>> The StorageNode uses memory + spill-to-disk mechanism. The cached data
>> will just be written to the local StorageNode service. If the StorageNode
>> is running within the TM process, the in-memory cache could just be objects
>> so we save some serde cost. A later job referring to the cached Table will
>> be scheduled in a locality aware manner, i.e. run in the TM whose peer
>> StorageNode hosts the data.
>> 
>> 
>> 2. Semantic
>> I am not sure why introducing a new hintCache() or
>> env.enableAutomaticCaching() method would avoid the consequence of semantic
>> change.
>> 
>> If the auto optimization is not enabled by default, users still need to
>> make code change to all existing programs in order to get the benefit.
>> If the auto optimization is enabled by default, advanced users who know
>> that they really want to use cache will suddenly lose the opportunity to do
>> so, unless they change the code to disable auto optimization.
>> 
>> 
>> 3. side effect
>> The CacheHandle is not only for where to put uncache(). It is to solve the
>> implicit performance impact by moving the uncache() to the CacheHandle.
>> 
>>   - If users wants to leverage cache, they can call a.cache(). After
>>   that, unless user explicitly release that CacheHandle, a.foo() will always
>>   leverage cache if needed (optimizer may choose to ignore cache if that
>>   helps accelerate the process). Any function call will not be able to
>>   release the cache because they do not have that CacheHandle.
>>   - If some advanced users do not want to use cache at all, they will
>>   call a.hint(ignoreCache).foo(). This will for sure ignore cache and use the
>>   original DAG to process.
>> 
>> 
>>> In vast majority of the cases, users wouldn't really care whether the
>>> cache is used or not.
>>> I wouldn’t agree with that, because “caching” (if not purely in memory
>>> caching) would add additional IO costs. It’s similar as saying that users
>>> would not see a difference between Spark/Flink and MapReduce (MapReduce
>>> writes data to disks after every map/reduce stage).
>> 
>> What I wanted to say is that in most cases, after users call cache(), they
>> don't really care about whether auto optimization has decided to ignore the
>> cache or not, as long as the program runs faster.
>> 
>> Thanks,
>> 
>> Jiangjie (Becket) Qin
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <pi...@data-artisans.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> Thanks for the quick answer :)
>>> 
>>> Re 1.
>>> 
>>> I generally agree with you, however couple of points:
>>> 
>>> a) the problem with using automatic caching is bigger, because you will
>>> have to decide, how do you compare IO vs CPU costs and if you pick wrong,
>>> additional IO costs might be enormous or even can crash your system. This
>>> is more difficult problem compared to let say join reordering, where the
>>> only issue is to have good statistics that can capture correlations between
>>> columns (when you reorder joins number of IO operations do not change)
>>> c) your example is completely independent of caching.
>>> 
>>> Query like this:
>>> 
>>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===`f2).as('f3,
>>> …).filter(‘f3 > 30)
>>> 
>>> Should/could be optimised to empty result immediately, without the need
>>> for any cache/materialisation and that should work even without any
>>> statistics provided by the connector.
>>> 
>>> For me prerequisite to any serious cost-based optimisations would be some
>>> reasonable benchmark coverage of the code (tpch?). Otherwise that would be
>>> equivalent of adding not tested code, since we wouldn’t be able to verify
>>> our assumptions, like how does the writing of 10 000 records to
>>> cache/RocksDB/Kafka/CSV file compare to joining/filtering/processing of
>>> lets say 1000 000 rows.
>>> 
>>> Re 2.
>>> 
>>> I wasn’t proposing to change the semantic later. I was proposing that we
>>> start now:
>>> 
>>> CachedTable cachedA = a.cache()
>>> cachedA.foo() // Cache is used
>>> a.bar() // Original DAG is used
>>> 
>>> And then later we can think about adding for example
>>> 
>>> CachedTable cachedA = a.hintCache()
>>> cachedA.foo() // Cache might be used
>>> a.bar() // Original DAG is used
>>> 
>>> Or
>>> 
>>> env.enableAutomaticCaching()
>>> a.foo() // Cache might be used
>>> a.bar() // Cache might be used
>>> 
>>> Or (I would still not like this option):
>>> 
>>> a.hintCache()
>>> a.foo() // Cache might be used
>>> a.bar() // Cache might be used
>>> 
>>> Or whatever else that will come to our mind. Even if we add some
>>> automatic caching in the future, keeping implicit (`CachedTable cache()`)
>>> caching will still be useful, at least in some cases.
>>> 
>>> Re 3.
>>> 
>>>> 2. The source tables are immutable during one run of batch processing
>>> logic.
>>>> 3. The cache is immutable during one run of batch processing logic.
>>> 
>>>> I think assumption 2 and 3 are by definition what batch processing
>>> means,
>>>> i.e the data must be complete before it is processed and should not
>>> change
>>>> when the processing is running.
>>> 
>>> I agree that this is how batch systems SHOULD be working. However I know
>>> from my previous experience that it’s not always the case. Sometimes users
>>> are just working on some non transactional storage, which can be (either
>>> constantly or occasionally) being modified by some other processes for
>>> whatever the reasons (fixing the data, updating, adding new data etc).
>>> 
>>> But even if we ignore this point (data immutability), performance side
>>> effect issue of your proposal remains. If user calls `void a.cache()` deep
>>> inside some private method, it will have implicit side effects on other
>>> parts of his program that might not be obvious.
>>> 
>>> Re `CacheHandle`.
>>> 
>>> If I understand it correctly, it only addresses the issue where to place
>>> method `uncache`/`dropCache`.
>>> 
>>> Btw,
>>> 
>>>> In vast majority of the cases, users wouldn't really care whether the
>>> cache is used or not.
>>> 
>>> I wouldn’t agree with that, because “caching” (if not purely in memory
>>> caching) would add additional IO costs. It’s similar as saying that users
>>> would not see a difference between Spark/Flink and MapReduce (MapReduce
>>> writes data to disks after every map/reduce stage).
>>> 
>>> Piotrek
>>> 
>>>> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com> wrote:
>>>> 
>>>> Hi Piotrek,
>>>> 
>>>> Not sure if you noticed, in my last email, I was proposing `CacheHandle
>>>> cache()` to avoid the potential side effect due to function calls.
>>>> 
>>>> Let's look at the disagreement in your reply one by one.
>>>> 
>>>> 
>>>> 1. Optimization chances
>>>> 
>>>> Optimization is never a trivial work. This is exactly why we should not
>>> let
>>>> user manually do that. Databases have done huge amount of work in this
>>>> area. At Alibaba, we rely heavily on many optimization rules to boost
>>> the
>>>> SQL query performance.
>>>> 
>>>> In your example, if I filling the filter conditions in a certain way,
>>> the
>>>> optimization would become obvious.
>>>> 
>>>> Table src1 = … // read from connector 1
>>>> Table src2 = … // read from connector 2
>>>> 
>>>> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
>>>> `f2).as('f3, ...)
>>>> a.cache() // write cache to connector 3, when writing the records,
>>> remember
>>>> min and max of `f1
>>>> 
>>>> a.filter('f3 > 30) // There is no need to read from any connector
>>> because
>>>> `a` does not contain any record whose 'f3 is greater than 30.
>>>> env.execute()
>>>> a.select(…)
>>>> 
>>>> BTW, it seems to me that adding some basic statistics is fairly
>>>> straightforward and the cost is pretty marginal if not ignorable. In
>>> fact
>>>> it is not only needed for optimization, but also for cases such as ML,
>>>> where some algorithms may need to decide their parameter based on the
>>>> statistics of the data.
>>>> 
>>>> 
>>>> 2. Same API, one semantic now, another semantic later.
>>>> 
>>>> I am trying to understand what is the semantic of `CachedTable cache()`
>>> you
>>>> are proposing. IMO, we should avoid designing an API whose semantic
>>> will be
>>>> changed later. If we have a "CachedTable cache()" method, then the
>>> semantic
>>>> should be very clearly defined upfront and do not change later. It
>>> should
>>>> never be "right now let's go with semantic 1, later we can silently
>>> change
>>>> it to semantic 2 or 3". Such change could result in bad consequence. For
>>>> example, let's say we decide go with semantic 1:
>>>> 
>>>> CachedTable cachedA = a.cache()
>>>> cachedA.foo() // Cache is used
>>>> a.bar() // Original DAG is used.
>>>> 
>>>> Now majority of the users would be using cachedA.foo() in their code.
>>> And
>>>> some advanced users will use a.bar() to explicitly skip the cache. Later
>>>> on, we added smart optimization and change the semantic to semantic 2:
>>>> 
>>>> CachedTable cachedA = a.cache()
>>>> cachedA.foo() // Cache is used
>>>> a.bar() // Cache MIGHT be used, and Flink may decide to skip cache if
>>> it is
>>>> faster.
>>>> 
>>>> Now most of the users who were writing cachedA.foo() will not benefit
>>> from
>>>> this optimization at all, unless they change their code to use a.foo()
>>>> instead. And those advanced users suddenly lose the option to explicitly
>>>> ignore cache unless they change their code (assuming we care enough to
>>>> provide something like hint(useCache)). If we don't define the semantic
>>>> carefully, our users will have to change their code again and again
>>> while
>>>> they shouldn't have to.
>>>> 
>>>> 
>>>> 3. side effect.
>>>> 
>>>> Before we talk about side effect, we have to agree on the assumptions.
>>> The
>>>> assumptions I have are following:
>>>> 1. We are talking about batch processing.
>>>> 2. The source tables are immutable during one run of batch processing
>>> logic.
>>>> 3. The cache is immutable during one run of batch processing logic.
>>>> 
>>>> I think assumption 2 and 3 are by definition what batch processing
>>> means,
>>>> i.e the data must be complete before it is processed and should not
>>> change
>>>> when the processing is running.
>>>> 
>>>> As far as I am aware of, I don't know any batch processing system
>>> breaking
>>>> those assumptions. Even for relational database tables, where queries
>>> can
>>>> run with concurrent modifications, necessary locking are still required
>>> to
>>>> ensure the integrity of the query result.
>>>> 
>>>> Please let me know if you disagree with the above assumptions. If you
>>> agree
>>>> with these assumptions, with the `CacheHandle cache()` API in my last
>>>> email, do you still see side effects?
>>>> 
>>>> Thanks,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <piotr@data-artisans.com
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Becket,
>>>>> 
>>>>>> Regarding the chance of optimization, it might not be that rare. Some
>>>>> very
>>>>>> simple statistics could already help in many cases. For example,
>>> simply
>>>>>> maintaining max and min of each fields can already eliminate some
>>>>>> unnecessary table scan (potentially scanning the cached table) if the
>>>>>> result is doomed to be empty. A histogram would give even further
>>>>>> information. The optimizer could be very careful and only ignores
>>> cache
>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a filter on
>>>>> the
>>>>>> cache will absolutely return nothing.
>>>>> 
>>>>> I do not see how this might be easy to achieve. It would require tons
>>> of
>>>>> effort to make it work and in the end you would still have a problem of
>>>>> comparing/trading CPU cycles vs IO. For example:
>>>>> 
>>>>> Table src1 = … // read from connector 1
>>>>> Table src2 = … // read from connector 2
>>>>> 
>>>>> Table a = src1.filter(…).join(src2.filter(…), …)
>>>>> a.cache() // write cache to connector 3
>>>>> 
>>>>> a.filter(…)
>>>>> env.execute()
>>>>> a.select(…)
>>>>> 
>>>>> Decision whether it’s better to:
>>>>> A) read from connector1/connector2, filter/map and join them twice
>>>>> B) read from connector1/connector2, filter/map and join them once, pay
>>> the
>>>>> price of writing to connector 3 and then reading from it
>>>>> 
>>>>> Is very far from trivial. `a` can end up much larger than `src1` and
>>>>> `src2`, writes to connector 3 might be extremely slow, reads from
>>> connector
>>>>> 3 can be slower compared to reads from connector 1 & 2, … . You really
>>> need
>>>>> to have extremely good statistics to correctly asses size of the
>>> output and
>>>>> it would still be failing many times (correlations etc). And keep in
>>> mind
>>>>> that at the moment we do not have ANY statistics at all. More than
>>> that, it
>>>>> would require significantly more testing and setting up some
>>> benchmarks to
>>>>> make sure that we do not brake it with some regressions.
>>>>> 
>>>>> That’s why I’m strongly opposing this idea - at least let’s not starts
>>>>> with this. If we first start with completely manual/explicit caching,
>>>>> without any magic, it would be a significant improvement for the users
>>> for
>>>>> a fraction of the development cost. After implementing that, when we
>>>>> already have all of the working pieces, we can start working on some
>>>>> optimisations rules. As I wrote before, if we start with
>>>>> 
>>>>> `CachedTable cache()`
>>>>> 
>>>>> We can later work on follow up stories to make it automatic. Despite
>>> that
>>>>> I don’t like this implicit/side effect approach with `void` method,
>>> having
>>>>> explicit `CachedTable cache()` wouldn’t even prevent as from later
>>> adding
>>>>> `void hintCache()` method, with the exact semantic that you want.
>>>>> 
>>>>> On top of that I re-rise again that having implicit `void
>>>>> cache()/hintCache()` has other side effects and problems with non
>>> immutable
>>>>> data, and being annoying when used secretly inside methods.
>>>>> 
>>>>> Explicit `CachedTable cache()` just looks like much less controversial
>>> MVP
>>>>> and if we decide to go further with this topic, it’s not a wasted
>>> effort,
>>>>> but just lies on a stright path to more advanced/complicated solutions
>>> in
>>>>> the future. Are there any drawbacks of starting with `CachedTable
>>> cache()`
>>>>> that I’m missing?
>>>>> 
>>>>> Piotrek
>>>>> 
>>>>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi Becket,
>>>>>> 
>>>>>> Introducing CacheHandle seems too complicated. That means users have
>>> to
>>>>>> maintain Handler properly.
>>>>>> 
>>>>>> And since cache is just a hint for optimizer, why not just return
>>> Table
>>>>>> itself for cache method. This hint info should be kept in Table I
>>>>> believe.
>>>>>> 
>>>>>> So how about adding method cache and uncache for Table, and both
>>> return
>>>>>> Table. Because what cache and uncache did is just adding some hint
>>> info
>>>>>> into Table.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
>>>>>> 
>>>>>>> Hi Till and Piotrek,
>>>>>>> 
>>>>>>> Thanks for the clarification. That solves quite a few confusion. My
>>>>>>> understanding of how cache works is same as what Till describe. i.e.
>>>>>>> cache() is a hint to Flink, but it is not guaranteed that cache
>>> always
>>>>>>> exist and it might be recomputed from its lineage.
>>>>>>> 
>>>>>>> Is this the core of our disagreement here? That you would like this
>>>>>>>> “cache()” to be mostly hint for the optimiser?
>>>>>>> 
>>>>>>> Semantic wise, yes. That's also why I think materialize() has a much
>>>>> larger
>>>>>>> scope than cache(), thus it should be a different method.
>>>>>>> 
>>>>>>> Regarding the chance of optimization, it might not be that rare. Some
>>>>> very
>>>>>>> simple statistics could already help in many cases. For example,
>>> simply
>>>>>>> maintaining max and min of each fields can already eliminate some
>>>>>>> unnecessary table scan (potentially scanning the cached table) if the
>>>>>>> result is doomed to be empty. A histogram would give even further
>>>>>>> information. The optimizer could be very careful and only ignores
>>> cache
>>>>>>> when it is 100% sure doing that is cheaper. e.g. only when a filter
>>> on
>>>>> the
>>>>>>> cache will absolutely return nothing.
>>>>>>> 
>>>>>>> Given the above clarification on cache, I would like to revisit the
>>>>>>> original "void cache()" proposal and see if we can improve on top of
>>>>> that.
>>>>>>> 
>>>>>>> What do you think about the following modified interface?
>>>>>>> 
>>>>>>> Table {
>>>>>>> /**
>>>>>>> * This call hints Flink to maintain a cache of this table and
>>> leverage
>>>>>>> it for performance optimization if needed.
>>>>>>> * Note that Flink may still decide to not use the cache if it is
>>>>> cheaper
>>>>>>> by doing so.
>>>>>>> *
>>>>>>> * A CacheHandle will be returned to allow user release the cache
>>>>>>> actively. The cache will be deleted if there
>>>>>>> * is no unreleased cache handlers to it. When the TableEnvironment
>>> is
>>>>>>> closed. The cache will also be deleted
>>>>>>> * and all the cache handlers will be released.
>>>>>>> *
>>>>>>> * @return a CacheHandle referring to the cache of this table.
>>>>>>> */
>>>>>>> CacheHandle cache();
>>>>>>> }
>>>>>>> 
>>>>>>> CacheHandle {
>>>>>>> /**
>>>>>>> * Close the cache handle. This method does not necessarily deletes
>>> the
>>>>>>> cache. Instead, it simply decrements the reference counter to the
>>> cache.
>>>>>>> * When the there is no handle referring to a cache. The cache will
>>> be
>>>>>>> deleted.
>>>>>>> *
>>>>>>> * @return the number of open handles to the cache after this handle
>>>>> has
>>>>>>> been released.
>>>>>>> */
>>>>>>> int release()
>>>>>>> }
>>>>>>> 
>>>>>>> The rationale behind this interface is following:
>>>>>>> In vast majority of the cases, users wouldn't really care whether the
>>>>> cache
>>>>>>> is used or not. So I think the most intuitive way is letting cache()
>>>>> return
>>>>>>> nothing. So nobody needs to worry about the difference between
>>>>> operations
>>>>>>> on CacheTables and those on the "original" tables. This will make
>>> maybe
>>>>>>> 99.9% of the users happy. There were two concerns raised for this
>>>>> approach:
>>>>>>> 1. In some rare cases, users may want to ignore cache,
>>>>>>> 2. A table might be cached/uncached in a third party function while
>>> the
>>>>>>> caller does not know.
>>>>>>> 
>>>>>>> For the first issue, users can use hint("ignoreCache") to explicitly
>>>>> ignore
>>>>>>> cache.
>>>>>>> For the second issue, the above proposal lets cache() return a
>>>>> CacheHandle,
>>>>>>> the only method in it is release(). Different CacheHandles will
>>> refer to
>>>>>>> the same cache, if a cache no longer has any cache handle, it will be
>>>>>>> deleted. This will address the following case:
>>>>>>> {
>>>>>>> val handle1 = a.cache()
>>>>>>> process(a)
>>>>>>> a.select(...) // cache is still available, handle1 has not been
>>>>> released.
>>>>>>> }
>>>>>>> 
>>>>>>> void process(Table t) {
>>>>>>> val handle2 = t.cache() // new handle to cache
>>>>>>> t.select(...) // optimizer decides cache usage
>>>>>>> t.hint("ignoreCache").select(...) // cache is ignored
>>>>>>> handle2.release() // release the handle, but the cache may still be
>>>>>>> available if there are other handles
>>>>>>> ...
>>>>>>> }
>>>>>>> 
>>>>>>> Does the above modified approach look reasonable to you?
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> Jiangjie (Becket) Qin
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <tr...@apache.org>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Becket,
>>>>>>>> 
>>>>>>>> I was aiming at semantics similar to 1. I actually thought that
>>>>> `cache()`
>>>>>>>> would tell the system to materialize the intermediate result so that
>>>>>>>> subsequent queries don't need to reprocess it. This means that the
>>>>> usage
>>>>>>> of
>>>>>>>> the cached table in this example
>>>>>>>> 
>>>>>>>> {
>>>>>>>> val cachedTable = a.cache()
>>>>>>>> val b1 = cachedTable.select(…)
>>>>>>>> val b2 = cachedTable.foo().select(…)
>>>>>>>> val b3 = cachedTable.bar().select(...)
>>>>>>>> val c1 = a.select(…)
>>>>>>>> val c2 = a.foo().select(…)
>>>>>>>> val c3 = a.bar().select(...)
>>>>>>>> }
>>>>>>>> 
>>>>>>>> strongly depends on interleaved calls which trigger the execution of
>>>>> sub
>>>>>>>> queries. So for example, if there is only a single env.execute call
>>> at
>>>>>>> the
>>>>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be computed
>>> by
>>>>>>>> reading directly from the sources (given that there is only a single
>>>>>>>> JobGraph). It just happens that the result of `a` will be cached
>>> such
>>>>>>> that
>>>>>>>> we skip the processing of `a` when there are subsequent queries
>>> reading
>>>>>>>> from `cachedTable`. If for some reason the system cannot materialize
>>>>> the
>>>>>>>> table (e.g. running out of disk space, ttl expired), then it could
>>> also
>>>>>>>> happen that we need to reprocess `a`. In that sense `cachedTable`
>>>>> simply
>>>>>>> is
>>>>>>>> an identifier for the materialized result of `a` with the lineage
>>> how
>>>>> to
>>>>>>>> reprocess it.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
>>>>> piotr@data-artisans.com
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Becket,
>>>>>>>>> 
>>>>>>>>>> {
>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>> val b = cachedTable.select(...)
>>>>>>>>>> val c = a.select(...)
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>>> original
>>>>>>> DAG
>>>>>>>>> as
>>>>>>>>>> user demanded so. In this case, the optimizer has no chance to
>>>>>>>> optimize.
>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
>>>>>>>>> optimizer
>>>>>>>>>> to choose whether the cache or DAG should be used. In this case,
>>> user
>>>>>>>>> lose
>>>>>>>>>> the option to NOT use cache.
>>>>>>>>>> 
>>>>>>>>>> As you can see, neither of the options seem perfect. However, I
>>> guess
>>>>>>>> you
>>>>>>>>>> and Till are proposing the third option:
>>>>>>>>>> 
>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
>>>>>>>> should
>>>>>>>>> be
>>>>>>>>>> used. c always use the DAG.
>>>>>>>>> 
>>>>>>>>> I am pretty sure that me, Till, Fabian and others were all
>>> proposing
>>>>>>> and
>>>>>>>>> advocating in favour of semantic “1”. No cost based optimiser
>>>>> decisions
>>>>>>>> at
>>>>>>>>> all.
>>>>>>>>> 
>>>>>>>>> {
>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>> val b1 = cachedTable.select(…)
>>>>>>>>> val b2 = cachedTable.foo().select(…)
>>>>>>>>> val b3 = cachedTable.bar().select(...)
>>>>>>>>> val c1 = a.select(…)
>>>>>>>>> val c2 = a.foo().select(…)
>>>>>>>>> val c3 = a.bar().select(...)
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
>>>>>>>>> re-executing whole plan for “a”.
>>>>>>>>> 
>>>>>>>>> In the future we could discuss going one step further, introducing
>>>>> some
>>>>>>>>> global optimisation (that can be manually enabled/disabled):
>>>>>>> deduplicate
>>>>>>>>> plan nodes/deduplicate sub queries/re-use sub queries results/or
>>>>>>> whatever
>>>>>>>>> we could call it. It could do two things:
>>>>>>>>> 
>>>>>>>>> 1. Automatically try to deduplicate fragments of the plan and share
>>>>> the
>>>>>>>>> result using CachedTable - in other words automatically insert
>>>>>>>> `CachedTable
>>>>>>>>> cache()` calls.
>>>>>>>>> 2. Automatically make decision to bypass explicit `CachedTable`
>>> access
>>>>>>>>> (this would be the equivalent of what you described as “semantic
>>> 3”).
>>>>>>>>> 
>>>>>>>>> However as I wrote previously, I have big doubts if such cost-based
>>>>>>>>> optimisation would work (this applies also to “Semantic 2”). I
>>> would
>>>>>>>> expect
>>>>>>>>> it to do more harm than good in so many cases, that it wouldn’t
>>> make
>>>>>>>> sense.
>>>>>>>>> Even assuming that we calculate statistics perfectly (this ain’t
>>> gonna
>>>>>>>>> happen), it’s virtually impossible to correctly estimate correct
>>>>>>> exchange
>>>>>>>>> rate of CPU cycles vs IO operations as it is changing so much from
>>>>>>>>> deployment to deployment.
>>>>>>>>> 
>>>>>>>>> Is this the core of our disagreement here? That you would like this
>>>>>>>>> “cache()” to be mostly hint for the optimiser?
>>>>>>>>> 
>>>>>>>>> Piotrek
>>>>>>>>> 
>>>>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com>
>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Another potential concern for semantic 3 is that. In the future,
>>> we
>>>>>>> may
>>>>>>>>> add
>>>>>>>>>> automatic caching to Flink. e.g. cache the intermediate results at
>>>>>>> the
>>>>>>>>>> shuffle boundary. If our semantic is that reference to the
>>> original
>>>>>>>> table
>>>>>>>>>> means skipping cache, those users may not be able to benefit from
>>> the
>>>>>>>>>> implicit cache.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <becket.qin@gmail.com
>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the reply. Thought about it again, I might have
>>>>>>>> misunderstood
>>>>>>>>>>> your proposal in earlier emails. Returning a CachedTable might
>>> not
>>>>>>> be
>>>>>>>> a
>>>>>>>>> bad
>>>>>>>>>>> idea.
>>>>>>>>>>> 
>>>>>>>>>>> I was more concerned about the semantic and its intuitiveness
>>> when a
>>>>>>>>>>> CachedTable is returned. i..e, if cache() returns CachedTable.
>>> What
>>>>>>>> are
>>>>>>>>> the
>>>>>>>>>>> semantic in the following code:
>>>>>>>>>>> {
>>>>>>>>>>> val cachedTable = a.cache()
>>>>>>>>>>> val b = cachedTable.select(...)
>>>>>>>>>>> val c = a.select(...)
>>>>>>>>>>> }
>>>>>>>>>>> What is the difference between b and c? At the first glance, I
>>> see
>>>>>>> two
>>>>>>>>>>> options:
>>>>>>>>>>> 
>>>>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>>> original
>>>>>>>> DAG
>>>>>>>>> as
>>>>>>>>>>> user demanded so. In this case, the optimizer has no chance to
>>>>>>>> optimize.
>>>>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
>>>>>>>>> optimizer
>>>>>>>>>>> to choose whether the cache or DAG should be used. In this case,
>>>>>>> user
>>>>>>>>> lose
>>>>>>>>>>> the option to NOT use cache.
>>>>>>>>>>> 
>>>>>>>>>>> As you can see, neither of the options seem perfect. However, I
>>>>>>> guess
>>>>>>>>> you
>>>>>>>>>>> and Till are proposing the third option:
>>>>>>>>>>> 
>>>>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
>>>>>>>> should
>>>>>>>>>>> be used. c always use the DAG.
>>>>>>>>>>> 
>>>>>>>>>>> This does address all the concerns. It is just that from
>>>>>>> intuitiveness
>>>>>>>>>>> perspective, I found that asking user to explicitly use a
>>>>>>> CachedTable
>>>>>>>>> while
>>>>>>>>>>> the optimizer might choose to ignore is a little weird. That was
>>>>>>> why I
>>>>>>>>> did
>>>>>>>>>>> not think about that semantic. But given there is material
>>> benefit,
>>>>>>> I
>>>>>>>>> think
>>>>>>>>>>> this semantic is acceptable.
>>>>>>>>>>> 
>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
>>> cache
>>>>>>> or
>>>>>>>>> not,
>>>>>>>>>>>> then why do we need “void cache()” method at all? Would It
>>>>>>>> “increase”
>>>>>>>>> the
>>>>>>>>>>>> chance of using the cache? That’s sounds strange. What would be
>>> the
>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
>>> want
>>>>>>> to
>>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>>>>>>> deduplication”
>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>>> optimiser
>>>>>>> do
>>>>>>>>> all of
>>>>>>>>>>>> the work.
>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not use
>>>>>>> cache
>>>>>>>>>>>> decision.
>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
>>> cost
>>>>>>>>> based
>>>>>>>>>>>> optimisations would work properly and I would still insist
>>> first on
>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>>>>>>> 
>>>>>>>>>>> We are absolutely on the same page here. An explicit cache()
>>> method
>>>>>>> is
>>>>>>>>>>> necessary not only because optimizer may not be able to make the
>>>>>>> right
>>>>>>>>>>> decision, but also because of the nature of interactive
>>> programming.
>>>>>>>> For
>>>>>>>>>>> example, if users write the following code in Scala shell:
>>>>>>>>>>> val b = a.select(...)
>>>>>>>>>>> val c = b.select(...)
>>>>>>>>>>> val d = c.select(...).writeToSink(...)
>>>>>>>>>>> tEnv.execute()
>>>>>>>>>>> There is no way optimizer will know whether b or c will be used
>>> in
>>>>>>>> later
>>>>>>>>>>> code, unless users hint explicitly.
>>>>>>>>>>> 
>>>>>>>>>>> At the same time I’m not sure if you have responded to our
>>>>>>> objections
>>>>>>>> of
>>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
>>> Jark,
>>>>>>>>> Fabian,
>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>>>>> 
>>>>>>>>>>> Is there any other side effects if we use semantic 3 mentioned
>>>>>>> above?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> JIangjie (Becket) Qin
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
>>>>>>>> piotr@data-artisans.com
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry for not responding long time.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regarding case1.
>>>>>>>>>>>> 
>>>>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would expect
>>> only
>>>>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t
>>>>>>> affect
>>>>>>>>>>>> `cachedTableA2`. Just as in any other database dropping
>>> modifying
>>>>>>> one
>>>>>>>>>>>> independent table/materialised view does not affect others.
>>>>>>>>>>>> 
>>>>>>>>>>>>> What I meant is that assuming there is already a cached table,
>>>>>>>> ideally
>>>>>>>>>>>> users need
>>>>>>>>>>>>> not to specify whether the next query should read from the
>>> cache
>>>>>>> or
>>>>>>>>> use
>>>>>>>>>>>> the
>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. If we want to let optimiser make decisions whether to use
>>> cache
>>>>>>> or
>>>>>>>>>>>> not, then why do we need “void cache()” method at all? Would It
>>>>>>>>> “increase”
>>>>>>>>>>>> the chance of using the cache? That’s sounds strange. What
>>> would be
>>>>>>>> the
>>>>>>>>>>>> mechanism of deciding whether to use the cache or not? If we
>>> want
>>>>>>> to
>>>>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>>>>>>> deduplication”
>>>>>>>>>>>> I would turn it on globally, not per table, and let the
>>> optimiser
>>>>>>> do
>>>>>>>>> all of
>>>>>>>>>>>> the work.
>>>>>>>>>>>> 2. We do not have statistics at the moment for any use/not use
>>>>>>> cache
>>>>>>>>>>>> decision.
>>>>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
>>> cost
>>>>>>>>> based
>>>>>>>>>>>> optimisations would work properly and I would still insist
>>> first on
>>>>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
>>>>>>>>>>>> contradict future work on automated cost based caching.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> At the same time I’m not sure if you have responded to our
>>>>>>> objections
>>>>>>>>> of
>>>>>>>>>>>> `void cache()` being implicit/having side effects, which me,
>>> Jark,
>>>>>>>>> Fabian,
>>>>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>>>>>> 
>>>>>>>>>>>> Piotrek
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com>
>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It is true that after the first job submission, there will be
>>> no
>>>>>>>>>>>> ambiguity
>>>>>>>>>>>>> in terms of whether a cached table is used or not. That is the
>>>>>>> same
>>>>>>>>> for
>>>>>>>>>>>> the
>>>>>>>>>>>>> cache() without returning a CachedTable.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
>>> caching
>>>>>>>>>>>> operator
>>>>>>>>>>>>>> from which you need to consume from if you want to benefit
>>> from
>>>>>>> the
>>>>>>>>>>>> caching
>>>>>>>>>>>>>> functionality.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am thinking a little differently. I think it is a hint (as
>>> you
>>>>>>>>>>>> mentioned
>>>>>>>>>>>>> later) instead of a new operator. I'd like to be careful about
>>> the
>>>>>>>>>>>> semantic
>>>>>>>>>>>>> of the API. A hint is a property set on an existing operator,
>>> but
>>>>>>> is
>>>>>>>>> not
>>>>>>>>>>>>> itself an operator as it does not really manipulate the data.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
>>> which
>>>>>>>>>>>>>> intermediate result should be cached. But especially when
>>>>>>> executing
>>>>>>>>>>>> ad-hoc
>>>>>>>>>>>>>> queries the user might better know which results need to be
>>>>>>> cached
>>>>>>>>>>>> because
>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>>> consider
>>>>>>>> the
>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
>>>>>>>> future
>>>>>>>>> we
>>>>>>>>>>>>>> might add functionality which tries to automatically cache
>>>>>>> results
>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>> caching the latest intermediate results until so and so much
>>>>>>> space
>>>>>>>> is
>>>>>>>>>>>>>> used). But this should hopefully not contradict with
>>> `CachedTable
>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I agree that cache() method is needed for exactly the reason
>>> you
>>>>>>>>>>>> mentioned,
>>>>>>>>>>>>> i.e. Flink cannot predict what users are going to write later,
>>> so
>>>>>>>>> users
>>>>>>>>>>>>> need to tell Flink explicitly that this table will be used
>>> later.
>>>>>>>>> What I
>>>>>>>>>>>>> meant is that assuming there is already a cached table, ideally
>>>>>>>> users
>>>>>>>>>>>> need
>>>>>>>>>>>>> not to specify whether the next query should read from the
>>> cache
>>>>>>> or
>>>>>>>>> use
>>>>>>>>>>>> the
>>>>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> To explain the difference between returning / not returning a
>>>>>>>>>>>> CachedTable,
>>>>>>>>>>>>> I want compare the following two case:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Case 1:  returning a CachedTable*
>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>> val cachedTableA1 = a.cache()
>>>>>>>>>>>>> val cachedTableA2 = a.cache()
>>>>>>>>>>>>> b.print() // Just to make sure a is cached.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> c = a.filter(...) // User specify that the original DAG is
>>> used?
>>>>>>> Or
>>>>>>>>> the
>>>>>>>>>>>>> optimizer decides whether DAG or cache should be used?
>>>>>>>>>>>>> d = cachedTableA1.filter() // User specify that the cached
>>> table
>>>>>>> is
>>>>>>>>>>>> used.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
>>>>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Case 2: not returning a CachedTable*
>>>>>>>>>>>>> b = a.map()
>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>> a.cache() // no-op
>>>>>>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>>>>>>> 
>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>>> used
>>>>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or DAG
>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>>> used
>>>>>>>>>>>>> 
>>>>>>>>>>>>> a.unCache()
>>>>>>>>>>>>> a.unCache() // no-op
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to choose
>>>>>>>> between
>>>>>>>>>>>> DAG
>>>>>>>>>>>>> and cache. And the unCache() call becomes tricky.
>>>>>>>>>>>>> In case 2, users do not need to worry about whether cache or
>>> DAG
>>>>>>> is
>>>>>>>>>>>> used.
>>>>>>>>>>>>> And the unCache() semantic is clear. However, the caveat is
>>> that
>>>>>>>> users
>>>>>>>>>>>>> cannot explicitly ignore the cache.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In order to address the issues mentioned in case 2 and
>>> inspired by
>>>>>>>> the
>>>>>>>>>>>>> discussion so far, I am thinking about using hint to allow user
>>>>>>>>>>>> explicitly
>>>>>>>>>>>>> ignore cache. Although we do not have hint yet, but we probably
>>>>>>>> should
>>>>>>>>>>>> have
>>>>>>>>>>>>> one. So the code becomes:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Case 3: returning this table*
>>>>>>>>>>>>> b = a.map()
>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>> a.cache() // no-op
>>>>>>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>>>>>>> 
>>>>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>>> used
>>>>>>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used
>>> instead
>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>>>> cache.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> a.unCache()
>>>>>>>>>>>>> a.unCache() // no-op
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We could also let cache() return this table to allow chained
>>>>>>> method
>>>>>>>>>>>> calls.
>>>>>>>>>>>>> Do you think this API addresses the concerns?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com>
>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> All the recent discussions are focused on whether there is a
>>>>>>>> problem
>>>>>>>>> if
>>>>>>>>>>>>>> cache() not return a Table.
>>>>>>>>>>>>>> It seems that returning a Table explicitly is more clear (and
>>>>>>>> safe?).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So whether there are any problems if cache() returns a Table?
>>>>>>>>> @Becket
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <
>>> trohrmann@apache.org
>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It's true that b, c, d and e will all read from the original
>>> DAG
>>>>>>>>> that
>>>>>>>>>>>>>>> generates a. But all subsequent operators (when running
>>> multiple
>>>>>>>>>>>> queries)
>>>>>>>>>>>>>>> which reference cachedTableA should not need to reproduce `a`
>>>>>>> but
>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>> consume the intermediate result.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Conceptually one could think of cache() as introducing a
>>> caching
>>>>>>>>>>>> operator
>>>>>>>>>>>>>>> from which you need to consume from if you want to benefit
>>> from
>>>>>>>> the
>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>> functionality.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
>>> which
>>>>>>>>>>>>>>> intermediate result should be cached. But especially when
>>>>>>>> executing
>>>>>>>>>>>>>> ad-hoc
>>>>>>>>>>>>>>> queries the user might better know which results need to be
>>>>>>> cached
>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>>>>>>> consider
>>>>>>>>> the
>>>>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
>>>>>>>> future
>>>>>>>>>>>> we
>>>>>>>>>>>>>>> might add functionality which tries to automatically cache
>>>>>>> results
>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>>> caching the latest intermediate results until so and so much
>>>>>>> space
>>>>>>>>> is
>>>>>>>>>>>>>>> used). But this should hopefully not contradict with
>>>>>>> `CachedTable
>>>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <
>>> becket.qin@gmail.com
>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the clarification. I am still a little confused.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> If cache() returns a CachedTable, the example might become:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> cachedTableA = a.cache()
>>>>>>>>>>>>>>>> d = cachedTableA.map(...)
>>>>>>>>>>>>>>>> e = a.map()
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b, c, d
>>> and
>>>>>>> e
>>>>>>>>> are
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>> going to be reading from the original DAG that generates a.
>>> But
>>>>>>>>> with
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> naive expectation, d should be reading from the cache. This
>>>>>>> seems
>>>>>>>>> not
>>>>>>>>>>>>>>>> solving the potential confusion you raised, right?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Just to be clear, my understanding are all based on the
>>>>>>>> assumption
>>>>>>>>>>>> that
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
>>>>>>>>>>>> c*achedTableA*
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> original table *a * should be completely interchangeable.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> That said, I think a valid argument is optimization. There
>>> are
>>>>>>>>> indeed
>>>>>>>>>>>>>>> cases
>>>>>>>>>>>>>>>> that reading from the original DAG could be faster than
>>> reading
>>>>>>>>> from
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> cache. For example, in the following example:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> a.filter(f1' > 100)
>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>> b = a.filter(f1' < 100)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Ideally the optimizer should be intelligent enough to decide
>>>>>>>> which
>>>>>>>>>>>> way
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> faster, without user intervention. In this case, it will
>>>>>>> identify
>>>>>>>>>>>> that
>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>> would just be an empty table, thus skip reading from the
>>> cache
>>>>>>>>>>>>>>> completely.
>>>>>>>>>>>>>>>> But I agree that returning a CachedTable would give user the
>>>>>>>>> control
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> when to use cache, even though I still feel that letting the
>>>>>>>>>>>> optimizer
>>>>>>>>>>>>>>>> handle this is a better option in long run.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
>>>>>>>> trohrmann@apache.org
>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Yes you are right Becket that it still depends on the
>>> actual
>>>>>>>>>>>>>> execution
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the job whether a consumer reads from a cached result or
>>> not.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> My point was actually about the properties of a (cached vs.
>>>>>>>>>>>>>> non-cached)
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> not about the execution. I would not make cache trigger the
>>>>>>>>>>>> execution
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the job because one loses some flexibility by eagerly
>>>>>>> triggering
>>>>>>>>> the
>>>>>>>>>>>>>>>>> execution.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is
>>> returned
>>>>>>>> by
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cache() method like Piotr did in order to make the API more
>>>>>>>>>>>> explicit.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
>>>>>>> becket.qin@gmail.com
>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> That is a good example. Just a minor correction, in this
>>>>>>> case,
>>>>>>>>> b, c
>>>>>>>>>>>>>>>> and d
>>>>>>>>>>>>>>>>>> will all consume from a non-cached a. This is because
>>> cache
>>>>>>>> will
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> created on the very first job submission that generates
>>> the
>>>>>>>> table
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> cached.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> If I understand correctly, this is example is about
>>> whether
>>>>>>>>>>>>>> .cache()
>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In
>>> another
>>>>>>>> word,
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>> cache() method actually triggers a job that creates the
>>>>>>> cache,
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>> be no such confusion. Is that right?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> In the example, although d will not consume from the
>>> cached
>>>>>>>> Table
>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> looks supposed to, from correctness perspective the code
>>> will
>>>>>>>>> still
>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>> correct result, assuming that tables are immutable.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Personally I feel it is OK because users probably won't
>>>>>>> really
>>>>>>>>>>>>>> worry
>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>> whether the table is cached or not. And lazy cache could
>>>>>>> avoid
>>>>>>>>> some
>>>>>>>>>>>>>>>>>> unnecessary caching if a cached table is never created in
>>> the
>>>>>>>>> user
>>>>>>>>>>>>>>>>>> application. But I am not opposed to do eager evaluation
>>> of
>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
>>>>>>>>>>>>>> trohrmann@apache.org>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily
>>> changing
>>>>>>>>>>>>>>> properties
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> node affects all down stream consumers but does not
>>>>>>>> necessarily
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> happen before these consumers are defined. From a user's
>>>>>>>>>>>>>>> perspective
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> can be quite confusing:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>>>>> d = a.map(...)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> now b, c and d will consume from a cached operator. In
>>> this
>>>>>>>>> case,
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>> would most likely expect that only d reads from a cached
>>>>>>>> result.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
>>> effects?
>>>>>>> So
>>>>>>>>>>>>>>> far
>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
>>>>>>>> table
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Not only that. There are also performance implications
>>> and
>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>> another implicit side effects of using `void cache()`.
>>> As I
>>>>>>>>>>>>>> wrote
>>>>>>>>>>>>>>>>>> before,
>>>>>>>>>>>>>>>>>>>> reading from cache might not always be desirable, thus
>>> it
>>>>>>> can
>>>>>>>>>>>>>>> cause
>>>>>>>>>>>>>>>>>>>> performance degradation and I’m fine with that - user's
>>> or
>>>>>>>>>>>>>>>>> optimiser’s
>>>>>>>>>>>>>>>>>>>> choice. What I do not like is that this implicit side
>>>>>>> effect
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> manifest
>>>>>>>>>>>>>>>>>>>> in completely different part of code, that wasn’t
>>> touched
>>>>>>> by
>>>>>>>> a
>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else. And
>>> even
>>>>>>> if
>>>>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>>>>> improves performance, it’s still a side effect of `void
>>>>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>>>>> Almost
>>>>>>>>>>>>>>>>>>>> from the definition `void` methods have only side
>>> effects.
>>>>>>>> As I
>>>>>>>>>>>>>>>> wrote
>>>>>>>>>>>>>>>>>>>> before, there are couple of scenarios where this might
>>> be
>>>>>>>>>>>>>>>> undesirable
>>>>>>>>>>>>>>>>>>>> and/or unexpected, for example:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>>>>>>> Table b = …;
>>>>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>>>>> x = b.join(…)
>>>>>>>>>>>>>>>>>>>> y = b.count()
>>>>>>>>>>>>>>>>>>>> // ...
>>>>>>>>>>>>>>>>>>>> // 100
>>>>>>>>>>>>>>>>>>>> // hundred
>>>>>>>>>>>>>>>>>>>> // lines
>>>>>>>>>>>>>>>>>>>> // of
>>>>>>>>>>>>>>>>>>>> // code
>>>>>>>>>>>>>>>>>>>> // later
>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden
>>> in
>>>>>>> a
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>> method/file/package/dependency
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Table b = ...
>>>>>>>>>>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>>>>>>>>>>> foo(b)
>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>> Else {
>>>>>>>>>>>>>>>>>>>> bar(b)
>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Void foo(Table b) {
>>>>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>>>>> // do something with b
>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly
>>> affect
>>>>>>>>>>>>>>>> (semantic
>>>>>>>>>>>>>>>>>> of a
>>>>>>>>>>>>>>>>>>>> program in case of sources being mutable and
>>> performance)
>>>>>>> `z
>>>>>>>> =
>>>>>>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On top of that, there is still this argument of mine
>>> that
>>>>>>>>>>>>>> having
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
>>>>>>> flexible
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> us
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> future and for the user (as a manual option to bypass
>>> cache
>>>>>>>>>>>>>>> reads).
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> But Jiangjie is correct,
>>>>>>>>>>>>>>>>>>>>> the source table in batching should be immutable. It is
>>>>>>> the
>>>>>>>>>>>>>>>> user’s
>>>>>>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
>>>>>>>>>>>>>> failover
>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>> lead
>>>>>>>>>>>>>>>>>>>>> to inconsistent results.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment
>>>>>>> should
>>>>>>>>>>>>>> be.
>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this (since
>>> the
>>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>> fix
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> to support transactions), I’m just trying to minimise
>>>>>>>> confusion
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> users that are not fully aware what’s going on and
>>> operate
>>>>>>> in
>>>>>>>>>>>>>>> less
>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>> perfect setup. And if something bites them after adding
>>>>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>>>>> call,
>>>>>>>>>>>>>>>>>>>> to make sure that they at least know all of the places
>>> that
>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> line can affect.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks, Piotrek
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <
>>> becket.qin@gmail.com
>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more replies
>>> are
>>>>>>>>>>>>>>>>> following.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be
>>>>>>> used
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>>>> programming and not only in batching.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache() has
>>> the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> batch processing. The semantic is following:
>>>>>>>>>>>>>>>>>>>>> For a table created via a series of computation, save
>>> that
>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>>>>> reference to avoid running the computation logic to
>>>>>>>>>>>>>> regenerate
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
>>>>>>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
>>>>>>> processing.
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>> difference
>>>>>>>>>>>>>>>>>>>>> is that stream applications will only run once as they
>>> are
>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>> running.
>>>>>>>>>>>>>>>>>>>>> And the batch applications may be run multiple times,
>>>>>>> hence
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>> be created and dropped each time the application runs.
>>>>>>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
>>>>>>> management
>>>>>>>>>>>>>>>>>>> requirements
>>>>>>>>>>>>>>>>>>>>> for the streaming cached table, such as time based /
>>> size
>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>>>> retention,
>>>>>>>>>>>>>>>>>>>>> to address the infinite data issue. But such
>>> requirement
>>>>>>>> does
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>>>> the semantic.
>>>>>>>>>>>>>>>>>>>>> You are right that interactive programming is just one
>>> use
>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> cache().
>>>>>>>>>>>>>>>>>>>>> It is not the only use case.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
>>> `void
>>>>>>>>>>>>>>>> cache()`
>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>> side effects.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> This is indeed the key point. The argument around
>>> whether
>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>> return something already indicates that cache() and
>>>>>>>>>>>>>>> materialize()
>>>>>>>>>>>>>>>>>>> address
>>>>>>>>>>>>>>>>>>>>> different issues.
>>>>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
>>> effects?
>>>>>>> So
>>>>>>>>>>>>>>> far
>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
>>>>>>>> table
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>>>>>>> CachedTable
>>>>>>>>>>>>>>>>>> read-only.
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
>>> can
>>>>>>>> not
>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently
>>> can
>>>>>>> not
>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I don't think anyone should insert something to a
>>> cache.
>>>>>>> By
>>>>>>>>>>>>>>>>>> definition
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> cache should only be updated when the corresponding
>>>>>>> original
>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> updated. What I am wondering is that given the
>>> following
>>>>>>> two
>>>>>>>>>>>>>>>> facts:
>>>>>>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with something
>>> like
>>>>>>>>>>>>>>>>> insert()),
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
>>>>>>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
>>>>>>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
>>>>>>> mutable
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is where I
>>>>>>>> thought
>>>>>>>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One
>>> more
>>>>>>>>>>>>>>>>> explanation
>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is that I
>>>>>>> think
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> “Table”s
>>>>>>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
>>>>>>>>>>>>>> views,
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>> difference for me is that their live scope is short -
>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>> is limited by different execution model. That’s why
>>>>>>>>>>>>>> “cashing”
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>> for me
>>>>>>>>>>>>>>>>>>>>>> is just materialising it.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> However I see and I understand your point of view.
>>> Coming
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL
>>> world,
>>>>>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might
>>> not
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> interactive programming and not only in batching. But
>>>>>>>> naming
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>>>>> and not that critical to me. Especially that once we
>>>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
>>>>>>>> `cache()`
>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> deem
>>>>>>>>>>>>>>>>>>>> so.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
>>>>>>> `void
>>>>>>>>>>>>>>>>> cache()`
>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you have
>>>>>>>>>>>>>> mentioned.
>>>>>>>>>>>>>>>>> True:
>>>>>>>>>>>>>>>>>>>>>> results might be non deterministic if underlying
>>> source
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>> changing.
>>>>>>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It
>>> can
>>>>>>>>>>>>>> cause
>>>>>>>>>>>>>>>>> “wtf”
>>>>>>>>>>>>>>>>>>>> moment
>>>>>>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some
>>> place
>>>>>>> in
>>>>>>>>>>>>>> his
>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
>>>>>>> differently.
>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle,
>>> we
>>>>>>>>>>>>>> force
>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> explicitly use the cache which removes the “random”
>>> part
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> "suddenly
>>>>>>>>>>>>>>>>>>>>>> some other random places are behaving differently”.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
>>>>>>>>>>>>>>>>>>> flexibility/allowing
>>>>>>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are independent
>>> of
>>>>>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>>>> vs
>>>>>>>>>>>>>>>>>>>>>> `materialize()` discussion.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the
>>> CachedTable?
>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>> sounds
>>>>>>>>>>>>>>>>>>>>>> pretty confusing.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>>>>>>> CachedTable
>>>>>>>>>>>>>>>>>>> read-only. I
>>>>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
>>> can
>>>>>>>> not
>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently
>>> can
>>>>>>> not
>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
>>>>>>> xingcanc@gmail.com
>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and
>>> `materialize()`
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> considered as two different methods where the later
>>> one
>>>>>>> is
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>> sophisticated.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea is
>>> just
>>>>>>> to
>>>>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI
>>>>>>> is a
>>>>>>>>>>>>>>>>>> high-level
>>>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet
>>> API
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> force
>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching it.
>>> Then
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>> manually register the cached dataset to a table again
>>> (we
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
>>>>>>> identical
>>>>>>>>>>>>>>>> schema
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>> different contents here). After all, it’s the dataset
>>>>>>>> rather
>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
>>>>>>>>>>>>>>>> becket.qin@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are
>>> good
>>>>>>>>>>>>>>>>> arguments.
>>>>>>>>>>>>>>>>>>>> But I
>>>>>>>>>>>>>>>>>>>>>>>> think those arguments are mostly about materialized
>>>>>>> view.
>>>>>>>>>>>>>>> Let
>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and
>>> materialize()
>>>>>>>> are
>>>>>>>>>>>>>>>>>>> different.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite
>>> different
>>>>>>>>>>>>>>>>>> implications.
>>>>>>>>>>>>>>>>>>>> An
>>>>>>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When
>>> users
>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>>>>> cache(),
>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result as
>>> a
>>>>>>>>>>>>>> draft
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>> work,
>>>>>>>>>>>>>>>>>>>>>>>> this intermediate result may not have any realistic
>>>>>>>>>>>>>> meaning.
>>>>>>>>>>>>>>>>>> Calling
>>>>>>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the
>>> cached
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>>>>>>>>> manner.
>>>>>>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I
>>> have
>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>>> meaningful
>>>>>>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think
>>> about
>>>>>>>> the
>>>>>>>>>>>>>>>>>>> validation,
>>>>>>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
>>>>>>> materialize()
>>>>>>>>>>>>>>>> methods
>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them. The
>>>>>>> concept
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to say
>>> the
>>>>>>>>>>>>>>> related
>>>>>>>>>>>>>>>>>> stuff
>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and
>>> systematic
>>>>>>>>>>>>>>> manner.
>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> found
>>>>>>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
>>>>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>>>>>> programming experience.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still have
>>> some
>>>>>>>>>>>>>>>>> questions,
>>>>>>>>>>>>>>>>>>>>>> though.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files
>>> from a
>>>>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>>>>>>>>>>>>>> initialised)
>>>>>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
>>>>>>> writes
>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to
>>> be
>>>>>>>>>>>>>>>>> implemented
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> initial version
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> what if someone else added some more files to
>>> /foo/bar
>>>>>>> at
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result
>>> become
>>>>>>>>>>>>>>>>>>>>>> non-deterministic,
>>>>>>>>>>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension,
>>> manual
>>>>>>>>>>>>>>>> “cache”
>>>>>>>>>>>>>>>>>>>> dropping
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in most
>>>>>>>> cases,
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental assumption
>>> of
>>>>>>>> such
>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> source data is complete before the data processing
>>>>>>>> begins,
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO, if
>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>> to be added to some source during the processing, it
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> ways
>>>>>>>>>>>>>>>>>>>>>>>> like union the source with another table containing
>>> the
>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> There are a few cases that computations are executed
>>>>>>>>>>>>>>>> repeatedly
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> changing data source.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job every
>>>>>>> hour
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> samples
>>>>>>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the
>>> source
>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged
>>>>>>>> within
>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>> run.
>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>>> usually in that case, the result will need
>>> versioning,
>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result from
>>> the
>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>> by a
>>>>>>>>>>>>>>>>>>>>>>>> certain timestamp.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Another example is something like data warehouse. In
>>>>>>> this
>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>> are a
>>>>>>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
>>>>>>> sources,
>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be
>>> created to
>>>>>>>>>>>>>>>> generate
>>>>>>>>>>>>>>>>>>>> derived
>>>>>>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated when
>>> the
>>>>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic
>>> that
>>>>>>>>>>>>>>> derives
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
>>>>>>>>>>>>>>>>>> reports/views.
>>>>>>>>>>>>>>>>>>>>>> Again,
>>>>>>>>>>>>>>>>>>>>>>>> all those derived data also need to ha
>>> 
>>> 



Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Happy New Year, everybody!

I would like to resume this discussion thread. At this point, We have
agreed on the first step goal of interactive programming. The open
discussion is the exact API. More specifically, what should *cache()*
method return and what is the semantic. There are three options:

*Option 1*
*void cache()* OR *Table cache()* which returns the original table for
chained calls.
*void uncache() *releases the cache.
*Table.hint(ignoreCache).foo()* to ignore cache for operation foo().

- Semantic: a.cache() hints that table 'a' should be cached. Optimizer
decides whether the cache will be used or not.
- pros: simple and no confusion between CachedTable and original table
- cons: A table may be cached / uncached in a method invocation, while the
caller does not know about this.

*Option 2*
*CachedTable cache()*
*CachedTable *extends *Table *with an additional *uncache()* method

- Semantic: After *val cachedA = a.cache()*, *cachedA.foo()* will always
use cache. *a.bar() *will always use original DAG.
- pros: No potential side effects in method invocation.
- cons: Optimizer has no chance to kick in. Future optimization will become
a behavior change and need users to change the code.

*Option 3*
*CacheHandle cache()*
*CacheHandle.release() *to release a cache handle on the table. If all
cache handles are released, the cache could be removed.
*Table.hint(ignoreCache).foo()* to ignore cache for operation foo().

- Semantic: *a.cache() *hints that 'a' should be cached. Optimizer decides
whether the cache will be used or not. Cache is released either no handle
is on it, or the user program exits.
- pros: No potential side effect in method invocation. No confusion between
cached table v.s original table.
- cons: An additional CacheHandle exposed to the users.


Personally I prefer option 3 for the following reasons:
1. It is simple. Vast majority of the users would just call
*a.cache()* followed
by *a.foo(),* *a.bar(), etc. *
2. There is no semantic ambiguity and semantic change if we decide to add
implicit cache in the future.
3. There is no side effect in the method calls.
4. Admittedly we need to expose one more CacheHandle class to the users.
But it is not that difficult to understand given similar well known concept
like ref count (we can name it CacheReference if that is easier to
understand). So I think it is fine.


Thanks,

Jiangjie (Becket) Qin


On Thu, Dec 13, 2018 at 11:23 AM Becket Qin <be...@gmail.com> wrote:

> Hi Piotrek,
>
> 1. Regarding optimization.
> Sure there are many cases that the decision is hard to make. But that does
> not make it any easier for the users to make those decisions. I imagine 99%
> of the users would just naively use cache. I am not saying we can optimize
> in all the cases. But as long as we agree that at least in certain cases (I
> would argue most cases), optimizer can do a little better than an average
> user who likely knows little about Flink internals, we should not push the
> burden of optimization to users.
>
> BTW, it seems some of your concerns are related to the implementation. I
> did not mention the implementation of the caching service because that
> should not affect the API semantic. Not sure if this helps, but imagine the
> default implementation has one StorageNode service colocating with each TM.
> It could be running within the TM process or in a standalone process,
> depending on configuration.
>
> The StorageNode uses memory + spill-to-disk mechanism. The cached data
> will just be written to the local StorageNode service. If the StorageNode
> is running within the TM process, the in-memory cache could just be objects
> so we save some serde cost. A later job referring to the cached Table will
> be scheduled in a locality aware manner, i.e. run in the TM whose peer
> StorageNode hosts the data.
>
>
> 2. Semantic
> I am not sure why introducing a new hintCache() or
> env.enableAutomaticCaching() method would avoid the consequence of semantic
> change.
>
> If the auto optimization is not enabled by default, users still need to
> make code change to all existing programs in order to get the benefit.
> If the auto optimization is enabled by default, advanced users who know
> that they really want to use cache will suddenly lose the opportunity to do
> so, unless they change the code to disable auto optimization.
>
>
> 3. side effect
> The CacheHandle is not only for where to put uncache(). It is to solve the
> implicit performance impact by moving the uncache() to the CacheHandle.
>
>    - If users wants to leverage cache, they can call a.cache(). After
>    that, unless user explicitly release that CacheHandle, a.foo() will always
>    leverage cache if needed (optimizer may choose to ignore cache if that
>    helps accelerate the process). Any function call will not be able to
>    release the cache because they do not have that CacheHandle.
>    - If some advanced users do not want to use cache at all, they will
>    call a.hint(ignoreCache).foo(). This will for sure ignore cache and use the
>    original DAG to process.
>
>
> > In vast majority of the cases, users wouldn't really care whether the
>> cache is used or not.
>> I wouldn’t agree with that, because “caching” (if not purely in memory
>> caching) would add additional IO costs. It’s similar as saying that users
>> would not see a difference between Spark/Flink and MapReduce (MapReduce
>> writes data to disks after every map/reduce stage).
>
> What I wanted to say is that in most cases, after users call cache(), they
> don't really care about whether auto optimization has decided to ignore the
> cache or not, as long as the program runs faster.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
>
>
>
>
>
> On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
>> Hi,
>>
>> Thanks for the quick answer :)
>>
>> Re 1.
>>
>> I generally agree with you, however couple of points:
>>
>> a) the problem with using automatic caching is bigger, because you will
>> have to decide, how do you compare IO vs CPU costs and if you pick wrong,
>> additional IO costs might be enormous or even can crash your system. This
>> is more difficult problem compared to let say join reordering, where the
>> only issue is to have good statistics that can capture correlations between
>> columns (when you reorder joins number of IO operations do not change)
>> c) your example is completely independent of caching.
>>
>> Query like this:
>>
>> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===`f2).as('f3,
>> …).filter(‘f3 > 30)
>>
>> Should/could be optimised to empty result immediately, without the need
>> for any cache/materialisation and that should work even without any
>> statistics provided by the connector.
>>
>> For me prerequisite to any serious cost-based optimisations would be some
>> reasonable benchmark coverage of the code (tpch?). Otherwise that would be
>> equivalent of adding not tested code, since we wouldn’t be able to verify
>> our assumptions, like how does the writing of 10 000 records to
>> cache/RocksDB/Kafka/CSV file compare to joining/filtering/processing of
>> lets say 1000 000 rows.
>>
>> Re 2.
>>
>> I wasn’t proposing to change the semantic later. I was proposing that we
>> start now:
>>
>> CachedTable cachedA = a.cache()
>> cachedA.foo() // Cache is used
>> a.bar() // Original DAG is used
>>
>> And then later we can think about adding for example
>>
>> CachedTable cachedA = a.hintCache()
>> cachedA.foo() // Cache might be used
>> a.bar() // Original DAG is used
>>
>> Or
>>
>> env.enableAutomaticCaching()
>> a.foo() // Cache might be used
>> a.bar() // Cache might be used
>>
>> Or (I would still not like this option):
>>
>> a.hintCache()
>> a.foo() // Cache might be used
>> a.bar() // Cache might be used
>>
>> Or whatever else that will come to our mind. Even if we add some
>> automatic caching in the future, keeping implicit (`CachedTable cache()`)
>> caching will still be useful, at least in some cases.
>>
>> Re 3.
>>
>> > 2. The source tables are immutable during one run of batch processing
>> logic.
>> > 3. The cache is immutable during one run of batch processing logic.
>>
>> > I think assumption 2 and 3 are by definition what batch processing
>> means,
>> > i.e the data must be complete before it is processed and should not
>> change
>> > when the processing is running.
>>
>> I agree that this is how batch systems SHOULD be working. However I know
>> from my previous experience that it’s not always the case. Sometimes users
>> are just working on some non transactional storage, which can be (either
>> constantly or occasionally) being modified by some other processes for
>> whatever the reasons (fixing the data, updating, adding new data etc).
>>
>> But even if we ignore this point (data immutability), performance side
>> effect issue of your proposal remains. If user calls `void a.cache()` deep
>> inside some private method, it will have implicit side effects on other
>> parts of his program that might not be obvious.
>>
>> Re `CacheHandle`.
>>
>> If I understand it correctly, it only addresses the issue where to place
>> method `uncache`/`dropCache`.
>>
>> Btw,
>>
>> > In vast majority of the cases, users wouldn't really care whether the
>> cache is used or not.
>>
>> I wouldn’t agree with that, because “caching” (if not purely in memory
>> caching) would add additional IO costs. It’s similar as saying that users
>> would not see a difference between Spark/Flink and MapReduce (MapReduce
>> writes data to disks after every map/reduce stage).
>>
>> Piotrek
>>
>> > On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com> wrote:
>> >
>> > Hi Piotrek,
>> >
>> > Not sure if you noticed, in my last email, I was proposing `CacheHandle
>> > cache()` to avoid the potential side effect due to function calls.
>> >
>> > Let's look at the disagreement in your reply one by one.
>> >
>> >
>> > 1. Optimization chances
>> >
>> > Optimization is never a trivial work. This is exactly why we should not
>> let
>> > user manually do that. Databases have done huge amount of work in this
>> > area. At Alibaba, we rely heavily on many optimization rules to boost
>> the
>> > SQL query performance.
>> >
>> > In your example, if I filling the filter conditions in a certain way,
>> the
>> > optimization would become obvious.
>> >
>> > Table src1 = … // read from connector 1
>> > Table src2 = … // read from connector 2
>> >
>> > Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
>> > `f2).as('f3, ...)
>> > a.cache() // write cache to connector 3, when writing the records,
>> remember
>> > min and max of `f1
>> >
>> > a.filter('f3 > 30) // There is no need to read from any connector
>> because
>> > `a` does not contain any record whose 'f3 is greater than 30.
>> > env.execute()
>> > a.select(…)
>> >
>> > BTW, it seems to me that adding some basic statistics is fairly
>> > straightforward and the cost is pretty marginal if not ignorable. In
>> fact
>> > it is not only needed for optimization, but also for cases such as ML,
>> > where some algorithms may need to decide their parameter based on the
>> > statistics of the data.
>> >
>> >
>> > 2. Same API, one semantic now, another semantic later.
>> >
>> > I am trying to understand what is the semantic of `CachedTable cache()`
>> you
>> > are proposing. IMO, we should avoid designing an API whose semantic
>> will be
>> > changed later. If we have a "CachedTable cache()" method, then the
>> semantic
>> > should be very clearly defined upfront and do not change later. It
>> should
>> > never be "right now let's go with semantic 1, later we can silently
>> change
>> > it to semantic 2 or 3". Such change could result in bad consequence. For
>> > example, let's say we decide go with semantic 1:
>> >
>> > CachedTable cachedA = a.cache()
>> > cachedA.foo() // Cache is used
>> > a.bar() // Original DAG is used.
>> >
>> > Now majority of the users would be using cachedA.foo() in their code.
>> And
>> > some advanced users will use a.bar() to explicitly skip the cache. Later
>> > on, we added smart optimization and change the semantic to semantic 2:
>> >
>> > CachedTable cachedA = a.cache()
>> > cachedA.foo() // Cache is used
>> > a.bar() // Cache MIGHT be used, and Flink may decide to skip cache if
>> it is
>> > faster.
>> >
>> > Now most of the users who were writing cachedA.foo() will not benefit
>> from
>> > this optimization at all, unless they change their code to use a.foo()
>> > instead. And those advanced users suddenly lose the option to explicitly
>> > ignore cache unless they change their code (assuming we care enough to
>> > provide something like hint(useCache)). If we don't define the semantic
>> > carefully, our users will have to change their code again and again
>> while
>> > they shouldn't have to.
>> >
>> >
>> > 3. side effect.
>> >
>> > Before we talk about side effect, we have to agree on the assumptions.
>> The
>> > assumptions I have are following:
>> > 1. We are talking about batch processing.
>> > 2. The source tables are immutable during one run of batch processing
>> logic.
>> > 3. The cache is immutable during one run of batch processing logic.
>> >
>> > I think assumption 2 and 3 are by definition what batch processing
>> means,
>> > i.e the data must be complete before it is processed and should not
>> change
>> > when the processing is running.
>> >
>> > As far as I am aware of, I don't know any batch processing system
>> breaking
>> > those assumptions. Even for relational database tables, where queries
>> can
>> > run with concurrent modifications, necessary locking are still required
>> to
>> > ensure the integrity of the query result.
>> >
>> > Please let me know if you disagree with the above assumptions. If you
>> agree
>> > with these assumptions, with the `CacheHandle cache()` API in my last
>> > email, do you still see side effects?
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> >
>> > On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <piotr@data-artisans.com
>> >
>> > wrote:
>> >
>> >> Hi Becket,
>> >>
>> >>> Regarding the chance of optimization, it might not be that rare. Some
>> >> very
>> >>> simple statistics could already help in many cases. For example,
>> simply
>> >>> maintaining max and min of each fields can already eliminate some
>> >>> unnecessary table scan (potentially scanning the cached table) if the
>> >>> result is doomed to be empty. A histogram would give even further
>> >>> information. The optimizer could be very careful and only ignores
>> cache
>> >>> when it is 100% sure doing that is cheaper. e.g. only when a filter on
>> >> the
>> >>> cache will absolutely return nothing.
>> >>
>> >> I do not see how this might be easy to achieve. It would require tons
>> of
>> >> effort to make it work and in the end you would still have a problem of
>> >> comparing/trading CPU cycles vs IO. For example:
>> >>
>> >> Table src1 = … // read from connector 1
>> >> Table src2 = … // read from connector 2
>> >>
>> >> Table a = src1.filter(…).join(src2.filter(…), …)
>> >> a.cache() // write cache to connector 3
>> >>
>> >> a.filter(…)
>> >> env.execute()
>> >> a.select(…)
>> >>
>> >> Decision whether it’s better to:
>> >> A) read from connector1/connector2, filter/map and join them twice
>> >> B) read from connector1/connector2, filter/map and join them once, pay
>> the
>> >> price of writing to connector 3 and then reading from it
>> >>
>> >> Is very far from trivial. `a` can end up much larger than `src1` and
>> >> `src2`, writes to connector 3 might be extremely slow, reads from
>> connector
>> >> 3 can be slower compared to reads from connector 1 & 2, … . You really
>> need
>> >> to have extremely good statistics to correctly asses size of the
>> output and
>> >> it would still be failing many times (correlations etc). And keep in
>> mind
>> >> that at the moment we do not have ANY statistics at all. More than
>> that, it
>> >> would require significantly more testing and setting up some
>> benchmarks to
>> >> make sure that we do not brake it with some regressions.
>> >>
>> >> That’s why I’m strongly opposing this idea - at least let’s not starts
>> >> with this. If we first start with completely manual/explicit caching,
>> >> without any magic, it would be a significant improvement for the users
>> for
>> >> a fraction of the development cost. After implementing that, when we
>> >> already have all of the working pieces, we can start working on some
>> >> optimisations rules. As I wrote before, if we start with
>> >>
>> >> `CachedTable cache()`
>> >>
>> >> We can later work on follow up stories to make it automatic. Despite
>> that
>> >> I don’t like this implicit/side effect approach with `void` method,
>> having
>> >> explicit `CachedTable cache()` wouldn’t even prevent as from later
>> adding
>> >> `void hintCache()` method, with the exact semantic that you want.
>> >>
>> >> On top of that I re-rise again that having implicit `void
>> >> cache()/hintCache()` has other side effects and problems with non
>> immutable
>> >> data, and being annoying when used secretly inside methods.
>> >>
>> >> Explicit `CachedTable cache()` just looks like much less controversial
>> MVP
>> >> and if we decide to go further with this topic, it’s not a wasted
>> effort,
>> >> but just lies on a stright path to more advanced/complicated solutions
>> in
>> >> the future. Are there any drawbacks of starting with `CachedTable
>> cache()`
>> >> that I’m missing?
>> >>
>> >> Piotrek
>> >>
>> >>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
>> >>>
>> >>> Hi Becket,
>> >>>
>> >>> Introducing CacheHandle seems too complicated. That means users have
>> to
>> >>> maintain Handler properly.
>> >>>
>> >>> And since cache is just a hint for optimizer, why not just return
>> Table
>> >>> itself for cache method. This hint info should be kept in Table I
>> >> believe.
>> >>>
>> >>> So how about adding method cache and uncache for Table, and both
>> return
>> >>> Table. Because what cache and uncache did is just adding some hint
>> info
>> >>> into Table.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
>> >>>
>> >>>> Hi Till and Piotrek,
>> >>>>
>> >>>> Thanks for the clarification. That solves quite a few confusion. My
>> >>>> understanding of how cache works is same as what Till describe. i.e.
>> >>>> cache() is a hint to Flink, but it is not guaranteed that cache
>> always
>> >>>> exist and it might be recomputed from its lineage.
>> >>>>
>> >>>> Is this the core of our disagreement here? That you would like this
>> >>>>> “cache()” to be mostly hint for the optimiser?
>> >>>>
>> >>>> Semantic wise, yes. That's also why I think materialize() has a much
>> >> larger
>> >>>> scope than cache(), thus it should be a different method.
>> >>>>
>> >>>> Regarding the chance of optimization, it might not be that rare. Some
>> >> very
>> >>>> simple statistics could already help in many cases. For example,
>> simply
>> >>>> maintaining max and min of each fields can already eliminate some
>> >>>> unnecessary table scan (potentially scanning the cached table) if the
>> >>>> result is doomed to be empty. A histogram would give even further
>> >>>> information. The optimizer could be very careful and only ignores
>> cache
>> >>>> when it is 100% sure doing that is cheaper. e.g. only when a filter
>> on
>> >> the
>> >>>> cache will absolutely return nothing.
>> >>>>
>> >>>> Given the above clarification on cache, I would like to revisit the
>> >>>> original "void cache()" proposal and see if we can improve on top of
>> >> that.
>> >>>>
>> >>>> What do you think about the following modified interface?
>> >>>>
>> >>>> Table {
>> >>>> /**
>> >>>>  * This call hints Flink to maintain a cache of this table and
>> leverage
>> >>>> it for performance optimization if needed.
>> >>>>  * Note that Flink may still decide to not use the cache if it is
>> >> cheaper
>> >>>> by doing so.
>> >>>>  *
>> >>>>  * A CacheHandle will be returned to allow user release the cache
>> >>>> actively. The cache will be deleted if there
>> >>>>  * is no unreleased cache handlers to it. When the TableEnvironment
>> is
>> >>>> closed. The cache will also be deleted
>> >>>>  * and all the cache handlers will be released.
>> >>>>  *
>> >>>>  * @return a CacheHandle referring to the cache of this table.
>> >>>>  */
>> >>>> CacheHandle cache();
>> >>>> }
>> >>>>
>> >>>> CacheHandle {
>> >>>> /**
>> >>>>  * Close the cache handle. This method does not necessarily deletes
>> the
>> >>>> cache. Instead, it simply decrements the reference counter to the
>> cache.
>> >>>>  * When the there is no handle referring to a cache. The cache will
>> be
>> >>>> deleted.
>> >>>>  *
>> >>>>  * @return the number of open handles to the cache after this handle
>> >> has
>> >>>> been released.
>> >>>>  */
>> >>>> int release()
>> >>>> }
>> >>>>
>> >>>> The rationale behind this interface is following:
>> >>>> In vast majority of the cases, users wouldn't really care whether the
>> >> cache
>> >>>> is used or not. So I think the most intuitive way is letting cache()
>> >> return
>> >>>> nothing. So nobody needs to worry about the difference between
>> >> operations
>> >>>> on CacheTables and those on the "original" tables. This will make
>> maybe
>> >>>> 99.9% of the users happy. There were two concerns raised for this
>> >> approach:
>> >>>> 1. In some rare cases, users may want to ignore cache,
>> >>>> 2. A table might be cached/uncached in a third party function while
>> the
>> >>>> caller does not know.
>> >>>>
>> >>>> For the first issue, users can use hint("ignoreCache") to explicitly
>> >> ignore
>> >>>> cache.
>> >>>> For the second issue, the above proposal lets cache() return a
>> >> CacheHandle,
>> >>>> the only method in it is release(). Different CacheHandles will
>> refer to
>> >>>> the same cache, if a cache no longer has any cache handle, it will be
>> >>>> deleted. This will address the following case:
>> >>>> {
>> >>>> val handle1 = a.cache()
>> >>>> process(a)
>> >>>> a.select(...) // cache is still available, handle1 has not been
>> >> released.
>> >>>> }
>> >>>>
>> >>>> void process(Table t) {
>> >>>> val handle2 = t.cache() // new handle to cache
>> >>>> t.select(...) // optimizer decides cache usage
>> >>>> t.hint("ignoreCache").select(...) // cache is ignored
>> >>>> handle2.release() // release the handle, but the cache may still be
>> >>>> available if there are other handles
>> >>>> ...
>> >>>> }
>> >>>>
>> >>>> Does the above modified approach look reasonable to you?
>> >>>>
>> >>>> Cheers,
>> >>>>
>> >>>> Jiangjie (Becket) Qin
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <tr...@apache.org>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Becket,
>> >>>>>
>> >>>>> I was aiming at semantics similar to 1. I actually thought that
>> >> `cache()`
>> >>>>> would tell the system to materialize the intermediate result so that
>> >>>>> subsequent queries don't need to reprocess it. This means that the
>> >> usage
>> >>>> of
>> >>>>> the cached table in this example
>> >>>>>
>> >>>>> {
>> >>>>> val cachedTable = a.cache()
>> >>>>> val b1 = cachedTable.select(…)
>> >>>>> val b2 = cachedTable.foo().select(…)
>> >>>>> val b3 = cachedTable.bar().select(...)
>> >>>>> val c1 = a.select(…)
>> >>>>> val c2 = a.foo().select(…)
>> >>>>> val c3 = a.bar().select(...)
>> >>>>> }
>> >>>>>
>> >>>>> strongly depends on interleaved calls which trigger the execution of
>> >> sub
>> >>>>> queries. So for example, if there is only a single env.execute call
>> at
>> >>>> the
>> >>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be computed
>> by
>> >>>>> reading directly from the sources (given that there is only a single
>> >>>>> JobGraph). It just happens that the result of `a` will be cached
>> such
>> >>>> that
>> >>>>> we skip the processing of `a` when there are subsequent queries
>> reading
>> >>>>> from `cachedTable`. If for some reason the system cannot materialize
>> >> the
>> >>>>> table (e.g. running out of disk space, ttl expired), then it could
>> also
>> >>>>> happen that we need to reprocess `a`. In that sense `cachedTable`
>> >> simply
>> >>>> is
>> >>>>> an identifier for the materialized result of `a` with the lineage
>> how
>> >> to
>> >>>>> reprocess it.
>> >>>>>
>> >>>>> Cheers,
>> >>>>> Till
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
>> >> piotr@data-artisans.com
>> >>>>>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Hi Becket,
>> >>>>>>
>> >>>>>>> {
>> >>>>>>> val cachedTable = a.cache()
>> >>>>>>> val b = cachedTable.select(...)
>> >>>>>>> val c = a.select(...)
>> >>>>>>> }
>> >>>>>>>
>> >>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>> original
>> >>>> DAG
>> >>>>>> as
>> >>>>>>> user demanded so. In this case, the optimizer has no chance to
>> >>>>> optimize.
>> >>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
>> >>>>>> optimizer
>> >>>>>>> to choose whether the cache or DAG should be used. In this case,
>> user
>> >>>>>> lose
>> >>>>>>> the option to NOT use cache.
>> >>>>>>>
>> >>>>>>> As you can see, neither of the options seem perfect. However, I
>> guess
>> >>>>> you
>> >>>>>>> and Till are proposing the third option:
>> >>>>>>>
>> >>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
>> >>>>> should
>> >>>>>> be
>> >>>>>>> used. c always use the DAG.
>> >>>>>>
>> >>>>>> I am pretty sure that me, Till, Fabian and others were all
>> proposing
>> >>>> and
>> >>>>>> advocating in favour of semantic “1”. No cost based optimiser
>> >> decisions
>> >>>>> at
>> >>>>>> all.
>> >>>>>>
>> >>>>>> {
>> >>>>>> val cachedTable = a.cache()
>> >>>>>> val b1 = cachedTable.select(…)
>> >>>>>> val b2 = cachedTable.foo().select(…)
>> >>>>>> val b3 = cachedTable.bar().select(...)
>> >>>>>> val c1 = a.select(…)
>> >>>>>> val c2 = a.foo().select(…)
>> >>>>>> val c3 = a.bar().select(...)
>> >>>>>> }
>> >>>>>>
>> >>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
>> >>>>>> re-executing whole plan for “a”.
>> >>>>>>
>> >>>>>> In the future we could discuss going one step further, introducing
>> >> some
>> >>>>>> global optimisation (that can be manually enabled/disabled):
>> >>>> deduplicate
>> >>>>>> plan nodes/deduplicate sub queries/re-use sub queries results/or
>> >>>> whatever
>> >>>>>> we could call it. It could do two things:
>> >>>>>>
>> >>>>>> 1. Automatically try to deduplicate fragments of the plan and share
>> >> the
>> >>>>>> result using CachedTable - in other words automatically insert
>> >>>>> `CachedTable
>> >>>>>> cache()` calls.
>> >>>>>> 2. Automatically make decision to bypass explicit `CachedTable`
>> access
>> >>>>>> (this would be the equivalent of what you described as “semantic
>> 3”).
>> >>>>>>
>> >>>>>> However as I wrote previously, I have big doubts if such cost-based
>> >>>>>> optimisation would work (this applies also to “Semantic 2”). I
>> would
>> >>>>> expect
>> >>>>>> it to do more harm than good in so many cases, that it wouldn’t
>> make
>> >>>>> sense.
>> >>>>>> Even assuming that we calculate statistics perfectly (this ain’t
>> gonna
>> >>>>>> happen), it’s virtually impossible to correctly estimate correct
>> >>>> exchange
>> >>>>>> rate of CPU cycles vs IO operations as it is changing so much from
>> >>>>>> deployment to deployment.
>> >>>>>>
>> >>>>>> Is this the core of our disagreement here? That you would like this
>> >>>>>> “cache()” to be mostly hint for the optimiser?
>> >>>>>>
>> >>>>>> Piotrek
>> >>>>>>
>> >>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>> Another potential concern for semantic 3 is that. In the future,
>> we
>> >>>> may
>> >>>>>> add
>> >>>>>>> automatic caching to Flink. e.g. cache the intermediate results at
>> >>>> the
>> >>>>>>> shuffle boundary. If our semantic is that reference to the
>> original
>> >>>>> table
>> >>>>>>> means skipping cache, those users may not be able to benefit from
>> the
>> >>>>>>> implicit cache.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <becket.qin@gmail.com
>> >
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Hi Piotrek,
>> >>>>>>>>
>> >>>>>>>> Thanks for the reply. Thought about it again, I might have
>> >>>>> misunderstood
>> >>>>>>>> your proposal in earlier emails. Returning a CachedTable might
>> not
>> >>>> be
>> >>>>> a
>> >>>>>> bad
>> >>>>>>>> idea.
>> >>>>>>>>
>> >>>>>>>> I was more concerned about the semantic and its intuitiveness
>> when a
>> >>>>>>>> CachedTable is returned. i..e, if cache() returns CachedTable.
>> What
>> >>>>> are
>> >>>>>> the
>> >>>>>>>> semantic in the following code:
>> >>>>>>>> {
>> >>>>>>>> val cachedTable = a.cache()
>> >>>>>>>> val b = cachedTable.select(...)
>> >>>>>>>> val c = a.select(...)
>> >>>>>>>> }
>> >>>>>>>> What is the difference between b and c? At the first glance, I
>> see
>> >>>> two
>> >>>>>>>> options:
>> >>>>>>>>
>> >>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
>> original
>> >>>>> DAG
>> >>>>>> as
>> >>>>>>>> user demanded so. In this case, the optimizer has no chance to
>> >>>>> optimize.
>> >>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
>> >>>>>> optimizer
>> >>>>>>>> to choose whether the cache or DAG should be used. In this case,
>> >>>> user
>> >>>>>> lose
>> >>>>>>>> the option to NOT use cache.
>> >>>>>>>>
>> >>>>>>>> As you can see, neither of the options seem perfect. However, I
>> >>>> guess
>> >>>>>> you
>> >>>>>>>> and Till are proposing the third option:
>> >>>>>>>>
>> >>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
>> >>>>> should
>> >>>>>>>> be used. c always use the DAG.
>> >>>>>>>>
>> >>>>>>>> This does address all the concerns. It is just that from
>> >>>> intuitiveness
>> >>>>>>>> perspective, I found that asking user to explicitly use a
>> >>>> CachedTable
>> >>>>>> while
>> >>>>>>>> the optimizer might choose to ignore is a little weird. That was
>> >>>> why I
>> >>>>>> did
>> >>>>>>>> not think about that semantic. But given there is material
>> benefit,
>> >>>> I
>> >>>>>> think
>> >>>>>>>> this semantic is acceptable.
>> >>>>>>>>
>> >>>>>>>> 1. If we want to let optimiser make decisions whether to use
>> cache
>> >>>> or
>> >>>>>> not,
>> >>>>>>>>> then why do we need “void cache()” method at all? Would It
>> >>>>> “increase”
>> >>>>>> the
>> >>>>>>>>> chance of using the cache? That’s sounds strange. What would be
>> the
>> >>>>>>>>> mechanism of deciding whether to use the cache or not? If we
>> want
>> >>>> to
>> >>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>> >>>>>> deduplication”
>> >>>>>>>>> I would turn it on globally, not per table, and let the
>> optimiser
>> >>>> do
>> >>>>>> all of
>> >>>>>>>>> the work.
>> >>>>>>>>> 2. We do not have statistics at the moment for any use/not use
>> >>>> cache
>> >>>>>>>>> decision.
>> >>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
>> cost
>> >>>>>> based
>> >>>>>>>>> optimisations would work properly and I would still insist
>> first on
>> >>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>> >>>>>>>>>
>> >>>>>>>> We are absolutely on the same page here. An explicit cache()
>> method
>> >>>> is
>> >>>>>>>> necessary not only because optimizer may not be able to make the
>> >>>> right
>> >>>>>>>> decision, but also because of the nature of interactive
>> programming.
>> >>>>> For
>> >>>>>>>> example, if users write the following code in Scala shell:
>> >>>>>>>> val b = a.select(...)
>> >>>>>>>> val c = b.select(...)
>> >>>>>>>> val d = c.select(...).writeToSink(...)
>> >>>>>>>> tEnv.execute()
>> >>>>>>>> There is no way optimizer will know whether b or c will be used
>> in
>> >>>>> later
>> >>>>>>>> code, unless users hint explicitly.
>> >>>>>>>>
>> >>>>>>>> At the same time I’m not sure if you have responded to our
>> >>>> objections
>> >>>>> of
>> >>>>>>>>> `void cache()` being implicit/having side effects, which me,
>> Jark,
>> >>>>>> Fabian,
>> >>>>>>>>> Till and I think also Shaoxuan are supporting.
>> >>>>>>>>
>> >>>>>>>> Is there any other side effects if we use semantic 3 mentioned
>> >>>> above?
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>>
>> >>>>>>>> JIangjie (Becket) Qin
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
>> >>>>> piotr@data-artisans.com
>> >>>>>>>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hi Becket,
>> >>>>>>>>>
>> >>>>>>>>> Sorry for not responding long time.
>> >>>>>>>>>
>> >>>>>>>>> Regarding case1.
>> >>>>>>>>>
>> >>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would expect
>> only
>> >>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t
>> >>>> affect
>> >>>>>>>>> `cachedTableA2`. Just as in any other database dropping
>> modifying
>> >>>> one
>> >>>>>>>>> independent table/materialised view does not affect others.
>> >>>>>>>>>
>> >>>>>>>>>> What I meant is that assuming there is already a cached table,
>> >>>>> ideally
>> >>>>>>>>> users need
>> >>>>>>>>>> not to specify whether the next query should read from the
>> cache
>> >>>> or
>> >>>>>> use
>> >>>>>>>>> the
>> >>>>>>>>>> original DAG. This should be decided by the optimizer.
>> >>>>>>>>>
>> >>>>>>>>> 1. If we want to let optimiser make decisions whether to use
>> cache
>> >>>> or
>> >>>>>>>>> not, then why do we need “void cache()” method at all? Would It
>> >>>>>> “increase”
>> >>>>>>>>> the chance of using the cache? That’s sounds strange. What
>> would be
>> >>>>> the
>> >>>>>>>>> mechanism of deciding whether to use the cache or not? If we
>> want
>> >>>> to
>> >>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>> >>>>>> deduplication”
>> >>>>>>>>> I would turn it on globally, not per table, and let the
>> optimiser
>> >>>> do
>> >>>>>> all of
>> >>>>>>>>> the work.
>> >>>>>>>>> 2. We do not have statistics at the moment for any use/not use
>> >>>> cache
>> >>>>>>>>> decision.
>> >>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
>> cost
>> >>>>>> based
>> >>>>>>>>> optimisations would work properly and I would still insist
>> first on
>> >>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>> >>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
>> >>>>>>>>> contradict future work on automated cost based caching.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> At the same time I’m not sure if you have responded to our
>> >>>> objections
>> >>>>>> of
>> >>>>>>>>> `void cache()` being implicit/having side effects, which me,
>> Jark,
>> >>>>>> Fabian,
>> >>>>>>>>> Till and I think also Shaoxuan are supporting.
>> >>>>>>>>>
>> >>>>>>>>> Piotrek
>> >>>>>>>>>
>> >>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com>
>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi Till,
>> >>>>>>>>>>
>> >>>>>>>>>> It is true that after the first job submission, there will be
>> no
>> >>>>>>>>> ambiguity
>> >>>>>>>>>> in terms of whether a cached table is used or not. That is the
>> >>>> same
>> >>>>>> for
>> >>>>>>>>> the
>> >>>>>>>>>> cache() without returning a CachedTable.
>> >>>>>>>>>>
>> >>>>>>>>>> Conceptually one could think of cache() as introducing a
>> caching
>> >>>>>>>>> operator
>> >>>>>>>>>>> from which you need to consume from if you want to benefit
>> from
>> >>>> the
>> >>>>>>>>> caching
>> >>>>>>>>>>> functionality.
>> >>>>>>>>>>
>> >>>>>>>>>> I am thinking a little differently. I think it is a hint (as
>> you
>> >>>>>>>>> mentioned
>> >>>>>>>>>> later) instead of a new operator. I'd like to be careful about
>> the
>> >>>>>>>>> semantic
>> >>>>>>>>>> of the API. A hint is a property set on an existing operator,
>> but
>> >>>> is
>> >>>>>> not
>> >>>>>>>>>> itself an operator as it does not really manipulate the data.
>> >>>>>>>>>>
>> >>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
>> which
>> >>>>>>>>>>> intermediate result should be cached. But especially when
>> >>>> executing
>> >>>>>>>>> ad-hoc
>> >>>>>>>>>>> queries the user might better know which results need to be
>> >>>> cached
>> >>>>>>>>> because
>> >>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>> consider
>> >>>>> the
>> >>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
>> >>>>> future
>> >>>>>> we
>> >>>>>>>>>>> might add functionality which tries to automatically cache
>> >>>> results
>> >>>>>>>>> (e.g.
>> >>>>>>>>>>> caching the latest intermediate results until so and so much
>> >>>> space
>> >>>>> is
>> >>>>>>>>>>> used). But this should hopefully not contradict with
>> `CachedTable
>> >>>>>>>>> cache()`.
>> >>>>>>>>>>
>> >>>>>>>>>> I agree that cache() method is needed for exactly the reason
>> you
>> >>>>>>>>> mentioned,
>> >>>>>>>>>> i.e. Flink cannot predict what users are going to write later,
>> so
>> >>>>>> users
>> >>>>>>>>>> need to tell Flink explicitly that this table will be used
>> later.
>> >>>>>> What I
>> >>>>>>>>>> meant is that assuming there is already a cached table, ideally
>> >>>>> users
>> >>>>>>>>> need
>> >>>>>>>>>> not to specify whether the next query should read from the
>> cache
>> >>>> or
>> >>>>>> use
>> >>>>>>>>> the
>> >>>>>>>>>> original DAG. This should be decided by the optimizer.
>> >>>>>>>>>>
>> >>>>>>>>>> To explain the difference between returning / not returning a
>> >>>>>>>>> CachedTable,
>> >>>>>>>>>> I want compare the following two case:
>> >>>>>>>>>>
>> >>>>>>>>>> *Case 1:  returning a CachedTable*
>> >>>>>>>>>> b = a.map(...)
>> >>>>>>>>>> val cachedTableA1 = a.cache()
>> >>>>>>>>>> val cachedTableA2 = a.cache()
>> >>>>>>>>>> b.print() // Just to make sure a is cached.
>> >>>>>>>>>>
>> >>>>>>>>>> c = a.filter(...) // User specify that the original DAG is
>> used?
>> >>>> Or
>> >>>>>> the
>> >>>>>>>>>> optimizer decides whether DAG or cache should be used?
>> >>>>>>>>>> d = cachedTableA1.filter() // User specify that the cached
>> table
>> >>>> is
>> >>>>>>>>> used.
>> >>>>>>>>>>
>> >>>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
>> >>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
>> >>>>>>>>>>
>> >>>>>>>>>> *Case 2: not returning a CachedTable*
>> >>>>>>>>>> b = a.map()
>> >>>>>>>>>> a.cache()
>> >>>>>>>>>> a.cache() // no-op
>> >>>>>>>>>> b.print() // Just to make sure a is cached
>> >>>>>>>>>>
>> >>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
>> >>>>> should
>> >>>>>>>>> be
>> >>>>>>>>>> used
>> >>>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or DAG
>> >>>>> should
>> >>>>>>>>> be
>> >>>>>>>>>> used
>> >>>>>>>>>>
>> >>>>>>>>>> a.unCache()
>> >>>>>>>>>> a.unCache() // no-op
>> >>>>>>>>>>
>> >>>>>>>>>> In case 1, semantic wise, optimizer lose the option to choose
>> >>>>> between
>> >>>>>>>>> DAG
>> >>>>>>>>>> and cache. And the unCache() call becomes tricky.
>> >>>>>>>>>> In case 2, users do not need to worry about whether cache or
>> DAG
>> >>>> is
>> >>>>>>>>> used.
>> >>>>>>>>>> And the unCache() semantic is clear. However, the caveat is
>> that
>> >>>>> users
>> >>>>>>>>>> cannot explicitly ignore the cache.
>> >>>>>>>>>>
>> >>>>>>>>>> In order to address the issues mentioned in case 2 and
>> inspired by
>> >>>>> the
>> >>>>>>>>>> discussion so far, I am thinking about using hint to allow user
>> >>>>>>>>> explicitly
>> >>>>>>>>>> ignore cache. Although we do not have hint yet, but we probably
>> >>>>> should
>> >>>>>>>>> have
>> >>>>>>>>>> one. So the code becomes:
>> >>>>>>>>>>
>> >>>>>>>>>> *Case 3: returning this table*
>> >>>>>>>>>> b = a.map()
>> >>>>>>>>>> a.cache()
>> >>>>>>>>>> a.cache() // no-op
>> >>>>>>>>>> b.print() // Just to make sure a is cached
>> >>>>>>>>>>
>> >>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
>> >>>>> should
>> >>>>>>>>> be
>> >>>>>>>>>> used
>> >>>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used
>> instead
>> >>>> of
>> >>>>>> the
>> >>>>>>>>>> cache.
>> >>>>>>>>>>
>> >>>>>>>>>> a.unCache()
>> >>>>>>>>>> a.unCache() // no-op
>> >>>>>>>>>>
>> >>>>>>>>>> We could also let cache() return this table to allow chained
>> >>>> method
>> >>>>>>>>> calls.
>> >>>>>>>>>> Do you think this API addresses the concerns?
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>>
>> >>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com>
>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi,
>> >>>>>>>>>>>
>> >>>>>>>>>>> All the recent discussions are focused on whether there is a
>> >>>>> problem
>> >>>>>> if
>> >>>>>>>>>>> cache() not return a Table.
>> >>>>>>>>>>> It seems that returning a Table explicitly is more clear (and
>> >>>>> safe?).
>> >>>>>>>>>>>
>> >>>>>>>>>>> So whether there are any problems if cache() returns a Table?
>> >>>>>> @Becket
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>> Jark
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <
>> trohrmann@apache.org
>> >>>>>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> It's true that b, c, d and e will all read from the original
>> DAG
>> >>>>>> that
>> >>>>>>>>>>>> generates a. But all subsequent operators (when running
>> multiple
>> >>>>>>>>> queries)
>> >>>>>>>>>>>> which reference cachedTableA should not need to reproduce `a`
>> >>>> but
>> >>>>>>>>>>> directly
>> >>>>>>>>>>>> consume the intermediate result.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Conceptually one could think of cache() as introducing a
>> caching
>> >>>>>>>>> operator
>> >>>>>>>>>>>> from which you need to consume from if you want to benefit
>> from
>> >>>>> the
>> >>>>>>>>>>> caching
>> >>>>>>>>>>>> functionality.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
>> which
>> >>>>>>>>>>>> intermediate result should be cached. But especially when
>> >>>>> executing
>> >>>>>>>>>>> ad-hoc
>> >>>>>>>>>>>> queries the user might better know which results need to be
>> >>>> cached
>> >>>>>>>>>>> because
>> >>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>> >>>> consider
>> >>>>>> the
>> >>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
>> >>>>> future
>> >>>>>>>>> we
>> >>>>>>>>>>>> might add functionality which tries to automatically cache
>> >>>> results
>> >>>>>>>>> (e.g.
>> >>>>>>>>>>>> caching the latest intermediate results until so and so much
>> >>>> space
>> >>>>>> is
>> >>>>>>>>>>>> used). But this should hopefully not contradict with
>> >>>> `CachedTable
>> >>>>>>>>>>> cache()`.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Cheers,
>> >>>>>>>>>>>> Till
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <
>> becket.qin@gmail.com
>> >>>>>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi Till,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks for the clarification. I am still a little confused.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> If cache() returns a CachedTable, the example might become:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> b = a.map(...)
>> >>>>>>>>>>>>> c = a.map(...)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> cachedTableA = a.cache()
>> >>>>>>>>>>>>> d = cachedTableA.map(...)
>> >>>>>>>>>>>>> e = a.map()
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b, c, d
>> and
>> >>>> e
>> >>>>>> are
>> >>>>>>>>>>> all
>> >>>>>>>>>>>>> going to be reading from the original DAG that generates a.
>> But
>> >>>>>> with
>> >>>>>>>>> a
>> >>>>>>>>>>>>> naive expectation, d should be reading from the cache. This
>> >>>> seems
>> >>>>>> not
>> >>>>>>>>>>>>> solving the potential confusion you raised, right?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Just to be clear, my understanding are all based on the
>> >>>>> assumption
>> >>>>>>>>> that
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
>> >>>>>>>>> c*achedTableA*
>> >>>>>>>>>>>> and
>> >>>>>>>>>>>>> original table *a * should be completely interchangeable.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> That said, I think a valid argument is optimization. There
>> are
>> >>>>>> indeed
>> >>>>>>>>>>>> cases
>> >>>>>>>>>>>>> that reading from the original DAG could be faster than
>> reading
>> >>>>>> from
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>> cache. For example, in the following example:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> a.filter(f1' > 100)
>> >>>>>>>>>>>>> a.cache()
>> >>>>>>>>>>>>> b = a.filter(f1' < 100)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Ideally the optimizer should be intelligent enough to decide
>> >>>>> which
>> >>>>>>>>> way
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>> faster, without user intervention. In this case, it will
>> >>>> identify
>> >>>>>>>>> that
>> >>>>>>>>>>> b
>> >>>>>>>>>>>>> would just be an empty table, thus skip reading from the
>> cache
>> >>>>>>>>>>>> completely.
>> >>>>>>>>>>>>> But I agree that returning a CachedTable would give user the
>> >>>>>> control
>> >>>>>>>>> of
>> >>>>>>>>>>>>> when to use cache, even though I still feel that letting the
>> >>>>>>>>> optimizer
>> >>>>>>>>>>>>> handle this is a better option in long run.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
>> >>>>> trohrmann@apache.org
>> >>>>>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Yes you are right Becket that it still depends on the
>> actual
>> >>>>>>>>>>> execution
>> >>>>>>>>>>>> of
>> >>>>>>>>>>>>>> the job whether a consumer reads from a cached result or
>> not.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> My point was actually about the properties of a (cached vs.
>> >>>>>>>>>>> non-cached)
>> >>>>>>>>>>>>> and
>> >>>>>>>>>>>>>> not about the execution. I would not make cache trigger the
>> >>>>>>>>> execution
>> >>>>>>>>>>>> of
>> >>>>>>>>>>>>>> the job because one loses some flexibility by eagerly
>> >>>> triggering
>> >>>>>> the
>> >>>>>>>>>>>>>> execution.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is
>> returned
>> >>>>> by
>> >>>>>>>>> the
>> >>>>>>>>>>>>>> cache() method like Piotr did in order to make the API more
>> >>>>>>>>> explicit.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Cheers,
>> >>>>>>>>>>>>>> Till
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
>> >>>> becket.qin@gmail.com
>> >>>>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi Till,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> That is a good example. Just a minor correction, in this
>> >>>> case,
>> >>>>>> b, c
>> >>>>>>>>>>>>> and d
>> >>>>>>>>>>>>>>> will all consume from a non-cached a. This is because
>> cache
>> >>>>> will
>> >>>>>>>>>>> only
>> >>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>> created on the very first job submission that generates
>> the
>> >>>>> table
>> >>>>>>>>>>> to
>> >>>>>>>>>>>> be
>> >>>>>>>>>>>>>>> cached.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> If I understand correctly, this is example is about
>> whether
>> >>>>>>>>>>> .cache()
>> >>>>>>>>>>>>>> method
>> >>>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In
>> another
>> >>>>> word,
>> >>>>>>>>>>> if
>> >>>>>>>>>>>>>>> cache() method actually triggers a job that creates the
>> >>>> cache,
>> >>>>>>>>>>> there
>> >>>>>>>>>>>>> will
>> >>>>>>>>>>>>>>> be no such confusion. Is that right?
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> In the example, although d will not consume from the
>> cached
>> >>>>> Table
>> >>>>>>>>>>>> while
>> >>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>> looks supposed to, from correctness perspective the code
>> will
>> >>>>>> still
>> >>>>>>>>>>>>>> return
>> >>>>>>>>>>>>>>> correct result, assuming that tables are immutable.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Personally I feel it is OK because users probably won't
>> >>>> really
>> >>>>>>>>>>> worry
>> >>>>>>>>>>>>>> about
>> >>>>>>>>>>>>>>> whether the table is cached or not. And lazy cache could
>> >>>> avoid
>> >>>>>> some
>> >>>>>>>>>>>>>>> unnecessary caching if a cached table is never created in
>> the
>> >>>>>> user
>> >>>>>>>>>>>>>>> application. But I am not opposed to do eager evaluation
>> of
>> >>>>>> cache.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
>> >>>>>>>>>>> trohrmann@apache.org>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily
>> changing
>> >>>>>>>>>>>> properties
>> >>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>> node affects all down stream consumers but does not
>> >>>>> necessarily
>> >>>>>>>>>>>> have
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>> happen before these consumers are defined. From a user's
>> >>>>>>>>>>>> perspective
>> >>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>> can be quite confusing:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> b = a.map(...)
>> >>>>>>>>>>>>>>>> c = a.map(...)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> a.cache()
>> >>>>>>>>>>>>>>>> d = a.map(...)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> now b, c and d will consume from a cached operator. In
>> this
>> >>>>>> case,
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>>> would most likely expect that only d reads from a cached
>> >>>>> result.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Cheers,
>> >>>>>>>>>>>>>>>> Till
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
>> >>>>>>>>>>>>>> piotr@data-artisans.com>
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
>> effects?
>> >>>> So
>> >>>>>>>>>>>> far
>> >>>>>>>>>>>>> my
>> >>>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
>> >>>>> table
>> >>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>> mutable.
>> >>>>>>>>>>>>>>>>>> Is that the case?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Not only that. There are also performance implications
>> and
>> >>>>>>>>>>> those
>> >>>>>>>>>>>>> are
>> >>>>>>>>>>>>>>>>> another implicit side effects of using `void cache()`.
>> As I
>> >>>>>>>>>>> wrote
>> >>>>>>>>>>>>>>> before,
>> >>>>>>>>>>>>>>>>> reading from cache might not always be desirable, thus
>> it
>> >>>> can
>> >>>>>>>>>>>> cause
>> >>>>>>>>>>>>>>>>> performance degradation and I’m fine with that - user's
>> or
>> >>>>>>>>>>>>>> optimiser’s
>> >>>>>>>>>>>>>>>>> choice. What I do not like is that this implicit side
>> >>>> effect
>> >>>>>>>>>>> can
>> >>>>>>>>>>>>>>> manifest
>> >>>>>>>>>>>>>>>>> in completely different part of code, that wasn’t
>> touched
>> >>>> by
>> >>>>> a
>> >>>>>>>>>>>> user
>> >>>>>>>>>>>>>>> while
>> >>>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else. And
>> even
>> >>>> if
>> >>>>>>>>>>>>> caching
>> >>>>>>>>>>>>>>>>> improves performance, it’s still a side effect of `void
>> >>>>>>>>>>> cache()`.
>> >>>>>>>>>>>>>>> Almost
>> >>>>>>>>>>>>>>>>> from the definition `void` methods have only side
>> effects.
>> >>>>> As I
>> >>>>>>>>>>>>> wrote
>> >>>>>>>>>>>>>>>>> before, there are couple of scenarios where this might
>> be
>> >>>>>>>>>>>>> undesirable
>> >>>>>>>>>>>>>>>>> and/or unexpected, for example:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 1.
>> >>>>>>>>>>>>>>>>> Table b = …;
>> >>>>>>>>>>>>>>>>> b.cache()
>> >>>>>>>>>>>>>>>>> x = b.join(…)
>> >>>>>>>>>>>>>>>>> y = b.count()
>> >>>>>>>>>>>>>>>>> // ...
>> >>>>>>>>>>>>>>>>> // 100
>> >>>>>>>>>>>>>>>>> // hundred
>> >>>>>>>>>>>>>>>>> // lines
>> >>>>>>>>>>>>>>>>> // of
>> >>>>>>>>>>>>>>>>> // code
>> >>>>>>>>>>>>>>>>> // later
>> >>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden
>> in
>> >>>> a
>> >>>>>>>>>>>>>> different
>> >>>>>>>>>>>>>>>>> method/file/package/dependency
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 2.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Table b = ...
>> >>>>>>>>>>>>>>>>> If (some_condition) {
>> >>>>>>>>>>>>>>>>> foo(b)
>> >>>>>>>>>>>>>>>>> }
>> >>>>>>>>>>>>>>>>> Else {
>> >>>>>>>>>>>>>>>>> bar(b)
>> >>>>>>>>>>>>>>>>> }
>> >>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Void foo(Table b) {
>> >>>>>>>>>>>>>>>>> b.cache()
>> >>>>>>>>>>>>>>>>> // do something with b
>> >>>>>>>>>>>>>>>>> }
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly
>> affect
>> >>>>>>>>>>>>> (semantic
>> >>>>>>>>>>>>>>> of a
>> >>>>>>>>>>>>>>>>> program in case of sources being mutable and
>> performance)
>> >>>> `z
>> >>>>> =
>> >>>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On top of that, there is still this argument of mine
>> that
>> >>>>>>>>>>> having
>> >>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
>> >>>> flexible
>> >>>>>>>>>>> for
>> >>>>>>>>>>>> us
>> >>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> future and for the user (as a manual option to bypass
>> cache
>> >>>>>>>>>>>> reads).
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> But Jiangjie is correct,
>> >>>>>>>>>>>>>>>>>> the source table in batching should be immutable. It is
>> >>>> the
>> >>>>>>>>>>>>> user’s
>> >>>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
>> >>>>>>>>>>> failover
>> >>>>>>>>>>>>> may
>> >>>>>>>>>>>>>>> lead
>> >>>>>>>>>>>>>>>>>> to inconsistent results.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment
>> >>>> should
>> >>>>>>>>>>> be.
>> >>>>>>>>>>>>> But
>> >>>>>>>>>>>>>>> its
>> >>>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this (since
>> the
>> >>>>>>>>>>>> proper
>> >>>>>>>>>>>>>> fix
>> >>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>> to support transactions), I’m just trying to minimise
>> >>>>> confusion
>> >>>>>>>>>>>> for
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> users that are not fully aware what’s going on and
>> operate
>> >>>> in
>> >>>>>>>>>>>> less
>> >>>>>>>>>>>>>> then
>> >>>>>>>>>>>>>>>>> perfect setup. And if something bites them after adding
>> >>>>>>>>>>>> `b.cache()`
>> >>>>>>>>>>>>>>> call,
>> >>>>>>>>>>>>>>>>> to make sure that they at least know all of the places
>> that
>> >>>>>>>>>>>> adding
>> >>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>> line can affect.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thanks, Piotrek
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <
>> becket.qin@gmail.com
>> >>>>>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Hi Piotrek,
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more replies
>> are
>> >>>>>>>>>>>>>> following.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be
>> >>>> used
>> >>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>> interactive
>> >>>>>>>>>>>>>>>>>>> programming and not only in batching.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache() has
>> the
>> >>>>>>>>>>> same
>> >>>>>>>>>>>>>>>> semantic
>> >>>>>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>>> batch processing. The semantic is following:
>> >>>>>>>>>>>>>>>>>> For a table created via a series of computation, save
>> that
>> >>>>>>>>>>>> table
>> >>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>> later
>> >>>>>>>>>>>>>>>>>> reference to avoid running the computation logic to
>> >>>>>>>>>>> regenerate
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> table.
>> >>>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
>> >>>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
>> >>>> processing.
>> >>>>>>>>>>> The
>> >>>>>>>>>>>>>>>>> difference
>> >>>>>>>>>>>>>>>>>> is that stream applications will only run once as they
>> are
>> >>>>>>>>>>> long
>> >>>>>>>>>>>>>>>> running.
>> >>>>>>>>>>>>>>>>>> And the batch applications may be run multiple times,
>> >>>> hence
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>> may
>> >>>>>>>>>>>>>>>>>> be created and dropped each time the application runs.
>> >>>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
>> >>>> management
>> >>>>>>>>>>>>>>>> requirements
>> >>>>>>>>>>>>>>>>>> for the streaming cached table, such as time based /
>> size
>> >>>>>>>>>>> based
>> >>>>>>>>>>>>>>>>> retention,
>> >>>>>>>>>>>>>>>>>> to address the infinite data issue. But such
>> requirement
>> >>>>> does
>> >>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>> change
>> >>>>>>>>>>>>>>>>>> the semantic.
>> >>>>>>>>>>>>>>>>>> You are right that interactive programming is just one
>> use
>> >>>>>>>>>>> case
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>> cache().
>> >>>>>>>>>>>>>>>>>> It is not the only use case.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
>> `void
>> >>>>>>>>>>>>> cache()`
>> >>>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>>>> side effects.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> This is indeed the key point. The argument around
>> whether
>> >>>>>>>>>>>> cache()
>> >>>>>>>>>>>>>>>> should
>> >>>>>>>>>>>>>>>>>> return something already indicates that cache() and
>> >>>>>>>>>>>> materialize()
>> >>>>>>>>>>>>>>>> address
>> >>>>>>>>>>>>>>>>>> different issues.
>> >>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
>> effects?
>> >>>> So
>> >>>>>>>>>>>> far
>> >>>>>>>>>>>>> my
>> >>>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
>> >>>>> table
>> >>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>> mutable.
>> >>>>>>>>>>>>>>>>>> Is that the case?
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>> >>>> CachedTable
>> >>>>>>>>>>>>>>> read-only.
>> >>>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
>> can
>> >>>>> not
>> >>>>>>>>>>>>> write
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> views
>> >>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently
>> can
>> >>>> not
>> >>>>>>>>>>>>> write
>> >>>>>>>>>>>>>>> to a
>> >>>>>>>>>>>>>>>>>>> Table.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> I don't think anyone should insert something to a
>> cache.
>> >>>> By
>> >>>>>>>>>>>>>>> definition
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> cache should only be updated when the corresponding
>> >>>> original
>> >>>>>>>>>>>>> table
>> >>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>> updated. What I am wondering is that given the
>> following
>> >>>> two
>> >>>>>>>>>>>>> facts:
>> >>>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with something
>> like
>> >>>>>>>>>>>>>> insert()),
>> >>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
>> >>>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
>> >>>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
>> >>>> mutable
>> >>>>>>>>>>> and
>> >>>>>>>>>>>>>> users
>> >>>>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is where I
>> >>>>> thought
>> >>>>>>>>>>>>>>>> confusing.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
>> >>>>>>>>>>>>>>> piotr@data-artisans.com
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One
>> more
>> >>>>>>>>>>>>>> explanation
>> >>>>>>>>>>>>>>>> why
>> >>>>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is that I
>> >>>> think
>> >>>>>>>>>>> of
>> >>>>>>>>>>>>> all
>> >>>>>>>>>>>>>>>>> “Table”s
>> >>>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
>> >>>>>>>>>>> views,
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>> only
>> >>>>>>>>>>>>>>>>>>> difference for me is that their live scope is short -
>> >>>>>>>>>>> current
>> >>>>>>>>>>>>>>> session
>> >>>>>>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>>>> is limited by different execution model. That’s why
>> >>>>>>>>>>> “cashing”
>> >>>>>>>>>>>> a
>> >>>>>>>>>>>>>> view
>> >>>>>>>>>>>>>>>>> for me
>> >>>>>>>>>>>>>>>>>>> is just materialising it.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> However I see and I understand your point of view.
>> Coming
>> >>>>>>>>>>> from
>> >>>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL
>> world,
>> >>>>>>>>>>>>> `cache()`
>> >>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might
>> not
>> >>>>>>>>>>> only
>> >>>>>>>>>>>> be
>> >>>>>>>>>>>>>>> used
>> >>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>> interactive programming and not only in batching. But
>> >>>>> naming
>> >>>>>>>>>>>> is
>> >>>>>>>>>>>>>> one
>> >>>>>>>>>>>>>>>>> issue,
>> >>>>>>>>>>>>>>>>>>> and not that critical to me. Especially that once we
>> >>>>>>>>>>> implement
>> >>>>>>>>>>>>>>> proper
>> >>>>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
>> >>>>> `cache()`
>> >>>>>>>>>>>> if
>> >>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>> deem
>> >>>>>>>>>>>>>>>>> so.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
>> >>>> `void
>> >>>>>>>>>>>>>> cache()`
>> >>>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you have
>> >>>>>>>>>>> mentioned.
>> >>>>>>>>>>>>>> True:
>> >>>>>>>>>>>>>>>>>>> results might be non deterministic if underlying
>> source
>> >>>>>>>>>>> table
>> >>>>>>>>>>>>> are
>> >>>>>>>>>>>>>>>>> changing.
>> >>>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
>> >>>>>>>>>>> semantic
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It
>> can
>> >>>>>>>>>>> cause
>> >>>>>>>>>>>>>> “wtf”
>> >>>>>>>>>>>>>>>>> moment
>> >>>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some
>> place
>> >>>> in
>> >>>>>>>>>>> his
>> >>>>>>>>>>>>>> code
>> >>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
>> >>>> differently.
>> >>>>>>>>>>> If
>> >>>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle,
>> we
>> >>>>>>>>>>> force
>> >>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>> explicitly use the cache which removes the “random”
>> part
>> >>>>>>>>>>> from
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> "suddenly
>> >>>>>>>>>>>>>>>>>>> some other random places are behaving differently”.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
>> >>>>>>>>>>>>>>>> flexibility/allowing
>> >>>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are independent
>> of
>> >>>>>>>>>>>>> `cache()`
>> >>>>>>>>>>>>>> vs
>> >>>>>>>>>>>>>>>>>>> `materialize()` discussion.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the
>> CachedTable?
>> >>>>>>>>>>> This
>> >>>>>>>>>>>>>>> sounds
>> >>>>>>>>>>>>>>>>>>> pretty confusing.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>> >>>> CachedTable
>> >>>>>>>>>>>>>>>> read-only. I
>> >>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
>> can
>> >>>>> not
>> >>>>>>>>>>>>> write
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> views
>> >>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently
>> can
>> >>>> not
>> >>>>>>>>>>>>> write
>> >>>>>>>>>>>>>>> to a
>> >>>>>>>>>>>>>>>>>>> Table.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Piotrek
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
>> >>>> xingcanc@gmail.com
>> >>>>>>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and
>> `materialize()`
>> >>>>>>>>>>>> should
>> >>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>> considered as two different methods where the later
>> one
>> >>>> is
>> >>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>>>> sophisticated.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea is
>> just
>> >>>> to
>> >>>>>>>>>>>>>>> introduce
>> >>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI
>> >>>> is a
>> >>>>>>>>>>>>>>> high-level
>> >>>>>>>>>>>>>>>>> API,
>> >>>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet
>> API
>> >>>>>>>>>>> and
>> >>>>>>>>>>>>>> force
>> >>>>>>>>>>>>>>>>> users
>> >>>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching it.
>> Then
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>> users
>> >>>>>>>>>>>>>>>>> should
>> >>>>>>>>>>>>>>>>>>> manually register the cached dataset to a table again
>> (we
>> >>>>>>>>>>> may
>> >>>>>>>>>>>>> need
>> >>>>>>>>>>>>>>>> some
>> >>>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
>> >>>> identical
>> >>>>>>>>>>>>> schema
>> >>>>>>>>>>>>>>> but
>> >>>>>>>>>>>>>>>>>>> different contents here). After all, it’s the dataset
>> >>>>> rather
>> >>>>>>>>>>>>> than
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>> Xingcan
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
>> >>>>>>>>>>>>> becket.qin@gmail.com>
>> >>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are
>> good
>> >>>>>>>>>>>>>> arguments.
>> >>>>>>>>>>>>>>>>> But I
>> >>>>>>>>>>>>>>>>>>>>> think those arguments are mostly about materialized
>> >>>> view.
>> >>>>>>>>>>>> Let
>> >>>>>>>>>>>>> me
>> >>>>>>>>>>>>>>> try
>> >>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and
>> materialize()
>> >>>>> are
>> >>>>>>>>>>>>>>>> different.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite
>> different
>> >>>>>>>>>>>>>>> implications.
>> >>>>>>>>>>>>>>>>> An
>> >>>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When
>> users
>> >>>>>>>>>>> call
>> >>>>>>>>>>>>>>> cache(),
>> >>>>>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result as
>> a
>> >>>>>>>>>>> draft
>> >>>>>>>>>>>> of
>> >>>>>>>>>>>>>>> their
>> >>>>>>>>>>>>>>>>>>> work,
>> >>>>>>>>>>>>>>>>>>>>> this intermediate result may not have any realistic
>> >>>>>>>>>>> meaning.
>> >>>>>>>>>>>>>>> Calling
>> >>>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the
>> cached
>> >>>>>>>>>>> table
>> >>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>> any
>> >>>>>>>>>>>>>>>>>>> manner.
>> >>>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I
>> have
>> >>>>>>>>>>>>> something
>> >>>>>>>>>>>>>>>>>>> meaningful
>> >>>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think
>> about
>> >>>>> the
>> >>>>>>>>>>>>>>>> validation,
>> >>>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
>> >>>> materialize()
>> >>>>>>>>>>>>> methods
>> >>>>>>>>>>>>>>> are
>> >>>>>>>>>>>>>>>>>>> very
>> >>>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them. The
>> >>>> concept
>> >>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>> materialized
>> >>>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to say
>> the
>> >>>>>>>>>>>> related
>> >>>>>>>>>>>>>>> stuff
>> >>>>>>>>>>>>>>>>> like
>> >>>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
>> >>>>>>>>>>>> materialized
>> >>>>>>>>>>>>>>> view
>> >>>>>>>>>>>>>>>>>>> itself
>> >>>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and
>> systematic
>> >>>>>>>>>>>> manner.
>> >>>>>>>>>>>>>> And
>> >>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>>> found
>> >>>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
>> >>>>>>>>>>>>> interactive
>> >>>>>>>>>>>>>>>>>>>>> programming experience.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still have
>> some
>> >>>>>>>>>>>>>> questions,
>> >>>>>>>>>>>>>>>>>>> though.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files
>> from a
>> >>>>>>>>>>>>>> directory
>> >>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
>> >>>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>> >>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>> >>>>>>>>>>> initialised)
>> >>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
>> >>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
>> >>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
>> >>>> writes
>> >>>>>>>>>>>> new
>> >>>>>>>>>>>>>>> files
>> >>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>> /foo/bar
>> >>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
>> >>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
>> >>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to
>> be
>> >>>>>>>>>>>>>> implemented
>> >>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>> initial version
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> what if someone else added some more files to
>> /foo/bar
>> >>>> at
>> >>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>> point?
>> >>>>>>>>>>>>>>>>> In
>> >>>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result
>> become
>> >>>>>>>>>>>>>>>>>>> non-deterministic,
>> >>>>>>>>>>>>>>>>>>>>> right?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
>> >>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
>> >>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension,
>> manual
>> >>>>>>>>>>>>> “cache”
>> >>>>>>>>>>>>>>>>> dropping
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in most
>> >>>>> cases,
>> >>>>>>>>>>>> we
>> >>>>>>>>>>>>>> are
>> >>>>>>>>>>>>>>>>>>> talking
>> >>>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental assumption
>> of
>> >>>>> such
>> >>>>>>>>>>>>> case
>> >>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> source data is complete before the data processing
>> >>>>> begins,
>> >>>>>>>>>>>> and
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> data
>> >>>>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO, if
>> >>>>>>>>>>>> additional
>> >>>>>>>>>>>>>>> rows
>> >>>>>>>>>>>>>>>>>>> needs
>> >>>>>>>>>>>>>>>>>>>>> to be added to some source during the processing, it
>> >>>>>>>>>>> should
>> >>>>>>>>>>>> be
>> >>>>>>>>>>>>>>> done
>> >>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>> ways
>> >>>>>>>>>>>>>>>>>>>>> like union the source with another table containing
>> the
>> >>>>>>>>>>> rows
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>> added.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> There are a few cases that computations are executed
>> >>>>>>>>>>>>> repeatedly
>> >>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> changing data source.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job every
>> >>>> hour
>> >>>>>>>>>>>> with
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>> samples
>> >>>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the
>> source
>> >>>>>>>>>>> data
>> >>>>>>>>>>>>>>> between
>> >>>>>>>>>>>>>>>>> will
>> >>>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged
>> >>>>> within
>> >>>>>>>>>>>> one
>> >>>>>>>>>>>>>>> run.
>> >>>>>>>>>>>>>>>>> And
>> >>>>>>>>>>>>>>>>>>>>> usually in that case, the result will need
>> versioning,
>> >>>>>>>>>>> i.e.
>> >>>>>>>>>>>>> for
>> >>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>> given
>> >>>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result from
>> the
>> >>>>>>>>>>> source
>> >>>>>>>>>>>>>> data
>> >>>>>>>>>>>>>>>> by a
>> >>>>>>>>>>>>>>>>>>>>> certain timestamp.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Another example is something like data warehouse. In
>> >>>> this
>> >>>>>>>>>>>>> case,
>> >>>>>>>>>>>>>>>> there
>> >>>>>>>>>>>>>>>>>>> are a
>> >>>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
>> >>>> sources,
>> >>>>>>>>>>>> many
>> >>>>>>>>>>>>>>>>>>> materialized
>> >>>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be
>> created to
>> >>>>>>>>>>>>> generate
>> >>>>>>>>>>>>>>>>> derived
>> >>>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated when
>> the
>> >>>>>>>>>>>>> underlying
>> >>>>>>>>>>>>>>>>>>> original
>> >>>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic
>> that
>> >>>>>>>>>>>> derives
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>> original
>> >>>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
>> >>>>>>>>>>>>>>> reports/views.
>> >>>>>>>>>>>>>>>>>>> Again,
>> >>>>>>>>>>>>>>>>>>>>> all those derived data also need to ha
>>
>>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotrek,

1. Regarding optimization.
Sure there are many cases that the decision is hard to make. But that does
not make it any easier for the users to make those decisions. I imagine 99%
of the users would just naively use cache. I am not saying we can optimize
in all the cases. But as long as we agree that at least in certain cases (I
would argue most cases), optimizer can do a little better than an average
user who likely knows little about Flink internals, we should not push the
burden of optimization to users.

BTW, it seems some of your concerns are related to the implementation. I
did not mention the implementation of the caching service because that
should not affect the API semantic. Not sure if this helps, but imagine the
default implementation has one StorageNode service colocating with each TM.
It could be running within the TM process or in a standalone process,
depending on configuration.

The StorageNode uses memory + spill-to-disk mechanism. The cached data will
just be written to the local StorageNode service. If the StorageNode is
running within the TM process, the in-memory cache could just be objects so
we save some serde cost. A later job referring to the cached Table will be
scheduled in a locality aware manner, i.e. run in the TM whose peer
StorageNode hosts the data.


2. Semantic
I am not sure why introducing a new hintCache() or
env.enableAutomaticCaching() method would avoid the consequence of semantic
change.

If the auto optimization is not enabled by default, users still need to
make code change to all existing programs in order to get the benefit.
If the auto optimization is enabled by default, advanced users who know
that they really want to use cache will suddenly lose the opportunity to do
so, unless they change the code to disable auto optimization.


3. side effect
The CacheHandle is not only for where to put uncache(). It is to solve the
implicit performance impact by moving the uncache() to the CacheHandle.

   - If users wants to leverage cache, they can call a.cache(). After that,
   unless user explicitly release that CacheHandle, a.foo() will always
   leverage cache if needed (optimizer may choose to ignore cache if that
   helps accelerate the process). Any function call will not be able to
   release the cache because they do not have that CacheHandle.
   - If some advanced users do not want to use cache at all, they will call
   a.hint(ignoreCache).foo(). This will for sure ignore cache and use the
   original DAG to process.


> In vast majority of the cases, users wouldn't really care whether the
> cache is used or not.
> I wouldn’t agree with that, because “caching” (if not purely in memory
> caching) would add additional IO costs. It’s similar as saying that users
> would not see a difference between Spark/Flink and MapReduce (MapReduce
> writes data to disks after every map/reduce stage).

What I wanted to say is that in most cases, after users call cache(), they
don't really care about whether auto optimization has decided to ignore the
cache or not, as long as the program runs faster.

Thanks,

Jiangjie (Becket) Qin








On Wed, Dec 12, 2018 at 10:50 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi,
>
> Thanks for the quick answer :)
>
> Re 1.
>
> I generally agree with you, however couple of points:
>
> a) the problem with using automatic caching is bigger, because you will
> have to decide, how do you compare IO vs CPU costs and if you pick wrong,
> additional IO costs might be enormous or even can crash your system. This
> is more difficult problem compared to let say join reordering, where the
> only issue is to have good statistics that can capture correlations between
> columns (when you reorder joins number of IO operations do not change)
> c) your example is completely independent of caching.
>
> Query like this:
>
> src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===`f2).as('f3,
> …).filter(‘f3 > 30)
>
> Should/could be optimised to empty result immediately, without the need
> for any cache/materialisation and that should work even without any
> statistics provided by the connector.
>
> For me prerequisite to any serious cost-based optimisations would be some
> reasonable benchmark coverage of the code (tpch?). Otherwise that would be
> equivalent of adding not tested code, since we wouldn’t be able to verify
> our assumptions, like how does the writing of 10 000 records to
> cache/RocksDB/Kafka/CSV file compare to joining/filtering/processing of
> lets say 1000 000 rows.
>
> Re 2.
>
> I wasn’t proposing to change the semantic later. I was proposing that we
> start now:
>
> CachedTable cachedA = a.cache()
> cachedA.foo() // Cache is used
> a.bar() // Original DAG is used
>
> And then later we can think about adding for example
>
> CachedTable cachedA = a.hintCache()
> cachedA.foo() // Cache might be used
> a.bar() // Original DAG is used
>
> Or
>
> env.enableAutomaticCaching()
> a.foo() // Cache might be used
> a.bar() // Cache might be used
>
> Or (I would still not like this option):
>
> a.hintCache()
> a.foo() // Cache might be used
> a.bar() // Cache might be used
>
> Or whatever else that will come to our mind. Even if we add some automatic
> caching in the future, keeping implicit (`CachedTable cache()`) caching
> will still be useful, at least in some cases.
>
> Re 3.
>
> > 2. The source tables are immutable during one run of batch processing
> logic.
> > 3. The cache is immutable during one run of batch processing logic.
>
> > I think assumption 2 and 3 are by definition what batch processing means,
> > i.e the data must be complete before it is processed and should not
> change
> > when the processing is running.
>
> I agree that this is how batch systems SHOULD be working. However I know
> from my previous experience that it’s not always the case. Sometimes users
> are just working on some non transactional storage, which can be (either
> constantly or occasionally) being modified by some other processes for
> whatever the reasons (fixing the data, updating, adding new data etc).
>
> But even if we ignore this point (data immutability), performance side
> effect issue of your proposal remains. If user calls `void a.cache()` deep
> inside some private method, it will have implicit side effects on other
> parts of his program that might not be obvious.
>
> Re `CacheHandle`.
>
> If I understand it correctly, it only addresses the issue where to place
> method `uncache`/`dropCache`.
>
> Btw,
>
> > In vast majority of the cases, users wouldn't really care whether the
> cache is used or not.
>
> I wouldn’t agree with that, because “caching” (if not purely in memory
> caching) would add additional IO costs. It’s similar as saying that users
> would not see a difference between Spark/Flink and MapReduce (MapReduce
> writes data to disks after every map/reduce stage).
>
> Piotrek
>
> > On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com> wrote:
> >
> > Hi Piotrek,
> >
> > Not sure if you noticed, in my last email, I was proposing `CacheHandle
> > cache()` to avoid the potential side effect due to function calls.
> >
> > Let's look at the disagreement in your reply one by one.
> >
> >
> > 1. Optimization chances
> >
> > Optimization is never a trivial work. This is exactly why we should not
> let
> > user manually do that. Databases have done huge amount of work in this
> > area. At Alibaba, we rely heavily on many optimization rules to boost the
> > SQL query performance.
> >
> > In your example, if I filling the filter conditions in a certain way, the
> > optimization would become obvious.
> >
> > Table src1 = … // read from connector 1
> > Table src2 = … // read from connector 2
> >
> > Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
> > `f2).as('f3, ...)
> > a.cache() // write cache to connector 3, when writing the records,
> remember
> > min and max of `f1
> >
> > a.filter('f3 > 30) // There is no need to read from any connector because
> > `a` does not contain any record whose 'f3 is greater than 30.
> > env.execute()
> > a.select(…)
> >
> > BTW, it seems to me that adding some basic statistics is fairly
> > straightforward and the cost is pretty marginal if not ignorable. In fact
> > it is not only needed for optimization, but also for cases such as ML,
> > where some algorithms may need to decide their parameter based on the
> > statistics of the data.
> >
> >
> > 2. Same API, one semantic now, another semantic later.
> >
> > I am trying to understand what is the semantic of `CachedTable cache()`
> you
> > are proposing. IMO, we should avoid designing an API whose semantic will
> be
> > changed later. If we have a "CachedTable cache()" method, then the
> semantic
> > should be very clearly defined upfront and do not change later. It should
> > never be "right now let's go with semantic 1, later we can silently
> change
> > it to semantic 2 or 3". Such change could result in bad consequence. For
> > example, let's say we decide go with semantic 1:
> >
> > CachedTable cachedA = a.cache()
> > cachedA.foo() // Cache is used
> > a.bar() // Original DAG is used.
> >
> > Now majority of the users would be using cachedA.foo() in their code. And
> > some advanced users will use a.bar() to explicitly skip the cache. Later
> > on, we added smart optimization and change the semantic to semantic 2:
> >
> > CachedTable cachedA = a.cache()
> > cachedA.foo() // Cache is used
> > a.bar() // Cache MIGHT be used, and Flink may decide to skip cache if it
> is
> > faster.
> >
> > Now most of the users who were writing cachedA.foo() will not benefit
> from
> > this optimization at all, unless they change their code to use a.foo()
> > instead. And those advanced users suddenly lose the option to explicitly
> > ignore cache unless they change their code (assuming we care enough to
> > provide something like hint(useCache)). If we don't define the semantic
> > carefully, our users will have to change their code again and again while
> > they shouldn't have to.
> >
> >
> > 3. side effect.
> >
> > Before we talk about side effect, we have to agree on the assumptions.
> The
> > assumptions I have are following:
> > 1. We are talking about batch processing.
> > 2. The source tables are immutable during one run of batch processing
> logic.
> > 3. The cache is immutable during one run of batch processing logic.
> >
> > I think assumption 2 and 3 are by definition what batch processing means,
> > i.e the data must be complete before it is processed and should not
> change
> > when the processing is running.
> >
> > As far as I am aware of, I don't know any batch processing system
> breaking
> > those assumptions. Even for relational database tables, where queries can
> > run with concurrent modifications, necessary locking are still required
> to
> > ensure the integrity of the query result.
> >
> > Please let me know if you disagree with the above assumptions. If you
> agree
> > with these assumptions, with the `CacheHandle cache()` API in my last
> > email, do you still see side effects?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <pi...@data-artisans.com>
> > wrote:
> >
> >> Hi Becket,
> >>
> >>> Regarding the chance of optimization, it might not be that rare. Some
> >> very
> >>> simple statistics could already help in many cases. For example, simply
> >>> maintaining max and min of each fields can already eliminate some
> >>> unnecessary table scan (potentially scanning the cached table) if the
> >>> result is doomed to be empty. A histogram would give even further
> >>> information. The optimizer could be very careful and only ignores cache
> >>> when it is 100% sure doing that is cheaper. e.g. only when a filter on
> >> the
> >>> cache will absolutely return nothing.
> >>
> >> I do not see how this might be easy to achieve. It would require tons of
> >> effort to make it work and in the end you would still have a problem of
> >> comparing/trading CPU cycles vs IO. For example:
> >>
> >> Table src1 = … // read from connector 1
> >> Table src2 = … // read from connector 2
> >>
> >> Table a = src1.filter(…).join(src2.filter(…), …)
> >> a.cache() // write cache to connector 3
> >>
> >> a.filter(…)
> >> env.execute()
> >> a.select(…)
> >>
> >> Decision whether it’s better to:
> >> A) read from connector1/connector2, filter/map and join them twice
> >> B) read from connector1/connector2, filter/map and join them once, pay
> the
> >> price of writing to connector 3 and then reading from it
> >>
> >> Is very far from trivial. `a` can end up much larger than `src1` and
> >> `src2`, writes to connector 3 might be extremely slow, reads from
> connector
> >> 3 can be slower compared to reads from connector 1 & 2, … . You really
> need
> >> to have extremely good statistics to correctly asses size of the output
> and
> >> it would still be failing many times (correlations etc). And keep in
> mind
> >> that at the moment we do not have ANY statistics at all. More than
> that, it
> >> would require significantly more testing and setting up some benchmarks
> to
> >> make sure that we do not brake it with some regressions.
> >>
> >> That’s why I’m strongly opposing this idea - at least let’s not starts
> >> with this. If we first start with completely manual/explicit caching,
> >> without any magic, it would be a significant improvement for the users
> for
> >> a fraction of the development cost. After implementing that, when we
> >> already have all of the working pieces, we can start working on some
> >> optimisations rules. As I wrote before, if we start with
> >>
> >> `CachedTable cache()`
> >>
> >> We can later work on follow up stories to make it automatic. Despite
> that
> >> I don’t like this implicit/side effect approach with `void` method,
> having
> >> explicit `CachedTable cache()` wouldn’t even prevent as from later
> adding
> >> `void hintCache()` method, with the exact semantic that you want.
> >>
> >> On top of that I re-rise again that having implicit `void
> >> cache()/hintCache()` has other side effects and problems with non
> immutable
> >> data, and being annoying when used secretly inside methods.
> >>
> >> Explicit `CachedTable cache()` just looks like much less controversial
> MVP
> >> and if we decide to go further with this topic, it’s not a wasted
> effort,
> >> but just lies on a stright path to more advanced/complicated solutions
> in
> >> the future. Are there any drawbacks of starting with `CachedTable
> cache()`
> >> that I’m missing?
> >>
> >> Piotrek
> >>
> >>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
> >>>
> >>> Hi Becket,
> >>>
> >>> Introducing CacheHandle seems too complicated. That means users have to
> >>> maintain Handler properly.
> >>>
> >>> And since cache is just a hint for optimizer, why not just return Table
> >>> itself for cache method. This hint info should be kept in Table I
> >> believe.
> >>>
> >>> So how about adding method cache and uncache for Table, and both return
> >>> Table. Because what cache and uncache did is just adding some hint info
> >>> into Table.
> >>>
> >>>
> >>>
> >>>
> >>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
> >>>
> >>>> Hi Till and Piotrek,
> >>>>
> >>>> Thanks for the clarification. That solves quite a few confusion. My
> >>>> understanding of how cache works is same as what Till describe. i.e.
> >>>> cache() is a hint to Flink, but it is not guaranteed that cache always
> >>>> exist and it might be recomputed from its lineage.
> >>>>
> >>>> Is this the core of our disagreement here? That you would like this
> >>>>> “cache()” to be mostly hint for the optimiser?
> >>>>
> >>>> Semantic wise, yes. That's also why I think materialize() has a much
> >> larger
> >>>> scope than cache(), thus it should be a different method.
> >>>>
> >>>> Regarding the chance of optimization, it might not be that rare. Some
> >> very
> >>>> simple statistics could already help in many cases. For example,
> simply
> >>>> maintaining max and min of each fields can already eliminate some
> >>>> unnecessary table scan (potentially scanning the cached table) if the
> >>>> result is doomed to be empty. A histogram would give even further
> >>>> information. The optimizer could be very careful and only ignores
> cache
> >>>> when it is 100% sure doing that is cheaper. e.g. only when a filter on
> >> the
> >>>> cache will absolutely return nothing.
> >>>>
> >>>> Given the above clarification on cache, I would like to revisit the
> >>>> original "void cache()" proposal and see if we can improve on top of
> >> that.
> >>>>
> >>>> What do you think about the following modified interface?
> >>>>
> >>>> Table {
> >>>> /**
> >>>>  * This call hints Flink to maintain a cache of this table and
> leverage
> >>>> it for performance optimization if needed.
> >>>>  * Note that Flink may still decide to not use the cache if it is
> >> cheaper
> >>>> by doing so.
> >>>>  *
> >>>>  * A CacheHandle will be returned to allow user release the cache
> >>>> actively. The cache will be deleted if there
> >>>>  * is no unreleased cache handlers to it. When the TableEnvironment is
> >>>> closed. The cache will also be deleted
> >>>>  * and all the cache handlers will be released.
> >>>>  *
> >>>>  * @return a CacheHandle referring to the cache of this table.
> >>>>  */
> >>>> CacheHandle cache();
> >>>> }
> >>>>
> >>>> CacheHandle {
> >>>> /**
> >>>>  * Close the cache handle. This method does not necessarily deletes
> the
> >>>> cache. Instead, it simply decrements the reference counter to the
> cache.
> >>>>  * When the there is no handle referring to a cache. The cache will be
> >>>> deleted.
> >>>>  *
> >>>>  * @return the number of open handles to the cache after this handle
> >> has
> >>>> been released.
> >>>>  */
> >>>> int release()
> >>>> }
> >>>>
> >>>> The rationale behind this interface is following:
> >>>> In vast majority of the cases, users wouldn't really care whether the
> >> cache
> >>>> is used or not. So I think the most intuitive way is letting cache()
> >> return
> >>>> nothing. So nobody needs to worry about the difference between
> >> operations
> >>>> on CacheTables and those on the "original" tables. This will make
> maybe
> >>>> 99.9% of the users happy. There were two concerns raised for this
> >> approach:
> >>>> 1. In some rare cases, users may want to ignore cache,
> >>>> 2. A table might be cached/uncached in a third party function while
> the
> >>>> caller does not know.
> >>>>
> >>>> For the first issue, users can use hint("ignoreCache") to explicitly
> >> ignore
> >>>> cache.
> >>>> For the second issue, the above proposal lets cache() return a
> >> CacheHandle,
> >>>> the only method in it is release(). Different CacheHandles will refer
> to
> >>>> the same cache, if a cache no longer has any cache handle, it will be
> >>>> deleted. This will address the following case:
> >>>> {
> >>>> val handle1 = a.cache()
> >>>> process(a)
> >>>> a.select(...) // cache is still available, handle1 has not been
> >> released.
> >>>> }
> >>>>
> >>>> void process(Table t) {
> >>>> val handle2 = t.cache() // new handle to cache
> >>>> t.select(...) // optimizer decides cache usage
> >>>> t.hint("ignoreCache").select(...) // cache is ignored
> >>>> handle2.release() // release the handle, but the cache may still be
> >>>> available if there are other handles
> >>>> ...
> >>>> }
> >>>>
> >>>> Does the above modified approach look reasonable to you?
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <tr...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> Hi Becket,
> >>>>>
> >>>>> I was aiming at semantics similar to 1. I actually thought that
> >> `cache()`
> >>>>> would tell the system to materialize the intermediate result so that
> >>>>> subsequent queries don't need to reprocess it. This means that the
> >> usage
> >>>> of
> >>>>> the cached table in this example
> >>>>>
> >>>>> {
> >>>>> val cachedTable = a.cache()
> >>>>> val b1 = cachedTable.select(…)
> >>>>> val b2 = cachedTable.foo().select(…)
> >>>>> val b3 = cachedTable.bar().select(...)
> >>>>> val c1 = a.select(…)
> >>>>> val c2 = a.foo().select(…)
> >>>>> val c3 = a.bar().select(...)
> >>>>> }
> >>>>>
> >>>>> strongly depends on interleaved calls which trigger the execution of
> >> sub
> >>>>> queries. So for example, if there is only a single env.execute call
> at
> >>>> the
> >>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be computed
> by
> >>>>> reading directly from the sources (given that there is only a single
> >>>>> JobGraph). It just happens that the result of `a` will be cached such
> >>>> that
> >>>>> we skip the processing of `a` when there are subsequent queries
> reading
> >>>>> from `cachedTable`. If for some reason the system cannot materialize
> >> the
> >>>>> table (e.g. running out of disk space, ttl expired), then it could
> also
> >>>>> happen that we need to reprocess `a`. In that sense `cachedTable`
> >> simply
> >>>> is
> >>>>> an identifier for the materialized result of `a` with the lineage how
> >> to
> >>>>> reprocess it.
> >>>>>
> >>>>> Cheers,
> >>>>> Till
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
> >> piotr@data-artisans.com
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Becket,
> >>>>>>
> >>>>>>> {
> >>>>>>> val cachedTable = a.cache()
> >>>>>>> val b = cachedTable.select(...)
> >>>>>>> val c = a.select(...)
> >>>>>>> }
> >>>>>>>
> >>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses original
> >>>> DAG
> >>>>>> as
> >>>>>>> user demanded so. In this case, the optimizer has no chance to
> >>>>> optimize.
> >>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
> >>>>>> optimizer
> >>>>>>> to choose whether the cache or DAG should be used. In this case,
> user
> >>>>>> lose
> >>>>>>> the option to NOT use cache.
> >>>>>>>
> >>>>>>> As you can see, neither of the options seem perfect. However, I
> guess
> >>>>> you
> >>>>>>> and Till are proposing the third option:
> >>>>>>>
> >>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
> >>>>> should
> >>>>>> be
> >>>>>>> used. c always use the DAG.
> >>>>>>
> >>>>>> I am pretty sure that me, Till, Fabian and others were all proposing
> >>>> and
> >>>>>> advocating in favour of semantic “1”. No cost based optimiser
> >> decisions
> >>>>> at
> >>>>>> all.
> >>>>>>
> >>>>>> {
> >>>>>> val cachedTable = a.cache()
> >>>>>> val b1 = cachedTable.select(…)
> >>>>>> val b2 = cachedTable.foo().select(…)
> >>>>>> val b3 = cachedTable.bar().select(...)
> >>>>>> val c1 = a.select(…)
> >>>>>> val c2 = a.foo().select(…)
> >>>>>> val c3 = a.bar().select(...)
> >>>>>> }
> >>>>>>
> >>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
> >>>>>> re-executing whole plan for “a”.
> >>>>>>
> >>>>>> In the future we could discuss going one step further, introducing
> >> some
> >>>>>> global optimisation (that can be manually enabled/disabled):
> >>>> deduplicate
> >>>>>> plan nodes/deduplicate sub queries/re-use sub queries results/or
> >>>> whatever
> >>>>>> we could call it. It could do two things:
> >>>>>>
> >>>>>> 1. Automatically try to deduplicate fragments of the plan and share
> >> the
> >>>>>> result using CachedTable - in other words automatically insert
> >>>>> `CachedTable
> >>>>>> cache()` calls.
> >>>>>> 2. Automatically make decision to bypass explicit `CachedTable`
> access
> >>>>>> (this would be the equivalent of what you described as “semantic
> 3”).
> >>>>>>
> >>>>>> However as I wrote previously, I have big doubts if such cost-based
> >>>>>> optimisation would work (this applies also to “Semantic 2”). I would
> >>>>> expect
> >>>>>> it to do more harm than good in so many cases, that it wouldn’t make
> >>>>> sense.
> >>>>>> Even assuming that we calculate statistics perfectly (this ain’t
> gonna
> >>>>>> happen), it’s virtually impossible to correctly estimate correct
> >>>> exchange
> >>>>>> rate of CPU cycles vs IO operations as it is changing so much from
> >>>>>> deployment to deployment.
> >>>>>>
> >>>>>> Is this the core of our disagreement here? That you would like this
> >>>>>> “cache()” to be mostly hint for the optimiser?
> >>>>>>
> >>>>>> Piotrek
> >>>>>>
> >>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Another potential concern for semantic 3 is that. In the future, we
> >>>> may
> >>>>>> add
> >>>>>>> automatic caching to Flink. e.g. cache the intermediate results at
> >>>> the
> >>>>>>> shuffle boundary. If our semantic is that reference to the original
> >>>>> table
> >>>>>>> means skipping cache, those users may not be able to benefit from
> the
> >>>>>>> implicit cache.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <be...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Piotrek,
> >>>>>>>>
> >>>>>>>> Thanks for the reply. Thought about it again, I might have
> >>>>> misunderstood
> >>>>>>>> your proposal in earlier emails. Returning a CachedTable might not
> >>>> be
> >>>>> a
> >>>>>> bad
> >>>>>>>> idea.
> >>>>>>>>
> >>>>>>>> I was more concerned about the semantic and its intuitiveness
> when a
> >>>>>>>> CachedTable is returned. i..e, if cache() returns CachedTable.
> What
> >>>>> are
> >>>>>> the
> >>>>>>>> semantic in the following code:
> >>>>>>>> {
> >>>>>>>> val cachedTable = a.cache()
> >>>>>>>> val b = cachedTable.select(...)
> >>>>>>>> val c = a.select(...)
> >>>>>>>> }
> >>>>>>>> What is the difference between b and c? At the first glance, I see
> >>>> two
> >>>>>>>> options:
> >>>>>>>>
> >>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses
> original
> >>>>> DAG
> >>>>>> as
> >>>>>>>> user demanded so. In this case, the optimizer has no chance to
> >>>>> optimize.
> >>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
> >>>>>> optimizer
> >>>>>>>> to choose whether the cache or DAG should be used. In this case,
> >>>> user
> >>>>>> lose
> >>>>>>>> the option to NOT use cache.
> >>>>>>>>
> >>>>>>>> As you can see, neither of the options seem perfect. However, I
> >>>> guess
> >>>>>> you
> >>>>>>>> and Till are proposing the third option:
> >>>>>>>>
> >>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
> >>>>> should
> >>>>>>>> be used. c always use the DAG.
> >>>>>>>>
> >>>>>>>> This does address all the concerns. It is just that from
> >>>> intuitiveness
> >>>>>>>> perspective, I found that asking user to explicitly use a
> >>>> CachedTable
> >>>>>> while
> >>>>>>>> the optimizer might choose to ignore is a little weird. That was
> >>>> why I
> >>>>>> did
> >>>>>>>> not think about that semantic. But given there is material
> benefit,
> >>>> I
> >>>>>> think
> >>>>>>>> this semantic is acceptable.
> >>>>>>>>
> >>>>>>>> 1. If we want to let optimiser make decisions whether to use cache
> >>>> or
> >>>>>> not,
> >>>>>>>>> then why do we need “void cache()” method at all? Would It
> >>>>> “increase”
> >>>>>> the
> >>>>>>>>> chance of using the cache? That’s sounds strange. What would be
> the
> >>>>>>>>> mechanism of deciding whether to use the cache or not? If we want
> >>>> to
> >>>>>>>>> introduce such kind  automated optimisations of “plan nodes
> >>>>>> deduplication”
> >>>>>>>>> I would turn it on globally, not per table, and let the optimiser
> >>>> do
> >>>>>> all of
> >>>>>>>>> the work.
> >>>>>>>>> 2. We do not have statistics at the moment for any use/not use
> >>>> cache
> >>>>>>>>> decision.
> >>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
> cost
> >>>>>> based
> >>>>>>>>> optimisations would work properly and I would still insist first
> on
> >>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
> >>>>>>>>>
> >>>>>>>> We are absolutely on the same page here. An explicit cache()
> method
> >>>> is
> >>>>>>>> necessary not only because optimizer may not be able to make the
> >>>> right
> >>>>>>>> decision, but also because of the nature of interactive
> programming.
> >>>>> For
> >>>>>>>> example, if users write the following code in Scala shell:
> >>>>>>>> val b = a.select(...)
> >>>>>>>> val c = b.select(...)
> >>>>>>>> val d = c.select(...).writeToSink(...)
> >>>>>>>> tEnv.execute()
> >>>>>>>> There is no way optimizer will know whether b or c will be used in
> >>>>> later
> >>>>>>>> code, unless users hint explicitly.
> >>>>>>>>
> >>>>>>>> At the same time I’m not sure if you have responded to our
> >>>> objections
> >>>>> of
> >>>>>>>>> `void cache()` being implicit/having side effects, which me,
> Jark,
> >>>>>> Fabian,
> >>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>
> >>>>>>>> Is there any other side effects if we use semantic 3 mentioned
> >>>> above?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> JIangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
> >>>>> piotr@data-artisans.com
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Becket,
> >>>>>>>>>
> >>>>>>>>> Sorry for not responding long time.
> >>>>>>>>>
> >>>>>>>>> Regarding case1.
> >>>>>>>>>
> >>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would expect
> only
> >>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t
> >>>> affect
> >>>>>>>>> `cachedTableA2`. Just as in any other database dropping modifying
> >>>> one
> >>>>>>>>> independent table/materialised view does not affect others.
> >>>>>>>>>
> >>>>>>>>>> What I meant is that assuming there is already a cached table,
> >>>>> ideally
> >>>>>>>>> users need
> >>>>>>>>>> not to specify whether the next query should read from the cache
> >>>> or
> >>>>>> use
> >>>>>>>>> the
> >>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>
> >>>>>>>>> 1. If we want to let optimiser make decisions whether to use
> cache
> >>>> or
> >>>>>>>>> not, then why do we need “void cache()” method at all? Would It
> >>>>>> “increase”
> >>>>>>>>> the chance of using the cache? That’s sounds strange. What would
> be
> >>>>> the
> >>>>>>>>> mechanism of deciding whether to use the cache or not? If we want
> >>>> to
> >>>>>>>>> introduce such kind  automated optimisations of “plan nodes
> >>>>>> deduplication”
> >>>>>>>>> I would turn it on globally, not per table, and let the optimiser
> >>>> do
> >>>>>> all of
> >>>>>>>>> the work.
> >>>>>>>>> 2. We do not have statistics at the moment for any use/not use
> >>>> cache
> >>>>>>>>> decision.
> >>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such
> cost
> >>>>>> based
> >>>>>>>>> optimisations would work properly and I would still insist first
> on
> >>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
> >>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
> >>>>>>>>> contradict future work on automated cost based caching.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> At the same time I’m not sure if you have responded to our
> >>>> objections
> >>>>>> of
> >>>>>>>>> `void cache()` being implicit/having side effects, which me,
> Jark,
> >>>>>> Fabian,
> >>>>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>>>
> >>>>>>>>> Piotrek
> >>>>>>>>>
> >>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Till,
> >>>>>>>>>>
> >>>>>>>>>> It is true that after the first job submission, there will be no
> >>>>>>>>> ambiguity
> >>>>>>>>>> in terms of whether a cached table is used or not. That is the
> >>>> same
> >>>>>> for
> >>>>>>>>> the
> >>>>>>>>>> cache() without returning a CachedTable.
> >>>>>>>>>>
> >>>>>>>>>> Conceptually one could think of cache() as introducing a caching
> >>>>>>>>> operator
> >>>>>>>>>>> from which you need to consume from if you want to benefit from
> >>>> the
> >>>>>>>>> caching
> >>>>>>>>>>> functionality.
> >>>>>>>>>>
> >>>>>>>>>> I am thinking a little differently. I think it is a hint (as you
> >>>>>>>>> mentioned
> >>>>>>>>>> later) instead of a new operator. I'd like to be careful about
> the
> >>>>>>>>> semantic
> >>>>>>>>>> of the API. A hint is a property set on an existing operator,
> but
> >>>> is
> >>>>>> not
> >>>>>>>>>> itself an operator as it does not really manipulate the data.
> >>>>>>>>>>
> >>>>>>>>>> I agree, ideally the optimizer makes this kind of decision which
> >>>>>>>>>>> intermediate result should be cached. But especially when
> >>>> executing
> >>>>>>>>> ad-hoc
> >>>>>>>>>>> queries the user might better know which results need to be
> >>>> cached
> >>>>>>>>> because
> >>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
> consider
> >>>>> the
> >>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
> >>>>> future
> >>>>>> we
> >>>>>>>>>>> might add functionality which tries to automatically cache
> >>>> results
> >>>>>>>>> (e.g.
> >>>>>>>>>>> caching the latest intermediate results until so and so much
> >>>> space
> >>>>> is
> >>>>>>>>>>> used). But this should hopefully not contradict with
> `CachedTable
> >>>>>>>>> cache()`.
> >>>>>>>>>>
> >>>>>>>>>> I agree that cache() method is needed for exactly the reason you
> >>>>>>>>> mentioned,
> >>>>>>>>>> i.e. Flink cannot predict what users are going to write later,
> so
> >>>>>> users
> >>>>>>>>>> need to tell Flink explicitly that this table will be used
> later.
> >>>>>> What I
> >>>>>>>>>> meant is that assuming there is already a cached table, ideally
> >>>>> users
> >>>>>>>>> need
> >>>>>>>>>> not to specify whether the next query should read from the cache
> >>>> or
> >>>>>> use
> >>>>>>>>> the
> >>>>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>>>
> >>>>>>>>>> To explain the difference between returning / not returning a
> >>>>>>>>> CachedTable,
> >>>>>>>>>> I want compare the following two case:
> >>>>>>>>>>
> >>>>>>>>>> *Case 1:  returning a CachedTable*
> >>>>>>>>>> b = a.map(...)
> >>>>>>>>>> val cachedTableA1 = a.cache()
> >>>>>>>>>> val cachedTableA2 = a.cache()
> >>>>>>>>>> b.print() // Just to make sure a is cached.
> >>>>>>>>>>
> >>>>>>>>>> c = a.filter(...) // User specify that the original DAG is used?
> >>>> Or
> >>>>>> the
> >>>>>>>>>> optimizer decides whether DAG or cache should be used?
> >>>>>>>>>> d = cachedTableA1.filter() // User specify that the cached table
> >>>> is
> >>>>>>>>> used.
> >>>>>>>>>>
> >>>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
> >>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
> >>>>>>>>>>
> >>>>>>>>>> *Case 2: not returning a CachedTable*
> >>>>>>>>>> b = a.map()
> >>>>>>>>>> a.cache()
> >>>>>>>>>> a.cache() // no-op
> >>>>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>>>
> >>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
> >>>>> should
> >>>>>>>>> be
> >>>>>>>>>> used
> >>>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or DAG
> >>>>> should
> >>>>>>>>> be
> >>>>>>>>>> used
> >>>>>>>>>>
> >>>>>>>>>> a.unCache()
> >>>>>>>>>> a.unCache() // no-op
> >>>>>>>>>>
> >>>>>>>>>> In case 1, semantic wise, optimizer lose the option to choose
> >>>>> between
> >>>>>>>>> DAG
> >>>>>>>>>> and cache. And the unCache() call becomes tricky.
> >>>>>>>>>> In case 2, users do not need to worry about whether cache or DAG
> >>>> is
> >>>>>>>>> used.
> >>>>>>>>>> And the unCache() semantic is clear. However, the caveat is that
> >>>>> users
> >>>>>>>>>> cannot explicitly ignore the cache.
> >>>>>>>>>>
> >>>>>>>>>> In order to address the issues mentioned in case 2 and inspired
> by
> >>>>> the
> >>>>>>>>>> discussion so far, I am thinking about using hint to allow user
> >>>>>>>>> explicitly
> >>>>>>>>>> ignore cache. Although we do not have hint yet, but we probably
> >>>>> should
> >>>>>>>>> have
> >>>>>>>>>> one. So the code becomes:
> >>>>>>>>>>
> >>>>>>>>>> *Case 3: returning this table*
> >>>>>>>>>> b = a.map()
> >>>>>>>>>> a.cache()
> >>>>>>>>>> a.cache() // no-op
> >>>>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>>>
> >>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
> >>>>> should
> >>>>>>>>> be
> >>>>>>>>>> used
> >>>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used
> instead
> >>>> of
> >>>>>> the
> >>>>>>>>>> cache.
> >>>>>>>>>>
> >>>>>>>>>> a.unCache()
> >>>>>>>>>> a.unCache() // no-op
> >>>>>>>>>>
> >>>>>>>>>> We could also let cache() return this table to allow chained
> >>>> method
> >>>>>>>>> calls.
> >>>>>>>>>> Do you think this API addresses the concerns?
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> All the recent discussions are focused on whether there is a
> >>>>> problem
> >>>>>> if
> >>>>>>>>>>> cache() not return a Table.
> >>>>>>>>>>> It seems that returning a Table explicitly is more clear (and
> >>>>> safe?).
> >>>>>>>>>>>
> >>>>>>>>>>> So whether there are any problems if cache() returns a Table?
> >>>>>> @Becket
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jark
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <
> trohrmann@apache.org
> >>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> It's true that b, c, d and e will all read from the original
> DAG
> >>>>>> that
> >>>>>>>>>>>> generates a. But all subsequent operators (when running
> multiple
> >>>>>>>>> queries)
> >>>>>>>>>>>> which reference cachedTableA should not need to reproduce `a`
> >>>> but
> >>>>>>>>>>> directly
> >>>>>>>>>>>> consume the intermediate result.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Conceptually one could think of cache() as introducing a
> caching
> >>>>>>>>> operator
> >>>>>>>>>>>> from which you need to consume from if you want to benefit
> from
> >>>>> the
> >>>>>>>>>>> caching
> >>>>>>>>>>>> functionality.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision
> which
> >>>>>>>>>>>> intermediate result should be cached. But especially when
> >>>>> executing
> >>>>>>>>>>> ad-hoc
> >>>>>>>>>>>> queries the user might better know which results need to be
> >>>> cached
> >>>>>>>>>>> because
> >>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
> >>>> consider
> >>>>>> the
> >>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
> >>>>> future
> >>>>>>>>> we
> >>>>>>>>>>>> might add functionality which tries to automatically cache
> >>>> results
> >>>>>>>>> (e.g.
> >>>>>>>>>>>> caching the latest intermediate results until so and so much
> >>>> space
> >>>>>> is
> >>>>>>>>>>>> used). But this should hopefully not contradict with
> >>>> `CachedTable
> >>>>>>>>>>> cache()`.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Till
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <
> becket.qin@gmail.com
> >>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the clarification. I am still a little confused.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If cache() returns a CachedTable, the example might become:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> cachedTableA = a.cache()
> >>>>>>>>>>>>> d = cachedTableA.map(...)
> >>>>>>>>>>>>> e = a.map()
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b, c, d
> and
> >>>> e
> >>>>>> are
> >>>>>>>>>>> all
> >>>>>>>>>>>>> going to be reading from the original DAG that generates a.
> But
> >>>>>> with
> >>>>>>>>> a
> >>>>>>>>>>>>> naive expectation, d should be reading from the cache. This
> >>>> seems
> >>>>>> not
> >>>>>>>>>>>>> solving the potential confusion you raised, right?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Just to be clear, my understanding are all based on the
> >>>>> assumption
> >>>>>>>>> that
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
> >>>>>>>>> c*achedTableA*
> >>>>>>>>>>>> and
> >>>>>>>>>>>>> original table *a * should be completely interchangeable.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> That said, I think a valid argument is optimization. There
> are
> >>>>>> indeed
> >>>>>>>>>>>> cases
> >>>>>>>>>>>>> that reading from the original DAG could be faster than
> reading
> >>>>>> from
> >>>>>>>>>>> the
> >>>>>>>>>>>>> cache. For example, in the following example:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> a.filter(f1' > 100)
> >>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>> b = a.filter(f1' < 100)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Ideally the optimizer should be intelligent enough to decide
> >>>>> which
> >>>>>>>>> way
> >>>>>>>>>>> is
> >>>>>>>>>>>>> faster, without user intervention. In this case, it will
> >>>> identify
> >>>>>>>>> that
> >>>>>>>>>>> b
> >>>>>>>>>>>>> would just be an empty table, thus skip reading from the
> cache
> >>>>>>>>>>>> completely.
> >>>>>>>>>>>>> But I agree that returning a CachedTable would give user the
> >>>>>> control
> >>>>>>>>> of
> >>>>>>>>>>>>> when to use cache, even though I still feel that letting the
> >>>>>>>>> optimizer
> >>>>>>>>>>>>> handle this is a better option in long run.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
> >>>>> trohrmann@apache.org
> >>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yes you are right Becket that it still depends on the actual
> >>>>>>>>>>> execution
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>> the job whether a consumer reads from a cached result or
> not.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> My point was actually about the properties of a (cached vs.
> >>>>>>>>>>> non-cached)
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>> not about the execution. I would not make cache trigger the
> >>>>>>>>> execution
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>> the job because one loses some flexibility by eagerly
> >>>> triggering
> >>>>>> the
> >>>>>>>>>>>>>> execution.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is
> returned
> >>>>> by
> >>>>>>>>> the
> >>>>>>>>>>>>>> cache() method like Piotr did in order to make the API more
> >>>>>>>>> explicit.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
> >>>> becket.qin@gmail.com
> >>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> That is a good example. Just a minor correction, in this
> >>>> case,
> >>>>>> b, c
> >>>>>>>>>>>>> and d
> >>>>>>>>>>>>>>> will all consume from a non-cached a. This is because cache
> >>>>> will
> >>>>>>>>>>> only
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> created on the very first job submission that generates the
> >>>>> table
> >>>>>>>>>>> to
> >>>>>>>>>>>> be
> >>>>>>>>>>>>>>> cached.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If I understand correctly, this is example is about whether
> >>>>>>>>>>> .cache()
> >>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In another
> >>>>> word,
> >>>>>>>>>>> if
> >>>>>>>>>>>>>>> cache() method actually triggers a job that creates the
> >>>> cache,
> >>>>>>>>>>> there
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>> be no such confusion. Is that right?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In the example, although d will not consume from the cached
> >>>>> Table
> >>>>>>>>>>>> while
> >>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>> looks supposed to, from correctness perspective the code
> will
> >>>>>> still
> >>>>>>>>>>>>>> return
> >>>>>>>>>>>>>>> correct result, assuming that tables are immutable.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Personally I feel it is OK because users probably won't
> >>>> really
> >>>>>>>>>>> worry
> >>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>> whether the table is cached or not. And lazy cache could
> >>>> avoid
> >>>>>> some
> >>>>>>>>>>>>>>> unnecessary caching if a cached table is never created in
> the
> >>>>>> user
> >>>>>>>>>>>>>>> application. But I am not opposed to do eager evaluation of
> >>>>>> cache.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> >>>>>>>>>>> trohrmann@apache.org>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily changing
> >>>>>>>>>>>> properties
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> node affects all down stream consumers but does not
> >>>>> necessarily
> >>>>>>>>>>>> have
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> happen before these consumers are defined. From a user's
> >>>>>>>>>>>> perspective
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>> can be quite confusing:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>>>> d = a.map(...)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> now b, c and d will consume from a cached operator. In
> this
> >>>>>> case,
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>> would most likely expect that only d reads from a cached
> >>>>> result.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> >>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
> effects?
> >>>> So
> >>>>>>>>>>>> far
> >>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
> >>>>> table
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Not only that. There are also performance implications
> and
> >>>>>>>>>>> those
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> another implicit side effects of using `void cache()`.
> As I
> >>>>>>>>>>> wrote
> >>>>>>>>>>>>>>> before,
> >>>>>>>>>>>>>>>>> reading from cache might not always be desirable, thus it
> >>>> can
> >>>>>>>>>>>> cause
> >>>>>>>>>>>>>>>>> performance degradation and I’m fine with that - user's
> or
> >>>>>>>>>>>>>> optimiser’s
> >>>>>>>>>>>>>>>>> choice. What I do not like is that this implicit side
> >>>> effect
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>> manifest
> >>>>>>>>>>>>>>>>> in completely different part of code, that wasn’t touched
> >>>> by
> >>>>> a
> >>>>>>>>>>>> user
> >>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else. And
> even
> >>>> if
> >>>>>>>>>>>>> caching
> >>>>>>>>>>>>>>>>> improves performance, it’s still a side effect of `void
> >>>>>>>>>>> cache()`.
> >>>>>>>>>>>>>>> Almost
> >>>>>>>>>>>>>>>>> from the definition `void` methods have only side
> effects.
> >>>>> As I
> >>>>>>>>>>>>> wrote
> >>>>>>>>>>>>>>>>> before, there are couple of scenarios where this might be
> >>>>>>>>>>>>> undesirable
> >>>>>>>>>>>>>>>>> and/or unexpected, for example:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1.
> >>>>>>>>>>>>>>>>> Table b = …;
> >>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>> x = b.join(…)
> >>>>>>>>>>>>>>>>> y = b.count()
> >>>>>>>>>>>>>>>>> // ...
> >>>>>>>>>>>>>>>>> // 100
> >>>>>>>>>>>>>>>>> // hundred
> >>>>>>>>>>>>>>>>> // lines
> >>>>>>>>>>>>>>>>> // of
> >>>>>>>>>>>>>>>>> // code
> >>>>>>>>>>>>>>>>> // later
> >>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden
> in
> >>>> a
> >>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>> method/file/package/dependency
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Table b = ...
> >>>>>>>>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>>>>>>>> foo(b)
> >>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>> Else {
> >>>>>>>>>>>>>>>>> bar(b)
> >>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Void foo(Table b) {
> >>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>> // do something with b
> >>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly
> affect
> >>>>>>>>>>>>> (semantic
> >>>>>>>>>>>>>>> of a
> >>>>>>>>>>>>>>>>> program in case of sources being mutable and performance)
> >>>> `z
> >>>>> =
> >>>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On top of that, there is still this argument of mine that
> >>>>>>>>>>> having
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
> >>>> flexible
> >>>>>>>>>>> for
> >>>>>>>>>>>> us
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> future and for the user (as a manual option to bypass
> cache
> >>>>>>>>>>>> reads).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> But Jiangjie is correct,
> >>>>>>>>>>>>>>>>>> the source table in batching should be immutable. It is
> >>>> the
> >>>>>>>>>>>>> user’s
> >>>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
> >>>>>>>>>>> failover
> >>>>>>>>>>>>> may
> >>>>>>>>>>>>>>> lead
> >>>>>>>>>>>>>>>>>> to inconsistent results.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment
> >>>> should
> >>>>>>>>>>> be.
> >>>>>>>>>>>>> But
> >>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this (since
> the
> >>>>>>>>>>>> proper
> >>>>>>>>>>>>>> fix
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> to support transactions), I’m just trying to minimise
> >>>>> confusion
> >>>>>>>>>>>> for
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> users that are not fully aware what’s going on and
> operate
> >>>> in
> >>>>>>>>>>>> less
> >>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>> perfect setup. And if something bites them after adding
> >>>>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>>>> call,
> >>>>>>>>>>>>>>>>> to make sure that they at least know all of the places
> that
> >>>>>>>>>>>> adding
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> line can affect.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks, Piotrek
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <
> becket.qin@gmail.com
> >>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more replies
> are
> >>>>>>>>>>>>>> following.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be
> >>>> used
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>> programming and not only in batching.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache() has
> the
> >>>>>>>>>>> same
> >>>>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>> batch processing. The semantic is following:
> >>>>>>>>>>>>>>>>>> For a table created via a series of computation, save
> that
> >>>>>>>>>>>> table
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>>>> reference to avoid running the computation logic to
> >>>>>>>>>>> regenerate
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> table.
> >>>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
> >>>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
> >>>> processing.
> >>>>>>>>>>> The
> >>>>>>>>>>>>>>>>> difference
> >>>>>>>>>>>>>>>>>> is that stream applications will only run once as they
> are
> >>>>>>>>>>> long
> >>>>>>>>>>>>>>>> running.
> >>>>>>>>>>>>>>>>>> And the batch applications may be run multiple times,
> >>>> hence
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>> be created and dropped each time the application runs.
> >>>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
> >>>> management
> >>>>>>>>>>>>>>>> requirements
> >>>>>>>>>>>>>>>>>> for the streaming cached table, such as time based /
> size
> >>>>>>>>>>> based
> >>>>>>>>>>>>>>>>> retention,
> >>>>>>>>>>>>>>>>>> to address the infinite data issue. But such requirement
> >>>>> does
> >>>>>>>>>>>> not
> >>>>>>>>>>>>>>>> change
> >>>>>>>>>>>>>>>>>> the semantic.
> >>>>>>>>>>>>>>>>>> You are right that interactive programming is just one
> use
> >>>>>>>>>>> case
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> cache().
> >>>>>>>>>>>>>>>>>> It is not the only use case.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
> `void
> >>>>>>>>>>>>> cache()`
> >>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> side effects.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This is indeed the key point. The argument around
> whether
> >>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> return something already indicates that cache() and
> >>>>>>>>>>>> materialize()
> >>>>>>>>>>>>>>>> address
> >>>>>>>>>>>>>>>>>> different issues.
> >>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side
> effects?
> >>>> So
> >>>>>>>>>>>> far
> >>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
> >>>>> table
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >>>> CachedTable
> >>>>>>>>>>>>>>> read-only.
> >>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
> can
> >>>>> not
> >>>>>>>>>>>>> write
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently can
> >>>> not
> >>>>>>>>>>>>> write
> >>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I don't think anyone should insert something to a cache.
> >>>> By
> >>>>>>>>>>>>>>> definition
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> cache should only be updated when the corresponding
> >>>> original
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> updated. What I am wondering is that given the following
> >>>> two
> >>>>>>>>>>>>> facts:
> >>>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with something
> like
> >>>>>>>>>>>>>> insert()),
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
> >>>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
> >>>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
> >>>> mutable
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is where I
> >>>>> thought
> >>>>>>>>>>>>>>>> confusing.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> >>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
> >>>>>>>>>>>>>> explanation
> >>>>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is that I
> >>>> think
> >>>>>>>>>>> of
> >>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>> “Table”s
> >>>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
> >>>>>>>>>>> views,
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>> difference for me is that their live scope is short -
> >>>>>>>>>>> current
> >>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>> is limited by different execution model. That’s why
> >>>>>>>>>>> “cashing”
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>> for me
> >>>>>>>>>>>>>>>>>>> is just materialising it.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> However I see and I understand your point of view.
> Coming
> >>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL
> world,
> >>>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might
> not
> >>>>>>>>>>> only
> >>>>>>>>>>>> be
> >>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> interactive programming and not only in batching. But
> >>>>> naming
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>>>> and not that critical to me. Especially that once we
> >>>>>>>>>>> implement
> >>>>>>>>>>>>>>> proper
> >>>>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
> >>>>> `cache()`
> >>>>>>>>>>>> if
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>> deem
> >>>>>>>>>>>>>>>>> so.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
> >>>> `void
> >>>>>>>>>>>>>> cache()`
> >>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you have
> >>>>>>>>>>> mentioned.
> >>>>>>>>>>>>>> True:
> >>>>>>>>>>>>>>>>>>> results might be non deterministic if underlying source
> >>>>>>>>>>> table
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> changing.
> >>>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
> >>>>>>>>>>> semantic
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It
> can
> >>>>>>>>>>> cause
> >>>>>>>>>>>>>> “wtf”
> >>>>>>>>>>>>>>>>> moment
> >>>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some place
> >>>> in
> >>>>>>>>>>> his
> >>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
> >>>> differently.
> >>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
> >>>>>>>>>>> force
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> explicitly use the cache which removes the “random”
> part
> >>>>>>>>>>> from
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> "suddenly
> >>>>>>>>>>>>>>>>>>> some other random places are behaving differently”.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
> >>>>>>>>>>>>>>>> flexibility/allowing
> >>>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are independent of
> >>>>>>>>>>>>> `cache()`
> >>>>>>>>>>>>>> vs
> >>>>>>>>>>>>>>>>>>> `materialize()` discussion.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the
> CachedTable?
> >>>>>>>>>>> This
> >>>>>>>>>>>>>>> sounds
> >>>>>>>>>>>>>>>>>>> pretty confusing.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >>>> CachedTable
> >>>>>>>>>>>>>>>> read-only. I
> >>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user
> can
> >>>>> not
> >>>>>>>>>>>>> write
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently can
> >>>> not
> >>>>>>>>>>>>> write
> >>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
> >>>> xingcanc@gmail.com
> >>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and
> `materialize()`
> >>>>>>>>>>>> should
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> considered as two different methods where the later one
> >>>> is
> >>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>> sophisticated.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea is
> just
> >>>> to
> >>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI
> >>>> is a
> >>>>>>>>>>>>>>> high-level
> >>>>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet
> API
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> force
> >>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching it.
> Then
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> manually register the cached dataset to a table again
> (we
> >>>>>>>>>>> may
> >>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
> >>>> identical
> >>>>>>>>>>>>> schema
> >>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>> different contents here). After all, it’s the dataset
> >>>>> rather
> >>>>>>>>>>>>> than
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> >>>>>>>>>>>>> becket.qin@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are
> good
> >>>>>>>>>>>>>> arguments.
> >>>>>>>>>>>>>>>>> But I
> >>>>>>>>>>>>>>>>>>>>> think those arguments are mostly about materialized
> >>>> view.
> >>>>>>>>>>>> Let
> >>>>>>>>>>>>> me
> >>>>>>>>>>>>>>> try
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and
> materialize()
> >>>>> are
> >>>>>>>>>>>>>>>> different.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite
> different
> >>>>>>>>>>>>>>> implications.
> >>>>>>>>>>>>>>>>> An
> >>>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When
> users
> >>>>>>>>>>> call
> >>>>>>>>>>>>>>> cache(),
> >>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result as a
> >>>>>>>>>>> draft
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>> work,
> >>>>>>>>>>>>>>>>>>>>> this intermediate result may not have any realistic
> >>>>>>>>>>> meaning.
> >>>>>>>>>>>>>>> Calling
> >>>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the
> cached
> >>>>>>>>>>> table
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>> any
> >>>>>>>>>>>>>>>>>>> manner.
> >>>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I have
> >>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>>> meaningful
> >>>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think
> about
> >>>>> the
> >>>>>>>>>>>>>>>> validation,
> >>>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
> >>>> materialize()
> >>>>>>>>>>>>> methods
> >>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them. The
> >>>> concept
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to say the
> >>>>>>>>>>>> related
> >>>>>>>>>>>>>>> stuff
> >>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
> >>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>> itself
> >>>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and systematic
> >>>>>>>>>>>> manner.
> >>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>> found
> >>>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
> >>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>>>> programming experience.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still have
> some
> >>>>>>>>>>>>>> questions,
> >>>>>>>>>>>>>>>>>>> though.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files
> from a
> >>>>>>>>>>>>>> directory
> >>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
> >>>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> >>>>>>>>>>> initialised)
> >>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
> >>>> writes
> >>>>>>>>>>>> new
> >>>>>>>>>>>>>>> files
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> /foo/bar
> >>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> >>>>>>>>>>>>>> implemented
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> initial version
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> what if someone else added some more files to
> /foo/bar
> >>>> at
> >>>>>>>>>>>> this
> >>>>>>>>>>>>>>>> point?
> >>>>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result
> become
> >>>>>>>>>>>>>>>>>>> non-deterministic,
> >>>>>>>>>>>>>>>>>>>>> right?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
> >>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
> >>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension,
> manual
> >>>>>>>>>>>>> “cache”
> >>>>>>>>>>>>>>>>> dropping
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in most
> >>>>> cases,
> >>>>>>>>>>>> we
> >>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> talking
> >>>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental assumption of
> >>>>> such
> >>>>>>>>>>>>> case
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> source data is complete before the data processing
> >>>>> begins,
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO, if
> >>>>>>>>>>>> additional
> >>>>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>>>> to be added to some source during the processing, it
> >>>>>>>>>>> should
> >>>>>>>>>>>> be
> >>>>>>>>>>>>>>> done
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> ways
> >>>>>>>>>>>>>>>>>>>>> like union the source with another table containing
> the
> >>>>>>>>>>> rows
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> added.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> There are a few cases that computations are executed
> >>>>>>>>>>>>> repeatedly
> >>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> changing data source.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job every
> >>>> hour
> >>>>>>>>>>>> with
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> samples
> >>>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the
> source
> >>>>>>>>>>> data
> >>>>>>>>>>>>>>> between
> >>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged
> >>>>> within
> >>>>>>>>>>>> one
> >>>>>>>>>>>>>>> run.
> >>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>> usually in that case, the result will need
> versioning,
> >>>>>>>>>>> i.e.
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> given
> >>>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result from the
> >>>>>>>>>>> source
> >>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>> by a
> >>>>>>>>>>>>>>>>>>>>> certain timestamp.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Another example is something like data warehouse. In
> >>>> this
> >>>>>>>>>>>>> case,
> >>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>> are a
> >>>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
> >>>> sources,
> >>>>>>>>>>>> many
> >>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be created
> to
> >>>>>>>>>>>>> generate
> >>>>>>>>>>>>>>>>> derived
> >>>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated when the
> >>>>>>>>>>>>> underlying
> >>>>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic that
> >>>>>>>>>>>> derives
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
> >>>>>>>>>>>>>>> reports/views.
> >>>>>>>>>>>>>>>>>>> Again,
> >>>>>>>>>>>>>>>>>>>>> all those derived data also need to ha
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

Thanks for the quick answer :)

Re 1.

I generally agree with you, however couple of points:

a) the problem with using automatic caching is bigger, because you will have to decide, how do you compare IO vs CPU costs and if you pick wrong, additional IO costs might be enormous or even can crash your system. This is more difficult problem compared to let say join reordering, where the only issue is to have good statistics that can capture correlations between columns (when you reorder joins number of IO operations do not change)
c) your example is completely independent of caching.

Query like this:

src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===`f2).as('f3, …).filter(‘f3 > 30)

Should/could be optimised to empty result immediately, without the need for any cache/materialisation and that should work even without any statistics provided by the connector. 

For me prerequisite to any serious cost-based optimisations would be some reasonable benchmark coverage of the code (tpch?). Otherwise that would be equivalent of adding not tested code, since we wouldn’t be able to verify our assumptions, like how does the writing of 10 000 records to cache/RocksDB/Kafka/CSV file compare to joining/filtering/processing of lets say 1000 000 rows.

Re 2.

I wasn’t proposing to change the semantic later. I was proposing that we start now:

CachedTable cachedA = a.cache()
cachedA.foo() // Cache is used
a.bar() // Original DAG is used

And then later we can think about adding for example 

CachedTable cachedA = a.hintCache()
cachedA.foo() // Cache might be used
a.bar() // Original DAG is used

Or

env.enableAutomaticCaching()
a.foo() // Cache might be used
a.bar() // Cache might be used

Or (I would still not like this option):

a.hintCache()
a.foo() // Cache might be used
a.bar() // Cache might be used

Or whatever else that will come to our mind. Even if we add some automatic caching in the future, keeping implicit (`CachedTable cache()`) caching will still be useful, at least in some cases.

Re 3.

> 2. The source tables are immutable during one run of batch processing logic.
> 3. The cache is immutable during one run of batch processing logic.

> I think assumption 2 and 3 are by definition what batch processing means,
> i.e the data must be complete before it is processed and should not change
> when the processing is running.

I agree that this is how batch systems SHOULD be working. However I know from my previous experience that it’s not always the case. Sometimes users are just working on some non transactional storage, which can be (either constantly or occasionally) being modified by some other processes for whatever the reasons (fixing the data, updating, adding new data etc).

But even if we ignore this point (data immutability), performance side effect issue of your proposal remains. If user calls `void a.cache()` deep inside some private method, it will have implicit side effects on other parts of his program that might not be obvious.

Re `CacheHandle`.

If I understand it correctly, it only addresses the issue where to place method `uncache`/`dropCache`.

Btw,

> In vast majority of the cases, users wouldn't really care whether the cache is used or not.

I wouldn’t agree with that, because “caching” (if not purely in memory caching) would add additional IO costs. It’s similar as saying that users would not see a difference between Spark/Flink and MapReduce (MapReduce writes data to disks after every map/reduce stage). 

Piotrek

> On 12 Dec 2018, at 14:28, Becket Qin <be...@gmail.com> wrote:
> 
> Hi Piotrek,
> 
> Not sure if you noticed, in my last email, I was proposing `CacheHandle
> cache()` to avoid the potential side effect due to function calls.
> 
> Let's look at the disagreement in your reply one by one.
> 
> 
> 1. Optimization chances
> 
> Optimization is never a trivial work. This is exactly why we should not let
> user manually do that. Databases have done huge amount of work in this
> area. At Alibaba, we rely heavily on many optimization rules to boost the
> SQL query performance.
> 
> In your example, if I filling the filter conditions in a certain way, the
> optimization would become obvious.
> 
> Table src1 = … // read from connector 1
> Table src2 = … // read from connector 2
> 
> Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
> `f2).as('f3, ...)
> a.cache() // write cache to connector 3, when writing the records, remember
> min and max of `f1
> 
> a.filter('f3 > 30) // There is no need to read from any connector because
> `a` does not contain any record whose 'f3 is greater than 30.
> env.execute()
> a.select(…)
> 
> BTW, it seems to me that adding some basic statistics is fairly
> straightforward and the cost is pretty marginal if not ignorable. In fact
> it is not only needed for optimization, but also for cases such as ML,
> where some algorithms may need to decide their parameter based on the
> statistics of the data.
> 
> 
> 2. Same API, one semantic now, another semantic later.
> 
> I am trying to understand what is the semantic of `CachedTable cache()` you
> are proposing. IMO, we should avoid designing an API whose semantic will be
> changed later. If we have a "CachedTable cache()" method, then the semantic
> should be very clearly defined upfront and do not change later. It should
> never be "right now let's go with semantic 1, later we can silently change
> it to semantic 2 or 3". Such change could result in bad consequence. For
> example, let's say we decide go with semantic 1:
> 
> CachedTable cachedA = a.cache()
> cachedA.foo() // Cache is used
> a.bar() // Original DAG is used.
> 
> Now majority of the users would be using cachedA.foo() in their code. And
> some advanced users will use a.bar() to explicitly skip the cache. Later
> on, we added smart optimization and change the semantic to semantic 2:
> 
> CachedTable cachedA = a.cache()
> cachedA.foo() // Cache is used
> a.bar() // Cache MIGHT be used, and Flink may decide to skip cache if it is
> faster.
> 
> Now most of the users who were writing cachedA.foo() will not benefit from
> this optimization at all, unless they change their code to use a.foo()
> instead. And those advanced users suddenly lose the option to explicitly
> ignore cache unless they change their code (assuming we care enough to
> provide something like hint(useCache)). If we don't define the semantic
> carefully, our users will have to change their code again and again while
> they shouldn't have to.
> 
> 
> 3. side effect.
> 
> Before we talk about side effect, we have to agree on the assumptions. The
> assumptions I have are following:
> 1. We are talking about batch processing.
> 2. The source tables are immutable during one run of batch processing logic.
> 3. The cache is immutable during one run of batch processing logic.
> 
> I think assumption 2 and 3 are by definition what batch processing means,
> i.e the data must be complete before it is processed and should not change
> when the processing is running.
> 
> As far as I am aware of, I don't know any batch processing system breaking
> those assumptions. Even for relational database tables, where queries can
> run with concurrent modifications, necessary locking are still required to
> ensure the integrity of the query result.
> 
> Please let me know if you disagree with the above assumptions. If you agree
> with these assumptions, with the `CacheHandle cache()` API in my last
> email, do you still see side effects?
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
> 
>> Hi Becket,
>> 
>>> Regarding the chance of optimization, it might not be that rare. Some
>> very
>>> simple statistics could already help in many cases. For example, simply
>>> maintaining max and min of each fields can already eliminate some
>>> unnecessary table scan (potentially scanning the cached table) if the
>>> result is doomed to be empty. A histogram would give even further
>>> information. The optimizer could be very careful and only ignores cache
>>> when it is 100% sure doing that is cheaper. e.g. only when a filter on
>> the
>>> cache will absolutely return nothing.
>> 
>> I do not see how this might be easy to achieve. It would require tons of
>> effort to make it work and in the end you would still have a problem of
>> comparing/trading CPU cycles vs IO. For example:
>> 
>> Table src1 = … // read from connector 1
>> Table src2 = … // read from connector 2
>> 
>> Table a = src1.filter(…).join(src2.filter(…), …)
>> a.cache() // write cache to connector 3
>> 
>> a.filter(…)
>> env.execute()
>> a.select(…)
>> 
>> Decision whether it’s better to:
>> A) read from connector1/connector2, filter/map and join them twice
>> B) read from connector1/connector2, filter/map and join them once, pay the
>> price of writing to connector 3 and then reading from it
>> 
>> Is very far from trivial. `a` can end up much larger than `src1` and
>> `src2`, writes to connector 3 might be extremely slow, reads from connector
>> 3 can be slower compared to reads from connector 1 & 2, … . You really need
>> to have extremely good statistics to correctly asses size of the output and
>> it would still be failing many times (correlations etc). And keep in mind
>> that at the moment we do not have ANY statistics at all. More than that, it
>> would require significantly more testing and setting up some benchmarks to
>> make sure that we do not brake it with some regressions.
>> 
>> That’s why I’m strongly opposing this idea - at least let’s not starts
>> with this. If we first start with completely manual/explicit caching,
>> without any magic, it would be a significant improvement for the users for
>> a fraction of the development cost. After implementing that, when we
>> already have all of the working pieces, we can start working on some
>> optimisations rules. As I wrote before, if we start with
>> 
>> `CachedTable cache()`
>> 
>> We can later work on follow up stories to make it automatic. Despite that
>> I don’t like this implicit/side effect approach with `void` method, having
>> explicit `CachedTable cache()` wouldn’t even prevent as from later adding
>> `void hintCache()` method, with the exact semantic that you want.
>> 
>> On top of that I re-rise again that having implicit `void
>> cache()/hintCache()` has other side effects and problems with non immutable
>> data, and being annoying when used secretly inside methods.
>> 
>> Explicit `CachedTable cache()` just looks like much less controversial MVP
>> and if we decide to go further with this topic, it’s not a wasted effort,
>> but just lies on a stright path to more advanced/complicated solutions in
>> the future. Are there any drawbacks of starting with `CachedTable cache()`
>> that I’m missing?
>> 
>> Piotrek
>> 
>>> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
>>> 
>>> Hi Becket,
>>> 
>>> Introducing CacheHandle seems too complicated. That means users have to
>>> maintain Handler properly.
>>> 
>>> And since cache is just a hint for optimizer, why not just return Table
>>> itself for cache method. This hint info should be kept in Table I
>> believe.
>>> 
>>> So how about adding method cache and uncache for Table, and both return
>>> Table. Because what cache and uncache did is just adding some hint info
>>> into Table.
>>> 
>>> 
>>> 
>>> 
>>> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
>>> 
>>>> Hi Till and Piotrek,
>>>> 
>>>> Thanks for the clarification. That solves quite a few confusion. My
>>>> understanding of how cache works is same as what Till describe. i.e.
>>>> cache() is a hint to Flink, but it is not guaranteed that cache always
>>>> exist and it might be recomputed from its lineage.
>>>> 
>>>> Is this the core of our disagreement here? That you would like this
>>>>> “cache()” to be mostly hint for the optimiser?
>>>> 
>>>> Semantic wise, yes. That's also why I think materialize() has a much
>> larger
>>>> scope than cache(), thus it should be a different method.
>>>> 
>>>> Regarding the chance of optimization, it might not be that rare. Some
>> very
>>>> simple statistics could already help in many cases. For example, simply
>>>> maintaining max and min of each fields can already eliminate some
>>>> unnecessary table scan (potentially scanning the cached table) if the
>>>> result is doomed to be empty. A histogram would give even further
>>>> information. The optimizer could be very careful and only ignores cache
>>>> when it is 100% sure doing that is cheaper. e.g. only when a filter on
>> the
>>>> cache will absolutely return nothing.
>>>> 
>>>> Given the above clarification on cache, I would like to revisit the
>>>> original "void cache()" proposal and see if we can improve on top of
>> that.
>>>> 
>>>> What do you think about the following modified interface?
>>>> 
>>>> Table {
>>>> /**
>>>>  * This call hints Flink to maintain a cache of this table and leverage
>>>> it for performance optimization if needed.
>>>>  * Note that Flink may still decide to not use the cache if it is
>> cheaper
>>>> by doing so.
>>>>  *
>>>>  * A CacheHandle will be returned to allow user release the cache
>>>> actively. The cache will be deleted if there
>>>>  * is no unreleased cache handlers to it. When the TableEnvironment is
>>>> closed. The cache will also be deleted
>>>>  * and all the cache handlers will be released.
>>>>  *
>>>>  * @return a CacheHandle referring to the cache of this table.
>>>>  */
>>>> CacheHandle cache();
>>>> }
>>>> 
>>>> CacheHandle {
>>>> /**
>>>>  * Close the cache handle. This method does not necessarily deletes the
>>>> cache. Instead, it simply decrements the reference counter to the cache.
>>>>  * When the there is no handle referring to a cache. The cache will be
>>>> deleted.
>>>>  *
>>>>  * @return the number of open handles to the cache after this handle
>> has
>>>> been released.
>>>>  */
>>>> int release()
>>>> }
>>>> 
>>>> The rationale behind this interface is following:
>>>> In vast majority of the cases, users wouldn't really care whether the
>> cache
>>>> is used or not. So I think the most intuitive way is letting cache()
>> return
>>>> nothing. So nobody needs to worry about the difference between
>> operations
>>>> on CacheTables and those on the "original" tables. This will make maybe
>>>> 99.9% of the users happy. There were two concerns raised for this
>> approach:
>>>> 1. In some rare cases, users may want to ignore cache,
>>>> 2. A table might be cached/uncached in a third party function while the
>>>> caller does not know.
>>>> 
>>>> For the first issue, users can use hint("ignoreCache") to explicitly
>> ignore
>>>> cache.
>>>> For the second issue, the above proposal lets cache() return a
>> CacheHandle,
>>>> the only method in it is release(). Different CacheHandles will refer to
>>>> the same cache, if a cache no longer has any cache handle, it will be
>>>> deleted. This will address the following case:
>>>> {
>>>> val handle1 = a.cache()
>>>> process(a)
>>>> a.select(...) // cache is still available, handle1 has not been
>> released.
>>>> }
>>>> 
>>>> void process(Table t) {
>>>> val handle2 = t.cache() // new handle to cache
>>>> t.select(...) // optimizer decides cache usage
>>>> t.hint("ignoreCache").select(...) // cache is ignored
>>>> handle2.release() // release the handle, but the cache may still be
>>>> available if there are other handles
>>>> ...
>>>> }
>>>> 
>>>> Does the above modified approach look reasonable to you?
>>>> 
>>>> Cheers,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <tr...@apache.org>
>>>> wrote:
>>>> 
>>>>> Hi Becket,
>>>>> 
>>>>> I was aiming at semantics similar to 1. I actually thought that
>> `cache()`
>>>>> would tell the system to materialize the intermediate result so that
>>>>> subsequent queries don't need to reprocess it. This means that the
>> usage
>>>> of
>>>>> the cached table in this example
>>>>> 
>>>>> {
>>>>> val cachedTable = a.cache()
>>>>> val b1 = cachedTable.select(…)
>>>>> val b2 = cachedTable.foo().select(…)
>>>>> val b3 = cachedTable.bar().select(...)
>>>>> val c1 = a.select(…)
>>>>> val c2 = a.foo().select(…)
>>>>> val c3 = a.bar().select(...)
>>>>> }
>>>>> 
>>>>> strongly depends on interleaved calls which trigger the execution of
>> sub
>>>>> queries. So for example, if there is only a single env.execute call at
>>>> the
>>>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be computed by
>>>>> reading directly from the sources (given that there is only a single
>>>>> JobGraph). It just happens that the result of `a` will be cached such
>>>> that
>>>>> we skip the processing of `a` when there are subsequent queries reading
>>>>> from `cachedTable`. If for some reason the system cannot materialize
>> the
>>>>> table (e.g. running out of disk space, ttl expired), then it could also
>>>>> happen that we need to reprocess `a`. In that sense `cachedTable`
>> simply
>>>> is
>>>>> an identifier for the materialized result of `a` with the lineage how
>> to
>>>>> reprocess it.
>>>>> 
>>>>> Cheers,
>>>>> Till
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
>> piotr@data-artisans.com
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi Becket,
>>>>>> 
>>>>>>> {
>>>>>>> val cachedTable = a.cache()
>>>>>>> val b = cachedTable.select(...)
>>>>>>> val c = a.select(...)
>>>>>>> }
>>>>>>> 
>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses original
>>>> DAG
>>>>>> as
>>>>>>> user demanded so. In this case, the optimizer has no chance to
>>>>> optimize.
>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
>>>>>> optimizer
>>>>>>> to choose whether the cache or DAG should be used. In this case, user
>>>>>> lose
>>>>>>> the option to NOT use cache.
>>>>>>> 
>>>>>>> As you can see, neither of the options seem perfect. However, I guess
>>>>> you
>>>>>>> and Till are proposing the third option:
>>>>>>> 
>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
>>>>> should
>>>>>> be
>>>>>>> used. c always use the DAG.
>>>>>> 
>>>>>> I am pretty sure that me, Till, Fabian and others were all proposing
>>>> and
>>>>>> advocating in favour of semantic “1”. No cost based optimiser
>> decisions
>>>>> at
>>>>>> all.
>>>>>> 
>>>>>> {
>>>>>> val cachedTable = a.cache()
>>>>>> val b1 = cachedTable.select(…)
>>>>>> val b2 = cachedTable.foo().select(…)
>>>>>> val b3 = cachedTable.bar().select(...)
>>>>>> val c1 = a.select(…)
>>>>>> val c2 = a.foo().select(…)
>>>>>> val c3 = a.bar().select(...)
>>>>>> }
>>>>>> 
>>>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
>>>>>> re-executing whole plan for “a”.
>>>>>> 
>>>>>> In the future we could discuss going one step further, introducing
>> some
>>>>>> global optimisation (that can be manually enabled/disabled):
>>>> deduplicate
>>>>>> plan nodes/deduplicate sub queries/re-use sub queries results/or
>>>> whatever
>>>>>> we could call it. It could do two things:
>>>>>> 
>>>>>> 1. Automatically try to deduplicate fragments of the plan and share
>> the
>>>>>> result using CachedTable - in other words automatically insert
>>>>> `CachedTable
>>>>>> cache()` calls.
>>>>>> 2. Automatically make decision to bypass explicit `CachedTable` access
>>>>>> (this would be the equivalent of what you described as “semantic 3”).
>>>>>> 
>>>>>> However as I wrote previously, I have big doubts if such cost-based
>>>>>> optimisation would work (this applies also to “Semantic 2”). I would
>>>>> expect
>>>>>> it to do more harm than good in so many cases, that it wouldn’t make
>>>>> sense.
>>>>>> Even assuming that we calculate statistics perfectly (this ain’t gonna
>>>>>> happen), it’s virtually impossible to correctly estimate correct
>>>> exchange
>>>>>> rate of CPU cycles vs IO operations as it is changing so much from
>>>>>> deployment to deployment.
>>>>>> 
>>>>>> Is this the core of our disagreement here? That you would like this
>>>>>> “cache()” to be mostly hint for the optimiser?
>>>>>> 
>>>>>> Piotrek
>>>>>> 
>>>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Another potential concern for semantic 3 is that. In the future, we
>>>> may
>>>>>> add
>>>>>>> automatic caching to Flink. e.g. cache the intermediate results at
>>>> the
>>>>>>> shuffle boundary. If our semantic is that reference to the original
>>>>> table
>>>>>>> means skipping cache, those users may not be able to benefit from the
>>>>>>> implicit cache.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <be...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Piotrek,
>>>>>>>> 
>>>>>>>> Thanks for the reply. Thought about it again, I might have
>>>>> misunderstood
>>>>>>>> your proposal in earlier emails. Returning a CachedTable might not
>>>> be
>>>>> a
>>>>>> bad
>>>>>>>> idea.
>>>>>>>> 
>>>>>>>> I was more concerned about the semantic and its intuitiveness when a
>>>>>>>> CachedTable is returned. i..e, if cache() returns CachedTable. What
>>>>> are
>>>>>> the
>>>>>>>> semantic in the following code:
>>>>>>>> {
>>>>>>>> val cachedTable = a.cache()
>>>>>>>> val b = cachedTable.select(...)
>>>>>>>> val c = a.select(...)
>>>>>>>> }
>>>>>>>> What is the difference between b and c? At the first glance, I see
>>>> two
>>>>>>>> options:
>>>>>>>> 
>>>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses original
>>>>> DAG
>>>>>> as
>>>>>>>> user demanded so. In this case, the optimizer has no chance to
>>>>> optimize.
>>>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
>>>>>> optimizer
>>>>>>>> to choose whether the cache or DAG should be used. In this case,
>>>> user
>>>>>> lose
>>>>>>>> the option to NOT use cache.
>>>>>>>> 
>>>>>>>> As you can see, neither of the options seem perfect. However, I
>>>> guess
>>>>>> you
>>>>>>>> and Till are proposing the third option:
>>>>>>>> 
>>>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
>>>>> should
>>>>>>>> be used. c always use the DAG.
>>>>>>>> 
>>>>>>>> This does address all the concerns. It is just that from
>>>> intuitiveness
>>>>>>>> perspective, I found that asking user to explicitly use a
>>>> CachedTable
>>>>>> while
>>>>>>>> the optimizer might choose to ignore is a little weird. That was
>>>> why I
>>>>>> did
>>>>>>>> not think about that semantic. But given there is material benefit,
>>>> I
>>>>>> think
>>>>>>>> this semantic is acceptable.
>>>>>>>> 
>>>>>>>> 1. If we want to let optimiser make decisions whether to use cache
>>>> or
>>>>>> not,
>>>>>>>>> then why do we need “void cache()” method at all? Would It
>>>>> “increase”
>>>>>> the
>>>>>>>>> chance of using the cache? That’s sounds strange. What would be the
>>>>>>>>> mechanism of deciding whether to use the cache or not? If we want
>>>> to
>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>>>> deduplication”
>>>>>>>>> I would turn it on globally, not per table, and let the optimiser
>>>> do
>>>>>> all of
>>>>>>>>> the work.
>>>>>>>>> 2. We do not have statistics at the moment for any use/not use
>>>> cache
>>>>>>>>> decision.
>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such cost
>>>>>> based
>>>>>>>>> optimisations would work properly and I would still insist first on
>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>>>> 
>>>>>>>> We are absolutely on the same page here. An explicit cache() method
>>>> is
>>>>>>>> necessary not only because optimizer may not be able to make the
>>>> right
>>>>>>>> decision, but also because of the nature of interactive programming.
>>>>> For
>>>>>>>> example, if users write the following code in Scala shell:
>>>>>>>> val b = a.select(...)
>>>>>>>> val c = b.select(...)
>>>>>>>> val d = c.select(...).writeToSink(...)
>>>>>>>> tEnv.execute()
>>>>>>>> There is no way optimizer will know whether b or c will be used in
>>>>> later
>>>>>>>> code, unless users hint explicitly.
>>>>>>>> 
>>>>>>>> At the same time I’m not sure if you have responded to our
>>>> objections
>>>>> of
>>>>>>>>> `void cache()` being implicit/having side effects, which me, Jark,
>>>>>> Fabian,
>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>> 
>>>>>>>> Is there any other side effects if we use semantic 3 mentioned
>>>> above?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> JIangjie (Becket) Qin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
>>>>> piotr@data-artisans.com
>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Becket,
>>>>>>>>> 
>>>>>>>>> Sorry for not responding long time.
>>>>>>>>> 
>>>>>>>>> Regarding case1.
>>>>>>>>> 
>>>>>>>>> There wouldn’t be no “a.unCache()” method, but I would expect only
>>>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t
>>>> affect
>>>>>>>>> `cachedTableA2`. Just as in any other database dropping modifying
>>>> one
>>>>>>>>> independent table/materialised view does not affect others.
>>>>>>>>> 
>>>>>>>>>> What I meant is that assuming there is already a cached table,
>>>>> ideally
>>>>>>>>> users need
>>>>>>>>>> not to specify whether the next query should read from the cache
>>>> or
>>>>>> use
>>>>>>>>> the
>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>> 
>>>>>>>>> 1. If we want to let optimiser make decisions whether to use cache
>>>> or
>>>>>>>>> not, then why do we need “void cache()” method at all? Would It
>>>>>> “increase”
>>>>>>>>> the chance of using the cache? That’s sounds strange. What would be
>>>>> the
>>>>>>>>> mechanism of deciding whether to use the cache or not? If we want
>>>> to
>>>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>>>> deduplication”
>>>>>>>>> I would turn it on globally, not per table, and let the optimiser
>>>> do
>>>>>> all of
>>>>>>>>> the work.
>>>>>>>>> 2. We do not have statistics at the moment for any use/not use
>>>> cache
>>>>>>>>> decision.
>>>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such cost
>>>>>> based
>>>>>>>>> optimisations would work properly and I would still insist first on
>>>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
>>>>>>>>> contradict future work on automated cost based caching.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> At the same time I’m not sure if you have responded to our
>>>> objections
>>>>>> of
>>>>>>>>> `void cache()` being implicit/having side effects, which me, Jark,
>>>>>> Fabian,
>>>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>>>> 
>>>>>>>>> Piotrek
>>>>>>>>> 
>>>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Till,
>>>>>>>>>> 
>>>>>>>>>> It is true that after the first job submission, there will be no
>>>>>>>>> ambiguity
>>>>>>>>>> in terms of whether a cached table is used or not. That is the
>>>> same
>>>>>> for
>>>>>>>>> the
>>>>>>>>>> cache() without returning a CachedTable.
>>>>>>>>>> 
>>>>>>>>>> Conceptually one could think of cache() as introducing a caching
>>>>>>>>> operator
>>>>>>>>>>> from which you need to consume from if you want to benefit from
>>>> the
>>>>>>>>> caching
>>>>>>>>>>> functionality.
>>>>>>>>>> 
>>>>>>>>>> I am thinking a little differently. I think it is a hint (as you
>>>>>>>>> mentioned
>>>>>>>>>> later) instead of a new operator. I'd like to be careful about the
>>>>>>>>> semantic
>>>>>>>>>> of the API. A hint is a property set on an existing operator, but
>>>> is
>>>>>> not
>>>>>>>>>> itself an operator as it does not really manipulate the data.
>>>>>>>>>> 
>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision which
>>>>>>>>>>> intermediate result should be cached. But especially when
>>>> executing
>>>>>>>>> ad-hoc
>>>>>>>>>>> queries the user might better know which results need to be
>>>> cached
>>>>>>>>> because
>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would consider
>>>>> the
>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
>>>>> future
>>>>>> we
>>>>>>>>>>> might add functionality which tries to automatically cache
>>>> results
>>>>>>>>> (e.g.
>>>>>>>>>>> caching the latest intermediate results until so and so much
>>>> space
>>>>> is
>>>>>>>>>>> used). But this should hopefully not contradict with `CachedTable
>>>>>>>>> cache()`.
>>>>>>>>>> 
>>>>>>>>>> I agree that cache() method is needed for exactly the reason you
>>>>>>>>> mentioned,
>>>>>>>>>> i.e. Flink cannot predict what users are going to write later, so
>>>>>> users
>>>>>>>>>> need to tell Flink explicitly that this table will be used later.
>>>>>> What I
>>>>>>>>>> meant is that assuming there is already a cached table, ideally
>>>>> users
>>>>>>>>> need
>>>>>>>>>> not to specify whether the next query should read from the cache
>>>> or
>>>>>> use
>>>>>>>>> the
>>>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>>>> 
>>>>>>>>>> To explain the difference between returning / not returning a
>>>>>>>>> CachedTable,
>>>>>>>>>> I want compare the following two case:
>>>>>>>>>> 
>>>>>>>>>> *Case 1:  returning a CachedTable*
>>>>>>>>>> b = a.map(...)
>>>>>>>>>> val cachedTableA1 = a.cache()
>>>>>>>>>> val cachedTableA2 = a.cache()
>>>>>>>>>> b.print() // Just to make sure a is cached.
>>>>>>>>>> 
>>>>>>>>>> c = a.filter(...) // User specify that the original DAG is used?
>>>> Or
>>>>>> the
>>>>>>>>>> optimizer decides whether DAG or cache should be used?
>>>>>>>>>> d = cachedTableA1.filter() // User specify that the cached table
>>>> is
>>>>>>>>> used.
>>>>>>>>>> 
>>>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
>>>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
>>>>>>>>>> 
>>>>>>>>>> *Case 2: not returning a CachedTable*
>>>>>>>>>> b = a.map()
>>>>>>>>>> a.cache()
>>>>>>>>>> a.cache() // no-op
>>>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>>>> 
>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
>>>>> should
>>>>>>>>> be
>>>>>>>>>> used
>>>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or DAG
>>>>> should
>>>>>>>>> be
>>>>>>>>>> used
>>>>>>>>>> 
>>>>>>>>>> a.unCache()
>>>>>>>>>> a.unCache() // no-op
>>>>>>>>>> 
>>>>>>>>>> In case 1, semantic wise, optimizer lose the option to choose
>>>>> between
>>>>>>>>> DAG
>>>>>>>>>> and cache. And the unCache() call becomes tricky.
>>>>>>>>>> In case 2, users do not need to worry about whether cache or DAG
>>>> is
>>>>>>>>> used.
>>>>>>>>>> And the unCache() semantic is clear. However, the caveat is that
>>>>> users
>>>>>>>>>> cannot explicitly ignore the cache.
>>>>>>>>>> 
>>>>>>>>>> In order to address the issues mentioned in case 2 and inspired by
>>>>> the
>>>>>>>>>> discussion so far, I am thinking about using hint to allow user
>>>>>>>>> explicitly
>>>>>>>>>> ignore cache. Although we do not have hint yet, but we probably
>>>>> should
>>>>>>>>> have
>>>>>>>>>> one. So the code becomes:
>>>>>>>>>> 
>>>>>>>>>> *Case 3: returning this table*
>>>>>>>>>> b = a.map()
>>>>>>>>>> a.cache()
>>>>>>>>>> a.cache() // no-op
>>>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>>>> 
>>>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
>>>>> should
>>>>>>>>> be
>>>>>>>>>> used
>>>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used instead
>>>> of
>>>>>> the
>>>>>>>>>> cache.
>>>>>>>>>> 
>>>>>>>>>> a.unCache()
>>>>>>>>>> a.unCache() // no-op
>>>>>>>>>> 
>>>>>>>>>> We could also let cache() return this table to allow chained
>>>> method
>>>>>>>>> calls.
>>>>>>>>>> Do you think this API addresses the concerns?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> All the recent discussions are focused on whether there is a
>>>>> problem
>>>>>> if
>>>>>>>>>>> cache() not return a Table.
>>>>>>>>>>> It seems that returning a Table explicitly is more clear (and
>>>>> safe?).
>>>>>>>>>>> 
>>>>>>>>>>> So whether there are any problems if cache() returns a Table?
>>>>>> @Becket
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <trohrmann@apache.org
>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> It's true that b, c, d and e will all read from the original DAG
>>>>>> that
>>>>>>>>>>>> generates a. But all subsequent operators (when running multiple
>>>>>>>>> queries)
>>>>>>>>>>>> which reference cachedTableA should not need to reproduce `a`
>>>> but
>>>>>>>>>>> directly
>>>>>>>>>>>> consume the intermediate result.
>>>>>>>>>>>> 
>>>>>>>>>>>> Conceptually one could think of cache() as introducing a caching
>>>>>>>>> operator
>>>>>>>>>>>> from which you need to consume from if you want to benefit from
>>>>> the
>>>>>>>>>>> caching
>>>>>>>>>>>> functionality.
>>>>>>>>>>>> 
>>>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision which
>>>>>>>>>>>> intermediate result should be cached. But especially when
>>>>> executing
>>>>>>>>>>> ad-hoc
>>>>>>>>>>>> queries the user might better know which results need to be
>>>> cached
>>>>>>>>>>> because
>>>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>>>> consider
>>>>>> the
>>>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
>>>>> future
>>>>>>>>> we
>>>>>>>>>>>> might add functionality which tries to automatically cache
>>>> results
>>>>>>>>> (e.g.
>>>>>>>>>>>> caching the latest intermediate results until so and so much
>>>> space
>>>>>> is
>>>>>>>>>>>> used). But this should hopefully not contradict with
>>>> `CachedTable
>>>>>>>>>>> cache()`.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Till
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <becket.qin@gmail.com
>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the clarification. I am still a little confused.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If cache() returns a CachedTable, the example might become:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> cachedTableA = a.cache()
>>>>>>>>>>>>> d = cachedTableA.map(...)
>>>>>>>>>>>>> e = a.map()
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b, c, d and
>>>> e
>>>>>> are
>>>>>>>>>>> all
>>>>>>>>>>>>> going to be reading from the original DAG that generates a. But
>>>>>> with
>>>>>>>>> a
>>>>>>>>>>>>> naive expectation, d should be reading from the cache. This
>>>> seems
>>>>>> not
>>>>>>>>>>>>> solving the potential confusion you raised, right?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Just to be clear, my understanding are all based on the
>>>>> assumption
>>>>>>>>> that
>>>>>>>>>>>> the
>>>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
>>>>>>>>> c*achedTableA*
>>>>>>>>>>>> and
>>>>>>>>>>>>> original table *a * should be completely interchangeable.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> That said, I think a valid argument is optimization. There are
>>>>>> indeed
>>>>>>>>>>>> cases
>>>>>>>>>>>>> that reading from the original DAG could be faster than reading
>>>>>> from
>>>>>>>>>>> the
>>>>>>>>>>>>> cache. For example, in the following example:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> a.filter(f1' > 100)
>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>> b = a.filter(f1' < 100)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Ideally the optimizer should be intelligent enough to decide
>>>>> which
>>>>>>>>> way
>>>>>>>>>>> is
>>>>>>>>>>>>> faster, without user intervention. In this case, it will
>>>> identify
>>>>>>>>> that
>>>>>>>>>>> b
>>>>>>>>>>>>> would just be an empty table, thus skip reading from the cache
>>>>>>>>>>>> completely.
>>>>>>>>>>>>> But I agree that returning a CachedTable would give user the
>>>>>> control
>>>>>>>>> of
>>>>>>>>>>>>> when to use cache, even though I still feel that letting the
>>>>>>>>> optimizer
>>>>>>>>>>>>> handle this is a better option in long run.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
>>>>> trohrmann@apache.org
>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Yes you are right Becket that it still depends on the actual
>>>>>>>>>>> execution
>>>>>>>>>>>> of
>>>>>>>>>>>>>> the job whether a consumer reads from a cached result or not.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My point was actually about the properties of a (cached vs.
>>>>>>>>>>> non-cached)
>>>>>>>>>>>>> and
>>>>>>>>>>>>>> not about the execution. I would not make cache trigger the
>>>>>>>>> execution
>>>>>>>>>>>> of
>>>>>>>>>>>>>> the job because one loses some flexibility by eagerly
>>>> triggering
>>>>>> the
>>>>>>>>>>>>>> execution.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is returned
>>>>> by
>>>>>>>>> the
>>>>>>>>>>>>>> cache() method like Piotr did in order to make the API more
>>>>>>>>> explicit.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
>>>> becket.qin@gmail.com
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> That is a good example. Just a minor correction, in this
>>>> case,
>>>>>> b, c
>>>>>>>>>>>>> and d
>>>>>>>>>>>>>>> will all consume from a non-cached a. This is because cache
>>>>> will
>>>>>>>>>>> only
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> created on the very first job submission that generates the
>>>>> table
>>>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>>>>> cached.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If I understand correctly, this is example is about whether
>>>>>>>>>>> .cache()
>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In another
>>>>> word,
>>>>>>>>>>> if
>>>>>>>>>>>>>>> cache() method actually triggers a job that creates the
>>>> cache,
>>>>>>>>>>> there
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> be no such confusion. Is that right?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In the example, although d will not consume from the cached
>>>>> Table
>>>>>>>>>>>> while
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> looks supposed to, from correctness perspective the code will
>>>>>> still
>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>> correct result, assuming that tables are immutable.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Personally I feel it is OK because users probably won't
>>>> really
>>>>>>>>>>> worry
>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>> whether the table is cached or not. And lazy cache could
>>>> avoid
>>>>>> some
>>>>>>>>>>>>>>> unnecessary caching if a cached table is never created in the
>>>>>> user
>>>>>>>>>>>>>>> application. But I am not opposed to do eager evaluation of
>>>>>> cache.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
>>>>>>>>>>> trohrmann@apache.org>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily changing
>>>>>>>>>>>> properties
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> node affects all down stream consumers but does not
>>>>> necessarily
>>>>>>>>>>>> have
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> happen before these consumers are defined. From a user's
>>>>>>>>>>>> perspective
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> can be quite confusing:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>>>> d = a.map(...)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> now b, c and d will consume from a cached operator. In this
>>>>>> case,
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>> would most likely expect that only d reads from a cached
>>>>> result.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side effects?
>>>> So
>>>>>>>>>>>> far
>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
>>>>> table
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Not only that. There are also performance implications and
>>>>>>>>>>> those
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> another implicit side effects of using `void cache()`. As I
>>>>>>>>>>> wrote
>>>>>>>>>>>>>>> before,
>>>>>>>>>>>>>>>>> reading from cache might not always be desirable, thus it
>>>> can
>>>>>>>>>>>> cause
>>>>>>>>>>>>>>>>> performance degradation and I’m fine with that - user's or
>>>>>>>>>>>>>> optimiser’s
>>>>>>>>>>>>>>>>> choice. What I do not like is that this implicit side
>>>> effect
>>>>>>>>>>> can
>>>>>>>>>>>>>>> manifest
>>>>>>>>>>>>>>>>> in completely different part of code, that wasn’t touched
>>>> by
>>>>> a
>>>>>>>>>>>> user
>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else. And even
>>>> if
>>>>>>>>>>>>> caching
>>>>>>>>>>>>>>>>> improves performance, it’s still a side effect of `void
>>>>>>>>>>> cache()`.
>>>>>>>>>>>>>>> Almost
>>>>>>>>>>>>>>>>> from the definition `void` methods have only side effects.
>>>>> As I
>>>>>>>>>>>>> wrote
>>>>>>>>>>>>>>>>> before, there are couple of scenarios where this might be
>>>>>>>>>>>>> undesirable
>>>>>>>>>>>>>>>>> and/or unexpected, for example:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>>>> Table b = …;
>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>> x = b.join(…)
>>>>>>>>>>>>>>>>> y = b.count()
>>>>>>>>>>>>>>>>> // ...
>>>>>>>>>>>>>>>>> // 100
>>>>>>>>>>>>>>>>> // hundred
>>>>>>>>>>>>>>>>> // lines
>>>>>>>>>>>>>>>>> // of
>>>>>>>>>>>>>>>>> // code
>>>>>>>>>>>>>>>>> // later
>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in
>>>> a
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>> method/file/package/dependency
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Table b = ...
>>>>>>>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>>>>>>>> foo(b)
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>> Else {
>>>>>>>>>>>>>>>>> bar(b)
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Void foo(Table b) {
>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>> // do something with b
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly affect
>>>>>>>>>>>>> (semantic
>>>>>>>>>>>>>>> of a
>>>>>>>>>>>>>>>>> program in case of sources being mutable and performance)
>>>> `z
>>>>> =
>>>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On top of that, there is still this argument of mine that
>>>>>>>>>>> having
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
>>>> flexible
>>>>>>>>>>> for
>>>>>>>>>>>> us
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> future and for the user (as a manual option to bypass cache
>>>>>>>>>>>> reads).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> But Jiangjie is correct,
>>>>>>>>>>>>>>>>>> the source table in batching should be immutable. It is
>>>> the
>>>>>>>>>>>>> user’s
>>>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
>>>>>>>>>>> failover
>>>>>>>>>>>>> may
>>>>>>>>>>>>>>> lead
>>>>>>>>>>>>>>>>>> to inconsistent results.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment
>>>> should
>>>>>>>>>>> be.
>>>>>>>>>>>>> But
>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this (since the
>>>>>>>>>>>> proper
>>>>>>>>>>>>>> fix
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> to support transactions), I’m just trying to minimise
>>>>> confusion
>>>>>>>>>>>> for
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> users that are not fully aware what’s going on and operate
>>>> in
>>>>>>>>>>>> less
>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>> perfect setup. And if something bites them after adding
>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>> call,
>>>>>>>>>>>>>>>>> to make sure that they at least know all of the places that
>>>>>>>>>>>> adding
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> line can affect.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks, Piotrek
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <becket.qin@gmail.com
>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more replies are
>>>>>>>>>>>>>> following.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be
>>>> used
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>> programming and not only in batching.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache() has the
>>>>>>>>>>> same
>>>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> batch processing. The semantic is following:
>>>>>>>>>>>>>>>>>> For a table created via a series of computation, save that
>>>>>>>>>>>> table
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>> reference to avoid running the computation logic to
>>>>>>>>>>> regenerate
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
>>>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
>>>> processing.
>>>>>>>>>>> The
>>>>>>>>>>>>>>>>> difference
>>>>>>>>>>>>>>>>>> is that stream applications will only run once as they are
>>>>>>>>>>> long
>>>>>>>>>>>>>>>> running.
>>>>>>>>>>>>>>>>>> And the batch applications may be run multiple times,
>>>> hence
>>>>>>>>>>> the
>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>> be created and dropped each time the application runs.
>>>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
>>>> management
>>>>>>>>>>>>>>>> requirements
>>>>>>>>>>>>>>>>>> for the streaming cached table, such as time based / size
>>>>>>>>>>> based
>>>>>>>>>>>>>>>>> retention,
>>>>>>>>>>>>>>>>>> to address the infinite data issue. But such requirement
>>>>> does
>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>> the semantic.
>>>>>>>>>>>>>>>>>> You are right that interactive programming is just one use
>>>>>>>>>>> case
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> cache().
>>>>>>>>>>>>>>>>>> It is not the only use case.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the `void
>>>>>>>>>>>>> cache()`
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> side effects.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This is indeed the key point. The argument around whether
>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> return something already indicates that cache() and
>>>>>>>>>>>> materialize()
>>>>>>>>>>>>>>>> address
>>>>>>>>>>>>>>>>>> different issues.
>>>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side effects?
>>>> So
>>>>>>>>>>>> far
>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
>>>>> table
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>>>> CachedTable
>>>>>>>>>>>>>>> read-only.
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user can
>>>>> not
>>>>>>>>>>>>> write
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently can
>>>> not
>>>>>>>>>>>>> write
>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I don't think anyone should insert something to a cache.
>>>> By
>>>>>>>>>>>>>>> definition
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> cache should only be updated when the corresponding
>>>> original
>>>>>>>>>>>>> table
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> updated. What I am wondering is that given the following
>>>> two
>>>>>>>>>>>>> facts:
>>>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with something like
>>>>>>>>>>>>>> insert()),
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
>>>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
>>>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
>>>> mutable
>>>>>>>>>>> and
>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is where I
>>>>> thought
>>>>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
>>>>>>>>>>>>>> explanation
>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is that I
>>>> think
>>>>>>>>>>> of
>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> “Table”s
>>>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
>>>>>>>>>>> views,
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> difference for me is that their live scope is short -
>>>>>>>>>>> current
>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>> is limited by different execution model. That’s why
>>>>>>>>>>> “cashing”
>>>>>>>>>>>> a
>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>> for me
>>>>>>>>>>>>>>>>>>> is just materialising it.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> However I see and I understand your point of view. Coming
>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
>>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
>>>>>>>>>>> only
>>>>>>>>>>>> be
>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> interactive programming and not only in batching. But
>>>>> naming
>>>>>>>>>>>> is
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>>>> and not that critical to me. Especially that once we
>>>>>>>>>>> implement
>>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
>>>>> `cache()`
>>>>>>>>>>>> if
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> deem
>>>>>>>>>>>>>>>>> so.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
>>>> `void
>>>>>>>>>>>>>> cache()`
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you have
>>>>>>>>>>> mentioned.
>>>>>>>>>>>>>> True:
>>>>>>>>>>>>>>>>>>> results might be non deterministic if underlying source
>>>>>>>>>>> table
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> changing.
>>>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
>>>>>>>>>>> semantic
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It can
>>>>>>>>>>> cause
>>>>>>>>>>>>>> “wtf”
>>>>>>>>>>>>>>>>> moment
>>>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some place
>>>> in
>>>>>>>>>>> his
>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
>>>> differently.
>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
>>>>>>>>>>> force
>>>>>>>>>>>>> user
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> explicitly use the cache which removes the “random” part
>>>>>>>>>>> from
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> "suddenly
>>>>>>>>>>>>>>>>>>> some other random places are behaving differently”.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
>>>>>>>>>>>>>>>> flexibility/allowing
>>>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are independent of
>>>>>>>>>>>>> `cache()`
>>>>>>>>>>>>>> vs
>>>>>>>>>>>>>>>>>>> `materialize()` discussion.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the CachedTable?
>>>>>>>>>>> This
>>>>>>>>>>>>>>> sounds
>>>>>>>>>>>>>>>>>>> pretty confusing.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>>>> CachedTable
>>>>>>>>>>>>>>>> read-only. I
>>>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user can
>>>>> not
>>>>>>>>>>>>> write
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently can
>>>> not
>>>>>>>>>>>>> write
>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
>>>> xingcanc@gmail.com
>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
>>>>>>>>>>>> should
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> considered as two different methods where the later one
>>>> is
>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> sophisticated.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea is just
>>>> to
>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI
>>>> is a
>>>>>>>>>>>>>>> high-level
>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
>>>>>>>>>>> and
>>>>>>>>>>>>>> force
>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching it. Then
>>>>>>>>>>> the
>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> manually register the cached dataset to a table again (we
>>>>>>>>>>> may
>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
>>>> identical
>>>>>>>>>>>>> schema
>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>> different contents here). After all, it’s the dataset
>>>>> rather
>>>>>>>>>>>>> than
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
>>>>>>>>>>>>> becket.qin@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
>>>>>>>>>>>>>> arguments.
>>>>>>>>>>>>>>>>> But I
>>>>>>>>>>>>>>>>>>>>> think those arguments are mostly about materialized
>>>> view.
>>>>>>>>>>>> Let
>>>>>>>>>>>>> me
>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and materialize()
>>>>> are
>>>>>>>>>>>>>>>> different.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite different
>>>>>>>>>>>>>>> implications.
>>>>>>>>>>>>>>>>> An
>>>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When users
>>>>>>>>>>> call
>>>>>>>>>>>>>>> cache(),
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result as a
>>>>>>>>>>> draft
>>>>>>>>>>>> of
>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>> work,
>>>>>>>>>>>>>>>>>>>>> this intermediate result may not have any realistic
>>>>>>>>>>> meaning.
>>>>>>>>>>>>>>> Calling
>>>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the cached
>>>>>>>>>>> table
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>>>>>> manner.
>>>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I have
>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>> meaningful
>>>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think about
>>>>> the
>>>>>>>>>>>>>>>> validation,
>>>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
>>>> materialize()
>>>>>>>>>>>>> methods
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them. The
>>>> concept
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to say the
>>>>>>>>>>>> related
>>>>>>>>>>>>>>> stuff
>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and systematic
>>>>>>>>>>>> manner.
>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> found
>>>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>>> programming experience.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still have some
>>>>>>>>>>>>>> questions,
>>>>>>>>>>>>>>>>>>> though.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files from a
>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>>>>>>>>>>> initialised)
>>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
>>>> writes
>>>>>>>>>>>> new
>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
>>>>>>>>>>>>>> implemented
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> initial version
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> what if someone else added some more files to /foo/bar
>>>> at
>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
>>>>>>>>>>>>>>>>>>> non-deterministic,
>>>>>>>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
>>>>>>>>>>>>> “cache”
>>>>>>>>>>>>>>>>> dropping
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in most
>>>>> cases,
>>>>>>>>>>>> we
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental assumption of
>>>>> such
>>>>>>>>>>>>> case
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> source data is complete before the data processing
>>>>> begins,
>>>>>>>>>>>> and
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO, if
>>>>>>>>>>>> additional
>>>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>> to be added to some source during the processing, it
>>>>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> ways
>>>>>>>>>>>>>>>>>>>>> like union the source with another table containing the
>>>>>>>>>>> rows
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> There are a few cases that computations are executed
>>>>>>>>>>>>> repeatedly
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> changing data source.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job every
>>>> hour
>>>>>>>>>>>> with
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> samples
>>>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the source
>>>>>>>>>>> data
>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged
>>>>> within
>>>>>>>>>>>> one
>>>>>>>>>>>>>>> run.
>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>> usually in that case, the result will need versioning,
>>>>>>>>>>> i.e.
>>>>>>>>>>>>> for
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result from the
>>>>>>>>>>> source
>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>> by a
>>>>>>>>>>>>>>>>>>>>> certain timestamp.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Another example is something like data warehouse. In
>>>> this
>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>> are a
>>>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
>>>> sources,
>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be created to
>>>>>>>>>>>>> generate
>>>>>>>>>>>>>>>>> derived
>>>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated when the
>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic that
>>>>>>>>>>>> derives
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
>>>>>>>>>>>>>>> reports/views.
>>>>>>>>>>>>>>>>>>> Again,
>>>>>>>>>>>>>>>>>>>>> all those derived data also need to ha


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotrek,

Not sure if you noticed, in my last email, I was proposing `CacheHandle
cache()` to avoid the potential side effect due to function calls.

Let's look at the disagreement in your reply one by one.


1. Optimization chances

Optimization is never a trivial work. This is exactly why we should not let
user manually do that. Databases have done huge amount of work in this
area. At Alibaba, we rely heavily on many optimization rules to boost the
SQL query performance.

In your example, if I filling the filter conditions in a certain way, the
optimization would become obvious.

Table src1 = … // read from connector 1
Table src2 = … // read from connector 2

Table a = src1.filte('f1 > 10).join(src2.filter('f2 < 30), `f1 ===
`f2).as('f3, ...)
a.cache() // write cache to connector 3, when writing the records, remember
min and max of `f1

a.filter('f3 > 30) // There is no need to read from any connector because
`a` does not contain any record whose 'f3 is greater than 30.
env.execute()
a.select(…)

BTW, it seems to me that adding some basic statistics is fairly
straightforward and the cost is pretty marginal if not ignorable. In fact
it is not only needed for optimization, but also for cases such as ML,
where some algorithms may need to decide their parameter based on the
statistics of the data.


2. Same API, one semantic now, another semantic later.

I am trying to understand what is the semantic of `CachedTable cache()` you
are proposing. IMO, we should avoid designing an API whose semantic will be
changed later. If we have a "CachedTable cache()" method, then the semantic
should be very clearly defined upfront and do not change later. It should
never be "right now let's go with semantic 1, later we can silently change
it to semantic 2 or 3". Such change could result in bad consequence. For
example, let's say we decide go with semantic 1:

CachedTable cachedA = a.cache()
cachedA.foo() // Cache is used
a.bar() // Original DAG is used.

Now majority of the users would be using cachedA.foo() in their code. And
some advanced users will use a.bar() to explicitly skip the cache. Later
on, we added smart optimization and change the semantic to semantic 2:

CachedTable cachedA = a.cache()
cachedA.foo() // Cache is used
a.bar() // Cache MIGHT be used, and Flink may decide to skip cache if it is
faster.

Now most of the users who were writing cachedA.foo() will not benefit from
this optimization at all, unless they change their code to use a.foo()
instead. And those advanced users suddenly lose the option to explicitly
ignore cache unless they change their code (assuming we care enough to
provide something like hint(useCache)). If we don't define the semantic
carefully, our users will have to change their code again and again while
they shouldn't have to.


3. side effect.

Before we talk about side effect, we have to agree on the assumptions. The
assumptions I have are following:
1. We are talking about batch processing.
2. The source tables are immutable during one run of batch processing logic.
3. The cache is immutable during one run of batch processing logic.

I think assumption 2 and 3 are by definition what batch processing means,
i.e the data must be complete before it is processed and should not change
when the processing is running.

As far as I am aware of, I don't know any batch processing system breaking
those assumptions. Even for relational database tables, where queries can
run with concurrent modifications, necessary locking are still required to
ensure the integrity of the query result.

Please let me know if you disagree with the above assumptions. If you agree
with these assumptions, with the `CacheHandle cache()` API in my last
email, do you still see side effects?

Thanks,

Jiangjie (Becket) Qin


On Wed, Dec 12, 2018 at 7:11 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi Becket,
>
> > Regarding the chance of optimization, it might not be that rare. Some
> very
> > simple statistics could already help in many cases. For example, simply
> > maintaining max and min of each fields can already eliminate some
> > unnecessary table scan (potentially scanning the cached table) if the
> > result is doomed to be empty. A histogram would give even further
> > information. The optimizer could be very careful and only ignores cache
> > when it is 100% sure doing that is cheaper. e.g. only when a filter on
> the
> > cache will absolutely return nothing.
>
> I do not see how this might be easy to achieve. It would require tons of
> effort to make it work and in the end you would still have a problem of
> comparing/trading CPU cycles vs IO. For example:
>
> Table src1 = … // read from connector 1
> Table src2 = … // read from connector 2
>
> Table a = src1.filter(…).join(src2.filter(…), …)
> a.cache() // write cache to connector 3
>
> a.filter(…)
> env.execute()
> a.select(…)
>
> Decision whether it’s better to:
> A) read from connector1/connector2, filter/map and join them twice
> B) read from connector1/connector2, filter/map and join them once, pay the
> price of writing to connector 3 and then reading from it
>
> Is very far from trivial. `a` can end up much larger than `src1` and
> `src2`, writes to connector 3 might be extremely slow, reads from connector
> 3 can be slower compared to reads from connector 1 & 2, … . You really need
> to have extremely good statistics to correctly asses size of the output and
> it would still be failing many times (correlations etc). And keep in mind
> that at the moment we do not have ANY statistics at all. More than that, it
> would require significantly more testing and setting up some benchmarks to
> make sure that we do not brake it with some regressions.
>
> That’s why I’m strongly opposing this idea - at least let’s not starts
> with this. If we first start with completely manual/explicit caching,
> without any magic, it would be a significant improvement for the users for
> a fraction of the development cost. After implementing that, when we
> already have all of the working pieces, we can start working on some
> optimisations rules. As I wrote before, if we start with
>
> `CachedTable cache()`
>
> We can later work on follow up stories to make it automatic. Despite that
> I don’t like this implicit/side effect approach with `void` method, having
> explicit `CachedTable cache()` wouldn’t even prevent as from later adding
> `void hintCache()` method, with the exact semantic that you want.
>
> On top of that I re-rise again that having implicit `void
> cache()/hintCache()` has other side effects and problems with non immutable
> data, and being annoying when used secretly inside methods.
>
> Explicit `CachedTable cache()` just looks like much less controversial MVP
> and if we decide to go further with this topic, it’s not a wasted effort,
> but just lies on a stright path to more advanced/complicated solutions in
> the future. Are there any drawbacks of starting with `CachedTable cache()`
> that I’m missing?
>
> Piotrek
>
> > On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
> >
> > Hi Becket,
> >
> > Introducing CacheHandle seems too complicated. That means users have to
> > maintain Handler properly.
> >
> > And since cache is just a hint for optimizer, why not just return Table
> > itself for cache method. This hint info should be kept in Table I
> believe.
> >
> > So how about adding method cache and uncache for Table, and both return
> > Table. Because what cache and uncache did is just adding some hint info
> > into Table.
> >
> >
> >
> >
> > Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
> >
> >> Hi Till and Piotrek,
> >>
> >> Thanks for the clarification. That solves quite a few confusion. My
> >> understanding of how cache works is same as what Till describe. i.e.
> >> cache() is a hint to Flink, but it is not guaranteed that cache always
> >> exist and it might be recomputed from its lineage.
> >>
> >> Is this the core of our disagreement here? That you would like this
> >>> “cache()” to be mostly hint for the optimiser?
> >>
> >> Semantic wise, yes. That's also why I think materialize() has a much
> larger
> >> scope than cache(), thus it should be a different method.
> >>
> >> Regarding the chance of optimization, it might not be that rare. Some
> very
> >> simple statistics could already help in many cases. For example, simply
> >> maintaining max and min of each fields can already eliminate some
> >> unnecessary table scan (potentially scanning the cached table) if the
> >> result is doomed to be empty. A histogram would give even further
> >> information. The optimizer could be very careful and only ignores cache
> >> when it is 100% sure doing that is cheaper. e.g. only when a filter on
> the
> >> cache will absolutely return nothing.
> >>
> >> Given the above clarification on cache, I would like to revisit the
> >> original "void cache()" proposal and see if we can improve on top of
> that.
> >>
> >> What do you think about the following modified interface?
> >>
> >> Table {
> >>  /**
> >>   * This call hints Flink to maintain a cache of this table and leverage
> >> it for performance optimization if needed.
> >>   * Note that Flink may still decide to not use the cache if it is
> cheaper
> >> by doing so.
> >>   *
> >>   * A CacheHandle will be returned to allow user release the cache
> >> actively. The cache will be deleted if there
> >>   * is no unreleased cache handlers to it. When the TableEnvironment is
> >> closed. The cache will also be deleted
> >>   * and all the cache handlers will be released.
> >>   *
> >>   * @return a CacheHandle referring to the cache of this table.
> >>   */
> >>  CacheHandle cache();
> >> }
> >>
> >> CacheHandle {
> >>  /**
> >>   * Close the cache handle. This method does not necessarily deletes the
> >> cache. Instead, it simply decrements the reference counter to the cache.
> >>   * When the there is no handle referring to a cache. The cache will be
> >> deleted.
> >>   *
> >>   * @return the number of open handles to the cache after this handle
> has
> >> been released.
> >>   */
> >>  int release()
> >> }
> >>
> >> The rationale behind this interface is following:
> >> In vast majority of the cases, users wouldn't really care whether the
> cache
> >> is used or not. So I think the most intuitive way is letting cache()
> return
> >> nothing. So nobody needs to worry about the difference between
> operations
> >> on CacheTables and those on the "original" tables. This will make maybe
> >> 99.9% of the users happy. There were two concerns raised for this
> approach:
> >> 1. In some rare cases, users may want to ignore cache,
> >> 2. A table might be cached/uncached in a third party function while the
> >> caller does not know.
> >>
> >> For the first issue, users can use hint("ignoreCache") to explicitly
> ignore
> >> cache.
> >> For the second issue, the above proposal lets cache() return a
> CacheHandle,
> >> the only method in it is release(). Different CacheHandles will refer to
> >> the same cache, if a cache no longer has any cache handle, it will be
> >> deleted. This will address the following case:
> >> {
> >>  val handle1 = a.cache()
> >>  process(a)
> >>  a.select(...) // cache is still available, handle1 has not been
> released.
> >> }
> >>
> >> void process(Table t) {
> >>  val handle2 = t.cache() // new handle to cache
> >>  t.select(...) // optimizer decides cache usage
> >>  t.hint("ignoreCache").select(...) // cache is ignored
> >>  handle2.release() // release the handle, but the cache may still be
> >> available if there are other handles
> >>  ...
> >> }
> >>
> >> Does the above modified approach look reasonable to you?
> >>
> >> Cheers,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <tr...@apache.org>
> >> wrote:
> >>
> >>> Hi Becket,
> >>>
> >>> I was aiming at semantics similar to 1. I actually thought that
> `cache()`
> >>> would tell the system to materialize the intermediate result so that
> >>> subsequent queries don't need to reprocess it. This means that the
> usage
> >> of
> >>> the cached table in this example
> >>>
> >>> {
> >>> val cachedTable = a.cache()
> >>> val b1 = cachedTable.select(…)
> >>> val b2 = cachedTable.foo().select(…)
> >>> val b3 = cachedTable.bar().select(...)
> >>> val c1 = a.select(…)
> >>> val c2 = a.foo().select(…)
> >>> val c3 = a.bar().select(...)
> >>> }
> >>>
> >>> strongly depends on interleaved calls which trigger the execution of
> sub
> >>> queries. So for example, if there is only a single env.execute call at
> >> the
> >>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be computed by
> >>> reading directly from the sources (given that there is only a single
> >>> JobGraph). It just happens that the result of `a` will be cached such
> >> that
> >>> we skip the processing of `a` when there are subsequent queries reading
> >>> from `cachedTable`. If for some reason the system cannot materialize
> the
> >>> table (e.g. running out of disk space, ttl expired), then it could also
> >>> happen that we need to reprocess `a`. In that sense `cachedTable`
> simply
> >> is
> >>> an identifier for the materialized result of `a` with the lineage how
> to
> >>> reprocess it.
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <
> piotr@data-artisans.com
> >>>
> >>> wrote:
> >>>
> >>>> Hi Becket,
> >>>>
> >>>>> {
> >>>>> val cachedTable = a.cache()
> >>>>> val b = cachedTable.select(...)
> >>>>> val c = a.select(...)
> >>>>> }
> >>>>>
> >>>>> Semantic 1. b uses cachedTable as user demanded so. c uses original
> >> DAG
> >>>> as
> >>>>> user demanded so. In this case, the optimizer has no chance to
> >>> optimize.
> >>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
> >>>> optimizer
> >>>>> to choose whether the cache or DAG should be used. In this case, user
> >>>> lose
> >>>>> the option to NOT use cache.
> >>>>>
> >>>>> As you can see, neither of the options seem perfect. However, I guess
> >>> you
> >>>>> and Till are proposing the third option:
> >>>>>
> >>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
> >>> should
> >>>> be
> >>>>> used. c always use the DAG.
> >>>>
> >>>> I am pretty sure that me, Till, Fabian and others were all proposing
> >> and
> >>>> advocating in favour of semantic “1”. No cost based optimiser
> decisions
> >>> at
> >>>> all.
> >>>>
> >>>> {
> >>>> val cachedTable = a.cache()
> >>>> val b1 = cachedTable.select(…)
> >>>> val b2 = cachedTable.foo().select(…)
> >>>> val b3 = cachedTable.bar().select(...)
> >>>> val c1 = a.select(…)
> >>>> val c2 = a.foo().select(…)
> >>>> val c3 = a.bar().select(...)
> >>>> }
> >>>>
> >>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
> >>>> re-executing whole plan for “a”.
> >>>>
> >>>> In the future we could discuss going one step further, introducing
> some
> >>>> global optimisation (that can be manually enabled/disabled):
> >> deduplicate
> >>>> plan nodes/deduplicate sub queries/re-use sub queries results/or
> >> whatever
> >>>> we could call it. It could do two things:
> >>>>
> >>>> 1. Automatically try to deduplicate fragments of the plan and share
> the
> >>>> result using CachedTable - in other words automatically insert
> >>> `CachedTable
> >>>> cache()` calls.
> >>>> 2. Automatically make decision to bypass explicit `CachedTable` access
> >>>> (this would be the equivalent of what you described as “semantic 3”).
> >>>>
> >>>> However as I wrote previously, I have big doubts if such cost-based
> >>>> optimisation would work (this applies also to “Semantic 2”). I would
> >>> expect
> >>>> it to do more harm than good in so many cases, that it wouldn’t make
> >>> sense.
> >>>> Even assuming that we calculate statistics perfectly (this ain’t gonna
> >>>> happen), it’s virtually impossible to correctly estimate correct
> >> exchange
> >>>> rate of CPU cycles vs IO operations as it is changing so much from
> >>>> deployment to deployment.
> >>>>
> >>>> Is this the core of our disagreement here? That you would like this
> >>>> “cache()” to be mostly hint for the optimiser?
> >>>>
> >>>> Piotrek
> >>>>
> >>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com> wrote:
> >>>>>
> >>>>> Another potential concern for semantic 3 is that. In the future, we
> >> may
> >>>> add
> >>>>> automatic caching to Flink. e.g. cache the intermediate results at
> >> the
> >>>>> shuffle boundary. If our semantic is that reference to the original
> >>> table
> >>>>> means skipping cache, those users may not be able to benefit from the
> >>>>> implicit cache.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <be...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Hi Piotrek,
> >>>>>>
> >>>>>> Thanks for the reply. Thought about it again, I might have
> >>> misunderstood
> >>>>>> your proposal in earlier emails. Returning a CachedTable might not
> >> be
> >>> a
> >>>> bad
> >>>>>> idea.
> >>>>>>
> >>>>>> I was more concerned about the semantic and its intuitiveness when a
> >>>>>> CachedTable is returned. i..e, if cache() returns CachedTable. What
> >>> are
> >>>> the
> >>>>>> semantic in the following code:
> >>>>>> {
> >>>>>> val cachedTable = a.cache()
> >>>>>> val b = cachedTable.select(...)
> >>>>>> val c = a.select(...)
> >>>>>> }
> >>>>>> What is the difference between b and c? At the first glance, I see
> >> two
> >>>>>> options:
> >>>>>>
> >>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses original
> >>> DAG
> >>>> as
> >>>>>> user demanded so. In this case, the optimizer has no chance to
> >>> optimize.
> >>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
> >>>> optimizer
> >>>>>> to choose whether the cache or DAG should be used. In this case,
> >> user
> >>>> lose
> >>>>>> the option to NOT use cache.
> >>>>>>
> >>>>>> As you can see, neither of the options seem perfect. However, I
> >> guess
> >>>> you
> >>>>>> and Till are proposing the third option:
> >>>>>>
> >>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
> >>> should
> >>>>>> be used. c always use the DAG.
> >>>>>>
> >>>>>> This does address all the concerns. It is just that from
> >> intuitiveness
> >>>>>> perspective, I found that asking user to explicitly use a
> >> CachedTable
> >>>> while
> >>>>>> the optimizer might choose to ignore is a little weird. That was
> >> why I
> >>>> did
> >>>>>> not think about that semantic. But given there is material benefit,
> >> I
> >>>> think
> >>>>>> this semantic is acceptable.
> >>>>>>
> >>>>>> 1. If we want to let optimiser make decisions whether to use cache
> >> or
> >>>> not,
> >>>>>>> then why do we need “void cache()” method at all? Would It
> >>> “increase”
> >>>> the
> >>>>>>> chance of using the cache? That’s sounds strange. What would be the
> >>>>>>> mechanism of deciding whether to use the cache or not? If we want
> >> to
> >>>>>>> introduce such kind  automated optimisations of “plan nodes
> >>>> deduplication”
> >>>>>>> I would turn it on globally, not per table, and let the optimiser
> >> do
> >>>> all of
> >>>>>>> the work.
> >>>>>>> 2. We do not have statistics at the moment for any use/not use
> >> cache
> >>>>>>> decision.
> >>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such cost
> >>>> based
> >>>>>>> optimisations would work properly and I would still insist first on
> >>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
> >>>>>>>
> >>>>>> We are absolutely on the same page here. An explicit cache() method
> >> is
> >>>>>> necessary not only because optimizer may not be able to make the
> >> right
> >>>>>> decision, but also because of the nature of interactive programming.
> >>> For
> >>>>>> example, if users write the following code in Scala shell:
> >>>>>> val b = a.select(...)
> >>>>>> val c = b.select(...)
> >>>>>> val d = c.select(...).writeToSink(...)
> >>>>>> tEnv.execute()
> >>>>>> There is no way optimizer will know whether b or c will be used in
> >>> later
> >>>>>> code, unless users hint explicitly.
> >>>>>>
> >>>>>> At the same time I’m not sure if you have responded to our
> >> objections
> >>> of
> >>>>>>> `void cache()` being implicit/having side effects, which me, Jark,
> >>>> Fabian,
> >>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>
> >>>>>> Is there any other side effects if we use semantic 3 mentioned
> >> above?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> JIangjie (Becket) Qin
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
> >>> piotr@data-artisans.com
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Becket,
> >>>>>>>
> >>>>>>> Sorry for not responding long time.
> >>>>>>>
> >>>>>>> Regarding case1.
> >>>>>>>
> >>>>>>> There wouldn’t be no “a.unCache()” method, but I would expect only
> >>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t
> >> affect
> >>>>>>> `cachedTableA2`. Just as in any other database dropping modifying
> >> one
> >>>>>>> independent table/materialised view does not affect others.
> >>>>>>>
> >>>>>>>> What I meant is that assuming there is already a cached table,
> >>> ideally
> >>>>>>> users need
> >>>>>>>> not to specify whether the next query should read from the cache
> >> or
> >>>> use
> >>>>>>> the
> >>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>
> >>>>>>> 1. If we want to let optimiser make decisions whether to use cache
> >> or
> >>>>>>> not, then why do we need “void cache()” method at all? Would It
> >>>> “increase”
> >>>>>>> the chance of using the cache? That’s sounds strange. What would be
> >>> the
> >>>>>>> mechanism of deciding whether to use the cache or not? If we want
> >> to
> >>>>>>> introduce such kind  automated optimisations of “plan nodes
> >>>> deduplication”
> >>>>>>> I would turn it on globally, not per table, and let the optimiser
> >> do
> >>>> all of
> >>>>>>> the work.
> >>>>>>> 2. We do not have statistics at the moment for any use/not use
> >> cache
> >>>>>>> decision.
> >>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such cost
> >>>> based
> >>>>>>> optimisations would work properly and I would still insist first on
> >>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
> >>>>>>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
> >>>>>>> contradict future work on automated cost based caching.
> >>>>>>>
> >>>>>>>
> >>>>>>> At the same time I’m not sure if you have responded to our
> >> objections
> >>>> of
> >>>>>>> `void cache()` being implicit/having side effects, which me, Jark,
> >>>> Fabian,
> >>>>>>> Till and I think also Shaoxuan are supporting.
> >>>>>>>
> >>>>>>> Piotrek
> >>>>>>>
> >>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Till,
> >>>>>>>>
> >>>>>>>> It is true that after the first job submission, there will be no
> >>>>>>> ambiguity
> >>>>>>>> in terms of whether a cached table is used or not. That is the
> >> same
> >>>> for
> >>>>>>> the
> >>>>>>>> cache() without returning a CachedTable.
> >>>>>>>>
> >>>>>>>> Conceptually one could think of cache() as introducing a caching
> >>>>>>> operator
> >>>>>>>>> from which you need to consume from if you want to benefit from
> >> the
> >>>>>>> caching
> >>>>>>>>> functionality.
> >>>>>>>>
> >>>>>>>> I am thinking a little differently. I think it is a hint (as you
> >>>>>>> mentioned
> >>>>>>>> later) instead of a new operator. I'd like to be careful about the
> >>>>>>> semantic
> >>>>>>>> of the API. A hint is a property set on an existing operator, but
> >> is
> >>>> not
> >>>>>>>> itself an operator as it does not really manipulate the data.
> >>>>>>>>
> >>>>>>>> I agree, ideally the optimizer makes this kind of decision which
> >>>>>>>>> intermediate result should be cached. But especially when
> >> executing
> >>>>>>> ad-hoc
> >>>>>>>>> queries the user might better know which results need to be
> >> cached
> >>>>>>> because
> >>>>>>>>> Flink might not see the full DAG. In that sense, I would consider
> >>> the
> >>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
> >>> future
> >>>> we
> >>>>>>>>> might add functionality which tries to automatically cache
> >> results
> >>>>>>> (e.g.
> >>>>>>>>> caching the latest intermediate results until so and so much
> >> space
> >>> is
> >>>>>>>>> used). But this should hopefully not contradict with `CachedTable
> >>>>>>> cache()`.
> >>>>>>>>
> >>>>>>>> I agree that cache() method is needed for exactly the reason you
> >>>>>>> mentioned,
> >>>>>>>> i.e. Flink cannot predict what users are going to write later, so
> >>>> users
> >>>>>>>> need to tell Flink explicitly that this table will be used later.
> >>>> What I
> >>>>>>>> meant is that assuming there is already a cached table, ideally
> >>> users
> >>>>>>> need
> >>>>>>>> not to specify whether the next query should read from the cache
> >> or
> >>>> use
> >>>>>>> the
> >>>>>>>> original DAG. This should be decided by the optimizer.
> >>>>>>>>
> >>>>>>>> To explain the difference between returning / not returning a
> >>>>>>> CachedTable,
> >>>>>>>> I want compare the following two case:
> >>>>>>>>
> >>>>>>>> *Case 1:  returning a CachedTable*
> >>>>>>>> b = a.map(...)
> >>>>>>>> val cachedTableA1 = a.cache()
> >>>>>>>> val cachedTableA2 = a.cache()
> >>>>>>>> b.print() // Just to make sure a is cached.
> >>>>>>>>
> >>>>>>>> c = a.filter(...) // User specify that the original DAG is used?
> >> Or
> >>>> the
> >>>>>>>> optimizer decides whether DAG or cache should be used?
> >>>>>>>> d = cachedTableA1.filter() // User specify that the cached table
> >> is
> >>>>>>> used.
> >>>>>>>>
> >>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
> >>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
> >>>>>>>>
> >>>>>>>> *Case 2: not returning a CachedTable*
> >>>>>>>> b = a.map()
> >>>>>>>> a.cache()
> >>>>>>>> a.cache() // no-op
> >>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>
> >>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
> >>> should
> >>>>>>> be
> >>>>>>>> used
> >>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or DAG
> >>> should
> >>>>>>> be
> >>>>>>>> used
> >>>>>>>>
> >>>>>>>> a.unCache()
> >>>>>>>> a.unCache() // no-op
> >>>>>>>>
> >>>>>>>> In case 1, semantic wise, optimizer lose the option to choose
> >>> between
> >>>>>>> DAG
> >>>>>>>> and cache. And the unCache() call becomes tricky.
> >>>>>>>> In case 2, users do not need to worry about whether cache or DAG
> >> is
> >>>>>>> used.
> >>>>>>>> And the unCache() semantic is clear. However, the caveat is that
> >>> users
> >>>>>>>> cannot explicitly ignore the cache.
> >>>>>>>>
> >>>>>>>> In order to address the issues mentioned in case 2 and inspired by
> >>> the
> >>>>>>>> discussion so far, I am thinking about using hint to allow user
> >>>>>>> explicitly
> >>>>>>>> ignore cache. Although we do not have hint yet, but we probably
> >>> should
> >>>>>>> have
> >>>>>>>> one. So the code becomes:
> >>>>>>>>
> >>>>>>>> *Case 3: returning this table*
> >>>>>>>> b = a.map()
> >>>>>>>> a.cache()
> >>>>>>>> a.cache() // no-op
> >>>>>>>> b.print() // Just to make sure a is cached
> >>>>>>>>
> >>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
> >>> should
> >>>>>>> be
> >>>>>>>> used
> >>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used instead
> >> of
> >>>> the
> >>>>>>>> cache.
> >>>>>>>>
> >>>>>>>> a.unCache()
> >>>>>>>> a.unCache() // no-op
> >>>>>>>>
> >>>>>>>> We could also let cache() return this table to allow chained
> >> method
> >>>>>>> calls.
> >>>>>>>> Do you think this API addresses the concerns?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> All the recent discussions are focused on whether there is a
> >>> problem
> >>>> if
> >>>>>>>>> cache() not return a Table.
> >>>>>>>>> It seems that returning a Table explicitly is more clear (and
> >>> safe?).
> >>>>>>>>>
> >>>>>>>>> So whether there are any problems if cache() returns a Table?
> >>>> @Becket
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jark
> >>>>>>>>>
> >>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <trohrmann@apache.org
> >>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> It's true that b, c, d and e will all read from the original DAG
> >>>> that
> >>>>>>>>>> generates a. But all subsequent operators (when running multiple
> >>>>>>> queries)
> >>>>>>>>>> which reference cachedTableA should not need to reproduce `a`
> >> but
> >>>>>>>>> directly
> >>>>>>>>>> consume the intermediate result.
> >>>>>>>>>>
> >>>>>>>>>> Conceptually one could think of cache() as introducing a caching
> >>>>>>> operator
> >>>>>>>>>> from which you need to consume from if you want to benefit from
> >>> the
> >>>>>>>>> caching
> >>>>>>>>>> functionality.
> >>>>>>>>>>
> >>>>>>>>>> I agree, ideally the optimizer makes this kind of decision which
> >>>>>>>>>> intermediate result should be cached. But especially when
> >>> executing
> >>>>>>>>> ad-hoc
> >>>>>>>>>> queries the user might better know which results need to be
> >> cached
> >>>>>>>>> because
> >>>>>>>>>> Flink might not see the full DAG. In that sense, I would
> >> consider
> >>>> the
> >>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
> >>> future
> >>>>>>> we
> >>>>>>>>>> might add functionality which tries to automatically cache
> >> results
> >>>>>>> (e.g.
> >>>>>>>>>> caching the latest intermediate results until so and so much
> >> space
> >>>> is
> >>>>>>>>>> used). But this should hopefully not contradict with
> >> `CachedTable
> >>>>>>>>> cache()`.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Till
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <becket.qin@gmail.com
> >>>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Till,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the clarification. I am still a little confused.
> >>>>>>>>>>>
> >>>>>>>>>>> If cache() returns a CachedTable, the example might become:
> >>>>>>>>>>>
> >>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>
> >>>>>>>>>>> cachedTableA = a.cache()
> >>>>>>>>>>> d = cachedTableA.map(...)
> >>>>>>>>>>> e = a.map()
> >>>>>>>>>>>
> >>>>>>>>>>> In the above case, if cache() is lazily evaluated, b, c, d and
> >> e
> >>>> are
> >>>>>>>>> all
> >>>>>>>>>>> going to be reading from the original DAG that generates a. But
> >>>> with
> >>>>>>> a
> >>>>>>>>>>> naive expectation, d should be reading from the cache. This
> >> seems
> >>>> not
> >>>>>>>>>>> solving the potential confusion you raised, right?
> >>>>>>>>>>>
> >>>>>>>>>>> Just to be clear, my understanding are all based on the
> >>> assumption
> >>>>>>> that
> >>>>>>>>>> the
> >>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
> >>>>>>> c*achedTableA*
> >>>>>>>>>> and
> >>>>>>>>>>> original table *a * should be completely interchangeable.
> >>>>>>>>>>>
> >>>>>>>>>>> That said, I think a valid argument is optimization. There are
> >>>> indeed
> >>>>>>>>>> cases
> >>>>>>>>>>> that reading from the original DAG could be faster than reading
> >>>> from
> >>>>>>>>> the
> >>>>>>>>>>> cache. For example, in the following example:
> >>>>>>>>>>>
> >>>>>>>>>>> a.filter(f1' > 100)
> >>>>>>>>>>> a.cache()
> >>>>>>>>>>> b = a.filter(f1' < 100)
> >>>>>>>>>>>
> >>>>>>>>>>> Ideally the optimizer should be intelligent enough to decide
> >>> which
> >>>>>>> way
> >>>>>>>>> is
> >>>>>>>>>>> faster, without user intervention. In this case, it will
> >> identify
> >>>>>>> that
> >>>>>>>>> b
> >>>>>>>>>>> would just be an empty table, thus skip reading from the cache
> >>>>>>>>>> completely.
> >>>>>>>>>>> But I agree that returning a CachedTable would give user the
> >>>> control
> >>>>>>> of
> >>>>>>>>>>> when to use cache, even though I still feel that letting the
> >>>>>>> optimizer
> >>>>>>>>>>> handle this is a better option in long run.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
> >>> trohrmann@apache.org
> >>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Yes you are right Becket that it still depends on the actual
> >>>>>>>>> execution
> >>>>>>>>>> of
> >>>>>>>>>>>> the job whether a consumer reads from a cached result or not.
> >>>>>>>>>>>>
> >>>>>>>>>>>> My point was actually about the properties of a (cached vs.
> >>>>>>>>> non-cached)
> >>>>>>>>>>> and
> >>>>>>>>>>>> not about the execution. I would not make cache trigger the
> >>>>>>> execution
> >>>>>>>>>> of
> >>>>>>>>>>>> the job because one loses some flexibility by eagerly
> >> triggering
> >>>> the
> >>>>>>>>>>>> execution.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I tried to argue for an explicit CachedTable which is returned
> >>> by
> >>>>>>> the
> >>>>>>>>>>>> cache() method like Piotr did in order to make the API more
> >>>>>>> explicit.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Till
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
> >> becket.qin@gmail.com
> >>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Till,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> That is a good example. Just a minor correction, in this
> >> case,
> >>>> b, c
> >>>>>>>>>>> and d
> >>>>>>>>>>>>> will all consume from a non-cached a. This is because cache
> >>> will
> >>>>>>>>> only
> >>>>>>>>>>> be
> >>>>>>>>>>>>> created on the very first job submission that generates the
> >>> table
> >>>>>>>>> to
> >>>>>>>>>> be
> >>>>>>>>>>>>> cached.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If I understand correctly, this is example is about whether
> >>>>>>>>> .cache()
> >>>>>>>>>>>> method
> >>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In another
> >>> word,
> >>>>>>>>> if
> >>>>>>>>>>>>> cache() method actually triggers a job that creates the
> >> cache,
> >>>>>>>>> there
> >>>>>>>>>>> will
> >>>>>>>>>>>>> be no such confusion. Is that right?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In the example, although d will not consume from the cached
> >>> Table
> >>>>>>>>>> while
> >>>>>>>>>>>> it
> >>>>>>>>>>>>> looks supposed to, from correctness perspective the code will
> >>>> still
> >>>>>>>>>>>> return
> >>>>>>>>>>>>> correct result, assuming that tables are immutable.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Personally I feel it is OK because users probably won't
> >> really
> >>>>>>>>> worry
> >>>>>>>>>>>> about
> >>>>>>>>>>>>> whether the table is cached or not. And lazy cache could
> >> avoid
> >>>> some
> >>>>>>>>>>>>> unnecessary caching if a cached table is never created in the
> >>>> user
> >>>>>>>>>>>>> application. But I am not opposed to do eager evaluation of
> >>>> cache.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> >>>>>>>>> trohrmann@apache.org>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Another argument for Piotr's point is that lazily changing
> >>>>>>>>>> properties
> >>>>>>>>>>>> of
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>> node affects all down stream consumers but does not
> >>> necessarily
> >>>>>>>>>> have
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> happen before these consumers are defined. From a user's
> >>>>>>>>>> perspective
> >>>>>>>>>>>> this
> >>>>>>>>>>>>>> can be quite confusing:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> b = a.map(...)
> >>>>>>>>>>>>>> c = a.map(...)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> a.cache()
> >>>>>>>>>>>>>> d = a.map(...)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> now b, c and d will consume from a cached operator. In this
> >>>> case,
> >>>>>>>>>> the
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>> would most likely expect that only d reads from a cached
> >>> result.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Till
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> >>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Can you explain a bit more one what are the side effects?
> >> So
> >>>>>>>>>> far
> >>>>>>>>>>> my
> >>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
> >>> table
> >>>>>>>>>> is
> >>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Not only that. There are also performance implications and
> >>>>>>>>> those
> >>>>>>>>>>> are
> >>>>>>>>>>>>>>> another implicit side effects of using `void cache()`. As I
> >>>>>>>>> wrote
> >>>>>>>>>>>>> before,
> >>>>>>>>>>>>>>> reading from cache might not always be desirable, thus it
> >> can
> >>>>>>>>>> cause
> >>>>>>>>>>>>>>> performance degradation and I’m fine with that - user's or
> >>>>>>>>>>>> optimiser’s
> >>>>>>>>>>>>>>> choice. What I do not like is that this implicit side
> >> effect
> >>>>>>>>> can
> >>>>>>>>>>>>> manifest
> >>>>>>>>>>>>>>> in completely different part of code, that wasn’t touched
> >> by
> >>> a
> >>>>>>>>>> user
> >>>>>>>>>>>>> while
> >>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else. And even
> >> if
> >>>>>>>>>>> caching
> >>>>>>>>>>>>>>> improves performance, it’s still a side effect of `void
> >>>>>>>>> cache()`.
> >>>>>>>>>>>>> Almost
> >>>>>>>>>>>>>>> from the definition `void` methods have only side effects.
> >>> As I
> >>>>>>>>>>> wrote
> >>>>>>>>>>>>>>> before, there are couple of scenarios where this might be
> >>>>>>>>>>> undesirable
> >>>>>>>>>>>>>>> and/or unexpected, for example:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1.
> >>>>>>>>>>>>>>> Table b = …;
> >>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>> x = b.join(…)
> >>>>>>>>>>>>>>> y = b.count()
> >>>>>>>>>>>>>>> // ...
> >>>>>>>>>>>>>>> // 100
> >>>>>>>>>>>>>>> // hundred
> >>>>>>>>>>>>>>> // lines
> >>>>>>>>>>>>>>> // of
> >>>>>>>>>>>>>>> // code
> >>>>>>>>>>>>>>> // later
> >>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in
> >> a
> >>>>>>>>>>>> different
> >>>>>>>>>>>>>>> method/file/package/dependency
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Table b = ...
> >>>>>>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>>>>>> foo(b)
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>> Else {
> >>>>>>>>>>>>>>> bar(b)
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Void foo(Table b) {
> >>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>> // do something with b
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly affect
> >>>>>>>>>>> (semantic
> >>>>>>>>>>>>> of a
> >>>>>>>>>>>>>>> program in case of sources being mutable and performance)
> >> `z
> >>> =
> >>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On top of that, there is still this argument of mine that
> >>>>>>>>> having
> >>>>>>>>>> a
> >>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
> >> flexible
> >>>>>>>>> for
> >>>>>>>>>> us
> >>>>>>>>>>>> for
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> future and for the user (as a manual option to bypass cache
> >>>>>>>>>> reads).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> But Jiangjie is correct,
> >>>>>>>>>>>>>>>> the source table in batching should be immutable. It is
> >> the
> >>>>>>>>>>> user’s
> >>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
> >>>>>>>>> failover
> >>>>>>>>>>> may
> >>>>>>>>>>>>> lead
> >>>>>>>>>>>>>>>> to inconsistent results.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment
> >> should
> >>>>>>>>> be.
> >>>>>>>>>>> But
> >>>>>>>>>>>>> its
> >>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this (since the
> >>>>>>>>>> proper
> >>>>>>>>>>>> fix
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> to support transactions), I’m just trying to minimise
> >>> confusion
> >>>>>>>>>> for
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> users that are not fully aware what’s going on and operate
> >> in
> >>>>>>>>>> less
> >>>>>>>>>>>> then
> >>>>>>>>>>>>>>> perfect setup. And if something bites them after adding
> >>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>> call,
> >>>>>>>>>>>>>>> to make sure that they at least know all of the places that
> >>>>>>>>>> adding
> >>>>>>>>>>>> this
> >>>>>>>>>>>>>>> line can affect.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks, Piotrek
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <becket.qin@gmail.com
> >>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks again for the clarification. Some more replies are
> >>>>>>>>>>>> following.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be
> >> used
> >>>>>>>>> in
> >>>>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>> programming and not only in batching.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache() has the
> >>>>>>>>> same
> >>>>>>>>>>>>>> semantic
> >>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>> batch processing. The semantic is following:
> >>>>>>>>>>>>>>>> For a table created via a series of computation, save that
> >>>>>>>>>> table
> >>>>>>>>>>>> for
> >>>>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>> reference to avoid running the computation logic to
> >>>>>>>>> regenerate
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> table.
> >>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
> >>>>>>>>>>>>>>>> This semantic is same for both batch and stream
> >> processing.
> >>>>>>>>> The
> >>>>>>>>>>>>>>> difference
> >>>>>>>>>>>>>>>> is that stream applications will only run once as they are
> >>>>>>>>> long
> >>>>>>>>>>>>>> running.
> >>>>>>>>>>>>>>>> And the batch applications may be run multiple times,
> >> hence
> >>>>>>>>> the
> >>>>>>>>>>>> cache
> >>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>> be created and dropped each time the application runs.
> >>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
> >> management
> >>>>>>>>>>>>>> requirements
> >>>>>>>>>>>>>>>> for the streaming cached table, such as time based / size
> >>>>>>>>> based
> >>>>>>>>>>>>>>> retention,
> >>>>>>>>>>>>>>>> to address the infinite data issue. But such requirement
> >>> does
> >>>>>>>>>> not
> >>>>>>>>>>>>>> change
> >>>>>>>>>>>>>>>> the semantic.
> >>>>>>>>>>>>>>>> You are right that interactive programming is just one use
> >>>>>>>>> case
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>> cache().
> >>>>>>>>>>>>>>>> It is not the only use case.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> For me the more important issue is of not having the `void
> >>>>>>>>>>> cache()`
> >>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> side effects.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This is indeed the key point. The argument around whether
> >>>>>>>>>> cache()
> >>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> return something already indicates that cache() and
> >>>>>>>>>> materialize()
> >>>>>>>>>>>>>> address
> >>>>>>>>>>>>>>>> different issues.
> >>>>>>>>>>>>>>>> Can you explain a bit more one what are the side effects?
> >> So
> >>>>>>>>>> far
> >>>>>>>>>>> my
> >>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
> >>> table
> >>>>>>>>>> is
> >>>>>>>>>>>>>> mutable.
> >>>>>>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >> CachedTable
> >>>>>>>>>>>>> read-only.
> >>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user can
> >>> not
> >>>>>>>>>>> write
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently can
> >> not
> >>>>>>>>>>> write
> >>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I don't think anyone should insert something to a cache.
> >> By
> >>>>>>>>>>>>> definition
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> cache should only be updated when the corresponding
> >> original
> >>>>>>>>>>> table
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> updated. What I am wondering is that given the following
> >> two
> >>>>>>>>>>> facts:
> >>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with something like
> >>>>>>>>>>>> insert()),
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
> >>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
> >>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
> >> mutable
> >>>>>>>>> and
> >>>>>>>>>>>> users
> >>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>> insert into the CachedTable directly. This is where I
> >>> thought
> >>>>>>>>>>>>>> confusing.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> >>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
> >>>>>>>>>>>> explanation
> >>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is that I
> >> think
> >>>>>>>>> of
> >>>>>>>>>>> all
> >>>>>>>>>>>>>>> “Table”s
> >>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
> >>>>>>>>> views,
> >>>>>>>>>>> the
> >>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>> difference for me is that their live scope is short -
> >>>>>>>>> current
> >>>>>>>>>>>>> session
> >>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>> is limited by different execution model. That’s why
> >>>>>>>>> “cashing”
> >>>>>>>>>> a
> >>>>>>>>>>>> view
> >>>>>>>>>>>>>>> for me
> >>>>>>>>>>>>>>>>> is just materialising it.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> However I see and I understand your point of view. Coming
> >>>>>>>>> from
> >>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
> >>>>>>>>>>> `cache()`
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
> >>>>>>>>> only
> >>>>>>>>>> be
> >>>>>>>>>>>>> used
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> interactive programming and not only in batching. But
> >>> naming
> >>>>>>>>>> is
> >>>>>>>>>>>> one
> >>>>>>>>>>>>>>> issue,
> >>>>>>>>>>>>>>>>> and not that critical to me. Especially that once we
> >>>>>>>>> implement
> >>>>>>>>>>>>> proper
> >>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
> >>> `cache()`
> >>>>>>>>>> if
> >>>>>>>>>>> we
> >>>>>>>>>>>>>> deem
> >>>>>>>>>>>>>>> so.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For me the more important issue is of not having the
> >> `void
> >>>>>>>>>>>> cache()`
> >>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you have
> >>>>>>>>> mentioned.
> >>>>>>>>>>>> True:
> >>>>>>>>>>>>>>>>> results might be non deterministic if underlying source
> >>>>>>>>> table
> >>>>>>>>>>> are
> >>>>>>>>>>>>>>> changing.
> >>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
> >>>>>>>>> semantic
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It can
> >>>>>>>>> cause
> >>>>>>>>>>>> “wtf”
> >>>>>>>>>>>>>>> moment
> >>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some place
> >> in
> >>>>>>>>> his
> >>>>>>>>>>>> code
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> suddenly some other random places are behaving
> >> differently.
> >>>>>>>>> If
> >>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
> >>>>>>>>> force
> >>>>>>>>>>> user
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> explicitly use the cache which removes the “random” part
> >>>>>>>>> from
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> "suddenly
> >>>>>>>>>>>>>>>>> some other random places are behaving differently”.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
> >>>>>>>>>>>>>> flexibility/allowing
> >>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are independent of
> >>>>>>>>>>> `cache()`
> >>>>>>>>>>>> vs
> >>>>>>>>>>>>>>>>> `materialize()` discussion.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Does that mean one can also insert into the CachedTable?
> >>>>>>>>> This
> >>>>>>>>>>>>> sounds
> >>>>>>>>>>>>>>>>> pretty confusing.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
> >> CachedTable
> >>>>>>>>>>>>>> read-only. I
> >>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user can
> >>> not
> >>>>>>>>>>> write
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently can
> >> not
> >>>>>>>>>>> write
> >>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
> >> xingcanc@gmail.com
> >>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
> >>>>>>>>>> should
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> considered as two different methods where the later one
> >> is
> >>>>>>>>>> more
> >>>>>>>>>>>>>>>>> sophisticated.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> According to my understanding, the initial idea is just
> >> to
> >>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI
> >> is a
> >>>>>>>>>>>>> high-level
> >>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
> >>>>>>>>> and
> >>>>>>>>>>>> force
> >>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching it. Then
> >>>>>>>>> the
> >>>>>>>>>>>> users
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>> manually register the cached dataset to a table again (we
> >>>>>>>>> may
> >>>>>>>>>>> need
> >>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
> >> identical
> >>>>>>>>>>> schema
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>> different contents here). After all, it’s the dataset
> >>> rather
> >>>>>>>>>>> than
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> >>>>>>>>>>> becket.qin@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
> >>>>>>>>>>>> arguments.
> >>>>>>>>>>>>>>> But I
> >>>>>>>>>>>>>>>>>>> think those arguments are mostly about materialized
> >> view.
> >>>>>>>>>> Let
> >>>>>>>>>>> me
> >>>>>>>>>>>>> try
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and materialize()
> >>> are
> >>>>>>>>>>>>>> different.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite different
> >>>>>>>>>>>>> implications.
> >>>>>>>>>>>>>>> An
> >>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When users
> >>>>>>>>> call
> >>>>>>>>>>>>> cache(),
> >>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result as a
> >>>>>>>>> draft
> >>>>>>>>>> of
> >>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>> work,
> >>>>>>>>>>>>>>>>>>> this intermediate result may not have any realistic
> >>>>>>>>> meaning.
> >>>>>>>>>>>>> Calling
> >>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the cached
> >>>>>>>>> table
> >>>>>>>>>>> in
> >>>>>>>>>>>>> any
> >>>>>>>>>>>>>>>>> manner.
> >>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I have
> >>>>>>>>>>> something
> >>>>>>>>>>>>>>>>> meaningful
> >>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think about
> >>> the
> >>>>>>>>>>>>>> validation,
> >>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
> >> materialize()
> >>>>>>>>>>> methods
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them. The
> >> concept
> >>>>>>>>> of
> >>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to say the
> >>>>>>>>>> related
> >>>>>>>>>>>>> stuff
> >>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
> >>>>>>>>>> materialized
> >>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>> itself
> >>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and systematic
> >>>>>>>>>> manner.
> >>>>>>>>>>>> And
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>> found
> >>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
> >>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>> programming experience.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still have some
> >>>>>>>>>>>> questions,
> >>>>>>>>>>>>>>>>> though.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files from a
> >>>>>>>>>>>> directory
> >>>>>>>>>>>>>>>>>>>> “/foo/bar/“
> >>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> >>>>>>>>> initialised)
> >>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
> >>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
> >>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
> >> writes
> >>>>>>>>>> new
> >>>>>>>>>>>>> files
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> /foo/bar
> >>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
> >>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
> >>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> >>>>>>>>>>>> implemented
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> initial version
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> what if someone else added some more files to /foo/bar
> >> at
> >>>>>>>>>> this
> >>>>>>>>>>>>>> point?
> >>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
> >>>>>>>>>>>>>>>>> non-deterministic,
> >>>>>>>>>>>>>>>>>>> right?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> int a3 = t1.count()
> >>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
> >>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
> >>>>>>>>>>> “cache”
> >>>>>>>>>>>>>>> dropping
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in most
> >>> cases,
> >>>>>>>>>> we
> >>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> talking
> >>>>>>>>>>>>>>>>>>> about batch applications. A fundamental assumption of
> >>> such
> >>>>>>>>>>> case
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> source data is complete before the data processing
> >>> begins,
> >>>>>>>>>> and
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO, if
> >>>>>>>>>> additional
> >>>>>>>>>>>>> rows
> >>>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>> to be added to some source during the processing, it
> >>>>>>>>> should
> >>>>>>>>>> be
> >>>>>>>>>>>>> done
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> ways
> >>>>>>>>>>>>>>>>>>> like union the source with another table containing the
> >>>>>>>>> rows
> >>>>>>>>>>> to
> >>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> added.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> There are a few cases that computations are executed
> >>>>>>>>>>> repeatedly
> >>>>>>>>>>>> on
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> changing data source.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For example, people may run a ML training job every
> >> hour
> >>>>>>>>>> with
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> samples
> >>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the source
> >>>>>>>>> data
> >>>>>>>>>>>>> between
> >>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged
> >>> within
> >>>>>>>>>> one
> >>>>>>>>>>>>> run.
> >>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>> usually in that case, the result will need versioning,
> >>>>>>>>> i.e.
> >>>>>>>>>>> for
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>>> given
> >>>>>>>>>>>>>>>>>>> result, it tells that the result is a result from the
> >>>>>>>>> source
> >>>>>>>>>>>> data
> >>>>>>>>>>>>>> by a
> >>>>>>>>>>>>>>>>>>> certain timestamp.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Another example is something like data warehouse. In
> >> this
> >>>>>>>>>>> case,
> >>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>> are a
> >>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
> >> sources,
> >>>>>>>>>> many
> >>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be created to
> >>>>>>>>>>> generate
> >>>>>>>>>>>>>>> derived
> >>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated when the
> >>>>>>>>>>> underlying
> >>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic that
> >>>>>>>>>> derives
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
> >>>>>>>>>>>>> reports/views.
> >>>>>>>>>>>>>>>>> Again,
> >>>>>>>>>>>>>>>>>>> all those derived data also need to ha

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi Becket,

> Regarding the chance of optimization, it might not be that rare. Some very
> simple statistics could already help in many cases. For example, simply
> maintaining max and min of each fields can already eliminate some
> unnecessary table scan (potentially scanning the cached table) if the
> result is doomed to be empty. A histogram would give even further
> information. The optimizer could be very careful and only ignores cache
> when it is 100% sure doing that is cheaper. e.g. only when a filter on the
> cache will absolutely return nothing.

I do not see how this might be easy to achieve. It would require tons of effort to make it work and in the end you would still have a problem of comparing/trading CPU cycles vs IO. For example:

Table src1 = … // read from connector 1
Table src2 = … // read from connector 2

Table a = src1.filter(…).join(src2.filter(…), …)
a.cache() // write cache to connector 3

a.filter(…)
env.execute()
a.select(…)

Decision whether it’s better to:
A) read from connector1/connector2, filter/map and join them twice
B) read from connector1/connector2, filter/map and join them once, pay the price of writing to connector 3 and then reading from it

Is very far from trivial. `a` can end up much larger than `src1` and `src2`, writes to connector 3 might be extremely slow, reads from connector 3 can be slower compared to reads from connector 1 & 2, … . You really need to have extremely good statistics to correctly asses size of the output and it would still be failing many times (correlations etc). And keep in mind that at the moment we do not have ANY statistics at all. More than that, it would require significantly more testing and setting up some benchmarks to make sure that we do not brake it with some regressions. 

That’s why I’m strongly opposing this idea - at least let’s not starts with this. If we first start with completely manual/explicit caching, without any magic, it would be a significant improvement for the users for a fraction of the development cost. After implementing that, when we already have all of the working pieces, we can start working on some optimisations rules. As I wrote before, if we start with

`CachedTable cache()`

We can later work on follow up stories to make it automatic. Despite that I don’t like this implicit/side effect approach with `void` method, having explicit `CachedTable cache()` wouldn’t even prevent as from later adding `void hintCache()` method, with the exact semantic that you want.

On top of that I re-rise again that having implicit `void cache()/hintCache()` has other side effects and problems with non immutable data, and being annoying when used secretly inside methods. 

Explicit `CachedTable cache()` just looks like much less controversial MVP and if we decide to go further with this topic, it’s not a wasted effort, but just lies on a stright path to more advanced/complicated solutions in the future. Are there any drawbacks of starting with `CachedTable cache()` that I’m missing?

Piotrek

> On 12 Dec 2018, at 09:30, Jeff Zhang <zj...@gmail.com> wrote:
> 
> Hi Becket,
> 
> Introducing CacheHandle seems too complicated. That means users have to
> maintain Handler properly.
> 
> And since cache is just a hint for optimizer, why not just return Table
> itself for cache method. This hint info should be kept in Table I believe.
> 
> So how about adding method cache and uncache for Table, and both return
> Table. Because what cache and uncache did is just adding some hint info
> into Table.
> 
> 
> 
> 
> Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:
> 
>> Hi Till and Piotrek,
>> 
>> Thanks for the clarification. That solves quite a few confusion. My
>> understanding of how cache works is same as what Till describe. i.e.
>> cache() is a hint to Flink, but it is not guaranteed that cache always
>> exist and it might be recomputed from its lineage.
>> 
>> Is this the core of our disagreement here? That you would like this
>>> “cache()” to be mostly hint for the optimiser?
>> 
>> Semantic wise, yes. That's also why I think materialize() has a much larger
>> scope than cache(), thus it should be a different method.
>> 
>> Regarding the chance of optimization, it might not be that rare. Some very
>> simple statistics could already help in many cases. For example, simply
>> maintaining max and min of each fields can already eliminate some
>> unnecessary table scan (potentially scanning the cached table) if the
>> result is doomed to be empty. A histogram would give even further
>> information. The optimizer could be very careful and only ignores cache
>> when it is 100% sure doing that is cheaper. e.g. only when a filter on the
>> cache will absolutely return nothing.
>> 
>> Given the above clarification on cache, I would like to revisit the
>> original "void cache()" proposal and see if we can improve on top of that.
>> 
>> What do you think about the following modified interface?
>> 
>> Table {
>>  /**
>>   * This call hints Flink to maintain a cache of this table and leverage
>> it for performance optimization if needed.
>>   * Note that Flink may still decide to not use the cache if it is cheaper
>> by doing so.
>>   *
>>   * A CacheHandle will be returned to allow user release the cache
>> actively. The cache will be deleted if there
>>   * is no unreleased cache handlers to it. When the TableEnvironment is
>> closed. The cache will also be deleted
>>   * and all the cache handlers will be released.
>>   *
>>   * @return a CacheHandle referring to the cache of this table.
>>   */
>>  CacheHandle cache();
>> }
>> 
>> CacheHandle {
>>  /**
>>   * Close the cache handle. This method does not necessarily deletes the
>> cache. Instead, it simply decrements the reference counter to the cache.
>>   * When the there is no handle referring to a cache. The cache will be
>> deleted.
>>   *
>>   * @return the number of open handles to the cache after this handle has
>> been released.
>>   */
>>  int release()
>> }
>> 
>> The rationale behind this interface is following:
>> In vast majority of the cases, users wouldn't really care whether the cache
>> is used or not. So I think the most intuitive way is letting cache() return
>> nothing. So nobody needs to worry about the difference between operations
>> on CacheTables and those on the "original" tables. This will make maybe
>> 99.9% of the users happy. There were two concerns raised for this approach:
>> 1. In some rare cases, users may want to ignore cache,
>> 2. A table might be cached/uncached in a third party function while the
>> caller does not know.
>> 
>> For the first issue, users can use hint("ignoreCache") to explicitly ignore
>> cache.
>> For the second issue, the above proposal lets cache() return a CacheHandle,
>> the only method in it is release(). Different CacheHandles will refer to
>> the same cache, if a cache no longer has any cache handle, it will be
>> deleted. This will address the following case:
>> {
>>  val handle1 = a.cache()
>>  process(a)
>>  a.select(...) // cache is still available, handle1 has not been released.
>> }
>> 
>> void process(Table t) {
>>  val handle2 = t.cache() // new handle to cache
>>  t.select(...) // optimizer decides cache usage
>>  t.hint("ignoreCache").select(...) // cache is ignored
>>  handle2.release() // release the handle, but the cache may still be
>> available if there are other handles
>>  ...
>> }
>> 
>> Does the above modified approach look reasonable to you?
>> 
>> Cheers,
>> 
>> Jiangjie (Becket) Qin
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <tr...@apache.org>
>> wrote:
>> 
>>> Hi Becket,
>>> 
>>> I was aiming at semantics similar to 1. I actually thought that `cache()`
>>> would tell the system to materialize the intermediate result so that
>>> subsequent queries don't need to reprocess it. This means that the usage
>> of
>>> the cached table in this example
>>> 
>>> {
>>> val cachedTable = a.cache()
>>> val b1 = cachedTable.select(…)
>>> val b2 = cachedTable.foo().select(…)
>>> val b3 = cachedTable.bar().select(...)
>>> val c1 = a.select(…)
>>> val c2 = a.foo().select(…)
>>> val c3 = a.bar().select(...)
>>> }
>>> 
>>> strongly depends on interleaved calls which trigger the execution of sub
>>> queries. So for example, if there is only a single env.execute call at
>> the
>>> end of  block, then b1, b2, b3, c1, c2 and c3 would all be computed by
>>> reading directly from the sources (given that there is only a single
>>> JobGraph). It just happens that the result of `a` will be cached such
>> that
>>> we skip the processing of `a` when there are subsequent queries reading
>>> from `cachedTable`. If for some reason the system cannot materialize the
>>> table (e.g. running out of disk space, ttl expired), then it could also
>>> happen that we need to reprocess `a`. In that sense `cachedTable` simply
>> is
>>> an identifier for the materialized result of `a` with the lineage how to
>>> reprocess it.
>>> 
>>> Cheers,
>>> Till
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <piotr@data-artisans.com
>>> 
>>> wrote:
>>> 
>>>> Hi Becket,
>>>> 
>>>>> {
>>>>> val cachedTable = a.cache()
>>>>> val b = cachedTable.select(...)
>>>>> val c = a.select(...)
>>>>> }
>>>>> 
>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses original
>> DAG
>>>> as
>>>>> user demanded so. In this case, the optimizer has no chance to
>>> optimize.
>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
>>>> optimizer
>>>>> to choose whether the cache or DAG should be used. In this case, user
>>>> lose
>>>>> the option to NOT use cache.
>>>>> 
>>>>> As you can see, neither of the options seem perfect. However, I guess
>>> you
>>>>> and Till are proposing the third option:
>>>>> 
>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
>>> should
>>>> be
>>>>> used. c always use the DAG.
>>>> 
>>>> I am pretty sure that me, Till, Fabian and others were all proposing
>> and
>>>> advocating in favour of semantic “1”. No cost based optimiser decisions
>>> at
>>>> all.
>>>> 
>>>> {
>>>> val cachedTable = a.cache()
>>>> val b1 = cachedTable.select(…)
>>>> val b2 = cachedTable.foo().select(…)
>>>> val b3 = cachedTable.bar().select(...)
>>>> val c1 = a.select(…)
>>>> val c2 = a.foo().select(…)
>>>> val c3 = a.bar().select(...)
>>>> }
>>>> 
>>>> All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
>>>> re-executing whole plan for “a”.
>>>> 
>>>> In the future we could discuss going one step further, introducing some
>>>> global optimisation (that can be manually enabled/disabled):
>> deduplicate
>>>> plan nodes/deduplicate sub queries/re-use sub queries results/or
>> whatever
>>>> we could call it. It could do two things:
>>>> 
>>>> 1. Automatically try to deduplicate fragments of the plan and share the
>>>> result using CachedTable - in other words automatically insert
>>> `CachedTable
>>>> cache()` calls.
>>>> 2. Automatically make decision to bypass explicit `CachedTable` access
>>>> (this would be the equivalent of what you described as “semantic 3”).
>>>> 
>>>> However as I wrote previously, I have big doubts if such cost-based
>>>> optimisation would work (this applies also to “Semantic 2”). I would
>>> expect
>>>> it to do more harm than good in so many cases, that it wouldn’t make
>>> sense.
>>>> Even assuming that we calculate statistics perfectly (this ain’t gonna
>>>> happen), it’s virtually impossible to correctly estimate correct
>> exchange
>>>> rate of CPU cycles vs IO operations as it is changing so much from
>>>> deployment to deployment.
>>>> 
>>>> Is this the core of our disagreement here? That you would like this
>>>> “cache()” to be mostly hint for the optimiser?
>>>> 
>>>> Piotrek
>>>> 
>>>>> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com> wrote:
>>>>> 
>>>>> Another potential concern for semantic 3 is that. In the future, we
>> may
>>>> add
>>>>> automatic caching to Flink. e.g. cache the intermediate results at
>> the
>>>>> shuffle boundary. If our semantic is that reference to the original
>>> table
>>>>> means skipping cache, those users may not be able to benefit from the
>>>>> implicit cache.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <be...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi Piotrek,
>>>>>> 
>>>>>> Thanks for the reply. Thought about it again, I might have
>>> misunderstood
>>>>>> your proposal in earlier emails. Returning a CachedTable might not
>> be
>>> a
>>>> bad
>>>>>> idea.
>>>>>> 
>>>>>> I was more concerned about the semantic and its intuitiveness when a
>>>>>> CachedTable is returned. i..e, if cache() returns CachedTable. What
>>> are
>>>> the
>>>>>> semantic in the following code:
>>>>>> {
>>>>>> val cachedTable = a.cache()
>>>>>> val b = cachedTable.select(...)
>>>>>> val c = a.select(...)
>>>>>> }
>>>>>> What is the difference between b and c? At the first glance, I see
>> two
>>>>>> options:
>>>>>> 
>>>>>> Semantic 1. b uses cachedTable as user demanded so. c uses original
>>> DAG
>>>> as
>>>>>> user demanded so. In this case, the optimizer has no chance to
>>> optimize.
>>>>>> Semantic 2. b uses cachedTable as user demanded so. c leaves the
>>>> optimizer
>>>>>> to choose whether the cache or DAG should be used. In this case,
>> user
>>>> lose
>>>>>> the option to NOT use cache.
>>>>>> 
>>>>>> As you can see, neither of the options seem perfect. However, I
>> guess
>>>> you
>>>>>> and Till are proposing the third option:
>>>>>> 
>>>>>> Semantic 3. b leaves the optimizer to choose whether cache or DAG
>>> should
>>>>>> be used. c always use the DAG.
>>>>>> 
>>>>>> This does address all the concerns. It is just that from
>> intuitiveness
>>>>>> perspective, I found that asking user to explicitly use a
>> CachedTable
>>>> while
>>>>>> the optimizer might choose to ignore is a little weird. That was
>> why I
>>>> did
>>>>>> not think about that semantic. But given there is material benefit,
>> I
>>>> think
>>>>>> this semantic is acceptable.
>>>>>> 
>>>>>> 1. If we want to let optimiser make decisions whether to use cache
>> or
>>>> not,
>>>>>>> then why do we need “void cache()” method at all? Would It
>>> “increase”
>>>> the
>>>>>>> chance of using the cache? That’s sounds strange. What would be the
>>>>>>> mechanism of deciding whether to use the cache or not? If we want
>> to
>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>> deduplication”
>>>>>>> I would turn it on globally, not per table, and let the optimiser
>> do
>>>> all of
>>>>>>> the work.
>>>>>>> 2. We do not have statistics at the moment for any use/not use
>> cache
>>>>>>> decision.
>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such cost
>>>> based
>>>>>>> optimisations would work properly and I would still insist first on
>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>> 
>>>>>> We are absolutely on the same page here. An explicit cache() method
>> is
>>>>>> necessary not only because optimizer may not be able to make the
>> right
>>>>>> decision, but also because of the nature of interactive programming.
>>> For
>>>>>> example, if users write the following code in Scala shell:
>>>>>> val b = a.select(...)
>>>>>> val c = b.select(...)
>>>>>> val d = c.select(...).writeToSink(...)
>>>>>> tEnv.execute()
>>>>>> There is no way optimizer will know whether b or c will be used in
>>> later
>>>>>> code, unless users hint explicitly.
>>>>>> 
>>>>>> At the same time I’m not sure if you have responded to our
>> objections
>>> of
>>>>>>> `void cache()` being implicit/having side effects, which me, Jark,
>>>> Fabian,
>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>> 
>>>>>> Is there any other side effects if we use semantic 3 mentioned
>> above?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> JIangjie (Becket) Qin
>>>>>> 
>>>>>> 
>>>>>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
>>> piotr@data-artisans.com
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Becket,
>>>>>>> 
>>>>>>> Sorry for not responding long time.
>>>>>>> 
>>>>>>> Regarding case1.
>>>>>>> 
>>>>>>> There wouldn’t be no “a.unCache()” method, but I would expect only
>>>>>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t
>> affect
>>>>>>> `cachedTableA2`. Just as in any other database dropping modifying
>> one
>>>>>>> independent table/materialised view does not affect others.
>>>>>>> 
>>>>>>>> What I meant is that assuming there is already a cached table,
>>> ideally
>>>>>>> users need
>>>>>>>> not to specify whether the next query should read from the cache
>> or
>>>> use
>>>>>>> the
>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>> 
>>>>>>> 1. If we want to let optimiser make decisions whether to use cache
>> or
>>>>>>> not, then why do we need “void cache()” method at all? Would It
>>>> “increase”
>>>>>>> the chance of using the cache? That’s sounds strange. What would be
>>> the
>>>>>>> mechanism of deciding whether to use the cache or not? If we want
>> to
>>>>>>> introduce such kind  automated optimisations of “plan nodes
>>>> deduplication”
>>>>>>> I would turn it on globally, not per table, and let the optimiser
>> do
>>>> all of
>>>>>>> the work.
>>>>>>> 2. We do not have statistics at the moment for any use/not use
>> cache
>>>>>>> decision.
>>>>>>> 3. Even if we had, I would be veeerryy sceptical whether such cost
>>>> based
>>>>>>> optimisations would work properly and I would still insist first on
>>>>>>> providing explicit caching mechanism (`CachedTable cache()`)
>>>>>>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
>>>>>>> contradict future work on automated cost based caching.
>>>>>>> 
>>>>>>> 
>>>>>>> At the same time I’m not sure if you have responded to our
>> objections
>>>> of
>>>>>>> `void cache()` being implicit/having side effects, which me, Jark,
>>>> Fabian,
>>>>>>> Till and I think also Shaoxuan are supporting.
>>>>>>> 
>>>>>>> Piotrek
>>>>>>> 
>>>>>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Till,
>>>>>>>> 
>>>>>>>> It is true that after the first job submission, there will be no
>>>>>>> ambiguity
>>>>>>>> in terms of whether a cached table is used or not. That is the
>> same
>>>> for
>>>>>>> the
>>>>>>>> cache() without returning a CachedTable.
>>>>>>>> 
>>>>>>>> Conceptually one could think of cache() as introducing a caching
>>>>>>> operator
>>>>>>>>> from which you need to consume from if you want to benefit from
>> the
>>>>>>> caching
>>>>>>>>> functionality.
>>>>>>>> 
>>>>>>>> I am thinking a little differently. I think it is a hint (as you
>>>>>>> mentioned
>>>>>>>> later) instead of a new operator. I'd like to be careful about the
>>>>>>> semantic
>>>>>>>> of the API. A hint is a property set on an existing operator, but
>> is
>>>> not
>>>>>>>> itself an operator as it does not really manipulate the data.
>>>>>>>> 
>>>>>>>> I agree, ideally the optimizer makes this kind of decision which
>>>>>>>>> intermediate result should be cached. But especially when
>> executing
>>>>>>> ad-hoc
>>>>>>>>> queries the user might better know which results need to be
>> cached
>>>>>>> because
>>>>>>>>> Flink might not see the full DAG. In that sense, I would consider
>>> the
>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
>>> future
>>>> we
>>>>>>>>> might add functionality which tries to automatically cache
>> results
>>>>>>> (e.g.
>>>>>>>>> caching the latest intermediate results until so and so much
>> space
>>> is
>>>>>>>>> used). But this should hopefully not contradict with `CachedTable
>>>>>>> cache()`.
>>>>>>>> 
>>>>>>>> I agree that cache() method is needed for exactly the reason you
>>>>>>> mentioned,
>>>>>>>> i.e. Flink cannot predict what users are going to write later, so
>>>> users
>>>>>>>> need to tell Flink explicitly that this table will be used later.
>>>> What I
>>>>>>>> meant is that assuming there is already a cached table, ideally
>>> users
>>>>>>> need
>>>>>>>> not to specify whether the next query should read from the cache
>> or
>>>> use
>>>>>>> the
>>>>>>>> original DAG. This should be decided by the optimizer.
>>>>>>>> 
>>>>>>>> To explain the difference between returning / not returning a
>>>>>>> CachedTable,
>>>>>>>> I want compare the following two case:
>>>>>>>> 
>>>>>>>> *Case 1:  returning a CachedTable*
>>>>>>>> b = a.map(...)
>>>>>>>> val cachedTableA1 = a.cache()
>>>>>>>> val cachedTableA2 = a.cache()
>>>>>>>> b.print() // Just to make sure a is cached.
>>>>>>>> 
>>>>>>>> c = a.filter(...) // User specify that the original DAG is used?
>> Or
>>>> the
>>>>>>>> optimizer decides whether DAG or cache should be used?
>>>>>>>> d = cachedTableA1.filter() // User specify that the cached table
>> is
>>>>>>> used.
>>>>>>>> 
>>>>>>>> a.unCache() // Can cachedTableA still be used afterwards?
>>>>>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
>>>>>>>> 
>>>>>>>> *Case 2: not returning a CachedTable*
>>>>>>>> b = a.map()
>>>>>>>> a.cache()
>>>>>>>> a.cache() // no-op
>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>> 
>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
>>> should
>>>>>>> be
>>>>>>>> used
>>>>>>>> d = a.filter(...) // Optimizer decides whether the cache or DAG
>>> should
>>>>>>> be
>>>>>>>> used
>>>>>>>> 
>>>>>>>> a.unCache()
>>>>>>>> a.unCache() // no-op
>>>>>>>> 
>>>>>>>> In case 1, semantic wise, optimizer lose the option to choose
>>> between
>>>>>>> DAG
>>>>>>>> and cache. And the unCache() call becomes tricky.
>>>>>>>> In case 2, users do not need to worry about whether cache or DAG
>> is
>>>>>>> used.
>>>>>>>> And the unCache() semantic is clear. However, the caveat is that
>>> users
>>>>>>>> cannot explicitly ignore the cache.
>>>>>>>> 
>>>>>>>> In order to address the issues mentioned in case 2 and inspired by
>>> the
>>>>>>>> discussion so far, I am thinking about using hint to allow user
>>>>>>> explicitly
>>>>>>>> ignore cache. Although we do not have hint yet, but we probably
>>> should
>>>>>>> have
>>>>>>>> one. So the code becomes:
>>>>>>>> 
>>>>>>>> *Case 3: returning this table*
>>>>>>>> b = a.map()
>>>>>>>> a.cache()
>>>>>>>> a.cache() // no-op
>>>>>>>> b.print() // Just to make sure a is cached
>>>>>>>> 
>>>>>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
>>> should
>>>>>>> be
>>>>>>>> used
>>>>>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used instead
>> of
>>>> the
>>>>>>>> cache.
>>>>>>>> 
>>>>>>>> a.unCache()
>>>>>>>> a.unCache() // no-op
>>>>>>>> 
>>>>>>>> We could also let cache() return this table to allow chained
>> method
>>>>>>> calls.
>>>>>>>> Do you think this API addresses the concerns?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> All the recent discussions are focused on whether there is a
>>> problem
>>>> if
>>>>>>>>> cache() not return a Table.
>>>>>>>>> It seems that returning a Table explicitly is more clear (and
>>> safe?).
>>>>>>>>> 
>>>>>>>>> So whether there are any problems if cache() returns a Table?
>>>> @Becket
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Jark
>>>>>>>>> 
>>>>>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <trohrmann@apache.org
>>> 
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> It's true that b, c, d and e will all read from the original DAG
>>>> that
>>>>>>>>>> generates a. But all subsequent operators (when running multiple
>>>>>>> queries)
>>>>>>>>>> which reference cachedTableA should not need to reproduce `a`
>> but
>>>>>>>>> directly
>>>>>>>>>> consume the intermediate result.
>>>>>>>>>> 
>>>>>>>>>> Conceptually one could think of cache() as introducing a caching
>>>>>>> operator
>>>>>>>>>> from which you need to consume from if you want to benefit from
>>> the
>>>>>>>>> caching
>>>>>>>>>> functionality.
>>>>>>>>>> 
>>>>>>>>>> I agree, ideally the optimizer makes this kind of decision which
>>>>>>>>>> intermediate result should be cached. But especially when
>>> executing
>>>>>>>>> ad-hoc
>>>>>>>>>> queries the user might better know which results need to be
>> cached
>>>>>>>>> because
>>>>>>>>>> Flink might not see the full DAG. In that sense, I would
>> consider
>>>> the
>>>>>>>>>> cache() method as a hint for the optimizer. Of course, in the
>>> future
>>>>>>> we
>>>>>>>>>> might add functionality which tries to automatically cache
>> results
>>>>>>> (e.g.
>>>>>>>>>> caching the latest intermediate results until so and so much
>> space
>>>> is
>>>>>>>>>> used). But this should hopefully not contradict with
>> `CachedTable
>>>>>>>>> cache()`.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Till
>>>>>>>>>> 
>>>>>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <becket.qin@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Till,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the clarification. I am still a little confused.
>>>>>>>>>>> 
>>>>>>>>>>> If cache() returns a CachedTable, the example might become:
>>>>>>>>>>> 
>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>> 
>>>>>>>>>>> cachedTableA = a.cache()
>>>>>>>>>>> d = cachedTableA.map(...)
>>>>>>>>>>> e = a.map()
>>>>>>>>>>> 
>>>>>>>>>>> In the above case, if cache() is lazily evaluated, b, c, d and
>> e
>>>> are
>>>>>>>>> all
>>>>>>>>>>> going to be reading from the original DAG that generates a. But
>>>> with
>>>>>>> a
>>>>>>>>>>> naive expectation, d should be reading from the cache. This
>> seems
>>>> not
>>>>>>>>>>> solving the potential confusion you raised, right?
>>>>>>>>>>> 
>>>>>>>>>>> Just to be clear, my understanding are all based on the
>>> assumption
>>>>>>> that
>>>>>>>>>> the
>>>>>>>>>>> tables are immutable. Therefore, after a.cache(), a the
>>>>>>> c*achedTableA*
>>>>>>>>>> and
>>>>>>>>>>> original table *a * should be completely interchangeable.
>>>>>>>>>>> 
>>>>>>>>>>> That said, I think a valid argument is optimization. There are
>>>> indeed
>>>>>>>>>> cases
>>>>>>>>>>> that reading from the original DAG could be faster than reading
>>>> from
>>>>>>>>> the
>>>>>>>>>>> cache. For example, in the following example:
>>>>>>>>>>> 
>>>>>>>>>>> a.filter(f1' > 100)
>>>>>>>>>>> a.cache()
>>>>>>>>>>> b = a.filter(f1' < 100)
>>>>>>>>>>> 
>>>>>>>>>>> Ideally the optimizer should be intelligent enough to decide
>>> which
>>>>>>> way
>>>>>>>>> is
>>>>>>>>>>> faster, without user intervention. In this case, it will
>> identify
>>>>>>> that
>>>>>>>>> b
>>>>>>>>>>> would just be an empty table, thus skip reading from the cache
>>>>>>>>>> completely.
>>>>>>>>>>> But I agree that returning a CachedTable would give user the
>>>> control
>>>>>>> of
>>>>>>>>>>> when to use cache, even though I still feel that letting the
>>>>>>> optimizer
>>>>>>>>>>> handle this is a better option in long run.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
>>> trohrmann@apache.org
>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Yes you are right Becket that it still depends on the actual
>>>>>>>>> execution
>>>>>>>>>> of
>>>>>>>>>>>> the job whether a consumer reads from a cached result or not.
>>>>>>>>>>>> 
>>>>>>>>>>>> My point was actually about the properties of a (cached vs.
>>>>>>>>> non-cached)
>>>>>>>>>>> and
>>>>>>>>>>>> not about the execution. I would not make cache trigger the
>>>>>>> execution
>>>>>>>>>> of
>>>>>>>>>>>> the job because one loses some flexibility by eagerly
>> triggering
>>>> the
>>>>>>>>>>>> execution.
>>>>>>>>>>>> 
>>>>>>>>>>>> I tried to argue for an explicit CachedTable which is returned
>>> by
>>>>>>> the
>>>>>>>>>>>> cache() method like Piotr did in order to make the API more
>>>>>>> explicit.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Till
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
>> becket.qin@gmail.com
>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> That is a good example. Just a minor correction, in this
>> case,
>>>> b, c
>>>>>>>>>>> and d
>>>>>>>>>>>>> will all consume from a non-cached a. This is because cache
>>> will
>>>>>>>>> only
>>>>>>>>>>> be
>>>>>>>>>>>>> created on the very first job submission that generates the
>>> table
>>>>>>>>> to
>>>>>>>>>> be
>>>>>>>>>>>>> cached.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If I understand correctly, this is example is about whether
>>>>>>>>> .cache()
>>>>>>>>>>>> method
>>>>>>>>>>>>> should be eagerly evaluated or lazily evaluated. In another
>>> word,
>>>>>>>>> if
>>>>>>>>>>>>> cache() method actually triggers a job that creates the
>> cache,
>>>>>>>>> there
>>>>>>>>>>> will
>>>>>>>>>>>>> be no such confusion. Is that right?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In the example, although d will not consume from the cached
>>> Table
>>>>>>>>>> while
>>>>>>>>>>>> it
>>>>>>>>>>>>> looks supposed to, from correctness perspective the code will
>>>> still
>>>>>>>>>>>> return
>>>>>>>>>>>>> correct result, assuming that tables are immutable.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Personally I feel it is OK because users probably won't
>> really
>>>>>>>>> worry
>>>>>>>>>>>> about
>>>>>>>>>>>>> whether the table is cached or not. And lazy cache could
>> avoid
>>>> some
>>>>>>>>>>>>> unnecessary caching if a cached table is never created in the
>>>> user
>>>>>>>>>>>>> application. But I am not opposed to do eager evaluation of
>>>> cache.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
>>>>>>>>> trohrmann@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Another argument for Piotr's point is that lazily changing
>>>>>>>>>> properties
>>>>>>>>>>>> of
>>>>>>>>>>>>> a
>>>>>>>>>>>>>> node affects all down stream consumers but does not
>>> necessarily
>>>>>>>>>> have
>>>>>>>>>>> to
>>>>>>>>>>>>>> happen before these consumers are defined. From a user's
>>>>>>>>>> perspective
>>>>>>>>>>>> this
>>>>>>>>>>>>>> can be quite confusing:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> b = a.map(...)
>>>>>>>>>>>>>> c = a.map(...)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> a.cache()
>>>>>>>>>>>>>> d = a.map(...)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> now b, c and d will consume from a cached operator. In this
>>>> case,
>>>>>>>>>> the
>>>>>>>>>>>>> user
>>>>>>>>>>>>>> would most likely expect that only d reads from a cached
>>> result.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hey Shaoxuan and Becket,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side effects?
>> So
>>>>>>>>>> far
>>>>>>>>>>> my
>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
>>> table
>>>>>>>>>> is
>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Not only that. There are also performance implications and
>>>>>>>>> those
>>>>>>>>>>> are
>>>>>>>>>>>>>>> another implicit side effects of using `void cache()`. As I
>>>>>>>>> wrote
>>>>>>>>>>>>> before,
>>>>>>>>>>>>>>> reading from cache might not always be desirable, thus it
>> can
>>>>>>>>>> cause
>>>>>>>>>>>>>>> performance degradation and I’m fine with that - user's or
>>>>>>>>>>>> optimiser’s
>>>>>>>>>>>>>>> choice. What I do not like is that this implicit side
>> effect
>>>>>>>>> can
>>>>>>>>>>>>> manifest
>>>>>>>>>>>>>>> in completely different part of code, that wasn’t touched
>> by
>>> a
>>>>>>>>>> user
>>>>>>>>>>>>> while
>>>>>>>>>>>>>>> he was adding `void cache()` call somewhere else. And even
>> if
>>>>>>>>>>> caching
>>>>>>>>>>>>>>> improves performance, it’s still a side effect of `void
>>>>>>>>> cache()`.
>>>>>>>>>>>>> Almost
>>>>>>>>>>>>>>> from the definition `void` methods have only side effects.
>>> As I
>>>>>>>>>>> wrote
>>>>>>>>>>>>>>> before, there are couple of scenarios where this might be
>>>>>>>>>>> undesirable
>>>>>>>>>>>>>>> and/or unexpected, for example:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>> Table b = …;
>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>> x = b.join(…)
>>>>>>>>>>>>>>> y = b.count()
>>>>>>>>>>>>>>> // ...
>>>>>>>>>>>>>>> // 100
>>>>>>>>>>>>>>> // hundred
>>>>>>>>>>>>>>> // lines
>>>>>>>>>>>>>>> // of
>>>>>>>>>>>>>>> // code
>>>>>>>>>>>>>>> // later
>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in
>> a
>>>>>>>>>>>> different
>>>>>>>>>>>>>>> method/file/package/dependency
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Table b = ...
>>>>>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>>>>>> foo(b)
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> Else {
>>>>>>>>>>>>>>> bar(b)
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> z = b.filter(…).groupBy(…)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Void foo(Table b) {
>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>> // do something with b
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In both above examples, `b.cache()` will implicitly affect
>>>>>>>>>>> (semantic
>>>>>>>>>>>>> of a
>>>>>>>>>>>>>>> program in case of sources being mutable and performance)
>> `z
>>> =
>>>>>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On top of that, there is still this argument of mine that
>>>>>>>>> having
>>>>>>>>>> a
>>>>>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
>> flexible
>>>>>>>>> for
>>>>>>>>>> us
>>>>>>>>>>>> for
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> future and for the user (as a manual option to bypass cache
>>>>>>>>>> reads).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> But Jiangjie is correct,
>>>>>>>>>>>>>>>> the source table in batching should be immutable. It is
>> the
>>>>>>>>>>> user’s
>>>>>>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
>>>>>>>>> failover
>>>>>>>>>>> may
>>>>>>>>>>>>> lead
>>>>>>>>>>>>>>>> to inconsistent results.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment
>> should
>>>>>>>>> be.
>>>>>>>>>>> But
>>>>>>>>>>>>> its
>>>>>>>>>>>>>>> often isn’t and while I’m not trying to fix this (since the
>>>>>>>>>> proper
>>>>>>>>>>>> fix
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> to support transactions), I’m just trying to minimise
>>> confusion
>>>>>>>>>> for
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> users that are not fully aware what’s going on and operate
>> in
>>>>>>>>>> less
>>>>>>>>>>>> then
>>>>>>>>>>>>>>> perfect setup. And if something bites them after adding
>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>> call,
>>>>>>>>>>>>>>> to make sure that they at least know all of the places that
>>>>>>>>>> adding
>>>>>>>>>>>> this
>>>>>>>>>>>>>>> line can affect.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks, Piotrek
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <becket.qin@gmail.com
>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks again for the clarification. Some more replies are
>>>>>>>>>>>> following.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be
>> used
>>>>>>>>> in
>>>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>> programming and not only in batching.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It is true. Actually in stream processing, cache() has the
>>>>>>>>> same
>>>>>>>>>>>>>> semantic
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> batch processing. The semantic is following:
>>>>>>>>>>>>>>>> For a table created via a series of computation, save that
>>>>>>>>>> table
>>>>>>>>>>>> for
>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>> reference to avoid running the computation logic to
>>>>>>>>> regenerate
>>>>>>>>>>> the
>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>> Once the application exits, drop all the cache.
>>>>>>>>>>>>>>>> This semantic is same for both batch and stream
>> processing.
>>>>>>>>> The
>>>>>>>>>>>>>>> difference
>>>>>>>>>>>>>>>> is that stream applications will only run once as they are
>>>>>>>>> long
>>>>>>>>>>>>>> running.
>>>>>>>>>>>>>>>> And the batch applications may be run multiple times,
>> hence
>>>>>>>>> the
>>>>>>>>>>>> cache
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>> be created and dropped each time the application runs.
>>>>>>>>>>>>>>>> Admittedly, there will probably be some resource
>> management
>>>>>>>>>>>>>> requirements
>>>>>>>>>>>>>>>> for the streaming cached table, such as time based / size
>>>>>>>>> based
>>>>>>>>>>>>>>> retention,
>>>>>>>>>>>>>>>> to address the infinite data issue. But such requirement
>>> does
>>>>>>>>>> not
>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>> the semantic.
>>>>>>>>>>>>>>>> You are right that interactive programming is just one use
>>>>>>>>> case
>>>>>>>>>>> of
>>>>>>>>>>>>>>> cache().
>>>>>>>>>>>>>>>> It is not the only use case.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> For me the more important issue is of not having the `void
>>>>>>>>>>> cache()`
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> side effects.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is indeed the key point. The argument around whether
>>>>>>>>>> cache()
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> return something already indicates that cache() and
>>>>>>>>>> materialize()
>>>>>>>>>>>>>> address
>>>>>>>>>>>>>>>> different issues.
>>>>>>>>>>>>>>>> Can you explain a bit more one what are the side effects?
>> So
>>>>>>>>>> far
>>>>>>>>>>> my
>>>>>>>>>>>>>>>> understanding is that such side effects only exist if a
>>> table
>>>>>>>>>> is
>>>>>>>>>>>>>> mutable.
>>>>>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>> CachedTable
>>>>>>>>>>>>> read-only.
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user can
>>> not
>>>>>>>>>>> write
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently can
>> not
>>>>>>>>>>> write
>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I don't think anyone should insert something to a cache.
>> By
>>>>>>>>>>>>> definition
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> cache should only be updated when the corresponding
>> original
>>>>>>>>>>> table
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> updated. What I am wondering is that given the following
>> two
>>>>>>>>>>> facts:
>>>>>>>>>>>>>>>> 1. If and only if a table is mutable (with something like
>>>>>>>>>>>> insert()),
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> CachedTable may have implicit behavior.
>>>>>>>>>>>>>>>> 2. A CachedTable extends a Table.
>>>>>>>>>>>>>>>> We can come to the conclusion that a CachedTable is
>> mutable
>>>>>>>>> and
>>>>>>>>>>>> users
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> insert into the CachedTable directly. This is where I
>>> thought
>>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
>>>>>>>>>>>> explanation
>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> think `materialize()` is more natural to me is that I
>> think
>>>>>>>>> of
>>>>>>>>>>> all
>>>>>>>>>>>>>>> “Table”s
>>>>>>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
>>>>>>>>> views,
>>>>>>>>>>> the
>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>> difference for me is that their live scope is short -
>>>>>>>>> current
>>>>>>>>>>>>> session
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> is limited by different execution model. That’s why
>>>>>>>>> “cashing”
>>>>>>>>>> a
>>>>>>>>>>>> view
>>>>>>>>>>>>>>> for me
>>>>>>>>>>>>>>>>> is just materialising it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> However I see and I understand your point of view. Coming
>>>>>>>>> from
>>>>>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
>>>>>>>>>>> `cache()`
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
>>>>>>>>> only
>>>>>>>>>> be
>>>>>>>>>>>>> used
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> interactive programming and not only in batching. But
>>> naming
>>>>>>>>>> is
>>>>>>>>>>>> one
>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>>>> and not that critical to me. Especially that once we
>>>>>>>>> implement
>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>> materialised views, we can always deprecate/rename
>>> `cache()`
>>>>>>>>>> if
>>>>>>>>>>> we
>>>>>>>>>>>>>> deem
>>>>>>>>>>>>>>> so.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> For me the more important issue is of not having the
>> `void
>>>>>>>>>>>> cache()`
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> side effects. Exactly for the reasons that you have
>>>>>>>>> mentioned.
>>>>>>>>>>>> True:
>>>>>>>>>>>>>>>>> results might be non deterministic if underlying source
>>>>>>>>> table
>>>>>>>>>>> are
>>>>>>>>>>>>>>> changing.
>>>>>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
>>>>>>>>> semantic
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It can
>>>>>>>>> cause
>>>>>>>>>>>> “wtf”
>>>>>>>>>>>>>>> moment
>>>>>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some place
>> in
>>>>>>>>> his
>>>>>>>>>>>> code
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> suddenly some other random places are behaving
>> differently.
>>>>>>>>> If
>>>>>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
>>>>>>>>> force
>>>>>>>>>>> user
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> explicitly use the cache which removes the “random” part
>>>>>>>>> from
>>>>>>>>>>> the
>>>>>>>>>>>>>>> "suddenly
>>>>>>>>>>>>>>>>> some other random places are behaving differently”.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> This argument and others that I’ve raised (greater
>>>>>>>>>>>>>> flexibility/allowing
>>>>>>>>>>>>>>>>> user to explicitly bypass the cache) are independent of
>>>>>>>>>>> `cache()`
>>>>>>>>>>>> vs
>>>>>>>>>>>>>>>>> `materialize()` discussion.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Does that mean one can also insert into the CachedTable?
>>>>>>>>> This
>>>>>>>>>>>>> sounds
>>>>>>>>>>>>>>>>> pretty confusing.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I don’t know, probably initially we should make
>> CachedTable
>>>>>>>>>>>>>> read-only. I
>>>>>>>>>>>>>>>>> don’t find it more confusing than the fact that user can
>>> not
>>>>>>>>>>> write
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>> or materialised views in SQL or that user currently can
>> not
>>>>>>>>>>> write
>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>> Table.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
>> xingcanc@gmail.com
>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
>>>>>>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> considered as two different methods where the later one
>> is
>>>>>>>>>> more
>>>>>>>>>>>>>>>>> sophisticated.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> According to my understanding, the initial idea is just
>> to
>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI
>> is a
>>>>>>>>>>>>> high-level
>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
>>>>>>>>> and
>>>>>>>>>>>> force
>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>> to translate a Table to a Dataset before caching it. Then
>>>>>>>>> the
>>>>>>>>>>>> users
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> manually register the cached dataset to a table again (we
>>>>>>>>> may
>>>>>>>>>>> need
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> table replacement mechanisms for datasets with an
>> identical
>>>>>>>>>>> schema
>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>> different contents here). After all, it’s the dataset
>>> rather
>>>>>>>>>>> than
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> dynamic table that need to be cached, right?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
>>>>>>>>>>> becket.qin@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Piotrek and Jark,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
>>>>>>>>>>>> arguments.
>>>>>>>>>>>>>>> But I
>>>>>>>>>>>>>>>>>>> think those arguments are mostly about materialized
>> view.
>>>>>>>>>> Let
>>>>>>>>>>> me
>>>>>>>>>>>>> try
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> explain the reason I believe cache() and materialize()
>>> are
>>>>>>>>>>>>>> different.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I think cache() and materialize() have quite different
>>>>>>>>>>>>> implications.
>>>>>>>>>>>>>>> An
>>>>>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When users
>>>>>>>>> call
>>>>>>>>>>>>> cache(),
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> just like they are saving an intermediate result as a
>>>>>>>>> draft
>>>>>>>>>> of
>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>> work,
>>>>>>>>>>>>>>>>>>> this intermediate result may not have any realistic
>>>>>>>>> meaning.
>>>>>>>>>>>>> Calling
>>>>>>>>>>>>>>>>>>> cache() does not mean users want to publish the cached
>>>>>>>>> table
>>>>>>>>>>> in
>>>>>>>>>>>>> any
>>>>>>>>>>>>>>>>> manner.
>>>>>>>>>>>>>>>>>>> But when users call materialize(), that means "I have
>>>>>>>>>>> something
>>>>>>>>>>>>>>>>> meaningful
>>>>>>>>>>>>>>>>>>> to be reused by others", now users need to think about
>>> the
>>>>>>>>>>>>>> validation,
>>>>>>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
>> materialize()
>>>>>>>>>>> methods
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>> useful. It would be great if Flink have them. The
>> concept
>>>>>>>>> of
>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>> view is actually a pretty big feature, not to say the
>>>>>>>>>> related
>>>>>>>>>>>>> stuff
>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
>>>>>>>>>> materialized
>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>>>> should be discussed in a more thorough and systematic
>>>>>>>>>> manner.
>>>>>>>>>>>> And
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> found
>>>>>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>> programming experience.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The example you gave was interesting. I still have some
>>>>>>>>>>>> questions,
>>>>>>>>>>>>>>>>> though.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files from a
>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>>>>>>>>> initialised)
>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
>> writes
>>>>>>>>>> new
>>>>>>>>>>>>> files
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
>>>>>>>>>>>> implemented
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> initial version
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> what if someone else added some more files to /foo/bar
>> at
>>>>>>>>>> this
>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
>>>>>>>>>>>>>>>>> non-deterministic,
>>>>>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
>>>>>>>>>>> “cache”
>>>>>>>>>>>>>>> dropping
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> When we talk about interactive programming, in most
>>> cases,
>>>>>>>>>> we
>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>>>>>> about batch applications. A fundamental assumption of
>>> such
>>>>>>>>>>> case
>>>>>>>>>>>> is
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> source data is complete before the data processing
>>> begins,
>>>>>>>>>> and
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>> will not change during the data processing. IMO, if
>>>>>>>>>> additional
>>>>>>>>>>>>> rows
>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>> to be added to some source during the processing, it
>>>>>>>>> should
>>>>>>>>>> be
>>>>>>>>>>>>> done
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> ways
>>>>>>>>>>>>>>>>>>> like union the source with another table containing the
>>>>>>>>> rows
>>>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> There are a few cases that computations are executed
>>>>>>>>>>> repeatedly
>>>>>>>>>>>> on
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> changing data source.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> For example, people may run a ML training job every
>> hour
>>>>>>>>>> with
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> samples
>>>>>>>>>>>>>>>>>>> newly added in the past hour. In that case, the source
>>>>>>>>> data
>>>>>>>>>>>>> between
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged
>>> within
>>>>>>>>>> one
>>>>>>>>>>>>> run.
>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>> usually in that case, the result will need versioning,
>>>>>>>>> i.e.
>>>>>>>>>>> for
>>>>>>>>>>>> a
>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>> result, it tells that the result is a result from the
>>>>>>>>> source
>>>>>>>>>>>> data
>>>>>>>>>>>>>> by a
>>>>>>>>>>>>>>>>>>> certain timestamp.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Another example is something like data warehouse. In
>> this
>>>>>>>>>>> case,
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>> are a
>>>>>>>>>>>>>>>>>>> few source of original/raw data. On top of those
>> sources,
>>>>>>>>>> many
>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>> view / queries / reports / dashboards can be created to
>>>>>>>>>>> generate
>>>>>>>>>>>>>>> derived
>>>>>>>>>>>>>>>>>>> data. Those derived data needs to be updated when the
>>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>> data changes. In that case, the processing logic that
>>>>>>>>>> derives
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
>>>>>>>>>>>>> reports/views.
>>>>>>>>>>>>>>>>> Again,
>>>>>>>>>>>>>>>>>>> all those derived data also need to have version
>>>>>>>>> management,
>>>>>>>>>>>> such
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> timestamp.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> In any of the above two cases, during a single run of
>> the
>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>> the data cannot change. Otherwise the behavior of the
>>>>>>>>>>> processing
>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>> be undefined. In the above two examples, when writing
>> the
>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>>>>>> Users can use .cache() to hint Flink that those results
>>>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>>>>> saved
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> avoid repeated computation. And then for the result of
>> my
>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>>> logic, I'll call materialize(), so that these results
>>>>>>>>> could
>>>>>>>>>> be
>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>> the system with versioning, metadata management,
>>> lifecycle
>>>>>>>>>>>>>> management,
>>>>>>>>>>>>>>>>>>> ACLs, etc.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It is true we can use materialize() to do the cache()
>>> job,
>>>>>>>>>>> but I
>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>> reluctant to shoehorn cache() into materialize() and
>>> force
>>>>>>>>>>> users
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> worry
>>>>>>>>>>>>>>>>>>> about a bunch of implications that they needn't have
>> to.
>>> I
>>>>>>>>>> am
>>>>>>>>>>>>>>>>> absolutely on
>>>>>>>>>>>>>>>>>>> your side that redundant API is bad. But it is equally
>>>>>>>>>>>>> frustrating,
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> more, that the same API does different things.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
>>>>>>>>>>>>> wshaoxuan@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks Piotrek,
>>>>>>>>>>>>>>>>>>>> You provided a very good example, it explains all the
>>>>>>>>>>>> confusions
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>>>> It is clear that there is something we have not
>>>>>>>>> considered
>>>>>>>>>> in
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>> proposal. We intend to force the user to reuse the
>>>>>>>>>>>>>>> cached/materialized
>>>>>>>>>>>>>>>>>>>> table, if its cache() method is executed. We did not
>>>>>>>>> expect
>>>>>>>>>>>> that
>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>> want to re-executed the plan from the source table.
>> Let
>>>>>>>>> me
>>>>>>>>>>>>> re-think
>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>> it and get back to you later.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> In the meanwhile, this example/observation also infers
>>>>>>>>> that
>>>>>>>>>>> we
>>>>>>>>>>>>>> cannot
>>>>>>>>>>>>>>>>> fully
>>>>>>>>>>>>>>>>>>>> involve the optimizer to decide the plan if a
>>>>>>>>>>> cache/materialize
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> explicitly used, because weather to reuse the cache
>> data
>>>>>>>>> or
>>>>>>>>>>>>>>> re-execute
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> query from source data may lead to different results.
>>>>>>>>> (But
>>>>>>>>>> I
>>>>>>>>>>>>> guess
>>>>>>>>>>>>>>>>>>>> optimizer can still help in some cases ---- as long as
>>> it
>>>>>>>>>>> does
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> re-execute from the varied source, we should be safe).
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Shaoxuan
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Shaoxuan,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Re 2:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
>>>>>>>>>> modified
>>>>>>>>>>>>> to->
>>>>>>>>>>>>>>> t1’
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> What do you mean that “ t1 is modified to-> t1’ ” ?
>>> That
>>>>>>>>>>>>>>>>>>>>> `methodThatAppliesOperators()` method has changed
>> it’s
>>>>>>>>>> plan?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I was thinking more about something like this:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Table source = … // some source that scans files
>> from a
>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>>>>>>>>>> initialised)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
>>> writes
>>>>>>>>>> new
>>>>>>>>>>>>> files
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
>>>>>>>>>>>> implemented
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> initial version
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension,
>> manual
>>>>>>>>>>> “cache”
>>>>>>>>>>>>>>>>> dropping
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> assertTrue(a1 == b1) // same results, but b1 comes
>> from
>>>>>>>>>> the
>>>>>>>>>>>>>> “cache"
>>>>>>>>>>>>>>>>>>>>> assertTrue(b1 == b2) // both values come from the
>> same
>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2
>>>>>>>>> re-executed
>>>>>>>>>>>> full
>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>> scan
>>>>>>>>>>>>>>>>>>>>> and has more data
>>>>>>>>>>>>>>>>>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
>>>>>>>>>>>>>>>>>>>>> assertTrue(b3 == a2 == a3)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <imjark@gmail.com
>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> It is an very interesting and useful design!
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Here I want to share some of my thoughts:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 1. Agree with that cache() method should return some
>>>>>>>>>> Table
>>>>>>>>>>> to
>>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>> unexpected problems because of the mutable object.
>>>>>>>>>>>>>>>>>>>>>> All the existing methods of Table are returning a
>> new
>>>>>>>>>> Table
>>>>>>>>>>>>>>> instance.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 2. I think materialize() would be more consistent
>> with
>>>>>>>>>> SQL,
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> makes
>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>> possible to support the same feature for SQL
>>>>>>>>> (materialize
>>>>>>>>>>>> view)
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> keep
>>>>>>>>>>>>>>>>>>>>>> the same API for users in the future.
>>>>>>>>>>>>>>>>>>>>>> But I'm also fine if we choose cache().
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 3. In the proposal, a TableService (or
>> FlinkService?)
>>>>>>>>> is
>>>>>>>>>>> used
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> result of the (intermediate) table.
>>>>>>>>>>>>>>>>>>>>>> But the name of TableService may be a bit general
>>> which
>>>>>>>>>> is
>>>>>>>>>>>> not
>>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>>>>>> understanding correctly in the first glance (a
>>>>>>>>> metastore
>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> tables?).
>>>>>>>>>>>>>>>>>>>>>> Maybe a more specific name would be better, such as
>>>>>>>>>>>>>>> TableCacheSerive
>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>> TableMaterializeSerivce or something else.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
>>>>>>>>>>>> fhueske@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the clarification Becket!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I have a few thoughts to share / questions:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 1) I'd like to know how you plan to implement the
>>>>>>>>>> feature
>>>>>>>>>>>> on a
>>>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>>>> /
>>>>>>>>>>>>>>>>>>>>>>> planner level.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I would imaging the following to happen when
>>>>>>>>>> Table.cache()
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> called:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 1) immediately optimize the Table and internally
>>>>>>>>> convert
>>>>>>>>>>> it
>>>>>>>>>>>>>> into a
>>>>>>>>>>>>>>>>>>>>>>> DataSet/DataStream. This is necessary, to avoid
>> that
>>>>>>>>>>>> operators
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>>>>>>> queries on top of the Table are pushed down.
>>>>>>>>>>>>>>>>>>>>>>> 2) register the DataSet/DataStream as a
>>>>>>>>>>>>>> DataSet/DataStream-backed
>>>>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>>> X
>>>>>>>>>>>>>>>>>>>>>>> 3) add a sink to the DataSet/DataStream. This is
>> the
>>>>>>>>>>>>>>> materialization
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> Table X
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Based on your proposal the following would happen:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Table t1 = ....
>>>>>>>>>>>>>>>>>>>>>>> t1.cache(); // cache() returns void. The logical
>> plan
>>>>>>>>> of
>>>>>>>>>>> t1
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> replaced
>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>> a scan of X. There is also a reference to the
>>>>>>>>>>>> materialization
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> X.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, including
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> DataSet/DataStream
>>>>>>>>>>>>>>>>>>>>>>> that backs X and the sink that writes the
>>>>>>>>>> materialization
>>>>>>>>>>>> of X
>>>>>>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, but
>> reads X
>>>>>>>>>> from
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> materialization.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> My question is, how do you determine when whether
>> the
>>>>>>>>>> scan
>>>>>>>>>>>> of
>>>>>>>>>>>>> t1
>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>> go
>>>>>>>>>>>>>>>>>>>>>>> against the DataSet/DataStream program and when
>>>>>>>>> against
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> materialization?
>>>>>>>>>>>>>>>>>>>>>>> AFAIK, there is no hook that will tell you that a
>>> part
>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> program
>>>>>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>>>>>> executed. Flipping a switch during optimization or
>>>>>>>>> plan
>>>>>>>>>>>>>> generation
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> sufficient as there is no guarantee that the plan
>> is
>>>>>>>>>> also
>>>>>>>>>>>>>>> executed.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Overall, this behavior is somewhat similar to what
>> I
>>>>>>>>>>>> proposed
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> FLINK-8950, which does not include persisting the
>>>>>>>>> table,
>>>>>>>>>>> but
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>> optimizing and reregistering it as
>> DataSet/DataStream
>>>>>>>>>>> scan.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 2) I think Piotr has a point about the implicit
>>>>>>>>> behavior
>>>>>>>>>>> and
>>>>>>>>>>>>>> side
>>>>>>>>>>>>>>>>>>>>> effects
>>>>>>>>>>>>>>>>>>>>>>> of the cache() method if it does not return
>> anything.
>>>>>>>>>>>>>>>>>>>>>>> Consider the following example:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Table t1 = ???
>>>>>>>>>>>>>>>>>>>>>>> Table t2 = methodThatAppliesOperators(t1);
>>>>>>>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> In this case, the behavior/performance of the plan
>>>>>>>>> that
>>>>>>>>>>>>> results
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> second method call depends on whether t1 was
>> modified
>>>>>>>>> by
>>>>>>>>>>> the
>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>> or not.
>>>>>>>>>>>>>>>>>>>>>>> This is the classic issue of mutable vs. immutable
>>>>>>>>>>> objects.
>>>>>>>>>>>>>>>>>>>>>>> Also, as Piotr pointed out, it might also be good
>> to
>>>>>>>>>> have
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>>>>>>>>>> of t1, because in some cases it is possible to push
>>>>>>>>>>> filters
>>>>>>>>>>>>> down
>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> evaluating the query from scratch might be more
>>>>>>>>>> efficient
>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>> accessing
>>>>>>>>>>>>>>>>>>>>>>> the cache.
>>>>>>>>>>>>>>>>>>>>>>> Moreover, a CachedTable could extend Table() and
>>>>>>>>> offer a
>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>> refresh().
>>>>>>>>>>>>>>>>>>>>>>> This sounds quite useful in an interactive session
>>>>>>>>> mode.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 3) Regarding the name, I can see both arguments.
>> IMO,
>>>>>>>>>>>>>>> materialize()
>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>>> to be more future proof.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
>>>>>>>>>> Wang <
>>>>>>>>>>>>>>>>>>>>>>> wshaoxuan@gmail.com>:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotr,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for sharing your ideas on the method
>> naming.
>>>>>>>>> We
>>>>>>>>>>> will
>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>> your suggestions. But I don't understand why we
>> need
>>>>>>>>> to
>>>>>>>>>>>>> change
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>>>> type of cache().
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Cache() is a physical operation, it does not
>> change
>>>>>>>>> the
>>>>>>>>>>>> logic
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> the `Table`. On the tableAPI layer, we should not
>>>>>>>>>>>> introduce a
>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>> type unless the logic of table has been changed.
>> If
>>>>>>>>> we
>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>>>>> table type `CachedTable`, we need create the same
>>> set
>>>>>>>>>> of
>>>>>>>>>>>>>> methods
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> `Table`
>>>>>>>>>>>>>>>>>>>>>>>> for it. I don't think it is worth doing this. Or
>> can
>>>>>>>>>> you
>>>>>>>>>>>>> please
>>>>>>>>>>>>>>>>>>>>> elaborate
>>>>>>>>>>>>>>>>>>>>>>>> more on what could be the "implicit
>> behaviours/side
>>>>>>>>>>>> effects"
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>> thinking about?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Shaoxuan
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 1. I wasn’t saying that materialised view must be
>>>>>>>>>>> mutable
>>>>>>>>>>>> or
>>>>>>>>>>>>>>> not.
>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>> thing applies to caches as well. To the
>> contrary, I
>>>>>>>>>>> would
>>>>>>>>>>>>>> expect
>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>> consistency and updates from something that is
>>>>>>>>> called
>>>>>>>>>>>>> “cache”
>>>>>>>>>>>>>> vs
>>>>>>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>>>>>> that’s a “materialised view”. In other words, IMO
>>>>>>>>> most
>>>>>>>>>>>>> caches
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>> serve
>>>>>>>>>>>>>>>>>>>>>>>>> you invalid/outdated data and they handle updates
>>> on
>>>>>>>>>>> their
>>>>>>>>>>>>>> own.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 2. I don’t think that having in the future two
>> very
>>>>>>>>>>>> similar
>>>>>>>>>>>>>>>>> concepts
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>> `materialized` view and `cache` is a good idea.
>> It
>>>>>>>>>> would
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>> the users. I think it could be handled by
>>>>>>>>>>>>>> variations/overloading
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>> materialised view concept. We could start with:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize()` - immutable,
>>>>>>>>> session
>>>>>>>>>>>> life
>>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>>>>>>> (basically the same semantic as you are proposing
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> And then in the future (if ever) build on top of
>>>>>>>>>>>> that/expand
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> with:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
>>>>>>>>>>>>>>>>> `MaterializedTable
>>>>>>>>>>>>>>>>>>>>>>>>> materialize(refreshHook=…)`
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Or with cross session support:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> `MaterializedTable materializeInto(connector=…)`
>> or
>>>>>>>>>>>>>>>>>>>> `MaterializedTable
>>>>>>>>>>>>>>>>>>>>>>>>> materializeInto(tableFactory=…)`
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I’m not saying that we should implement cross
>>>>>>>>>>>>>> session/refreshing
>>>>>>>>>>>>>>>>> now
>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>> even in the near future. I’m just arguing that
>>>>>>>>> naming
>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>>>>>> immutable
>>>>>>>>>>>>>>>>>>>>>>>>> session life scope method `materialize()` is more
>>>>>>>>>> future
>>>>>>>>>>>>> proof
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>> consistent with SQL (on which after all table-api
>>> is
>>>>>>>>>>>> heavily
>>>>>>>>>>>>>>>>> basing
>>>>>>>>>>>>>>>>>>>>>>> on).
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 3. Even if we agree on naming it `cache()`, I
>> would
>>>>>>>>>>> still
>>>>>>>>>>>>>> insist
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
>>>>>>>>>>> implicit
>>>>>>>>>>>>>>>>>>>>>>>> behaviours/side
>>>>>>>>>>>>>>>>>>>>>>>>> effects and to give both us & users more
>>>>>>>>> flexibility.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Just to add a little bit, the materialized view
>> is
>>>>>>>>>>>> probably
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>> similar
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> the persistent() brought up earlier in the
>> thread.
>>>>>>>>> So
>>>>>>>>>>> it
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>>>>>>>>> cross
>>>>>>>>>>>>>>>>>>>>>>>>>> session and could be used in a larger scope. For
>>>>>>>>>>>> example, a
>>>>>>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>>>>>> view created by user A may be visible to user B.
>>> It
>>>>>>>>>> is
>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>>>>>>> we want to have in the future. I'll put it in
>> the
>>>>>>>>>>> future
>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
>>>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Right now we are mostly thinking of the cached
>>>>>>>>> table
>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>> immutable. I
>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>> see the Materialized view would be useful in
>> the
>>>>>>>>>>> future.
>>>>>>>>>>>>>> That
>>>>>>>>>>>>>>>>>>>> said,
>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>> a simple cache mechanism is probably still
>>> needed.
>>>>>>>>>> So
>>>>>>>>>>> to
>>>>>>>>>>>>> me,
>>>>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> materialize() should be two separate method as
>>>>>>>>> they
>>>>>>>>>>>>> address
>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>> needs. Materialize() is a higher level concept
>>>>>>>>>> usually
>>>>>>>>>>>>>>> implying
>>>>>>>>>>>>>>>>>>>>>>>>> periodical
>>>>>>>>>>>>>>>>>>>>>>>>>>> update, while cache() has much simpler
>> semantic.
>>>>>>>>> For
>>>>>>>>>>>>>> example,
>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>>>>>> create a materialized view and use cache()
>> method
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>>>>>> creation logic. So that during the materialized
>>>>>>>>> view
>>>>>>>>>>>>> update,
>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>> need to worry about the case that the cached
>>> table
>>>>>>>>>> is
>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>>>> changed.
>>>>>>>>>>>>>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>> under the hood, materialized() and cache()
>> could
>>>>>>>>>> share
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>> mechanism,
>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think a simple cache() method would be handy
>> in
>>>>>>>>> a
>>>>>>>>>>> lot
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> cases.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski
>> <
>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
>>>>>>>>>>>>>> MaterializedTable
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe not in the initial implementation, but
>>>>>>>>>> various
>>>>>>>>>>>> DBs
>>>>>>>>>>>>>>> offer
>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ways to “refresh” the materialised view.
>> Hooks,
>>>>>>>>>>>> triggers,
>>>>>>>>>>>>>>>>> timers,
>>>>>>>>>>>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>>>>>>>>>>>> etc. Having `MaterializedTable` would help us
>> to
>>>>>>>>>>> handle
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> After users call *table.cache(), *users can
>>> just
>>>>>>>>>> use
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>>>>>> anything that is supported on a Table,
>> including
>>>>>>>>>> SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is some implicit behaviour with side
>>>>>>>>> effects.
>>>>>>>>>>>>> Imagine
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> long and complicated program, that touches
>> table
>>>>>>>>>> `b`
>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>>>>>>> times,
>>>>>>>>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>>> scattered around different methods. If he
>>>>>>>>> modifies
>>>>>>>>>>> his
>>>>>>>>>>>>>>> program
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>> inserting
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in one place
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> This implicitly alters the semantic and
>>> behaviour
>>>>>>>>>> of
>>>>>>>>>>>> his
>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>> over
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the place, maybe in a ways that might cause
>>>>>>>>>> problems.
>>>>>>>>>>>> For
>>>>>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>>> underlying data is changing?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Having invisible side effects is also not very
>>>>>>>>>> clean,
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>> about something like this (but more
>>> complicated):
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table b = ...;
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>> processTable1(b)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>>>>>> else {
>>>>>>>>>>>>>>>>>>>>>>>>>>>> processTable2(b)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> // do more stuff with b
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> And user adds `b.cache()` call to only one of
>>> the
>>>>>>>>>>>>>>>>> `processTable1`
>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>>>> `processTable2` methods.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On the other hand
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table materialisedB = b.materialize()
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Avoids (at least some of) the side effect
>> issues
>>>>>>>>>> and
>>>>>>>>>>>>> forces
>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> explicitly use `materialisedB` where it’s
>>>>>>>>>> appropriate
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> forces
>>>>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> think what does it actually mean. And if
>>>>>>>>> something
>>>>>>>>>>>>> doesn’t
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for the user, he will know what has he changed
>>>>>>>>>>> instead
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> blaming
>>>>>>>>>>>>>>>>>>>>>>>>> Flink for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> some “magic” underneath. In the above example,
>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>> materialising
>>>>>>>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> only one of the methods, he should/would
>> realise
>>>>>>>>>>> about
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>>>>>>>>>>> handling the return value `MaterializedTable`
>> of
>>>>>>>>>> that
>>>>>>>>>>>>>> method.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I guess it comes down to personal preferences
>> if
>>>>>>>>>> you
>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>> implicit or not. The more power is the user,
>>>>>>>>>> probably
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>> likely
>>>>>>>>>>>>>>>>>>>>>>>>> he is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to like/understand implicit behaviour. And we
>> as
>>>>>>>>>>> Table
>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>>>>>>> designers
>>>>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the most power users out there, so I would
>>>>>>>>> proceed
>>>>>>>>>>> with
>>>>>>>>>>>>>>> caution
>>>>>>>>>>>>>>>>>>>> (so
>>>>>>>>>>>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>>>>>>>>>>>> do not end up in the crazy perl realm with
>> it’s
>>>>>>>>>>> lovely
>>>>>>>>>>>>>>> implicit
>>>>>>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>>>>>> arguments ;)  <
>>>>>>>>>>>>>> https://stackoverflow.com/a/14922656/8149051
>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
>>>>>>>>>> processing
>>>>>>>>>>>>> cases,
>>>>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be slightly better.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think even such extended Table API could
>>>>>>>>> benefit
>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> sticking
>>>>>>>>>>>>>>>>>>>>>>>>> to/being
>>>>>>>>>>>>>>>>>>>>>>>>>>>> consistent with SQL where both SQL and Table
>> API
>>>>>>>>>> are
>>>>>>>>>>>>>>> basically
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more thing. `MaterializedTable
>>> materialize()`
>>>>>>>>>>> could
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>> powerful/flexible allowing the user to operate
>>>>>>>>> both
>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>>>>>>>>> and not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialised view at the same time for
>> whatever
>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>>>>> (underlying
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>> changing/better optimisation opportunities
>> after
>>>>>>>>>>>> pushing
>>>>>>>>>>>>>> down
>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>>>>>> etc). For example:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table b = …;
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Val min = mb.min();
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Val max = mb.max();
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Could be more efficient compared to
>> `b.cache()`
>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>> `filter(‘userId
>>>>>>>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 42);` allows for much more aggressive
>>>>>>>>>> optimisations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
>>>>>>>>>>>>>> fhueske@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm not suggesting to add support for Ignite.
>>>>>>>>> This
>>>>>>>>>>> was
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>> example.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For the sake of this proposal, it would be up
>>> to
>>>>>>>>>> the
>>>>>>>>>>>>> user
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> implement a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFactory and corresponding TableSource /
>>>>>>>>>>> TableSink
>>>>>>>>>>>>>>> classes
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and read the data.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb
>>>>>>>>> Flavio
>>>>>>>>>>>>>>> Pompermaier
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pompermaier@okkam.it>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow
>>> as
>>>>>>>>>> an
>>>>>>>>>>>>>>>>> alternative
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ignite?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian
>> Hueske
>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>> fhueske@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To summarize, you propose a new method
>>>>>>>>>>>> Table.cache():
>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger a job and write the result into
>> some
>>>>>>>>>>>> temporary
>>>>>>>>>>>>>>>>> storage
>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by a TableFactory.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The cache() call blocks while the job is
>>>>>>>>> running
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> eventually
>>>>>>>>>>>>>>>>>>>>>>>>>>>> returns a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table object that represents a scan of the
>>>>>>>>>>> temporary
>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> When the "session" is closed (closing to be
>>>>>>>>>>>> defined?),
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> temporary
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are all dropped.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think this behavior makes sense and is a
>>>>>>>>> good
>>>>>>>>>>>> first
>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>>>>>>>>>> towards
>>>>>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interactive workloads.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, its performance suffers from
>> writing
>>>>>>>>> to
>>>>>>>>>>> and
>>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> systems.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think this is OK for now. Changes that
>>> would
>>>>>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>>>>>> improve
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory
>>> across
>>>>>>>>>>> jobs)
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>> large
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> impacts on many components of Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Users could use in-memory filesystems or
>>>>>>>>> storage
>>>>>>>>>>>> grids
>>>>>>>>>>>>>>>>> (Apache
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ignite) to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mitigate some of the performance effects.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
>>>>>>>>>> Becket
>>>>>>>>>>>> Qin
>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
>>>>>>>>>>>>>>> MaterializedTable
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table? After users call
>>>>>>>>>>>>> *table.cache(),
>>>>>>>>>>>>>>>>> *users
>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that table and do anything that is
>> supported
>>>>>>>>>> on a
>>>>>>>>>>>>>> Table,
>>>>>>>>>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>>>>>>>>> SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Naming wise, either cache() or
>> materialize()
>>>>>>>>>>> sounds
>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a bit more general than materialize().
>> Given
>>>>>>>>>> that
>>>>>>>>>>>> we
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>> enhancing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
>>>>>>>>>>> processing
>>>>>>>>>>>>>>> cases,
>>>>>>>>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> slightly better.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
>>>>>>>>>> Nowojski <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you
>> intend
>>>>>>>>> to
>>>>>>>>>>>> reuse
>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
>>>>>>>>>> assumed
>>>>>>>>>>>> that
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternate way of writing the data.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now that I hopefully understand the
>>>>>>>>> proposal,
>>>>>>>>>>>> maybe
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>>>>>> rename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> void materialize()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or going step further
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable materialize()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable
>> createMaterializedView()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The second option with returning a
>> handle I
>>>>>>>>>>> think
>>>>>>>>>>>> is
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>> flexible
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could provide features such as
>>>>>>>>>>> “refresh”/“delete”
>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>> generally
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> speaking
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manage the the view. In the future we
>> could
>>>>>>>>>> also
>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hooks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is
>>>>>>>>> also
>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> explicit
>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialization returning a new table
>>> handle
>>>>>>>>>>> will
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implicit side effects as adding a simple
>>>>>>>>> line
>>>>>>>>>> of
>>>>>>>>>>>>> code
>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would have.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would also be more SQL like, making it
>>>>>>>>> more
>>>>>>>>>>>>>> intuitive
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> familiar with the SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
>>>>>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it
>> is
>>>>>>>>>>>>> equivalent
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> creating
>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BUILT-IN
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
>>>>>>>>>>>>>> functionality
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> today,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
>>>>>>>>>> question.
>>>>>>>>>>>> Do
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
>>>>>>>>>> sugar?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What's more interesting in the proposal
>> is
>>>>>>>>> do
>>>>>>>>>>> we
>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> stop
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> creating
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
>>>>>>>>>> extend
>>>>>>>>>>>> that
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful unified data store distributed
>> with
>>>>>>>>>>> Flink?
>>>>>>>>>>>>> And
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
>>>>>>>>>> pattern
>>>>>>>>>>>> with
>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> services. These considerations are much
>>>>>>>>> more
>>>>>>>>>>>>>>>>> architectural.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
>>>>>>>>>> Nowojski
>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to
>>> understand
>>>>>>>>>> the
>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>> Isn’t
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing
>>>>>>>>> data
>>>>>>>>>>> to
>>>>>>>>>>>> a
>>>>>>>>>>>>>> sink
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited
>>>>>>>>> live
>>>>>>>>>>>>>> scope/live
>>>>>>>>>>>>>>>>>>>> time?
>>>>>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a
>>>>>>>>> file
>>>>>>>>>>>> sink?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If so, what’s the problem with
>> creating a
>>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>>>> from a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
>>>>>>>>>> reusing
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms
>> to
>>>>>>>>>>> clean
>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> example when current session finishes)?
>>>>>>>>>> Maybe
>>>>>>>>>>> we
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> syntactic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sugar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on top of it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
>>>>>>>>>>> persist()
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lifecycle/defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the
>>>>>>>>> future
>>>>>>>>>>>> work
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM
>> jincheng
>>>>>>>>>> sun
>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about
>> the
>>>>>>>>>> name
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> `cache()`, I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you designed this way!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can
>> specify
>>> a
>>>>>>>>>>>>> lifecycle
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persistence?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, persist
>>>>>>>>> (LifeCycle.SESSION),
>>>>>>>>>> so
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> worried
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly
>>>>>>>>> specify
>>>>>>>>>>> the
>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>> range
>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to
>> expand,
>>>>>>>>> we
>>>>>>>>>>> can
>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group of session, for example:
>>>>>>>>>>>>>>>>>>>>>>>> LifeCycle.SESSION_GROUP(...), I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for
>>>>>>>>> reference
>>>>>>>>>>>> only!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
>>>>>>>>>>>>> 于2018年11月23日周五
>>>>>>>>>>>>>>>>>>>>>>> 下午1:33写道:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
>>>>>>>>>> cache()
>>>>>>>>>>>> v.s.
>>>>>>>>>>>>>>>>>>>>>>> persist(),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> personally I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
>>>>>>>>>>> describing
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> behavior,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will
>> be
>>>>>>>>>>>> deleted
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> closed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading
>> as
>>>>>>>>>>> people
>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still be there even after the
>> session
>>>>>>>>> is
>>>>>>>>>>>> gone.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch
>> and
>>>>>>>>>>> stream
>>>>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards
>> that
>>>>>>>>>>> goal.
>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> imagine
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change across the board, including
>>>>>>>>>> sources,
>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need
>> several
>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>> in-depth
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM
>>> Xingcan
>>>>>>>>>>> Cui <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> xingcanc@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
>>>>>>>>>> access
>>>>>>>>>>>>>> domain
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially,
>> this
>>>>>>>>> may
>>>>>>>>>>> be
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
>>>>>>>>>> other
>>>>>>>>>>>> than
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> state.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it’s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and
>> then
>>>>>>>>>>>>> concentrate
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> part?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
>>>>>>>>>> concerned
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> service.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major
>> change
>>>>>>>>> to
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> codebase.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
>>>>>>>>>> extendible
>>>>>>>>>>> to
>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
>>>>>>>>>>> thread.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy
>> the
>>>>>>>>>> more
>>>>>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough
>>>>>>>>> service
>>>>>>>>>>>>>>> mechanism.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM,
>>> Xiaowei
>>>>>>>>>>>> Jiang <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> xiaoweij@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
>>>>>>>>>> table
>>>>>>>>>>>> for
>>>>>>>>>>>>>>> clean
>>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reliable.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will
>>> be
>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>>>>>>>>>>>> successfully.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> risk
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think
>>> that
>>>>>>>>>>> it's
>>>>>>>>>>>>>> safer
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> association
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id.
>>> So
>>>>>>>>>> we
>>>>>>>>>>>> can
>>>>>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>>>>>> clean
>>>>>>>>>>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> temp
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated
>> with
>>>>>>>>> any
>>>>>>>>>>>>> active
>>>>>>>>>>>>>>>>>>>>>>> sessions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
>>>>>>>>>> jincheng
>>>>>>>>>>>>> sun <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
>>>>>>>>>>> proposal!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very
>>>>>>>>> useful
>>>>>>>>>>> and
>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>> friendly
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> examples.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a
>>> business
>>>>>>>>>> has
>>>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the
>>>>>>>>> pipeline
>>>>>>>>>>> of
>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>> ML,
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> utilize
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results
>> we
>>>>>>>>>> have
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> submit a
>>>>>>>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> env.execute().
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
>>>>>>>>>> better
>>>>>>>>>>>> to
>>>>>>>>>>>>>>> named
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `persist()`,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines
>> whether
>>>>>>>>> we
>>>>>>>>>>>>>> internally
>>>>>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save
>>> the
>>>>>>>>>>> data
>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
>>>>>>>>>>> RocksDBStateBackend
>>>>>>>>>>>>>> etc.)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view
>> in
>>>>>>>>> the
>>>>>>>>>>>>> future,
>>>>>>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same
>>> job
>>>>>>>>>>> will
>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>>>> benefit
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Interactive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking
>> forward
>>>>>>>>> to
>>>>>>>>>>>> your
>>>>>>>>>>>>>>> JIRAs
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> FLIP!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <becket.qin@gmail.com
>>> 
>>>>>>>>>>>>>>> 于2018年11月20日周二
>>>>>>>>>>>>>>>>>>>>>>>>> 下午9:56写道:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads
>> have
>>>>>>>>>>>> pointed
>>>>>>>>>>>>>> out,
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> promising
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink
>> Table
>>>>>>>>>> API
>>>>>>>>>>> in
>>>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>>>>>>>> aspects,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use
>>> among
>>>>>>>>>>>> others.
>>>>>>>>>>>>>> One
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scenarios
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>> programming.
>>>>>>>>>>>>>>>>>>>>>>> To
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on
>>>>>>>>> the
>>>>>>>>>>>>>> solution,
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>> put
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our
>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
>>>>>>>>>> welcome!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Jeff Zhang <zj...@gmail.com>.
Hi Becket,

Introducing CacheHandle seems too complicated. That means users have to
maintain Handler properly.

And since cache is just a hint for optimizer, why not just return Table
itself for cache method. This hint info should be kept in Table I believe.

So how about adding method cache and uncache for Table, and both return
Table. Because what cache and uncache did is just adding some hint info
into Table.




Becket Qin <be...@gmail.com> 于2018年12月12日周三 上午11:25写道:

> Hi Till and Piotrek,
>
> Thanks for the clarification. That solves quite a few confusion. My
> understanding of how cache works is same as what Till describe. i.e.
> cache() is a hint to Flink, but it is not guaranteed that cache always
> exist and it might be recomputed from its lineage.
>
> Is this the core of our disagreement here? That you would like this
> > “cache()” to be mostly hint for the optimiser?
>
> Semantic wise, yes. That's also why I think materialize() has a much larger
> scope than cache(), thus it should be a different method.
>
> Regarding the chance of optimization, it might not be that rare. Some very
> simple statistics could already help in many cases. For example, simply
> maintaining max and min of each fields can already eliminate some
> unnecessary table scan (potentially scanning the cached table) if the
> result is doomed to be empty. A histogram would give even further
> information. The optimizer could be very careful and only ignores cache
> when it is 100% sure doing that is cheaper. e.g. only when a filter on the
> cache will absolutely return nothing.
>
> Given the above clarification on cache, I would like to revisit the
> original "void cache()" proposal and see if we can improve on top of that.
>
> What do you think about the following modified interface?
>
> Table {
>   /**
>    * This call hints Flink to maintain a cache of this table and leverage
> it for performance optimization if needed.
>    * Note that Flink may still decide to not use the cache if it is cheaper
> by doing so.
>    *
>    * A CacheHandle will be returned to allow user release the cache
> actively. The cache will be deleted if there
>    * is no unreleased cache handlers to it. When the TableEnvironment is
> closed. The cache will also be deleted
>    * and all the cache handlers will be released.
>    *
>    * @return a CacheHandle referring to the cache of this table.
>    */
>   CacheHandle cache();
> }
>
> CacheHandle {
>   /**
>    * Close the cache handle. This method does not necessarily deletes the
> cache. Instead, it simply decrements the reference counter to the cache.
>    * When the there is no handle referring to a cache. The cache will be
> deleted.
>    *
>    * @return the number of open handles to the cache after this handle has
> been released.
>    */
>   int release()
> }
>
> The rationale behind this interface is following:
> In vast majority of the cases, users wouldn't really care whether the cache
> is used or not. So I think the most intuitive way is letting cache() return
> nothing. So nobody needs to worry about the difference between operations
> on CacheTables and those on the "original" tables. This will make maybe
> 99.9% of the users happy. There were two concerns raised for this approach:
> 1. In some rare cases, users may want to ignore cache,
> 2. A table might be cached/uncached in a third party function while the
> caller does not know.
>
> For the first issue, users can use hint("ignoreCache") to explicitly ignore
> cache.
> For the second issue, the above proposal lets cache() return a CacheHandle,
> the only method in it is release(). Different CacheHandles will refer to
> the same cache, if a cache no longer has any cache handle, it will be
> deleted. This will address the following case:
> {
>   val handle1 = a.cache()
>   process(a)
>   a.select(...) // cache is still available, handle1 has not been released.
> }
>
> void process(Table t) {
>   val handle2 = t.cache() // new handle to cache
>   t.select(...) // optimizer decides cache usage
>   t.hint("ignoreCache").select(...) // cache is ignored
>   handle2.release() // release the handle, but the cache may still be
> available if there are other handles
>   ...
> }
>
> Does the above modified approach look reasonable to you?
>
> Cheers,
>
> Jiangjie (Becket) Qin
>
>
>
>
>
>
>
> On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <tr...@apache.org>
> wrote:
>
> > Hi Becket,
> >
> > I was aiming at semantics similar to 1. I actually thought that `cache()`
> > would tell the system to materialize the intermediate result so that
> > subsequent queries don't need to reprocess it. This means that the usage
> of
> > the cached table in this example
> >
> > {
> >  val cachedTable = a.cache()
> >  val b1 = cachedTable.select(…)
> >  val b2 = cachedTable.foo().select(…)
> >  val b3 = cachedTable.bar().select(...)
> >  val c1 = a.select(…)
> >  val c2 = a.foo().select(…)
> >  val c3 = a.bar().select(...)
> > }
> >
> > strongly depends on interleaved calls which trigger the execution of sub
> > queries. So for example, if there is only a single env.execute call at
> the
> > end of  block, then b1, b2, b3, c1, c2 and c3 would all be computed by
> > reading directly from the sources (given that there is only a single
> > JobGraph). It just happens that the result of `a` will be cached such
> that
> > we skip the processing of `a` when there are subsequent queries reading
> > from `cachedTable`. If for some reason the system cannot materialize the
> > table (e.g. running out of disk space, ttl expired), then it could also
> > happen that we need to reprocess `a`. In that sense `cachedTable` simply
> is
> > an identifier for the materialized result of `a` with the lineage how to
> > reprocess it.
> >
> > Cheers,
> > Till
> >
> >
> >
> >
> >
> > On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <piotr@data-artisans.com
> >
> > wrote:
> >
> > > Hi Becket,
> > >
> > > > {
> > > >  val cachedTable = a.cache()
> > > >  val b = cachedTable.select(...)
> > > >  val c = a.select(...)
> > > > }
> > > >
> > > > Semantic 1. b uses cachedTable as user demanded so. c uses original
> DAG
> > > as
> > > > user demanded so. In this case, the optimizer has no chance to
> > optimize.
> > > > Semantic 2. b uses cachedTable as user demanded so. c leaves the
> > > optimizer
> > > > to choose whether the cache or DAG should be used. In this case, user
> > > lose
> > > > the option to NOT use cache.
> > > >
> > > > As you can see, neither of the options seem perfect. However, I guess
> > you
> > > > and Till are proposing the third option:
> > > >
> > > > Semantic 3. b leaves the optimizer to choose whether cache or DAG
> > should
> > > be
> > > > used. c always use the DAG.
> > >
> > > I am pretty sure that me, Till, Fabian and others were all proposing
> and
> > > advocating in favour of semantic “1”. No cost based optimiser decisions
> > at
> > > all.
> > >
> > > {
> > >  val cachedTable = a.cache()
> > >  val b1 = cachedTable.select(…)
> > >  val b2 = cachedTable.foo().select(…)
> > >  val b3 = cachedTable.bar().select(...)
> > >  val c1 = a.select(…)
> > >  val c2 = a.foo().select(…)
> > >  val c3 = a.bar().select(...)
> > > }
> > >
> > > All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
> > > re-executing whole plan for “a”.
> > >
> > > In the future we could discuss going one step further, introducing some
> > > global optimisation (that can be manually enabled/disabled):
> deduplicate
> > > plan nodes/deduplicate sub queries/re-use sub queries results/or
> whatever
> > > we could call it. It could do two things:
> > >
> > > 1. Automatically try to deduplicate fragments of the plan and share the
> > > result using CachedTable - in other words automatically insert
> > `CachedTable
> > > cache()` calls.
> > > 2. Automatically make decision to bypass explicit `CachedTable` access
> > > (this would be the equivalent of what you described as “semantic 3”).
> > >
> > > However as I wrote previously, I have big doubts if such cost-based
> > > optimisation would work (this applies also to “Semantic 2”). I would
> > expect
> > > it to do more harm than good in so many cases, that it wouldn’t make
> > sense.
> > > Even assuming that we calculate statistics perfectly (this ain’t gonna
> > > happen), it’s virtually impossible to correctly estimate correct
> exchange
> > > rate of CPU cycles vs IO operations as it is changing so much from
> > > deployment to deployment.
> > >
> > > Is this the core of our disagreement here? That you would like this
> > > “cache()” to be mostly hint for the optimiser?
> > >
> > > Piotrek
> > >
> > > > On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com> wrote:
> > > >
> > > > Another potential concern for semantic 3 is that. In the future, we
> may
> > > add
> > > > automatic caching to Flink. e.g. cache the intermediate results at
> the
> > > > shuffle boundary. If our semantic is that reference to the original
> > table
> > > > means skipping cache, those users may not be able to benefit from the
> > > > implicit cache.
> > > >
> > > >
> > > >
> > > > On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <be...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi Piotrek,
> > > >>
> > > >> Thanks for the reply. Thought about it again, I might have
> > misunderstood
> > > >> your proposal in earlier emails. Returning a CachedTable might not
> be
> > a
> > > bad
> > > >> idea.
> > > >>
> > > >> I was more concerned about the semantic and its intuitiveness when a
> > > >> CachedTable is returned. i..e, if cache() returns CachedTable. What
> > are
> > > the
> > > >> semantic in the following code:
> > > >> {
> > > >>  val cachedTable = a.cache()
> > > >>  val b = cachedTable.select(...)
> > > >>  val c = a.select(...)
> > > >> }
> > > >> What is the difference between b and c? At the first glance, I see
> two
> > > >> options:
> > > >>
> > > >> Semantic 1. b uses cachedTable as user demanded so. c uses original
> > DAG
> > > as
> > > >> user demanded so. In this case, the optimizer has no chance to
> > optimize.
> > > >> Semantic 2. b uses cachedTable as user demanded so. c leaves the
> > > optimizer
> > > >> to choose whether the cache or DAG should be used. In this case,
> user
> > > lose
> > > >> the option to NOT use cache.
> > > >>
> > > >> As you can see, neither of the options seem perfect. However, I
> guess
> > > you
> > > >> and Till are proposing the third option:
> > > >>
> > > >> Semantic 3. b leaves the optimizer to choose whether cache or DAG
> > should
> > > >> be used. c always use the DAG.
> > > >>
> > > >> This does address all the concerns. It is just that from
> intuitiveness
> > > >> perspective, I found that asking user to explicitly use a
> CachedTable
> > > while
> > > >> the optimizer might choose to ignore is a little weird. That was
> why I
> > > did
> > > >> not think about that semantic. But given there is material benefit,
> I
> > > think
> > > >> this semantic is acceptable.
> > > >>
> > > >> 1. If we want to let optimiser make decisions whether to use cache
> or
> > > not,
> > > >>> then why do we need “void cache()” method at all? Would It
> > “increase”
> > > the
> > > >>> chance of using the cache? That’s sounds strange. What would be the
> > > >>> mechanism of deciding whether to use the cache or not? If we want
> to
> > > >>> introduce such kind  automated optimisations of “plan nodes
> > > deduplication”
> > > >>> I would turn it on globally, not per table, and let the optimiser
> do
> > > all of
> > > >>> the work.
> > > >>> 2. We do not have statistics at the moment for any use/not use
> cache
> > > >>> decision.
> > > >>> 3. Even if we had, I would be veeerryy sceptical whether such cost
> > > based
> > > >>> optimisations would work properly and I would still insist first on
> > > >>> providing explicit caching mechanism (`CachedTable cache()`)
> > > >>>
> > > >> We are absolutely on the same page here. An explicit cache() method
> is
> > > >> necessary not only because optimizer may not be able to make the
> right
> > > >> decision, but also because of the nature of interactive programming.
> > For
> > > >> example, if users write the following code in Scala shell:
> > > >>  val b = a.select(...)
> > > >>  val c = b.select(...)
> > > >>  val d = c.select(...).writeToSink(...)
> > > >>  tEnv.execute()
> > > >> There is no way optimizer will know whether b or c will be used in
> > later
> > > >> code, unless users hint explicitly.
> > > >>
> > > >> At the same time I’m not sure if you have responded to our
> objections
> > of
> > > >>> `void cache()` being implicit/having side effects, which me, Jark,
> > > Fabian,
> > > >>> Till and I think also Shaoxuan are supporting.
> > > >>
> > > >> Is there any other side effects if we use semantic 3 mentioned
> above?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> JIangjie (Becket) Qin
> > > >>
> > > >>
> > > >> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
> > piotr@data-artisans.com
> > > >
> > > >> wrote:
> > > >>
> > > >>> Hi Becket,
> > > >>>
> > > >>> Sorry for not responding long time.
> > > >>>
> > > >>> Regarding case1.
> > > >>>
> > > >>> There wouldn’t be no “a.unCache()” method, but I would expect only
> > > >>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t
> affect
> > > >>> `cachedTableA2`. Just as in any other database dropping modifying
> one
> > > >>> independent table/materialised view does not affect others.
> > > >>>
> > > >>>> What I meant is that assuming there is already a cached table,
> > ideally
> > > >>> users need
> > > >>>> not to specify whether the next query should read from the cache
> or
> > > use
> > > >>> the
> > > >>>> original DAG. This should be decided by the optimizer.
> > > >>>
> > > >>> 1. If we want to let optimiser make decisions whether to use cache
> or
> > > >>> not, then why do we need “void cache()” method at all? Would It
> > > “increase”
> > > >>> the chance of using the cache? That’s sounds strange. What would be
> > the
> > > >>> mechanism of deciding whether to use the cache or not? If we want
> to
> > > >>> introduce such kind  automated optimisations of “plan nodes
> > > deduplication”
> > > >>> I would turn it on globally, not per table, and let the optimiser
> do
> > > all of
> > > >>> the work.
> > > >>> 2. We do not have statistics at the moment for any use/not use
> cache
> > > >>> decision.
> > > >>> 3. Even if we had, I would be veeerryy sceptical whether such cost
> > > based
> > > >>> optimisations would work properly and I would still insist first on
> > > >>> providing explicit caching mechanism (`CachedTable cache()`)
> > > >>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
> > > >>> contradict future work on automated cost based caching.
> > > >>>
> > > >>>
> > > >>> At the same time I’m not sure if you have responded to our
> objections
> > > of
> > > >>> `void cache()` being implicit/having side effects, which me, Jark,
> > > Fabian,
> > > >>> Till and I think also Shaoxuan are supporting.
> > > >>>
> > > >>> Piotrek
> > > >>>
> > > >>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
> > > >>>>
> > > >>>> Hi Till,
> > > >>>>
> > > >>>> It is true that after the first job submission, there will be no
> > > >>> ambiguity
> > > >>>> in terms of whether a cached table is used or not. That is the
> same
> > > for
> > > >>> the
> > > >>>> cache() without returning a CachedTable.
> > > >>>>
> > > >>>> Conceptually one could think of cache() as introducing a caching
> > > >>> operator
> > > >>>>> from which you need to consume from if you want to benefit from
> the
> > > >>> caching
> > > >>>>> functionality.
> > > >>>>
> > > >>>> I am thinking a little differently. I think it is a hint (as you
> > > >>> mentioned
> > > >>>> later) instead of a new operator. I'd like to be careful about the
> > > >>> semantic
> > > >>>> of the API. A hint is a property set on an existing operator, but
> is
> > > not
> > > >>>> itself an operator as it does not really manipulate the data.
> > > >>>>
> > > >>>> I agree, ideally the optimizer makes this kind of decision which
> > > >>>>> intermediate result should be cached. But especially when
> executing
> > > >>> ad-hoc
> > > >>>>> queries the user might better know which results need to be
> cached
> > > >>> because
> > > >>>>> Flink might not see the full DAG. In that sense, I would consider
> > the
> > > >>>>> cache() method as a hint for the optimizer. Of course, in the
> > future
> > > we
> > > >>>>> might add functionality which tries to automatically cache
> results
> > > >>> (e.g.
> > > >>>>> caching the latest intermediate results until so and so much
> space
> > is
> > > >>>>> used). But this should hopefully not contradict with `CachedTable
> > > >>> cache()`.
> > > >>>>
> > > >>>> I agree that cache() method is needed for exactly the reason you
> > > >>> mentioned,
> > > >>>> i.e. Flink cannot predict what users are going to write later, so
> > > users
> > > >>>> need to tell Flink explicitly that this table will be used later.
> > > What I
> > > >>>> meant is that assuming there is already a cached table, ideally
> > users
> > > >>> need
> > > >>>> not to specify whether the next query should read from the cache
> or
> > > use
> > > >>> the
> > > >>>> original DAG. This should be decided by the optimizer.
> > > >>>>
> > > >>>> To explain the difference between returning / not returning a
> > > >>> CachedTable,
> > > >>>> I want compare the following two case:
> > > >>>>
> > > >>>> *Case 1:  returning a CachedTable*
> > > >>>> b = a.map(...)
> > > >>>> val cachedTableA1 = a.cache()
> > > >>>> val cachedTableA2 = a.cache()
> > > >>>> b.print() // Just to make sure a is cached.
> > > >>>>
> > > >>>> c = a.filter(...) // User specify that the original DAG is used?
> Or
> > > the
> > > >>>> optimizer decides whether DAG or cache should be used?
> > > >>>> d = cachedTableA1.filter() // User specify that the cached table
> is
> > > >>> used.
> > > >>>>
> > > >>>> a.unCache() // Can cachedTableA still be used afterwards?
> > > >>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
> > > >>>>
> > > >>>> *Case 2: not returning a CachedTable*
> > > >>>> b = a.map()
> > > >>>> a.cache()
> > > >>>> a.cache() // no-op
> > > >>>> b.print() // Just to make sure a is cached
> > > >>>>
> > > >>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
> > should
> > > >>> be
> > > >>>> used
> > > >>>> d = a.filter(...) // Optimizer decides whether the cache or DAG
> > should
> > > >>> be
> > > >>>> used
> > > >>>>
> > > >>>> a.unCache()
> > > >>>> a.unCache() // no-op
> > > >>>>
> > > >>>> In case 1, semantic wise, optimizer lose the option to choose
> > between
> > > >>> DAG
> > > >>>> and cache. And the unCache() call becomes tricky.
> > > >>>> In case 2, users do not need to worry about whether cache or DAG
> is
> > > >>> used.
> > > >>>> And the unCache() semantic is clear. However, the caveat is that
> > users
> > > >>>> cannot explicitly ignore the cache.
> > > >>>>
> > > >>>> In order to address the issues mentioned in case 2 and inspired by
> > the
> > > >>>> discussion so far, I am thinking about using hint to allow user
> > > >>> explicitly
> > > >>>> ignore cache. Although we do not have hint yet, but we probably
> > should
> > > >>> have
> > > >>>> one. So the code becomes:
> > > >>>>
> > > >>>> *Case 3: returning this table*
> > > >>>> b = a.map()
> > > >>>> a.cache()
> > > >>>> a.cache() // no-op
> > > >>>> b.print() // Just to make sure a is cached
> > > >>>>
> > > >>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
> > should
> > > >>> be
> > > >>>> used
> > > >>>> d = a.hint("ignoreCache").filter(...) // DAG will be used instead
> of
> > > the
> > > >>>> cache.
> > > >>>>
> > > >>>> a.unCache()
> > > >>>> a.unCache() // no-op
> > > >>>>
> > > >>>> We could also let cache() return this table to allow chained
> method
> > > >>> calls.
> > > >>>> Do you think this API addresses the concerns?
> > > >>>>
> > > >>>> Thanks,
> > > >>>>
> > > >>>> Jiangjie (Becket) Qin
> > > >>>>
> > > >>>>
> > > >>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
> > > >>>>
> > > >>>>> Hi,
> > > >>>>>
> > > >>>>> All the recent discussions are focused on whether there is a
> > problem
> > > if
> > > >>>>> cache() not return a Table.
> > > >>>>> It seems that returning a Table explicitly is more clear (and
> > safe?).
> > > >>>>>
> > > >>>>> So whether there are any problems if cache() returns a Table?
> > > @Becket
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Jark
> > > >>>>>
> > > >>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <trohrmann@apache.org
> >
> > > >>> wrote:
> > > >>>>>
> > > >>>>>> It's true that b, c, d and e will all read from the original DAG
> > > that
> > > >>>>>> generates a. But all subsequent operators (when running multiple
> > > >>> queries)
> > > >>>>>> which reference cachedTableA should not need to reproduce `a`
> but
> > > >>>>> directly
> > > >>>>>> consume the intermediate result.
> > > >>>>>>
> > > >>>>>> Conceptually one could think of cache() as introducing a caching
> > > >>> operator
> > > >>>>>> from which you need to consume from if you want to benefit from
> > the
> > > >>>>> caching
> > > >>>>>> functionality.
> > > >>>>>>
> > > >>>>>> I agree, ideally the optimizer makes this kind of decision which
> > > >>>>>> intermediate result should be cached. But especially when
> > executing
> > > >>>>> ad-hoc
> > > >>>>>> queries the user might better know which results need to be
> cached
> > > >>>>> because
> > > >>>>>> Flink might not see the full DAG. In that sense, I would
> consider
> > > the
> > > >>>>>> cache() method as a hint for the optimizer. Of course, in the
> > future
> > > >>> we
> > > >>>>>> might add functionality which tries to automatically cache
> results
> > > >>> (e.g.
> > > >>>>>> caching the latest intermediate results until so and so much
> space
> > > is
> > > >>>>>> used). But this should hopefully not contradict with
> `CachedTable
> > > >>>>> cache()`.
> > > >>>>>>
> > > >>>>>> Cheers,
> > > >>>>>> Till
> > > >>>>>>
> > > >>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <becket.qin@gmail.com
> >
> > > >>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi Till,
> > > >>>>>>>
> > > >>>>>>> Thanks for the clarification. I am still a little confused.
> > > >>>>>>>
> > > >>>>>>> If cache() returns a CachedTable, the example might become:
> > > >>>>>>>
> > > >>>>>>> b = a.map(...)
> > > >>>>>>> c = a.map(...)
> > > >>>>>>>
> > > >>>>>>> cachedTableA = a.cache()
> > > >>>>>>> d = cachedTableA.map(...)
> > > >>>>>>> e = a.map()
> > > >>>>>>>
> > > >>>>>>> In the above case, if cache() is lazily evaluated, b, c, d and
> e
> > > are
> > > >>>>> all
> > > >>>>>>> going to be reading from the original DAG that generates a. But
> > > with
> > > >>> a
> > > >>>>>>> naive expectation, d should be reading from the cache. This
> seems
> > > not
> > > >>>>>>> solving the potential confusion you raised, right?
> > > >>>>>>>
> > > >>>>>>> Just to be clear, my understanding are all based on the
> > assumption
> > > >>> that
> > > >>>>>> the
> > > >>>>>>> tables are immutable. Therefore, after a.cache(), a the
> > > >>> c*achedTableA*
> > > >>>>>> and
> > > >>>>>>> original table *a * should be completely interchangeable.
> > > >>>>>>>
> > > >>>>>>> That said, I think a valid argument is optimization. There are
> > > indeed
> > > >>>>>> cases
> > > >>>>>>> that reading from the original DAG could be faster than reading
> > > from
> > > >>>>> the
> > > >>>>>>> cache. For example, in the following example:
> > > >>>>>>>
> > > >>>>>>> a.filter(f1' > 100)
> > > >>>>>>> a.cache()
> > > >>>>>>> b = a.filter(f1' < 100)
> > > >>>>>>>
> > > >>>>>>> Ideally the optimizer should be intelligent enough to decide
> > which
> > > >>> way
> > > >>>>> is
> > > >>>>>>> faster, without user intervention. In this case, it will
> identify
> > > >>> that
> > > >>>>> b
> > > >>>>>>> would just be an empty table, thus skip reading from the cache
> > > >>>>>> completely.
> > > >>>>>>> But I agree that returning a CachedTable would give user the
> > > control
> > > >>> of
> > > >>>>>>> when to use cache, even though I still feel that letting the
> > > >>> optimizer
> > > >>>>>>> handle this is a better option in long run.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>>
> > > >>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
> > trohrmann@apache.org
> > > >
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Yes you are right Becket that it still depends on the actual
> > > >>>>> execution
> > > >>>>>> of
> > > >>>>>>>> the job whether a consumer reads from a cached result or not.
> > > >>>>>>>>
> > > >>>>>>>> My point was actually about the properties of a (cached vs.
> > > >>>>> non-cached)
> > > >>>>>>> and
> > > >>>>>>>> not about the execution. I would not make cache trigger the
> > > >>> execution
> > > >>>>>> of
> > > >>>>>>>> the job because one loses some flexibility by eagerly
> triggering
> > > the
> > > >>>>>>>> execution.
> > > >>>>>>>>
> > > >>>>>>>> I tried to argue for an explicit CachedTable which is returned
> > by
> > > >>> the
> > > >>>>>>>> cache() method like Piotr did in order to make the API more
> > > >>> explicit.
> > > >>>>>>>>
> > > >>>>>>>> Cheers,
> > > >>>>>>>> Till
> > > >>>>>>>>
> > > >>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <
> becket.qin@gmail.com
> > >
> > > >>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Till,
> > > >>>>>>>>>
> > > >>>>>>>>> That is a good example. Just a minor correction, in this
> case,
> > > b, c
> > > >>>>>>> and d
> > > >>>>>>>>> will all consume from a non-cached a. This is because cache
> > will
> > > >>>>> only
> > > >>>>>>> be
> > > >>>>>>>>> created on the very first job submission that generates the
> > table
> > > >>>>> to
> > > >>>>>> be
> > > >>>>>>>>> cached.
> > > >>>>>>>>>
> > > >>>>>>>>> If I understand correctly, this is example is about whether
> > > >>>>> .cache()
> > > >>>>>>>> method
> > > >>>>>>>>> should be eagerly evaluated or lazily evaluated. In another
> > word,
> > > >>>>> if
> > > >>>>>>>>> cache() method actually triggers a job that creates the
> cache,
> > > >>>>> there
> > > >>>>>>> will
> > > >>>>>>>>> be no such confusion. Is that right?
> > > >>>>>>>>>
> > > >>>>>>>>> In the example, although d will not consume from the cached
> > Table
> > > >>>>>> while
> > > >>>>>>>> it
> > > >>>>>>>>> looks supposed to, from correctness perspective the code will
> > > still
> > > >>>>>>>> return
> > > >>>>>>>>> correct result, assuming that tables are immutable.
> > > >>>>>>>>>
> > > >>>>>>>>> Personally I feel it is OK because users probably won't
> really
> > > >>>>> worry
> > > >>>>>>>> about
> > > >>>>>>>>> whether the table is cached or not. And lazy cache could
> avoid
> > > some
> > > >>>>>>>>> unnecessary caching if a cached table is never created in the
> > > user
> > > >>>>>>>>> application. But I am not opposed to do eager evaluation of
> > > cache.
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>>
> > > >>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> > > >>>>> trohrmann@apache.org>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Another argument for Piotr's point is that lazily changing
> > > >>>>>> properties
> > > >>>>>>>> of
> > > >>>>>>>>> a
> > > >>>>>>>>>> node affects all down stream consumers but does not
> > necessarily
> > > >>>>>> have
> > > >>>>>>> to
> > > >>>>>>>>>> happen before these consumers are defined. From a user's
> > > >>>>>> perspective
> > > >>>>>>>> this
> > > >>>>>>>>>> can be quite confusing:
> > > >>>>>>>>>>
> > > >>>>>>>>>> b = a.map(...)
> > > >>>>>>>>>> c = a.map(...)
> > > >>>>>>>>>>
> > > >>>>>>>>>> a.cache()
> > > >>>>>>>>>> d = a.map(...)
> > > >>>>>>>>>>
> > > >>>>>>>>>> now b, c and d will consume from a cached operator. In this
> > > case,
> > > >>>>>> the
> > > >>>>>>>>> user
> > > >>>>>>>>>> would most likely expect that only d reads from a cached
> > result.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Cheers,
> > > >>>>>>>>>> Till
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> > > >>>>>>>> piotr@data-artisans.com>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hey Shaoxuan and Becket,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Can you explain a bit more one what are the side effects?
> So
> > > >>>>>> far
> > > >>>>>>> my
> > > >>>>>>>>>>>> understanding is that such side effects only exist if a
> > table
> > > >>>>>> is
> > > >>>>>>>>>> mutable.
> > > >>>>>>>>>>>> Is that the case?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Not only that. There are also performance implications and
> > > >>>>> those
> > > >>>>>>> are
> > > >>>>>>>>>>> another implicit side effects of using `void cache()`. As I
> > > >>>>> wrote
> > > >>>>>>>>> before,
> > > >>>>>>>>>>> reading from cache might not always be desirable, thus it
> can
> > > >>>>>> cause
> > > >>>>>>>>>>> performance degradation and I’m fine with that - user's or
> > > >>>>>>>> optimiser’s
> > > >>>>>>>>>>> choice. What I do not like is that this implicit side
> effect
> > > >>>>> can
> > > >>>>>>>>> manifest
> > > >>>>>>>>>>> in completely different part of code, that wasn’t touched
> by
> > a
> > > >>>>>> user
> > > >>>>>>>>> while
> > > >>>>>>>>>>> he was adding `void cache()` call somewhere else. And even
> if
> > > >>>>>>> caching
> > > >>>>>>>>>>> improves performance, it’s still a side effect of `void
> > > >>>>> cache()`.
> > > >>>>>>>>> Almost
> > > >>>>>>>>>>> from the definition `void` methods have only side effects.
> > As I
> > > >>>>>>> wrote
> > > >>>>>>>>>>> before, there are couple of scenarios where this might be
> > > >>>>>>> undesirable
> > > >>>>>>>>>>> and/or unexpected, for example:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1.
> > > >>>>>>>>>>> Table b = …;
> > > >>>>>>>>>>> b.cache()
> > > >>>>>>>>>>> x = b.join(…)
> > > >>>>>>>>>>> y = b.count()
> > > >>>>>>>>>>> // ...
> > > >>>>>>>>>>> // 100
> > > >>>>>>>>>>> // hundred
> > > >>>>>>>>>>> // lines
> > > >>>>>>>>>>> // of
> > > >>>>>>>>>>> // code
> > > >>>>>>>>>>> // later
> > > >>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in
> a
> > > >>>>>>>> different
> > > >>>>>>>>>>> method/file/package/dependency
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 2.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Table b = ...
> > > >>>>>>>>>>> If (some_condition) {
> > > >>>>>>>>>>> foo(b)
> > > >>>>>>>>>>> }
> > > >>>>>>>>>>> Else {
> > > >>>>>>>>>>> bar(b)
> > > >>>>>>>>>>> }
> > > >>>>>>>>>>> z = b.filter(…).groupBy(…)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Void foo(Table b) {
> > > >>>>>>>>>>> b.cache()
> > > >>>>>>>>>>> // do something with b
> > > >>>>>>>>>>> }
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> In both above examples, `b.cache()` will implicitly affect
> > > >>>>>>> (semantic
> > > >>>>>>>>> of a
> > > >>>>>>>>>>> program in case of sources being mutable and performance)
> `z
> > =
> > > >>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On top of that, there is still this argument of mine that
> > > >>>>> having
> > > >>>>>> a
> > > >>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more
> flexible
> > > >>>>> for
> > > >>>>>> us
> > > >>>>>>>> for
> > > >>>>>>>>>> the
> > > >>>>>>>>>>> future and for the user (as a manual option to bypass cache
> > > >>>>>> reads).
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> But Jiangjie is correct,
> > > >>>>>>>>>>>> the source table in batching should be immutable. It is
> the
> > > >>>>>>> user’s
> > > >>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
> > > >>>>> failover
> > > >>>>>>> may
> > > >>>>>>>>> lead
> > > >>>>>>>>>>>> to inconsistent results.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment
> should
> > > >>>>> be.
> > > >>>>>>> But
> > > >>>>>>>>> its
> > > >>>>>>>>>>> often isn’t and while I’m not trying to fix this (since the
> > > >>>>>> proper
> > > >>>>>>>> fix
> > > >>>>>>>>> is
> > > >>>>>>>>>>> to support transactions), I’m just trying to minimise
> > confusion
> > > >>>>>> for
> > > >>>>>>>> the
> > > >>>>>>>>>>> users that are not fully aware what’s going on and operate
> in
> > > >>>>>> less
> > > >>>>>>>> then
> > > >>>>>>>>>>> perfect setup. And if something bites them after adding
> > > >>>>>> `b.cache()`
> > > >>>>>>>>> call,
> > > >>>>>>>>>>> to make sure that they at least know all of the places that
> > > >>>>>> adding
> > > >>>>>>>> this
> > > >>>>>>>>>>> line can affect.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks, Piotrek
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <becket.qin@gmail.com
> >
> > > >>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks again for the clarification. Some more replies are
> > > >>>>>>>> following.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be
> used
> > > >>>>> in
> > > >>>>>>>>>>> interactive
> > > >>>>>>>>>>>>> programming and not only in batching.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> It is true. Actually in stream processing, cache() has the
> > > >>>>> same
> > > >>>>>>>>>> semantic
> > > >>>>>>>>>>> as
> > > >>>>>>>>>>>> batch processing. The semantic is following:
> > > >>>>>>>>>>>> For a table created via a series of computation, save that
> > > >>>>>> table
> > > >>>>>>>> for
> > > >>>>>>>>>>> later
> > > >>>>>>>>>>>> reference to avoid running the computation logic to
> > > >>>>> regenerate
> > > >>>>>>> the
> > > >>>>>>>>>> table.
> > > >>>>>>>>>>>> Once the application exits, drop all the cache.
> > > >>>>>>>>>>>> This semantic is same for both batch and stream
> processing.
> > > >>>>> The
> > > >>>>>>>>>>> difference
> > > >>>>>>>>>>>> is that stream applications will only run once as they are
> > > >>>>> long
> > > >>>>>>>>>> running.
> > > >>>>>>>>>>>> And the batch applications may be run multiple times,
> hence
> > > >>>>> the
> > > >>>>>>>> cache
> > > >>>>>>>>>> may
> > > >>>>>>>>>>>> be created and dropped each time the application runs.
> > > >>>>>>>>>>>> Admittedly, there will probably be some resource
> management
> > > >>>>>>>>>> requirements
> > > >>>>>>>>>>>> for the streaming cached table, such as time based / size
> > > >>>>> based
> > > >>>>>>>>>>> retention,
> > > >>>>>>>>>>>> to address the infinite data issue. But such requirement
> > does
> > > >>>>>> not
> > > >>>>>>>>>> change
> > > >>>>>>>>>>>> the semantic.
> > > >>>>>>>>>>>> You are right that interactive programming is just one use
> > > >>>>> case
> > > >>>>>>> of
> > > >>>>>>>>>>> cache().
> > > >>>>>>>>>>>> It is not the only use case.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> For me the more important issue is of not having the `void
> > > >>>>>>> cache()`
> > > >>>>>>>>>> with
> > > >>>>>>>>>>>>> side effects.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> This is indeed the key point. The argument around whether
> > > >>>>>> cache()
> > > >>>>>>>>>> should
> > > >>>>>>>>>>>> return something already indicates that cache() and
> > > >>>>>> materialize()
> > > >>>>>>>>>> address
> > > >>>>>>>>>>>> different issues.
> > > >>>>>>>>>>>> Can you explain a bit more one what are the side effects?
> So
> > > >>>>>> far
> > > >>>>>>> my
> > > >>>>>>>>>>>> understanding is that such side effects only exist if a
> > table
> > > >>>>>> is
> > > >>>>>>>>>> mutable.
> > > >>>>>>>>>>>> Is that the case?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I don’t know, probably initially we should make
> CachedTable
> > > >>>>>>>>> read-only.
> > > >>>>>>>>>> I
> > > >>>>>>>>>>>>> don’t find it more confusing than the fact that user can
> > not
> > > >>>>>>> write
> > > >>>>>>>>> to
> > > >>>>>>>>>>> views
> > > >>>>>>>>>>>>> or materialised views in SQL or that user currently can
> not
> > > >>>>>>> write
> > > >>>>>>>>> to a
> > > >>>>>>>>>>>>> Table.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I don't think anyone should insert something to a cache.
> By
> > > >>>>>>>>> definition
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>> cache should only be updated when the corresponding
> original
> > > >>>>>>> table
> > > >>>>>>>> is
> > > >>>>>>>>>>>> updated. What I am wondering is that given the following
> two
> > > >>>>>>> facts:
> > > >>>>>>>>>>>> 1. If and only if a table is mutable (with something like
> > > >>>>>>>> insert()),
> > > >>>>>>>>> a
> > > >>>>>>>>>>>> CachedTable may have implicit behavior.
> > > >>>>>>>>>>>> 2. A CachedTable extends a Table.
> > > >>>>>>>>>>>> We can come to the conclusion that a CachedTable is
> mutable
> > > >>>>> and
> > > >>>>>>>> users
> > > >>>>>>>>>> can
> > > >>>>>>>>>>>> insert into the CachedTable directly. This is where I
> > thought
> > > >>>>>>>>>> confusing.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> > > >>>>>>>>> piotr@data-artisans.com
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
> > > >>>>>>>> explanation
> > > >>>>>>>>>> why
> > > >>>>>>>>>>> I
> > > >>>>>>>>>>>>> think `materialize()` is more natural to me is that I
> think
> > > >>>>> of
> > > >>>>>>> all
> > > >>>>>>>>>>> “Table”s
> > > >>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
> > > >>>>> views,
> > > >>>>>>> the
> > > >>>>>>>>> only
> > > >>>>>>>>>>>>> difference for me is that their live scope is short -
> > > >>>>> current
> > > >>>>>>>>> session
> > > >>>>>>>>>>> which
> > > >>>>>>>>>>>>> is limited by different execution model. That’s why
> > > >>>>> “cashing”
> > > >>>>>> a
> > > >>>>>>>> view
> > > >>>>>>>>>>> for me
> > > >>>>>>>>>>>>> is just materialising it.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> However I see and I understand your point of view. Coming
> > > >>>>> from
> > > >>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
> > > >>>>>>> `cache()`
> > > >>>>>>>>> is
> > > >>>>>>>>>>> more
> > > >>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
> > > >>>>> only
> > > >>>>>> be
> > > >>>>>>>>> used
> > > >>>>>>>>>> in
> > > >>>>>>>>>>>>> interactive programming and not only in batching. But
> > naming
> > > >>>>>> is
> > > >>>>>>>> one
> > > >>>>>>>>>>> issue,
> > > >>>>>>>>>>>>> and not that critical to me. Especially that once we
> > > >>>>> implement
> > > >>>>>>>>> proper
> > > >>>>>>>>>>>>> materialised views, we can always deprecate/rename
> > `cache()`
> > > >>>>>> if
> > > >>>>>>> we
> > > >>>>>>>>>> deem
> > > >>>>>>>>>>> so.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> For me the more important issue is of not having the
> `void
> > > >>>>>>>> cache()`
> > > >>>>>>>>>> with
> > > >>>>>>>>>>>>> side effects. Exactly for the reasons that you have
> > > >>>>> mentioned.
> > > >>>>>>>> True:
> > > >>>>>>>>>>>>> results might be non deterministic if underlying source
> > > >>>>> table
> > > >>>>>>> are
> > > >>>>>>>>>>> changing.
> > > >>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
> > > >>>>> semantic
> > > >>>>>>> of
> > > >>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It can
> > > >>>>> cause
> > > >>>>>>>> “wtf”
> > > >>>>>>>>>>> moment
> > > >>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some place
> in
> > > >>>>> his
> > > >>>>>>>> code
> > > >>>>>>>>>> and
> > > >>>>>>>>>>>>> suddenly some other random places are behaving
> differently.
> > > >>>>> If
> > > >>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
> > > >>>>> force
> > > >>>>>>> user
> > > >>>>>>>>> to
> > > >>>>>>>>>>>>> explicitly use the cache which removes the “random” part
> > > >>>>> from
> > > >>>>>>> the
> > > >>>>>>>>>>> "suddenly
> > > >>>>>>>>>>>>> some other random places are behaving differently”.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> This argument and others that I’ve raised (greater
> > > >>>>>>>>>> flexibility/allowing
> > > >>>>>>>>>>>>> user to explicitly bypass the cache) are independent of
> > > >>>>>>> `cache()`
> > > >>>>>>>> vs
> > > >>>>>>>>>>>>> `materialize()` discussion.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Does that mean one can also insert into the CachedTable?
> > > >>>>> This
> > > >>>>>>>>> sounds
> > > >>>>>>>>>>>>> pretty confusing.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I don’t know, probably initially we should make
> CachedTable
> > > >>>>>>>>>> read-only. I
> > > >>>>>>>>>>>>> don’t find it more confusing than the fact that user can
> > not
> > > >>>>>>> write
> > > >>>>>>>>> to
> > > >>>>>>>>>>> views
> > > >>>>>>>>>>>>> or materialised views in SQL or that user currently can
> not
> > > >>>>>>> write
> > > >>>>>>>>> to a
> > > >>>>>>>>>>>>> Table.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <
> xingcanc@gmail.com
> > >
> > > >>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
> > > >>>>>> should
> > > >>>>>>> be
> > > >>>>>>>>>>>>> considered as two different methods where the later one
> is
> > > >>>>>> more
> > > >>>>>>>>>>>>> sophisticated.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> According to my understanding, the initial idea is just
> to
> > > >>>>>>>>> introduce
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI
> is a
> > > >>>>>>>>> high-level
> > > >>>>>>>>>>> API,
> > > >>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
> > > >>>>> and
> > > >>>>>>>> force
> > > >>>>>>>>>>> users
> > > >>>>>>>>>>>>> to translate a Table to a Dataset before caching it. Then
> > > >>>>> the
> > > >>>>>>>> users
> > > >>>>>>>>>>> should
> > > >>>>>>>>>>>>> manually register the cached dataset to a table again (we
> > > >>>>> may
> > > >>>>>>> need
> > > >>>>>>>>>> some
> > > >>>>>>>>>>>>> table replacement mechanisms for datasets with an
> identical
> > > >>>>>>> schema
> > > >>>>>>>>> but
> > > >>>>>>>>>>>>> different contents here). After all, it’s the dataset
> > rather
> > > >>>>>>> than
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>> dynamic table that need to be cached, right?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>> Xingcan
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> > > >>>>>>> becket.qin@gmail.com>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Hi Piotrek and Jark,
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
> > > >>>>>>>> arguments.
> > > >>>>>>>>>>> But I
> > > >>>>>>>>>>>>>>> think those arguments are mostly about materialized
> view.
> > > >>>>>> Let
> > > >>>>>>> me
> > > >>>>>>>>> try
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>>>> explain the reason I believe cache() and materialize()
> > are
> > > >>>>>>>>>> different.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I think cache() and materialize() have quite different
> > > >>>>>>>>> implications.
> > > >>>>>>>>>>> An
> > > >>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When users
> > > >>>>> call
> > > >>>>>>>>> cache(),
> > > >>>>>>>>>>> it
> > > >>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>> just like they are saving an intermediate result as a
> > > >>>>> draft
> > > >>>>>> of
> > > >>>>>>>>> their
> > > >>>>>>>>>>>>> work,
> > > >>>>>>>>>>>>>>> this intermediate result may not have any realistic
> > > >>>>> meaning.
> > > >>>>>>>>> Calling
> > > >>>>>>>>>>>>>>> cache() does not mean users want to publish the cached
> > > >>>>> table
> > > >>>>>>> in
> > > >>>>>>>>> any
> > > >>>>>>>>>>>>> manner.
> > > >>>>>>>>>>>>>>> But when users call materialize(), that means "I have
> > > >>>>>>> something
> > > >>>>>>>>>>>>> meaningful
> > > >>>>>>>>>>>>>>> to be reused by others", now users need to think about
> > the
> > > >>>>>>>>>> validation,
> > > >>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Piotrek's suggestions on variations of the
> materialize()
> > > >>>>>>> methods
> > > >>>>>>>>> are
> > > >>>>>>>>>>>>> very
> > > >>>>>>>>>>>>>>> useful. It would be great if Flink have them. The
> concept
> > > >>>>> of
> > > >>>>>>>>>>>>> materialized
> > > >>>>>>>>>>>>>>> view is actually a pretty big feature, not to say the
> > > >>>>>> related
> > > >>>>>>>>> stuff
> > > >>>>>>>>>>> like
> > > >>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
> > > >>>>>> materialized
> > > >>>>>>>>> view
> > > >>>>>>>>>>>>> itself
> > > >>>>>>>>>>>>>>> should be discussed in a more thorough and systematic
> > > >>>>>> manner.
> > > >>>>>>>> And
> > > >>>>>>>>> I
> > > >>>>>>>>>>>>> found
> > > >>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
> > > >>>>>>> interactive
> > > >>>>>>>>>>>>>>> programming experience.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> The example you gave was interesting. I still have some
> > > >>>>>>>> questions,
> > > >>>>>>>>>>>>> though.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Table source = … // some source that scans files from a
> > > >>>>>>>> directory
> > > >>>>>>>>>>>>>>>> “/foo/bar/“
> > > >>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > >>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> > > >>>>> initialised)
> > > >>>>>>>>>>>>>>>> int a1 = t1.count()
> > > >>>>>>>>>>>>>>>> int b1 = t2.count()
> > > >>>>>>>>>>>>>>>> // something in the background (or we trigger it)
> writes
> > > >>>>>> new
> > > >>>>>>>>> files
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>>>>> /foo/bar
> > > >>>>>>>>>>>>>>>> int a2 = t1.count()
> > > >>>>>>>>>>>>>>>> int b2 = t2.count()
> > > >>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> > > >>>>>>>> implemented
> > > >>>>>>>>> in
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> initial version
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> what if someone else added some more files to /foo/bar
> at
> > > >>>>>> this
> > > >>>>>>>>>> point?
> > > >>>>>>>>>>> In
> > > >>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
> > > >>>>>>>>>>>>> non-deterministic,
> > > >>>>>>>>>>>>>>> right?
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> int a3 = t1.count()
> > > >>>>>>>>>>>>>>>> int b3 = t2.count()
> > > >>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
> > > >>>>>>> “cache”
> > > >>>>>>>>>>> dropping
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> When we talk about interactive programming, in most
> > cases,
> > > >>>>>> we
> > > >>>>>>>> are
> > > >>>>>>>>>>>>> talking
> > > >>>>>>>>>>>>>>> about batch applications. A fundamental assumption of
> > such
> > > >>>>>>> case
> > > >>>>>>>> is
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>> source data is complete before the data processing
> > begins,
> > > >>>>>> and
> > > >>>>>>>> the
> > > >>>>>>>>>>> data
> > > >>>>>>>>>>>>>>> will not change during the data processing. IMO, if
> > > >>>>>> additional
> > > >>>>>>>>> rows
> > > >>>>>>>>>>>>> needs
> > > >>>>>>>>>>>>>>> to be added to some source during the processing, it
> > > >>>>> should
> > > >>>>>> be
> > > >>>>>>>>> done
> > > >>>>>>>>>> in
> > > >>>>>>>>>>>>> ways
> > > >>>>>>>>>>>>>>> like union the source with another table containing the
> > > >>>>> rows
> > > >>>>>>> to
> > > >>>>>>>> be
> > > >>>>>>>>>>>>> added.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> There are a few cases that computations are executed
> > > >>>>>>> repeatedly
> > > >>>>>>>> on
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>> changing data source.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> For example, people may run a ML training job every
> hour
> > > >>>>>> with
> > > >>>>>>>> the
> > > >>>>>>>>>>>>> samples
> > > >>>>>>>>>>>>>>> newly added in the past hour. In that case, the source
> > > >>>>> data
> > > >>>>>>>>> between
> > > >>>>>>>>>>> will
> > > >>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged
> > within
> > > >>>>>> one
> > > >>>>>>>>> run.
> > > >>>>>>>>>>> And
> > > >>>>>>>>>>>>>>> usually in that case, the result will need versioning,
> > > >>>>> i.e.
> > > >>>>>>> for
> > > >>>>>>>> a
> > > >>>>>>>>>>> given
> > > >>>>>>>>>>>>>>> result, it tells that the result is a result from the
> > > >>>>> source
> > > >>>>>>>> data
> > > >>>>>>>>>> by a
> > > >>>>>>>>>>>>>>> certain timestamp.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Another example is something like data warehouse. In
> this
> > > >>>>>>> case,
> > > >>>>>>>>>> there
> > > >>>>>>>>>>>>> are a
> > > >>>>>>>>>>>>>>> few source of original/raw data. On top of those
> sources,
> > > >>>>>> many
> > > >>>>>>>>>>>>> materialized
> > > >>>>>>>>>>>>>>> view / queries / reports / dashboards can be created to
> > > >>>>>>> generate
> > > >>>>>>>>>>> derived
> > > >>>>>>>>>>>>>>> data. Those derived data needs to be updated when the
> > > >>>>>>> underlying
> > > >>>>>>>>>>>>> original
> > > >>>>>>>>>>>>>>> data changes. In that case, the processing logic that
> > > >>>>>> derives
> > > >>>>>>>> the
> > > >>>>>>>>>>>>> original
> > > >>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
> > > >>>>>>>>> reports/views.
> > > >>>>>>>>>>>>> Again,
> > > >>>>>>>>>>>>>>> all those derived data also need to have version
> > > >>>>> management,
> > > >>>>>>>> such
> > > >>>>>>>>> as
> > > >>>>>>>>>>>>>>> timestamp.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> In any of the above two cases, during a single run of
> the
> > > >>>>>>>>> processing
> > > >>>>>>>>>>>>> logic,
> > > >>>>>>>>>>>>>>> the data cannot change. Otherwise the behavior of the
> > > >>>>>>> processing
> > > >>>>>>>>>> logic
> > > >>>>>>>>>>>>> may
> > > >>>>>>>>>>>>>>> be undefined. In the above two examples, when writing
> the
> > > >>>>>>>>> processing
> > > >>>>>>>>>>>>> logic,
> > > >>>>>>>>>>>>>>> Users can use .cache() to hint Flink that those results
> > > >>>>>> should
> > > >>>>>>>> be
> > > >>>>>>>>>>> saved
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>> avoid repeated computation. And then for the result of
> my
> > > >>>>>>>>>> application
> > > >>>>>>>>>>>>>>> logic, I'll call materialize(), so that these results
> > > >>>>> could
> > > >>>>>> be
> > > >>>>>>>>>> managed
> > > >>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>> the system with versioning, metadata management,
> > lifecycle
> > > >>>>>>>>>> management,
> > > >>>>>>>>>>>>>>> ACLs, etc.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> It is true we can use materialize() to do the cache()
> > job,
> > > >>>>>>> but I
> > > >>>>>>>>> am
> > > >>>>>>>>>>>>> really
> > > >>>>>>>>>>>>>>> reluctant to shoehorn cache() into materialize() and
> > force
> > > >>>>>>> users
> > > >>>>>>>>> to
> > > >>>>>>>>>>>>> worry
> > > >>>>>>>>>>>>>>> about a bunch of implications that they needn't have
> to.
> > I
> > > >>>>>> am
> > > >>>>>>>>>>>>> absolutely on
> > > >>>>>>>>>>>>>>> your side that redundant API is bad. But it is equally
> > > >>>>>>>>> frustrating,
> > > >>>>>>>>>> if
> > > >>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>> more, that the same API does different things.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> > > >>>>>>>>> wshaoxuan@gmail.com
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks Piotrek,
> > > >>>>>>>>>>>>>>>> You provided a very good example, it explains all the
> > > >>>>>>>> confusions
> > > >>>>>>>>> I
> > > >>>>>>>>>>>>> have.
> > > >>>>>>>>>>>>>>>> It is clear that there is something we have not
> > > >>>>> considered
> > > >>>>>> in
> > > >>>>>>>> the
> > > >>>>>>>>>>>>> initial
> > > >>>>>>>>>>>>>>>> proposal. We intend to force the user to reuse the
> > > >>>>>>>>>>> cached/materialized
> > > >>>>>>>>>>>>>>>> table, if its cache() method is executed. We did not
> > > >>>>> expect
> > > >>>>>>>> that
> > > >>>>>>>>>> user
> > > >>>>>>>>>>>>> may
> > > >>>>>>>>>>>>>>>> want to re-executed the plan from the source table.
> Let
> > > >>>>> me
> > > >>>>>>>>> re-think
> > > >>>>>>>>>>>>> about
> > > >>>>>>>>>>>>>>>> it and get back to you later.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> In the meanwhile, this example/observation also infers
> > > >>>>> that
> > > >>>>>>> we
> > > >>>>>>>>>> cannot
> > > >>>>>>>>>>>>> fully
> > > >>>>>>>>>>>>>>>> involve the optimizer to decide the plan if a
> > > >>>>>>> cache/materialize
> > > >>>>>>>>> is
> > > >>>>>>>>>>>>>>>> explicitly used, because weather to reuse the cache
> data
> > > >>>>> or
> > > >>>>>>>>>>> re-execute
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> query from source data may lead to different results.
> > > >>>>> (But
> > > >>>>>> I
> > > >>>>>>>>> guess
> > > >>>>>>>>>>>>>>>> optimizer can still help in some cases ---- as long as
> > it
> > > >>>>>>> does
> > > >>>>>>>>> not
> > > >>>>>>>>>>>>>>>> re-execute from the varied source, we should be safe).
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>> Shaoxuan
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> > > >>>>>>>>>>>>> piotr@data-artisans.com>
> > > >>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Hi Shaoxuan,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Re 2:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
> > > >>>>>> modified
> > > >>>>>>>>> to->
> > > >>>>>>>>>>> t1’
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> What do you mean that “ t1 is modified to-> t1’ ” ?
> > That
> > > >>>>>>>>>>>>>>>>> `methodThatAppliesOperators()` method has changed
> it’s
> > > >>>>>> plan?
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> I was thinking more about something like this:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Table source = … // some source that scans files
> from a
> > > >>>>>>>>> directory
> > > >>>>>>>>>>>>>>>>> “/foo/bar/“
> > > >>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > >>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> > > >>>>>> initialised)
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> int a1 = t1.count()
> > > >>>>>>>>>>>>>>>>> int b1 = t2.count()
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
> > writes
> > > >>>>>> new
> > > >>>>>>>>> files
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>> /foo/bar
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> int a2 = t1.count()
> > > >>>>>>>>>>>>>>>>> int b2 = t2.count()
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> > > >>>>>>>> implemented
> > > >>>>>>>>>> in
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> initial version
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> int a3 = t1.count()
> > > >>>>>>>>>>>>>>>>> int b3 = t2.count()
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> t2.drop() // another possible future extension,
> manual
> > > >>>>>>> “cache”
> > > >>>>>>>>>>>>> dropping
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> assertTrue(a1 == b1) // same results, but b1 comes
> from
> > > >>>>>> the
> > > >>>>>>>>>> “cache"
> > > >>>>>>>>>>>>>>>>> assertTrue(b1 == b2) // both values come from the
> same
> > > >>>>>> cache
> > > >>>>>>>>>>>>>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2
> > > >>>>> re-executed
> > > >>>>>>>> full
> > > >>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>> scan
> > > >>>>>>>>>>>>>>>>> and has more data
> > > >>>>>>>>>>>>>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> > > >>>>>>>>>>>>>>>>> assertTrue(b3 == a2 == a3)
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <imjark@gmail.com
> >
> > > >>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> It is an very interesting and useful design!
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Here I want to share some of my thoughts:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> 1. Agree with that cache() method should return some
> > > >>>>>> Table
> > > >>>>>>> to
> > > >>>>>>>>>> avoid
> > > >>>>>>>>>>>>>>>> some
> > > >>>>>>>>>>>>>>>>>> unexpected problems because of the mutable object.
> > > >>>>>>>>>>>>>>>>>> All the existing methods of Table are returning a
> new
> > > >>>>>> Table
> > > >>>>>>>>>>> instance.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> 2. I think materialize() would be more consistent
> with
> > > >>>>>> SQL,
> > > >>>>>>>>> this
> > > >>>>>>>>>>>>> makes
> > > >>>>>>>>>>>>>>>> it
> > > >>>>>>>>>>>>>>>>>> possible to support the same feature for SQL
> > > >>>>> (materialize
> > > >>>>>>>> view)
> > > >>>>>>>>>> and
> > > >>>>>>>>>>>>>>>> keep
> > > >>>>>>>>>>>>>>>>>> the same API for users in the future.
> > > >>>>>>>>>>>>>>>>>> But I'm also fine if we choose cache().
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> 3. In the proposal, a TableService (or
> FlinkService?)
> > > >>>>> is
> > > >>>>>>> used
> > > >>>>>>>>> to
> > > >>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>> result of the (intermediate) table.
> > > >>>>>>>>>>>>>>>>>> But the name of TableService may be a bit general
> > which
> > > >>>>>> is
> > > >>>>>>>> not
> > > >>>>>>>>>>> quite
> > > >>>>>>>>>>>>>>>>>> understanding correctly in the first glance (a
> > > >>>>> metastore
> > > >>>>>>> for
> > > >>>>>>>>>>>>> tables?).
> > > >>>>>>>>>>>>>>>>>> Maybe a more specific name would be better, such as
> > > >>>>>>>>>>> TableCacheSerive
> > > >>>>>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>> TableMaterializeSerivce or something else.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>> Jark
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
> > > >>>>>>>> fhueske@gmail.com
> > > >>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Thanks for the clarification Becket!
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> I have a few thoughts to share / questions:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> 1) I'd like to know how you plan to implement the
> > > >>>>>> feature
> > > >>>>>>>> on a
> > > >>>>>>>>>>> plan
> > > >>>>>>>>>>>>> /
> > > >>>>>>>>>>>>>>>>>>> planner level.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> I would imaging the following to happen when
> > > >>>>>> Table.cache()
> > > >>>>>>>> is
> > > >>>>>>>>>>>>> called:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> 1) immediately optimize the Table and internally
> > > >>>>> convert
> > > >>>>>>> it
> > > >>>>>>>>>> into a
> > > >>>>>>>>>>>>>>>>>>> DataSet/DataStream. This is necessary, to avoid
> that
> > > >>>>>>>> operators
> > > >>>>>>>>>> of
> > > >>>>>>>>>>>>>>>> later
> > > >>>>>>>>>>>>>>>>>>> queries on top of the Table are pushed down.
> > > >>>>>>>>>>>>>>>>>>> 2) register the DataSet/DataStream as a
> > > >>>>>>>>>> DataSet/DataStream-backed
> > > >>>>>>>>>>>>>>>> Table
> > > >>>>>>>>>>>>>>>>> X
> > > >>>>>>>>>>>>>>>>>>> 3) add a sink to the DataSet/DataStream. This is
> the
> > > >>>>>>>>>>> materialization
> > > >>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> Table X
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Based on your proposal the following would happen:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Table t1 = ....
> > > >>>>>>>>>>>>>>>>>>> t1.cache(); // cache() returns void. The logical
> plan
> > > >>>>> of
> > > >>>>>>> t1
> > > >>>>>>>> is
> > > >>>>>>>>>>>>>>>> replaced
> > > >>>>>>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>> a scan of X. There is also a reference to the
> > > >>>>>>>> materialization
> > > >>>>>>>>> of
> > > >>>>>>>>>>> X.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, including
> > > >>>>> the
> > > >>>>>>>>>>>>>>>>> DataSet/DataStream
> > > >>>>>>>>>>>>>>>>>>> that backs X and the sink that writes the
> > > >>>>>> materialization
> > > >>>>>>>> of X
> > > >>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, but
> reads X
> > > >>>>>> from
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> materialization.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> My question is, how do you determine when whether
> the
> > > >>>>>> scan
> > > >>>>>>>> of
> > > >>>>>>>>> t1
> > > >>>>>>>>>>>>>>>> should
> > > >>>>>>>>>>>>>>>>> go
> > > >>>>>>>>>>>>>>>>>>> against the DataSet/DataStream program and when
> > > >>>>> against
> > > >>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> materialization?
> > > >>>>>>>>>>>>>>>>>>> AFAIK, there is no hook that will tell you that a
> > part
> > > >>>>>> of
> > > >>>>>>>> the
> > > >>>>>>>>>>>>> program
> > > >>>>>>>>>>>>>>>>> was
> > > >>>>>>>>>>>>>>>>>>> executed. Flipping a switch during optimization or
> > > >>>>> plan
> > > >>>>>>>>>> generation
> > > >>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>> sufficient as there is no guarantee that the plan
> is
> > > >>>>>> also
> > > >>>>>>>>>>> executed.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Overall, this behavior is somewhat similar to what
> I
> > > >>>>>>>> proposed
> > > >>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>> FLINK-8950, which does not include persisting the
> > > >>>>> table,
> > > >>>>>>> but
> > > >>>>>>>>>> just
> > > >>>>>>>>>>>>>>>>>>> optimizing and reregistering it as
> DataSet/DataStream
> > > >>>>>>> scan.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> 2) I think Piotr has a point about the implicit
> > > >>>>> behavior
> > > >>>>>>> and
> > > >>>>>>>>>> side
> > > >>>>>>>>>>>>>>>>> effects
> > > >>>>>>>>>>>>>>>>>>> of the cache() method if it does not return
> anything.
> > > >>>>>>>>>>>>>>>>>>> Consider the following example:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Table t1 = ???
> > > >>>>>>>>>>>>>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> > > >>>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> In this case, the behavior/performance of the plan
> > > >>>>> that
> > > >>>>>>>>> results
> > > >>>>>>>>>>> from
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> second method call depends on whether t1 was
> modified
> > > >>>>> by
> > > >>>>>>> the
> > > >>>>>>>>>> first
> > > >>>>>>>>>>>>>>>>> method
> > > >>>>>>>>>>>>>>>>>>> or not.
> > > >>>>>>>>>>>>>>>>>>> This is the classic issue of mutable vs. immutable
> > > >>>>>>> objects.
> > > >>>>>>>>>>>>>>>>>>> Also, as Piotr pointed out, it might also be good
> to
> > > >>>>>> have
> > > >>>>>>>> the
> > > >>>>>>>>>>>>> original
> > > >>>>>>>>>>>>>>>>> plan
> > > >>>>>>>>>>>>>>>>>>> of t1, because in some cases it is possible to push
> > > >>>>>>> filters
> > > >>>>>>>>> down
> > > >>>>>>>>>>>>> such
> > > >>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>> evaluating the query from scratch might be more
> > > >>>>>> efficient
> > > >>>>>>>> than
> > > >>>>>>>>>>>>>>>> accessing
> > > >>>>>>>>>>>>>>>>>>> the cache.
> > > >>>>>>>>>>>>>>>>>>> Moreover, a CachedTable could extend Table() and
> > > >>>>> offer a
> > > >>>>>>>>> method
> > > >>>>>>>>>>>>>>>>> refresh().
> > > >>>>>>>>>>>>>>>>>>> This sounds quite useful in an interactive session
> > > >>>>> mode.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> 3) Regarding the name, I can see both arguments.
> IMO,
> > > >>>>>>>>>>> materialize()
> > > >>>>>>>>>>>>>>>>> seems
> > > >>>>>>>>>>>>>>>>>>> to be more future proof.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Best, Fabian
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
> > > >>>>>> Wang <
> > > >>>>>>>>>>>>>>>>>>> wshaoxuan@gmail.com>:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Hi Piotr,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Thanks for sharing your ideas on the method
> naming.
> > > >>>>> We
> > > >>>>>>> will
> > > >>>>>>>>>> think
> > > >>>>>>>>>>>>>>>> about
> > > >>>>>>>>>>>>>>>>>>>> your suggestions. But I don't understand why we
> need
> > > >>>>> to
> > > >>>>>>>>> change
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> return
> > > >>>>>>>>>>>>>>>>>>>> type of cache().
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Cache() is a physical operation, it does not
> change
> > > >>>>> the
> > > >>>>>>>> logic
> > > >>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>> the `Table`. On the tableAPI layer, we should not
> > > >>>>>>>> introduce a
> > > >>>>>>>>>> new
> > > >>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>> type unless the logic of table has been changed.
> If
> > > >>>>> we
> > > >>>>>>>>>> introduce
> > > >>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>> new
> > > >>>>>>>>>>>>>>>>>>>> table type `CachedTable`, we need create the same
> > set
> > > >>>>>> of
> > > >>>>>>>>>> methods
> > > >>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>> `Table`
> > > >>>>>>>>>>>>>>>>>>>> for it. I don't think it is worth doing this. Or
> can
> > > >>>>>> you
> > > >>>>>>>>> please
> > > >>>>>>>>>>>>>>>>> elaborate
> > > >>>>>>>>>>>>>>>>>>>> more on what could be the "implicit
> behaviours/side
> > > >>>>>>>> effects"
> > > >>>>>>>>>> you
> > > >>>>>>>>>>>>> are
> > > >>>>>>>>>>>>>>>>>>>> thinking about?
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>> Shaoxuan
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > > >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Hi Becket,
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thanks for the response.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> 1. I wasn’t saying that materialised view must be
> > > >>>>>>> mutable
> > > >>>>>>>> or
> > > >>>>>>>>>>> not.
> > > >>>>>>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>>>>> thing applies to caches as well. To the
> contrary, I
> > > >>>>>>> would
> > > >>>>>>>>>> expect
> > > >>>>>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>>> consistency and updates from something that is
> > > >>>>> called
> > > >>>>>>>>> “cache”
> > > >>>>>>>>>> vs
> > > >>>>>>>>>>>>>>>>>>>> something
> > > >>>>>>>>>>>>>>>>>>>>> that’s a “materialised view”. In other words, IMO
> > > >>>>> most
> > > >>>>>>>>> caches
> > > >>>>>>>>>> do
> > > >>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>> serve
> > > >>>>>>>>>>>>>>>>>>>>> you invalid/outdated data and they handle updates
> > on
> > > >>>>>>> their
> > > >>>>>>>>>> own.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> 2. I don’t think that having in the future two
> very
> > > >>>>>>>> similar
> > > >>>>>>>>>>>>> concepts
> > > >>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>> `materialized` view and `cache` is a good idea.
> It
> > > >>>>>> would
> > > >>>>>>>> be
> > > >>>>>>>>>>>>>>>> confusing
> > > >>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>> the users. I think it could be handled by
> > > >>>>>>>>>> variations/overloading
> > > >>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>> materialised view concept. We could start with:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize()` - immutable,
> > > >>>>> session
> > > >>>>>>>> life
> > > >>>>>>>>>>> scope
> > > >>>>>>>>>>>>>>>>>>>>> (basically the same semantic as you are proposing
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> And then in the future (if ever) build on top of
> > > >>>>>>>> that/expand
> > > >>>>>>>>>> it
> > > >>>>>>>>>>>>>>>> with:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> > > >>>>>>>>>>>>> `MaterializedTable
> > > >>>>>>>>>>>>>>>>>>>>> materialize(refreshHook=…)`
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Or with cross session support:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> `MaterializedTable materializeInto(connector=…)`
> or
> > > >>>>>>>>>>>>>>>> `MaterializedTable
> > > >>>>>>>>>>>>>>>>>>>>> materializeInto(tableFactory=…)`
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> I’m not saying that we should implement cross
> > > >>>>>>>>>> session/refreshing
> > > >>>>>>>>>>>>> now
> > > >>>>>>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>>> even in the near future. I’m just arguing that
> > > >>>>> naming
> > > >>>>>>>>> current
> > > >>>>>>>>>>>>>>>>> immutable
> > > >>>>>>>>>>>>>>>>>>>>> session life scope method `materialize()` is more
> > > >>>>>> future
> > > >>>>>>>>> proof
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>>> consistent with SQL (on which after all table-api
> > is
> > > >>>>>>>> heavily
> > > >>>>>>>>>>>>> basing
> > > >>>>>>>>>>>>>>>>>>> on).
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> 3. Even if we agree on naming it `cache()`, I
> would
> > > >>>>>>> still
> > > >>>>>>>>>> insist
> > > >>>>>>>>>>>>> on
> > > >>>>>>>>>>>>>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
> > > >>>>>>> implicit
> > > >>>>>>>>>>>>>>>>>>>> behaviours/side
> > > >>>>>>>>>>>>>>>>>>>>> effects and to give both us & users more
> > > >>>>> flexibility.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
> > > >>>>>>>> becket.qin@gmail.com
> > > >>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Just to add a little bit, the materialized view
> is
> > > >>>>>>>> probably
> > > >>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>> similar
> > > >>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>> the persistent() brought up earlier in the
> thread.
> > > >>>>> So
> > > >>>>>>> it
> > > >>>>>>>> is
> > > >>>>>>>>>>>>> usually
> > > >>>>>>>>>>>>>>>>>>>> cross
> > > >>>>>>>>>>>>>>>>>>>>>> session and could be used in a larger scope. For
> > > >>>>>>>> example, a
> > > >>>>>>>>>>>>>>>>>>>> materialized
> > > >>>>>>>>>>>>>>>>>>>>>> view created by user A may be visible to user B.
> > It
> > > >>>>>> is
> > > >>>>>>>>>> probably
> > > >>>>>>>>>>>>>>>>>>>> something
> > > >>>>>>>>>>>>>>>>>>>>>> we want to have in the future. I'll put it in
> the
> > > >>>>>>> future
> > > >>>>>>>>> work
> > > >>>>>>>>>>>>>>>>>>> section.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> > > >>>>>>>>>>> becket.qin@gmail.com
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Right now we are mostly thinking of the cached
> > > >>>>> table
> > > >>>>>>> as
> > > >>>>>>>>>>>>>>>> immutable. I
> > > >>>>>>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>>>>>> see the Materialized view would be useful in
> the
> > > >>>>>>> future.
> > > >>>>>>>>>> That
> > > >>>>>>>>>>>>>>>> said,
> > > >>>>>>>>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>>>>>>> think
> > > >>>>>>>>>>>>>>>>>>>>>>> a simple cache mechanism is probably still
> > needed.
> > > >>>>>> So
> > > >>>>>>> to
> > > >>>>>>>>> me,
> > > >>>>>>>>>>>>>>>> cache()
> > > >>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>> materialize() should be two separate method as
> > > >>>>> they
> > > >>>>>>>>> address
> > > >>>>>>>>>>>>>>>>>>> different
> > > >>>>>>>>>>>>>>>>>>>>>>> needs. Materialize() is a higher level concept
> > > >>>>>> usually
> > > >>>>>>>>>>> implying
> > > >>>>>>>>>>>>>>>>>>>>> periodical
> > > >>>>>>>>>>>>>>>>>>>>>>> update, while cache() has much simpler
> semantic.
> > > >>>>> For
> > > >>>>>>>>>> example,
> > > >>>>>>>>>>>>> one
> > > >>>>>>>>>>>>>>>>>>> may
> > > >>>>>>>>>>>>>>>>>>>>>>> create a materialized view and use cache()
> method
> > > >>>>> in
> > > >>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> materialized
> > > >>>>>>>>>>>>>>>>>>>>> view
> > > >>>>>>>>>>>>>>>>>>>>>>> creation logic. So that during the materialized
> > > >>>>> view
> > > >>>>>>>>> update,
> > > >>>>>>>>>>>>> they
> > > >>>>>>>>>>>>>>>> do
> > > >>>>>>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>>>>> need to worry about the case that the cached
> > table
> > > >>>>>> is
> > > >>>>>>>> also
> > > >>>>>>>>>>>>>>>> changed.
> > > >>>>>>>>>>>>>>>>>>>>> Maybe
> > > >>>>>>>>>>>>>>>>>>>>>>> under the hood, materialized() and cache()
> could
> > > >>>>>> share
> > > >>>>>>>>> some
> > > >>>>>>>>>>>>>>>>>>> mechanism,
> > > >>>>>>>>>>>>>>>>>>>>> but
> > > >>>>>>>>>>>>>>>>>>>>>>> I think a simple cache() method would be handy
> in
> > > >>>>> a
> > > >>>>>>> lot
> > > >>>>>>>> of
> > > >>>>>>>>>>>>> cases.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski
> <
> > > >>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > >>>>>>>>>> MaterializedTable
> > > >>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>> they
> > > >>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table?
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Maybe not in the initial implementation, but
> > > >>>>>> various
> > > >>>>>>>> DBs
> > > >>>>>>>>>>> offer
> > > >>>>>>>>>>>>>>>>>>>>> different
> > > >>>>>>>>>>>>>>>>>>>>>>>> ways to “refresh” the materialised view.
> Hooks,
> > > >>>>>>>> triggers,
> > > >>>>>>>>>>>>> timers,
> > > >>>>>>>>>>>>>>>>>>>>> manually
> > > >>>>>>>>>>>>>>>>>>>>>>>> etc. Having `MaterializedTable` would help us
> to
> > > >>>>>>> handle
> > > >>>>>>>>>> that
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>> future.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> After users call *table.cache(), *users can
> > just
> > > >>>>>> use
> > > >>>>>>>>> that
> > > >>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>> do
> > > >>>>>>>>>>>>>>>>>>>>>>>> anything that is supported on a Table,
> including
> > > >>>>>> SQL.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> This is some implicit behaviour with side
> > > >>>>> effects.
> > > >>>>>>>>> Imagine
> > > >>>>>>>>>> if
> > > >>>>>>>>>>>>>>>> user
> > > >>>>>>>>>>>>>>>>>>>> has
> > > >>>>>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>>>>> long and complicated program, that touches
> table
> > > >>>>>> `b`
> > > >>>>>>>>>> multiple
> > > >>>>>>>>>>>>>>>>>>> times,
> > > >>>>>>>>>>>>>>>>>>>>> maybe
> > > >>>>>>>>>>>>>>>>>>>>>>>> scattered around different methods. If he
> > > >>>>> modifies
> > > >>>>>>> his
> > > >>>>>>>>>>> program
> > > >>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>>>> inserting
> > > >>>>>>>>>>>>>>>>>>>>>>>> in one place
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> b.cache()
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> This implicitly alters the semantic and
> > behaviour
> > > >>>>>> of
> > > >>>>>>>> his
> > > >>>>>>>>>> code
> > > >>>>>>>>>>>>> all
> > > >>>>>>>>>>>>>>>>>>>> over
> > > >>>>>>>>>>>>>>>>>>>>>>>> the place, maybe in a ways that might cause
> > > >>>>>> problems.
> > > >>>>>>>> For
> > > >>>>>>>>>>>>> example
> > > >>>>>>>>>>>>>>>>>>>> what
> > > >>>>>>>>>>>>>>>>>>>>> if
> > > >>>>>>>>>>>>>>>>>>>>>>>> underlying data is changing?
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Having invisible side effects is also not very
> > > >>>>>> clean,
> > > >>>>>>>> for
> > > >>>>>>>>>>>>> example
> > > >>>>>>>>>>>>>>>>>>>> think
> > > >>>>>>>>>>>>>>>>>>>>>>>> about something like this (but more
> > complicated):
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Table b = ...;
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
> > > >>>>>>>>>>>>>>>>>>>>>>>> processTable1(b)
> > > >>>>>>>>>>>>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>>>>>>>>>>> else {
> > > >>>>>>>>>>>>>>>>>>>>>>>> processTable2(b)
> > > >>>>>>>>>>>>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> // do more stuff with b
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> And user adds `b.cache()` call to only one of
> > the
> > > >>>>>>>>>>>>> `processTable1`
> > > >>>>>>>>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>>>>>> `processTable2` methods.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> On the other hand
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Table materialisedB = b.materialize()
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Avoids (at least some of) the side effect
> issues
> > > >>>>>> and
> > > >>>>>>>>> forces
> > > >>>>>>>>>>>>> user
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> explicitly use `materialisedB` where it’s
> > > >>>>>> appropriate
> > > >>>>>>>> and
> > > >>>>>>>>>>>>> forces
> > > >>>>>>>>>>>>>>>>>>> user
> > > >>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> think what does it actually mean. And if
> > > >>>>> something
> > > >>>>>>>>> doesn’t
> > > >>>>>>>>>>> work
> > > >>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>> end
> > > >>>>>>>>>>>>>>>>>>>>>>>> for the user, he will know what has he changed
> > > >>>>>>> instead
> > > >>>>>>>> of
> > > >>>>>>>>>>>>> blaming
> > > >>>>>>>>>>>>>>>>>>>>> Flink for
> > > >>>>>>>>>>>>>>>>>>>>>>>> some “magic” underneath. In the above example,
> > > >>>>>> after
> > > >>>>>>>>>>>>>>>> materialising
> > > >>>>>>>>>>>>>>>>>>> b
> > > >>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>> only one of the methods, he should/would
> realise
> > > >>>>>>> about
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>> issue
> > > >>>>>>>>>>>>>>>>>>> when
> > > >>>>>>>>>>>>>>>>>>>>>>>> handling the return value `MaterializedTable`
> of
> > > >>>>>> that
> > > >>>>>>>>>> method.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> I guess it comes down to personal preferences
> if
> > > >>>>>> you
> > > >>>>>>>> like
> > > >>>>>>>>>>>>> things
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>>>>> implicit or not. The more power is the user,
> > > >>>>>> probably
> > > >>>>>>>> the
> > > >>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>> likely
> > > >>>>>>>>>>>>>>>>>>>>> he is
> > > >>>>>>>>>>>>>>>>>>>>>>>> to like/understand implicit behaviour. And we
> as
> > > >>>>>>> Table
> > > >>>>>>>>> API
> > > >>>>>>>>>>>>>>>>>>> designers
> > > >>>>>>>>>>>>>>>>>>>>> are
> > > >>>>>>>>>>>>>>>>>>>>>>>> the most power users out there, so I would
> > > >>>>> proceed
> > > >>>>>>> with
> > > >>>>>>>>>>> caution
> > > >>>>>>>>>>>>>>>> (so
> > > >>>>>>>>>>>>>>>>>>>>> that we
> > > >>>>>>>>>>>>>>>>>>>>>>>> do not end up in the crazy perl realm with
> it’s
> > > >>>>>>> lovely
> > > >>>>>>>>>>> implicit
> > > >>>>>>>>>>>>>>>>>>>> method
> > > >>>>>>>>>>>>>>>>>>>>>>>> arguments ;)  <
> > > >>>>>>>>>> https://stackoverflow.com/a/14922656/8149051
> > > >>>>>>>>>>>> )
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
> > > >>>>>> processing
> > > >>>>>>>>> cases,
> > > >>>>>>>>>>>>>>>> cache()
> > > >>>>>>>>>>>>>>>>>>>>>>>> might be slightly better.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> I think even such extended Table API could
> > > >>>>> benefit
> > > >>>>>>> from
> > > >>>>>>>>>>>>> sticking
> > > >>>>>>>>>>>>>>>>>>>>> to/being
> > > >>>>>>>>>>>>>>>>>>>>>>>> consistent with SQL where both SQL and Table
> API
> > > >>>>>> are
> > > >>>>>>>>>>> basically
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>> same.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> One more thing. `MaterializedTable
> > materialize()`
> > > >>>>>>> could
> > > >>>>>>>>> be
> > > >>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>>>>>> powerful/flexible allowing the user to operate
> > > >>>>> both
> > > >>>>>>> on
> > > >>>>>>>>>>>>>>>> materialised
> > > >>>>>>>>>>>>>>>>>>>>> and not
> > > >>>>>>>>>>>>>>>>>>>>>>>> materialised view at the same time for
> whatever
> > > >>>>>>> reasons
> > > >>>>>>>>>>>>>>>> (underlying
> > > >>>>>>>>>>>>>>>>>>>>> data
> > > >>>>>>>>>>>>>>>>>>>>>>>> changing/better optimisation opportunities
> after
> > > >>>>>>>> pushing
> > > >>>>>>>>>> down
> > > >>>>>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>>> filters
> > > >>>>>>>>>>>>>>>>>>>>>>>> etc). For example:
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Table b = …;
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Val min = mb.min();
> > > >>>>>>>>>>>>>>>>>>>>>>>> Val max = mb.max();
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Could be more efficient compared to
> `b.cache()`
> > > >>>>> if
> > > >>>>>>>>>>>>>>>> `filter(‘userId
> > > >>>>>>>>>>>>>>>>>>> =
> > > >>>>>>>>>>>>>>>>>>>>>>>> 42);` allows for much more aggressive
> > > >>>>>> optimisations.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> > > >>>>>>>>>> fhueske@gmail.com>
> > > >>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> I'm not suggesting to add support for Ignite.
> > > >>>>> This
> > > >>>>>>> was
> > > >>>>>>>>>> just
> > > >>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>> example.
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> > > >>>>>>>>>>>>>>>>>>>>>>>>> For the sake of this proposal, it would be up
> > to
> > > >>>>>> the
> > > >>>>>>>>> user
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>> implement a
> > > >>>>>>>>>>>>>>>>>>>>>>>>> TableFactory and corresponding TableSource /
> > > >>>>>>> TableSink
> > > >>>>>>>>>>> classes
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>>>>>>>>>>>>> and read the data.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb
> > > >>>>> Flavio
> > > >>>>>>>>>>> Pompermaier
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>>> pompermaier@okkam.it>:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow
> > as
> > > >>>>>> an
> > > >>>>>>>>>>>>> alternative
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> Apache
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Ignite?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>
> > >
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian
> Hueske
> > > >>>>> <
> > > >>>>>>>>>>>>>>>>>>> fhueske@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> To summarize, you propose a new method
> > > >>>>>>>> Table.cache():
> > > >>>>>>>>>>> Table
> > > >>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> trigger a job and write the result into
> some
> > > >>>>>>>> temporary
> > > >>>>>>>>>>>>> storage
> > > >>>>>>>>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>>>>>>>>>>> defined
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> by a TableFactory.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> The cache() call blocks while the job is
> > > >>>>> running
> > > >>>>>>> and
> > > >>>>>>>>>>>>>>>> eventually
> > > >>>>>>>>>>>>>>>>>>>>>>>> returns a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Table object that represents a scan of the
> > > >>>>>>> temporary
> > > >>>>>>>>>>> table.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> When the "session" is closed (closing to be
> > > >>>>>>>> defined?),
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>> temporary
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> are all dropped.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> I think this behavior makes sense and is a
> > > >>>>> good
> > > >>>>>>>> first
> > > >>>>>>>>>> step
> > > >>>>>>>>>>>>>>>>>>> towards
> > > >>>>>>>>>>>>>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> interactive workloads.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> However, its performance suffers from
> writing
> > > >>>>> to
> > > >>>>>>> and
> > > >>>>>>>>>>> reading
> > > >>>>>>>>>>>>>>>>>>> from
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> external
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> systems.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> I think this is OK for now. Changes that
> > would
> > > >>>>>>>>>>> significantly
> > > >>>>>>>>>>>>>>>>>>>> improve
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory
> > across
> > > >>>>>>> jobs)
> > > >>>>>>>>>> would
> > > >>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>>>> large
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> impacts on many components of Flink.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Users could use in-memory filesystems or
> > > >>>>> storage
> > > >>>>>>>> grids
> > > >>>>>>>>>>>>> (Apache
> > > >>>>>>>>>>>>>>>>>>>>>>>> Ignite) to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> mitigate some of the performance effects.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Best, Fabian
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
> > > >>>>>> Becket
> > > >>>>>>>> Qin
> > > >>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> :
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > >>>>>>>>>>> MaterializedTable
> > > >>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>> they
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> > > >>>>>>>>> *table.cache(),
> > > >>>>>>>>>>>>> *users
> > > >>>>>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>>>>>>> just
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> use
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> that table and do anything that is
> supported
> > > >>>>>> on a
> > > >>>>>>>>>> Table,
> > > >>>>>>>>>>>>>>>>>>>> including
> > > >>>>>>>>>>>>>>>>>>>>>>>> SQL.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Naming wise, either cache() or
> materialize()
> > > >>>>>>> sounds
> > > >>>>>>>>>> fine
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>> me.
> > > >>>>>>>>>>>>>>>>>>>>>>>> cache()
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> a bit more general than materialize().
> Given
> > > >>>>>> that
> > > >>>>>>>> we
> > > >>>>>>>>>> are
> > > >>>>>>>>>>>>>>>>>>>> enhancing
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
> > > >>>>>>> processing
> > > >>>>>>>>>>> cases,
> > > >>>>>>>>>>>>>>>>>>>> cache()
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> might
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> slightly better.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
> > > >>>>>> Nowojski <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you
> intend
> > > >>>>> to
> > > >>>>>>>> reuse
> > > >>>>>>>>>>>>> existing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
> > > >>>>>> assumed
> > > >>>>>>>> that
> > > >>>>>>>>>> you
> > > >>>>>>>>>>>>>>>> want
> > > >>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> provide
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternate way of writing the data.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now that I hopefully understand the
> > > >>>>> proposal,
> > > >>>>>>>> maybe
> > > >>>>>>>>> we
> > > >>>>>>>>>>>>> could
> > > >>>>>>>>>>>>>>>>>>>>> rename
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> void materialize()
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> or going step further
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable
> createMaterializedView()
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The second option with returning a
> handle I
> > > >>>>>>> think
> > > >>>>>>>> is
> > > >>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>> flexible
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could provide features such as
> > > >>>>>>> “refresh”/“delete”
> > > >>>>>>>> or
> > > >>>>>>>>>>>>>>>> generally
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> speaking
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> manage the the view. In the future we
> could
> > > >>>>>> also
> > > >>>>>>>>> think
> > > >>>>>>>>>>>>> about
> > > >>>>>>>>>>>>>>>>>>>>> adding
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> hooks
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is
> > > >>>>> also
> > > >>>>>>> more
> > > >>>>>>>>>>>>> explicit
> > > >>>>>>>>>>>>>>>> -
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialization returning a new table
> > handle
> > > >>>>>>> will
> > > >>>>>>>>> not
> > > >>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implicit side effects as adding a simple
> > > >>>>> line
> > > >>>>>> of
> > > >>>>>>>>> code
> > > >>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> `b.cache()`
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> would have.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would also be more SQL like, making it
> > > >>>>> more
> > > >>>>>>>>>> intuitive
> > > >>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>> users
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> already
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> familiar with the SQL.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> > > >>>>>>>>>>>>> becket.qin@gmail.com
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it
> is
> > > >>>>>>>>> equivalent
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>> creating
> > > >>>>>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> BUILT-IN
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> > > >>>>>>>>>> functionality
> > > >>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>> missing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> today,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
> > > >>>>>> question.
> > > >>>>>>>> Do
> > > >>>>>>>>>> you
> > > >>>>>>>>>>>>> mean
> > > >>>>>>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> already
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
> > > >>>>>> sugar?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What's more interesting in the proposal
> is
> > > >>>>> do
> > > >>>>>>> we
> > > >>>>>>>>> want
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>> stop
> > > >>>>>>>>>>>>>>>>>>>> at
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> creating
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
> > > >>>>>> extend
> > > >>>>>>>> that
> > > >>>>>>>>>> in
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>> future
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful unified data store distributed
> with
> > > >>>>>>> Flink?
> > > >>>>>>>>> And
> > > >>>>>>>>>>> do
> > > >>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>> want
> > > >>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
> > > >>>>>> pattern
> > > >>>>>>>> with
> > > >>>>>>>>>>> their
> > > >>>>>>>>>>>>>>>> own
> > > >>>>>>>>>>>>>>>>>>>>> user
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> defined
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> services. These considerations are much
> > > >>>>> more
> > > >>>>>>>>>>>>> architectural.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
> > > >>>>>> Nowojski
> > > >>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to
> > understand
> > > >>>>>> the
> > > >>>>>>>>>>> problem.
> > > >>>>>>>>>>>>>>>>>>> Isn’t
> > > >>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing
> > > >>>>> data
> > > >>>>>>> to
> > > >>>>>>>> a
> > > >>>>>>>>>> sink
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>> later
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> reading
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited
> > > >>>>> live
> > > >>>>>>>>>> scope/live
> > > >>>>>>>>>>>>>>>> time?
> > > >>>>>>>>>>>>>>>>>>>> And
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> sink
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a
> > > >>>>> file
> > > >>>>>>>> sink?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If so, what’s the problem with
> creating a
> > > >>>>>>>>>> materialised
> > > >>>>>>>>>>>>>>>> view
> > > >>>>>>>>>>>>>>>>>>>>> from a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
> > > >>>>>> reusing
> > > >>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>> materialised
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> view
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms
> to
> > > >>>>>>> clean
> > > >>>>>>>> up
> > > >>>>>>>>>>>>>>>>>>>> materialised
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> views
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> example when current session finishes)?
> > > >>>>>> Maybe
> > > >>>>>>> we
> > > >>>>>>>>>> need
> > > >>>>>>>>>>>>> some
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> syntactic
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> sugar
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on top of it?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> > > >>>>>>>>>>>>>>>> becket.qin@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
> > > >>>>>>> persist()
> > > >>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> lifecycle/defined
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the
> > > >>>>> future
> > > >>>>>>>> work
> > > >>>>>>>>>> for
> > > >>>>>>>>>>>>>>>> this.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM
> jincheng
> > > >>>>>> sun
> > > >>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about
> the
> > > >>>>>> name
> > > >>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>> `cache()`, I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you designed this way!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can
> specify
> > a
> > > >>>>>>>>> lifecycle
> > > >>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>> data
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> persistence?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, persist
> > > >>>>> (LifeCycle.SESSION),
> > > >>>>>> so
> > > >>>>>>>>> that
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> user
> > > >>>>>>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> worried
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly
> > > >>>>> specify
> > > >>>>>>> the
> > > >>>>>>>>> time
> > > >>>>>>>>>>>>> range
> > > >>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> keeping
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> time.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to
> expand,
> > > >>>>> we
> > > >>>>>>> can
> > > >>>>>>>>>> also
> > > >>>>>>>>>>>>>>>> share
> > > >>>>>>>>>>>>>>>>>>>> in a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> > > >>>>>>>>>>>>>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> am
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for
> > > >>>>> reference
> > > >>>>>>>> only!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bests,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > > >>>>>>>>> 于2018年11月23日周五
> > > >>>>>>>>>>>>>>>>>>> 下午1:33写道:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
> > > >>>>>> cache()
> > > >>>>>>>> v.s.
> > > >>>>>>>>>>>>>>>>>>> persist(),
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> personally I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
> > > >>>>>>> describing
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> behavior,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> i.e.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will
> be
> > > >>>>>>>> deleted
> > > >>>>>>>>>>> after
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> session
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> closed.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading
> as
> > > >>>>>>> people
> > > >>>>>>>>>> might
> > > >>>>>>>>>>>>>>>> think
> > > >>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still be there even after the
> session
> > > >>>>> is
> > > >>>>>>>> gone.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch
> and
> > > >>>>>>> stream
> > > >>>>>>>>>>>>>>>> processing
> > > >>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> job.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards
> that
> > > >>>>>>> goal.
> > > >>>>>>>> I
> > > >>>>>>>>>>>>> imagine
> > > >>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change across the board, including
> > > >>>>>> sources,
> > > >>>>>>>>>>> operators
> > > >>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need
> several
> > > >>>>>>>> separate
> > > >>>>>>>>>>>>>>>> in-depth
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> discussions.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM
> > Xingcan
> > > >>>>>>> Cui <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> xingcanc@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
> > > >>>>>> access
> > > >>>>>>>>>> domain
> > > >>>>>>>>>>>>> are
> > > >>>>>>>>>>>>>>>>>>> both
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially,
> this
> > > >>>>> may
> > > >>>>>>> be
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>> first
> > > >>>>>>>>>>>>>>>>>>> time
> > > >>>>>>>>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> plan
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
> > > >>>>>> other
> > > >>>>>>>> than
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> state.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Maybe
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> it’s
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and
> then
> > > >>>>>>>>> concentrate
> > > >>>>>>>>>>> on
> > > >>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>> specific
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> part?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
> > > >>>>>> concerned
> > > >>>>>>>>> with
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>> underlying
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> service.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major
> change
> > > >>>>> to
> > > >>>>>>> the
> > > >>>>>>>>>>>>> existing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> codebase.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> As
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
> > > >>>>>> extendible
> > > >>>>>>> to
> > > >>>>>>>>>>> support
> > > >>>>>>>>>>>>>>>>>>> other
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> components
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
> > > >>>>>>> thread.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy
> the
> > > >>>>>> more
> > > >>>>>>>>>>>>> interactive
> > > >>>>>>>>>>>>>>>>>>>> Table
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> API,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough
> > > >>>>> service
> > > >>>>>>>>>>> mechanism.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM,
> > Xiaowei
> > > >>>>>>>> Jiang <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
> > > >>>>>> table
> > > >>>>>>>> for
> > > >>>>>>>>>>> clean
> > > >>>>>>>>>>>>> up
> > > >>>>>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> very
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reliable.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will
> > be
> > > >>>>>>>>> executed
> > > >>>>>>>>>>>>>>>>>>>>> successfully.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> We
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> may
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> risk
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think
> > that
> > > >>>>>>> it's
> > > >>>>>>>>>> safer
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> association
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id.
> > So
> > > >>>>>> we
> > > >>>>>>>> can
> > > >>>>>>>>>>> always
> > > >>>>>>>>>>>>>>>>>>> clean
> > > >>>>>>>>>>>>>>>>>>>>> up
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> temp
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated
> with
> > > >>>>> any
> > > >>>>>>>>> active
> > > >>>>>>>>>>>>>>>>>>> sessions.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
> > > >>>>>> jincheng
> > > >>>>>>>>> sun <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
> > > >>>>>>> proposal!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very
> > > >>>>> useful
> > > >>>>>>> and
> > > >>>>>>>>>> user
> > > >>>>>>>>>>>>>>>>>>> friendly
> > > >>>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> case
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a
> > business
> > > >>>>>> has
> > > >>>>>>>> to
> > > >>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>> executed
> > > >>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> several
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stages
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the
> > > >>>>> pipeline
> > > >>>>>>> of
> > > >>>>>>>>>> Flink
> > > >>>>>>>>>>>>> ML,
> > > >>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>> order
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> utilize
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results
> we
> > > >>>>>> have
> > > >>>>>>>> to
> > > >>>>>>>>>>>>> submit a
> > > >>>>>>>>>>>>>>>>>>> job
> > > >>>>>>>>>>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
> > > >>>>>> better
> > > >>>>>>>> to
> > > >>>>>>>>>>> named
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> `persist()`,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> And
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines
> whether
> > > >>>>> we
> > > >>>>>>>>>> internally
> > > >>>>>>>>>>>>>>>> cache
> > > >>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> memory
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save
> > the
> > > >>>>>>> data
> > > >>>>>>>>> into
> > > >>>>>>>>>>>>> state
> > > >>>>>>>>>>>>>>>>>>>>> backend
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
> > > >>>>>>> RocksDBStateBackend
> > > >>>>>>>>>> etc.)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view
> in
> > > >>>>> the
> > > >>>>>>>>> future,
> > > >>>>>>>>>>>>>>>> support
> > > >>>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> streaming
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same
> > job
> > > >>>>>>> will
> > > >>>>>>>>> also
> > > >>>>>>>>>>>>>>>> benefit
> > > >>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Interactive
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking
> forward
> > > >>>>> to
> > > >>>>>>>> your
> > > >>>>>>>>>>> JIRAs
> > > >>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>> FLIP!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <becket.qin@gmail.com
> >
> > > >>>>>>>>>>> 于2018年11月20日周二
> > > >>>>>>>>>>>>>>>>>>>>> 下午9:56写道:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads
> have
> > > >>>>>>>> pointed
> > > >>>>>>>>>> out,
> > > >>>>>>>>>>>>> it
> > > >>>>>>>>>>>>>>>>>>> is a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> promising
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink
> Table
> > > >>>>>> API
> > > >>>>>>> in
> > > >>>>>>>>>>> various
> > > >>>>>>>>>>>>>>>>>>>>> aspects,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> including
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use
> > among
> > > >>>>>>>> others.
> > > >>>>>>>>>> One
> > > >>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> scenarios
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
> > > >>>>>> interactive
> > > >>>>>>>>>>>>>>>> programming.
> > > >>>>>>>>>>>>>>>>>>> To
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> explain
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on
> > > >>>>> the
> > > >>>>>>>>>> solution,
> > > >>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>> put
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> together
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our
> > > >>>>> proposal.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
> > > >>>>>> welcome!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > >
> > >
> >
>


-- 
Best Regards

Jeff Zhang

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Till and Piotrek,

Thanks for the clarification. That solves quite a few confusion. My
understanding of how cache works is same as what Till describe. i.e.
cache() is a hint to Flink, but it is not guaranteed that cache always
exist and it might be recomputed from its lineage.

Is this the core of our disagreement here? That you would like this
> “cache()” to be mostly hint for the optimiser?

Semantic wise, yes. That's also why I think materialize() has a much larger
scope than cache(), thus it should be a different method.

Regarding the chance of optimization, it might not be that rare. Some very
simple statistics could already help in many cases. For example, simply
maintaining max and min of each fields can already eliminate some
unnecessary table scan (potentially scanning the cached table) if the
result is doomed to be empty. A histogram would give even further
information. The optimizer could be very careful and only ignores cache
when it is 100% sure doing that is cheaper. e.g. only when a filter on the
cache will absolutely return nothing.

Given the above clarification on cache, I would like to revisit the
original "void cache()" proposal and see if we can improve on top of that.

What do you think about the following modified interface?

Table {
  /**
   * This call hints Flink to maintain a cache of this table and leverage
it for performance optimization if needed.
   * Note that Flink may still decide to not use the cache if it is cheaper
by doing so.
   *
   * A CacheHandle will be returned to allow user release the cache
actively. The cache will be deleted if there
   * is no unreleased cache handlers to it. When the TableEnvironment is
closed. The cache will also be deleted
   * and all the cache handlers will be released.
   *
   * @return a CacheHandle referring to the cache of this table.
   */
  CacheHandle cache();
}

CacheHandle {
  /**
   * Close the cache handle. This method does not necessarily deletes the
cache. Instead, it simply decrements the reference counter to the cache.
   * When the there is no handle referring to a cache. The cache will be
deleted.
   *
   * @return the number of open handles to the cache after this handle has
been released.
   */
  int release()
}

The rationale behind this interface is following:
In vast majority of the cases, users wouldn't really care whether the cache
is used or not. So I think the most intuitive way is letting cache() return
nothing. So nobody needs to worry about the difference between operations
on CacheTables and those on the "original" tables. This will make maybe
99.9% of the users happy. There were two concerns raised for this approach:
1. In some rare cases, users may want to ignore cache,
2. A table might be cached/uncached in a third party function while the
caller does not know.

For the first issue, users can use hint("ignoreCache") to explicitly ignore
cache.
For the second issue, the above proposal lets cache() return a CacheHandle,
the only method in it is release(). Different CacheHandles will refer to
the same cache, if a cache no longer has any cache handle, it will be
deleted. This will address the following case:
{
  val handle1 = a.cache()
  process(a)
  a.select(...) // cache is still available, handle1 has not been released.
}

void process(Table t) {
  val handle2 = t.cache() // new handle to cache
  t.select(...) // optimizer decides cache usage
  t.hint("ignoreCache").select(...) // cache is ignored
  handle2.release() // release the handle, but the cache may still be
available if there are other handles
  ...
}

Does the above modified approach look reasonable to you?

Cheers,

Jiangjie (Becket) Qin







On Tue, Dec 11, 2018 at 6:44 PM Till Rohrmann <tr...@apache.org> wrote:

> Hi Becket,
>
> I was aiming at semantics similar to 1. I actually thought that `cache()`
> would tell the system to materialize the intermediate result so that
> subsequent queries don't need to reprocess it. This means that the usage of
> the cached table in this example
>
> {
>  val cachedTable = a.cache()
>  val b1 = cachedTable.select(…)
>  val b2 = cachedTable.foo().select(…)
>  val b3 = cachedTable.bar().select(...)
>  val c1 = a.select(…)
>  val c2 = a.foo().select(…)
>  val c3 = a.bar().select(...)
> }
>
> strongly depends on interleaved calls which trigger the execution of sub
> queries. So for example, if there is only a single env.execute call at the
> end of  block, then b1, b2, b3, c1, c2 and c3 would all be computed by
> reading directly from the sources (given that there is only a single
> JobGraph). It just happens that the result of `a` will be cached such that
> we skip the processing of `a` when there are subsequent queries reading
> from `cachedTable`. If for some reason the system cannot materialize the
> table (e.g. running out of disk space, ttl expired), then it could also
> happen that we need to reprocess `a`. In that sense `cachedTable` simply is
> an identifier for the materialized result of `a` with the lineage how to
> reprocess it.
>
> Cheers,
> Till
>
>
>
>
>
> On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
> > Hi Becket,
> >
> > > {
> > >  val cachedTable = a.cache()
> > >  val b = cachedTable.select(...)
> > >  val c = a.select(...)
> > > }
> > >
> > > Semantic 1. b uses cachedTable as user demanded so. c uses original DAG
> > as
> > > user demanded so. In this case, the optimizer has no chance to
> optimize.
> > > Semantic 2. b uses cachedTable as user demanded so. c leaves the
> > optimizer
> > > to choose whether the cache or DAG should be used. In this case, user
> > lose
> > > the option to NOT use cache.
> > >
> > > As you can see, neither of the options seem perfect. However, I guess
> you
> > > and Till are proposing the third option:
> > >
> > > Semantic 3. b leaves the optimizer to choose whether cache or DAG
> should
> > be
> > > used. c always use the DAG.
> >
> > I am pretty sure that me, Till, Fabian and others were all proposing and
> > advocating in favour of semantic “1”. No cost based optimiser decisions
> at
> > all.
> >
> > {
> >  val cachedTable = a.cache()
> >  val b1 = cachedTable.select(…)
> >  val b2 = cachedTable.foo().select(…)
> >  val b3 = cachedTable.bar().select(...)
> >  val c1 = a.select(…)
> >  val c2 = a.foo().select(…)
> >  val c3 = a.bar().select(...)
> > }
> >
> > All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
> > re-executing whole plan for “a”.
> >
> > In the future we could discuss going one step further, introducing some
> > global optimisation (that can be manually enabled/disabled): deduplicate
> > plan nodes/deduplicate sub queries/re-use sub queries results/or whatever
> > we could call it. It could do two things:
> >
> > 1. Automatically try to deduplicate fragments of the plan and share the
> > result using CachedTable - in other words automatically insert
> `CachedTable
> > cache()` calls.
> > 2. Automatically make decision to bypass explicit `CachedTable` access
> > (this would be the equivalent of what you described as “semantic 3”).
> >
> > However as I wrote previously, I have big doubts if such cost-based
> > optimisation would work (this applies also to “Semantic 2”). I would
> expect
> > it to do more harm than good in so many cases, that it wouldn’t make
> sense.
> > Even assuming that we calculate statistics perfectly (this ain’t gonna
> > happen), it’s virtually impossible to correctly estimate correct exchange
> > rate of CPU cycles vs IO operations as it is changing so much from
> > deployment to deployment.
> >
> > Is this the core of our disagreement here? That you would like this
> > “cache()” to be mostly hint for the optimiser?
> >
> > Piotrek
> >
> > > On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com> wrote:
> > >
> > > Another potential concern for semantic 3 is that. In the future, we may
> > add
> > > automatic caching to Flink. e.g. cache the intermediate results at the
> > > shuffle boundary. If our semantic is that reference to the original
> table
> > > means skipping cache, those users may not be able to benefit from the
> > > implicit cache.
> > >
> > >
> > >
> > > On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <be...@gmail.com>
> > wrote:
> > >
> > >> Hi Piotrek,
> > >>
> > >> Thanks for the reply. Thought about it again, I might have
> misunderstood
> > >> your proposal in earlier emails. Returning a CachedTable might not be
> a
> > bad
> > >> idea.
> > >>
> > >> I was more concerned about the semantic and its intuitiveness when a
> > >> CachedTable is returned. i..e, if cache() returns CachedTable. What
> are
> > the
> > >> semantic in the following code:
> > >> {
> > >>  val cachedTable = a.cache()
> > >>  val b = cachedTable.select(...)
> > >>  val c = a.select(...)
> > >> }
> > >> What is the difference between b and c? At the first glance, I see two
> > >> options:
> > >>
> > >> Semantic 1. b uses cachedTable as user demanded so. c uses original
> DAG
> > as
> > >> user demanded so. In this case, the optimizer has no chance to
> optimize.
> > >> Semantic 2. b uses cachedTable as user demanded so. c leaves the
> > optimizer
> > >> to choose whether the cache or DAG should be used. In this case, user
> > lose
> > >> the option to NOT use cache.
> > >>
> > >> As you can see, neither of the options seem perfect. However, I guess
> > you
> > >> and Till are proposing the third option:
> > >>
> > >> Semantic 3. b leaves the optimizer to choose whether cache or DAG
> should
> > >> be used. c always use the DAG.
> > >>
> > >> This does address all the concerns. It is just that from intuitiveness
> > >> perspective, I found that asking user to explicitly use a CachedTable
> > while
> > >> the optimizer might choose to ignore is a little weird. That was why I
> > did
> > >> not think about that semantic. But given there is material benefit, I
> > think
> > >> this semantic is acceptable.
> > >>
> > >> 1. If we want to let optimiser make decisions whether to use cache or
> > not,
> > >>> then why do we need “void cache()” method at all? Would It
> “increase”
> > the
> > >>> chance of using the cache? That’s sounds strange. What would be the
> > >>> mechanism of deciding whether to use the cache or not? If we want to
> > >>> introduce such kind  automated optimisations of “plan nodes
> > deduplication”
> > >>> I would turn it on globally, not per table, and let the optimiser do
> > all of
> > >>> the work.
> > >>> 2. We do not have statistics at the moment for any use/not use cache
> > >>> decision.
> > >>> 3. Even if we had, I would be veeerryy sceptical whether such cost
> > based
> > >>> optimisations would work properly and I would still insist first on
> > >>> providing explicit caching mechanism (`CachedTable cache()`)
> > >>>
> > >> We are absolutely on the same page here. An explicit cache() method is
> > >> necessary not only because optimizer may not be able to make the right
> > >> decision, but also because of the nature of interactive programming.
> For
> > >> example, if users write the following code in Scala shell:
> > >>  val b = a.select(...)
> > >>  val c = b.select(...)
> > >>  val d = c.select(...).writeToSink(...)
> > >>  tEnv.execute()
> > >> There is no way optimizer will know whether b or c will be used in
> later
> > >> code, unless users hint explicitly.
> > >>
> > >> At the same time I’m not sure if you have responded to our objections
> of
> > >>> `void cache()` being implicit/having side effects, which me, Jark,
> > Fabian,
> > >>> Till and I think also Shaoxuan are supporting.
> > >>
> > >> Is there any other side effects if we use semantic 3 mentioned above?
> > >>
> > >> Thanks,
> > >>
> > >> JIangjie (Becket) Qin
> > >>
> > >>
> > >> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <
> piotr@data-artisans.com
> > >
> > >> wrote:
> > >>
> > >>> Hi Becket,
> > >>>
> > >>> Sorry for not responding long time.
> > >>>
> > >>> Regarding case1.
> > >>>
> > >>> There wouldn’t be no “a.unCache()” method, but I would expect only
> > >>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t affect
> > >>> `cachedTableA2`. Just as in any other database dropping modifying one
> > >>> independent table/materialised view does not affect others.
> > >>>
> > >>>> What I meant is that assuming there is already a cached table,
> ideally
> > >>> users need
> > >>>> not to specify whether the next query should read from the cache or
> > use
> > >>> the
> > >>>> original DAG. This should be decided by the optimizer.
> > >>>
> > >>> 1. If we want to let optimiser make decisions whether to use cache or
> > >>> not, then why do we need “void cache()” method at all? Would It
> > “increase”
> > >>> the chance of using the cache? That’s sounds strange. What would be
> the
> > >>> mechanism of deciding whether to use the cache or not? If we want to
> > >>> introduce such kind  automated optimisations of “plan nodes
> > deduplication”
> > >>> I would turn it on globally, not per table, and let the optimiser do
> > all of
> > >>> the work.
> > >>> 2. We do not have statistics at the moment for any use/not use cache
> > >>> decision.
> > >>> 3. Even if we had, I would be veeerryy sceptical whether such cost
> > based
> > >>> optimisations would work properly and I would still insist first on
> > >>> providing explicit caching mechanism (`CachedTable cache()`)
> > >>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
> > >>> contradict future work on automated cost based caching.
> > >>>
> > >>>
> > >>> At the same time I’m not sure if you have responded to our objections
> > of
> > >>> `void cache()` being implicit/having side effects, which me, Jark,
> > Fabian,
> > >>> Till and I think also Shaoxuan are supporting.
> > >>>
> > >>> Piotrek
> > >>>
> > >>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
> > >>>>
> > >>>> Hi Till,
> > >>>>
> > >>>> It is true that after the first job submission, there will be no
> > >>> ambiguity
> > >>>> in terms of whether a cached table is used or not. That is the same
> > for
> > >>> the
> > >>>> cache() without returning a CachedTable.
> > >>>>
> > >>>> Conceptually one could think of cache() as introducing a caching
> > >>> operator
> > >>>>> from which you need to consume from if you want to benefit from the
> > >>> caching
> > >>>>> functionality.
> > >>>>
> > >>>> I am thinking a little differently. I think it is a hint (as you
> > >>> mentioned
> > >>>> later) instead of a new operator. I'd like to be careful about the
> > >>> semantic
> > >>>> of the API. A hint is a property set on an existing operator, but is
> > not
> > >>>> itself an operator as it does not really manipulate the data.
> > >>>>
> > >>>> I agree, ideally the optimizer makes this kind of decision which
> > >>>>> intermediate result should be cached. But especially when executing
> > >>> ad-hoc
> > >>>>> queries the user might better know which results need to be cached
> > >>> because
> > >>>>> Flink might not see the full DAG. In that sense, I would consider
> the
> > >>>>> cache() method as a hint for the optimizer. Of course, in the
> future
> > we
> > >>>>> might add functionality which tries to automatically cache results
> > >>> (e.g.
> > >>>>> caching the latest intermediate results until so and so much space
> is
> > >>>>> used). But this should hopefully not contradict with `CachedTable
> > >>> cache()`.
> > >>>>
> > >>>> I agree that cache() method is needed for exactly the reason you
> > >>> mentioned,
> > >>>> i.e. Flink cannot predict what users are going to write later, so
> > users
> > >>>> need to tell Flink explicitly that this table will be used later.
> > What I
> > >>>> meant is that assuming there is already a cached table, ideally
> users
> > >>> need
> > >>>> not to specify whether the next query should read from the cache or
> > use
> > >>> the
> > >>>> original DAG. This should be decided by the optimizer.
> > >>>>
> > >>>> To explain the difference between returning / not returning a
> > >>> CachedTable,
> > >>>> I want compare the following two case:
> > >>>>
> > >>>> *Case 1:  returning a CachedTable*
> > >>>> b = a.map(...)
> > >>>> val cachedTableA1 = a.cache()
> > >>>> val cachedTableA2 = a.cache()
> > >>>> b.print() // Just to make sure a is cached.
> > >>>>
> > >>>> c = a.filter(...) // User specify that the original DAG is used? Or
> > the
> > >>>> optimizer decides whether DAG or cache should be used?
> > >>>> d = cachedTableA1.filter() // User specify that the cached table is
> > >>> used.
> > >>>>
> > >>>> a.unCache() // Can cachedTableA still be used afterwards?
> > >>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
> > >>>>
> > >>>> *Case 2: not returning a CachedTable*
> > >>>> b = a.map()
> > >>>> a.cache()
> > >>>> a.cache() // no-op
> > >>>> b.print() // Just to make sure a is cached
> > >>>>
> > >>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
> should
> > >>> be
> > >>>> used
> > >>>> d = a.filter(...) // Optimizer decides whether the cache or DAG
> should
> > >>> be
> > >>>> used
> > >>>>
> > >>>> a.unCache()
> > >>>> a.unCache() // no-op
> > >>>>
> > >>>> In case 1, semantic wise, optimizer lose the option to choose
> between
> > >>> DAG
> > >>>> and cache. And the unCache() call becomes tricky.
> > >>>> In case 2, users do not need to worry about whether cache or DAG is
> > >>> used.
> > >>>> And the unCache() semantic is clear. However, the caveat is that
> users
> > >>>> cannot explicitly ignore the cache.
> > >>>>
> > >>>> In order to address the issues mentioned in case 2 and inspired by
> the
> > >>>> discussion so far, I am thinking about using hint to allow user
> > >>> explicitly
> > >>>> ignore cache. Although we do not have hint yet, but we probably
> should
> > >>> have
> > >>>> one. So the code becomes:
> > >>>>
> > >>>> *Case 3: returning this table*
> > >>>> b = a.map()
> > >>>> a.cache()
> > >>>> a.cache() // no-op
> > >>>> b.print() // Just to make sure a is cached
> > >>>>
> > >>>> c = a.filter(...) // Optimizer decides whether the cache or DAG
> should
> > >>> be
> > >>>> used
> > >>>> d = a.hint("ignoreCache").filter(...) // DAG will be used instead of
> > the
> > >>>> cache.
> > >>>>
> > >>>> a.unCache()
> > >>>> a.unCache() // no-op
> > >>>>
> > >>>> We could also let cache() return this table to allow chained method
> > >>> calls.
> > >>>> Do you think this API addresses the concerns?
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>> Jiangjie (Becket) Qin
> > >>>>
> > >>>>
> > >>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> All the recent discussions are focused on whether there is a
> problem
> > if
> > >>>>> cache() not return a Table.
> > >>>>> It seems that returning a Table explicitly is more clear (and
> safe?).
> > >>>>>
> > >>>>> So whether there are any problems if cache() returns a Table?
> > @Becket
> > >>>>>
> > >>>>> Best,
> > >>>>> Jark
> > >>>>>
> > >>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <tr...@apache.org>
> > >>> wrote:
> > >>>>>
> > >>>>>> It's true that b, c, d and e will all read from the original DAG
> > that
> > >>>>>> generates a. But all subsequent operators (when running multiple
> > >>> queries)
> > >>>>>> which reference cachedTableA should not need to reproduce `a` but
> > >>>>> directly
> > >>>>>> consume the intermediate result.
> > >>>>>>
> > >>>>>> Conceptually one could think of cache() as introducing a caching
> > >>> operator
> > >>>>>> from which you need to consume from if you want to benefit from
> the
> > >>>>> caching
> > >>>>>> functionality.
> > >>>>>>
> > >>>>>> I agree, ideally the optimizer makes this kind of decision which
> > >>>>>> intermediate result should be cached. But especially when
> executing
> > >>>>> ad-hoc
> > >>>>>> queries the user might better know which results need to be cached
> > >>>>> because
> > >>>>>> Flink might not see the full DAG. In that sense, I would consider
> > the
> > >>>>>> cache() method as a hint for the optimizer. Of course, in the
> future
> > >>> we
> > >>>>>> might add functionality which tries to automatically cache results
> > >>> (e.g.
> > >>>>>> caching the latest intermediate results until so and so much space
> > is
> > >>>>>> used). But this should hopefully not contradict with `CachedTable
> > >>>>> cache()`.
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>> Till
> > >>>>>>
> > >>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <be...@gmail.com>
> > >>> wrote:
> > >>>>>>
> > >>>>>>> Hi Till,
> > >>>>>>>
> > >>>>>>> Thanks for the clarification. I am still a little confused.
> > >>>>>>>
> > >>>>>>> If cache() returns a CachedTable, the example might become:
> > >>>>>>>
> > >>>>>>> b = a.map(...)
> > >>>>>>> c = a.map(...)
> > >>>>>>>
> > >>>>>>> cachedTableA = a.cache()
> > >>>>>>> d = cachedTableA.map(...)
> > >>>>>>> e = a.map()
> > >>>>>>>
> > >>>>>>> In the above case, if cache() is lazily evaluated, b, c, d and e
> > are
> > >>>>> all
> > >>>>>>> going to be reading from the original DAG that generates a. But
> > with
> > >>> a
> > >>>>>>> naive expectation, d should be reading from the cache. This seems
> > not
> > >>>>>>> solving the potential confusion you raised, right?
> > >>>>>>>
> > >>>>>>> Just to be clear, my understanding are all based on the
> assumption
> > >>> that
> > >>>>>> the
> > >>>>>>> tables are immutable. Therefore, after a.cache(), a the
> > >>> c*achedTableA*
> > >>>>>> and
> > >>>>>>> original table *a * should be completely interchangeable.
> > >>>>>>>
> > >>>>>>> That said, I think a valid argument is optimization. There are
> > indeed
> > >>>>>> cases
> > >>>>>>> that reading from the original DAG could be faster than reading
> > from
> > >>>>> the
> > >>>>>>> cache. For example, in the following example:
> > >>>>>>>
> > >>>>>>> a.filter(f1' > 100)
> > >>>>>>> a.cache()
> > >>>>>>> b = a.filter(f1' < 100)
> > >>>>>>>
> > >>>>>>> Ideally the optimizer should be intelligent enough to decide
> which
> > >>> way
> > >>>>> is
> > >>>>>>> faster, without user intervention. In this case, it will identify
> > >>> that
> > >>>>> b
> > >>>>>>> would just be an empty table, thus skip reading from the cache
> > >>>>>> completely.
> > >>>>>>> But I agree that returning a CachedTable would give user the
> > control
> > >>> of
> > >>>>>>> when to use cache, even though I still feel that letting the
> > >>> optimizer
> > >>>>>>> handle this is a better option in long run.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>>
> > >>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <
> trohrmann@apache.org
> > >
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Yes you are right Becket that it still depends on the actual
> > >>>>> execution
> > >>>>>> of
> > >>>>>>>> the job whether a consumer reads from a cached result or not.
> > >>>>>>>>
> > >>>>>>>> My point was actually about the properties of a (cached vs.
> > >>>>> non-cached)
> > >>>>>>> and
> > >>>>>>>> not about the execution. I would not make cache trigger the
> > >>> execution
> > >>>>>> of
> > >>>>>>>> the job because one loses some flexibility by eagerly triggering
> > the
> > >>>>>>>> execution.
> > >>>>>>>>
> > >>>>>>>> I tried to argue for an explicit CachedTable which is returned
> by
> > >>> the
> > >>>>>>>> cache() method like Piotr did in order to make the API more
> > >>> explicit.
> > >>>>>>>>
> > >>>>>>>> Cheers,
> > >>>>>>>> Till
> > >>>>>>>>
> > >>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <becket.qin@gmail.com
> >
> > >>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Till,
> > >>>>>>>>>
> > >>>>>>>>> That is a good example. Just a minor correction, in this case,
> > b, c
> > >>>>>>> and d
> > >>>>>>>>> will all consume from a non-cached a. This is because cache
> will
> > >>>>> only
> > >>>>>>> be
> > >>>>>>>>> created on the very first job submission that generates the
> table
> > >>>>> to
> > >>>>>> be
> > >>>>>>>>> cached.
> > >>>>>>>>>
> > >>>>>>>>> If I understand correctly, this is example is about whether
> > >>>>> .cache()
> > >>>>>>>> method
> > >>>>>>>>> should be eagerly evaluated or lazily evaluated. In another
> word,
> > >>>>> if
> > >>>>>>>>> cache() method actually triggers a job that creates the cache,
> > >>>>> there
> > >>>>>>> will
> > >>>>>>>>> be no such confusion. Is that right?
> > >>>>>>>>>
> > >>>>>>>>> In the example, although d will not consume from the cached
> Table
> > >>>>>> while
> > >>>>>>>> it
> > >>>>>>>>> looks supposed to, from correctness perspective the code will
> > still
> > >>>>>>>> return
> > >>>>>>>>> correct result, assuming that tables are immutable.
> > >>>>>>>>>
> > >>>>>>>>> Personally I feel it is OK because users probably won't really
> > >>>>> worry
> > >>>>>>>> about
> > >>>>>>>>> whether the table is cached or not. And lazy cache could avoid
> > some
> > >>>>>>>>> unnecessary caching if a cached table is never created in the
> > user
> > >>>>>>>>> application. But I am not opposed to do eager evaluation of
> > cache.
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>>
> > >>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> > >>>>> trohrmann@apache.org>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Another argument for Piotr's point is that lazily changing
> > >>>>>> properties
> > >>>>>>>> of
> > >>>>>>>>> a
> > >>>>>>>>>> node affects all down stream consumers but does not
> necessarily
> > >>>>>> have
> > >>>>>>> to
> > >>>>>>>>>> happen before these consumers are defined. From a user's
> > >>>>>> perspective
> > >>>>>>>> this
> > >>>>>>>>>> can be quite confusing:
> > >>>>>>>>>>
> > >>>>>>>>>> b = a.map(...)
> > >>>>>>>>>> c = a.map(...)
> > >>>>>>>>>>
> > >>>>>>>>>> a.cache()
> > >>>>>>>>>> d = a.map(...)
> > >>>>>>>>>>
> > >>>>>>>>>> now b, c and d will consume from a cached operator. In this
> > case,
> > >>>>>> the
> > >>>>>>>>> user
> > >>>>>>>>>> would most likely expect that only d reads from a cached
> result.
> > >>>>>>>>>>
> > >>>>>>>>>> Cheers,
> > >>>>>>>>>> Till
> > >>>>>>>>>>
> > >>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> > >>>>>>>> piotr@data-artisans.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hey Shaoxuan and Becket,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Can you explain a bit more one what are the side effects? So
> > >>>>>> far
> > >>>>>>> my
> > >>>>>>>>>>>> understanding is that such side effects only exist if a
> table
> > >>>>>> is
> > >>>>>>>>>> mutable.
> > >>>>>>>>>>>> Is that the case?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Not only that. There are also performance implications and
> > >>>>> those
> > >>>>>>> are
> > >>>>>>>>>>> another implicit side effects of using `void cache()`. As I
> > >>>>> wrote
> > >>>>>>>>> before,
> > >>>>>>>>>>> reading from cache might not always be desirable, thus it can
> > >>>>>> cause
> > >>>>>>>>>>> performance degradation and I’m fine with that - user's or
> > >>>>>>>> optimiser’s
> > >>>>>>>>>>> choice. What I do not like is that this implicit side effect
> > >>>>> can
> > >>>>>>>>> manifest
> > >>>>>>>>>>> in completely different part of code, that wasn’t touched by
> a
> > >>>>>> user
> > >>>>>>>>> while
> > >>>>>>>>>>> he was adding `void cache()` call somewhere else. And even if
> > >>>>>>> caching
> > >>>>>>>>>>> improves performance, it’s still a side effect of `void
> > >>>>> cache()`.
> > >>>>>>>>> Almost
> > >>>>>>>>>>> from the definition `void` methods have only side effects.
> As I
> > >>>>>>> wrote
> > >>>>>>>>>>> before, there are couple of scenarios where this might be
> > >>>>>>> undesirable
> > >>>>>>>>>>> and/or unexpected, for example:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1.
> > >>>>>>>>>>> Table b = …;
> > >>>>>>>>>>> b.cache()
> > >>>>>>>>>>> x = b.join(…)
> > >>>>>>>>>>> y = b.count()
> > >>>>>>>>>>> // ...
> > >>>>>>>>>>> // 100
> > >>>>>>>>>>> // hundred
> > >>>>>>>>>>> // lines
> > >>>>>>>>>>> // of
> > >>>>>>>>>>> // code
> > >>>>>>>>>>> // later
> > >>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in a
> > >>>>>>>> different
> > >>>>>>>>>>> method/file/package/dependency
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Table b = ...
> > >>>>>>>>>>> If (some_condition) {
> > >>>>>>>>>>> foo(b)
> > >>>>>>>>>>> }
> > >>>>>>>>>>> Else {
> > >>>>>>>>>>> bar(b)
> > >>>>>>>>>>> }
> > >>>>>>>>>>> z = b.filter(…).groupBy(…)
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Void foo(Table b) {
> > >>>>>>>>>>> b.cache()
> > >>>>>>>>>>> // do something with b
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>> In both above examples, `b.cache()` will implicitly affect
> > >>>>>>> (semantic
> > >>>>>>>>> of a
> > >>>>>>>>>>> program in case of sources being mutable and performance) `z
> =
> > >>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On top of that, there is still this argument of mine that
> > >>>>> having
> > >>>>>> a
> > >>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more flexible
> > >>>>> for
> > >>>>>> us
> > >>>>>>>> for
> > >>>>>>>>>> the
> > >>>>>>>>>>> future and for the user (as a manual option to bypass cache
> > >>>>>> reads).
> > >>>>>>>>>>>
> > >>>>>>>>>>>> But Jiangjie is correct,
> > >>>>>>>>>>>> the source table in batching should be immutable. It is the
> > >>>>>>> user’s
> > >>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
> > >>>>> failover
> > >>>>>>> may
> > >>>>>>>>> lead
> > >>>>>>>>>>>> to inconsistent results.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment should
> > >>>>> be.
> > >>>>>>> But
> > >>>>>>>>> its
> > >>>>>>>>>>> often isn’t and while I’m not trying to fix this (since the
> > >>>>>> proper
> > >>>>>>>> fix
> > >>>>>>>>> is
> > >>>>>>>>>>> to support transactions), I’m just trying to minimise
> confusion
> > >>>>>> for
> > >>>>>>>> the
> > >>>>>>>>>>> users that are not fully aware what’s going on and operate in
> > >>>>>> less
> > >>>>>>>> then
> > >>>>>>>>>>> perfect setup. And if something bites them after adding
> > >>>>>> `b.cache()`
> > >>>>>>>>> call,
> > >>>>>>>>>>> to make sure that they at least know all of the places that
> > >>>>>> adding
> > >>>>>>>> this
> > >>>>>>>>>>> line can affect.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks, Piotrek
> > >>>>>>>>>>>
> > >>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi Piotrek,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks again for the clarification. Some more replies are
> > >>>>>>>> following.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be used
> > >>>>> in
> > >>>>>>>>>>> interactive
> > >>>>>>>>>>>>> programming and not only in batching.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> It is true. Actually in stream processing, cache() has the
> > >>>>> same
> > >>>>>>>>>> semantic
> > >>>>>>>>>>> as
> > >>>>>>>>>>>> batch processing. The semantic is following:
> > >>>>>>>>>>>> For a table created via a series of computation, save that
> > >>>>>> table
> > >>>>>>>> for
> > >>>>>>>>>>> later
> > >>>>>>>>>>>> reference to avoid running the computation logic to
> > >>>>> regenerate
> > >>>>>>> the
> > >>>>>>>>>> table.
> > >>>>>>>>>>>> Once the application exits, drop all the cache.
> > >>>>>>>>>>>> This semantic is same for both batch and stream processing.
> > >>>>> The
> > >>>>>>>>>>> difference
> > >>>>>>>>>>>> is that stream applications will only run once as they are
> > >>>>> long
> > >>>>>>>>>> running.
> > >>>>>>>>>>>> And the batch applications may be run multiple times, hence
> > >>>>> the
> > >>>>>>>> cache
> > >>>>>>>>>> may
> > >>>>>>>>>>>> be created and dropped each time the application runs.
> > >>>>>>>>>>>> Admittedly, there will probably be some resource management
> > >>>>>>>>>> requirements
> > >>>>>>>>>>>> for the streaming cached table, such as time based / size
> > >>>>> based
> > >>>>>>>>>>> retention,
> > >>>>>>>>>>>> to address the infinite data issue. But such requirement
> does
> > >>>>>> not
> > >>>>>>>>>> change
> > >>>>>>>>>>>> the semantic.
> > >>>>>>>>>>>> You are right that interactive programming is just one use
> > >>>>> case
> > >>>>>>> of
> > >>>>>>>>>>> cache().
> > >>>>>>>>>>>> It is not the only use case.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For me the more important issue is of not having the `void
> > >>>>>>> cache()`
> > >>>>>>>>>> with
> > >>>>>>>>>>>>> side effects.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This is indeed the key point. The argument around whether
> > >>>>>> cache()
> > >>>>>>>>>> should
> > >>>>>>>>>>>> return something already indicates that cache() and
> > >>>>>> materialize()
> > >>>>>>>>>> address
> > >>>>>>>>>>>> different issues.
> > >>>>>>>>>>>> Can you explain a bit more one what are the side effects? So
> > >>>>>> far
> > >>>>>>> my
> > >>>>>>>>>>>> understanding is that such side effects only exist if a
> table
> > >>>>>> is
> > >>>>>>>>>> mutable.
> > >>>>>>>>>>>> Is that the case?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I don’t know, probably initially we should make CachedTable
> > >>>>>>>>> read-only.
> > >>>>>>>>>> I
> > >>>>>>>>>>>>> don’t find it more confusing than the fact that user can
> not
> > >>>>>>> write
> > >>>>>>>>> to
> > >>>>>>>>>>> views
> > >>>>>>>>>>>>> or materialised views in SQL or that user currently can not
> > >>>>>>> write
> > >>>>>>>>> to a
> > >>>>>>>>>>>>> Table.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I don't think anyone should insert something to a cache. By
> > >>>>>>>>> definition
> > >>>>>>>>>>> the
> > >>>>>>>>>>>> cache should only be updated when the corresponding original
> > >>>>>>> table
> > >>>>>>>> is
> > >>>>>>>>>>>> updated. What I am wondering is that given the following two
> > >>>>>>> facts:
> > >>>>>>>>>>>> 1. If and only if a table is mutable (with something like
> > >>>>>>>> insert()),
> > >>>>>>>>> a
> > >>>>>>>>>>>> CachedTable may have implicit behavior.
> > >>>>>>>>>>>> 2. A CachedTable extends a Table.
> > >>>>>>>>>>>> We can come to the conclusion that a CachedTable is mutable
> > >>>>> and
> > >>>>>>>> users
> > >>>>>>>>>> can
> > >>>>>>>>>>>> insert into the CachedTable directly. This is where I
> thought
> > >>>>>>>>>> confusing.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> > >>>>>>>>> piotr@data-artisans.com
> > >>>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
> > >>>>>>>> explanation
> > >>>>>>>>>> why
> > >>>>>>>>>>> I
> > >>>>>>>>>>>>> think `materialize()` is more natural to me is that I think
> > >>>>> of
> > >>>>>>> all
> > >>>>>>>>>>> “Table”s
> > >>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
> > >>>>> views,
> > >>>>>>> the
> > >>>>>>>>> only
> > >>>>>>>>>>>>> difference for me is that their live scope is short -
> > >>>>> current
> > >>>>>>>>> session
> > >>>>>>>>>>> which
> > >>>>>>>>>>>>> is limited by different execution model. That’s why
> > >>>>> “cashing”
> > >>>>>> a
> > >>>>>>>> view
> > >>>>>>>>>>> for me
> > >>>>>>>>>>>>> is just materialising it.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> However I see and I understand your point of view. Coming
> > >>>>> from
> > >>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
> > >>>>>>> `cache()`
> > >>>>>>>>> is
> > >>>>>>>>>>> more
> > >>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
> > >>>>> only
> > >>>>>> be
> > >>>>>>>>> used
> > >>>>>>>>>> in
> > >>>>>>>>>>>>> interactive programming and not only in batching. But
> naming
> > >>>>>> is
> > >>>>>>>> one
> > >>>>>>>>>>> issue,
> > >>>>>>>>>>>>> and not that critical to me. Especially that once we
> > >>>>> implement
> > >>>>>>>>> proper
> > >>>>>>>>>>>>> materialised views, we can always deprecate/rename
> `cache()`
> > >>>>>> if
> > >>>>>>> we
> > >>>>>>>>>> deem
> > >>>>>>>>>>> so.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> For me the more important issue is of not having the `void
> > >>>>>>>> cache()`
> > >>>>>>>>>> with
> > >>>>>>>>>>>>> side effects. Exactly for the reasons that you have
> > >>>>> mentioned.
> > >>>>>>>> True:
> > >>>>>>>>>>>>> results might be non deterministic if underlying source
> > >>>>> table
> > >>>>>>> are
> > >>>>>>>>>>> changing.
> > >>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
> > >>>>> semantic
> > >>>>>>> of
> > >>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It can
> > >>>>> cause
> > >>>>>>>> “wtf”
> > >>>>>>>>>>> moment
> > >>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some place in
> > >>>>> his
> > >>>>>>>> code
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> suddenly some other random places are behaving differently.
> > >>>>> If
> > >>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
> > >>>>> force
> > >>>>>>> user
> > >>>>>>>>> to
> > >>>>>>>>>>>>> explicitly use the cache which removes the “random” part
> > >>>>> from
> > >>>>>>> the
> > >>>>>>>>>>> "suddenly
> > >>>>>>>>>>>>> some other random places are behaving differently”.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> This argument and others that I’ve raised (greater
> > >>>>>>>>>> flexibility/allowing
> > >>>>>>>>>>>>> user to explicitly bypass the cache) are independent of
> > >>>>>>> `cache()`
> > >>>>>>>> vs
> > >>>>>>>>>>>>> `materialize()` discussion.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Does that mean one can also insert into the CachedTable?
> > >>>>> This
> > >>>>>>>>> sounds
> > >>>>>>>>>>>>> pretty confusing.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I don’t know, probably initially we should make CachedTable
> > >>>>>>>>>> read-only. I
> > >>>>>>>>>>>>> don’t find it more confusing than the fact that user can
> not
> > >>>>>>> write
> > >>>>>>>>> to
> > >>>>>>>>>>> views
> > >>>>>>>>>>>>> or materialised views in SQL or that user currently can not
> > >>>>>>> write
> > >>>>>>>>> to a
> > >>>>>>>>>>>>> Table.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <xingcanc@gmail.com
> >
> > >>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
> > >>>>>> should
> > >>>>>>> be
> > >>>>>>>>>>>>> considered as two different methods where the later one is
> > >>>>>> more
> > >>>>>>>>>>>>> sophisticated.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> According to my understanding, the initial idea is just to
> > >>>>>>>>> introduce
> > >>>>>>>>>> a
> > >>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI is a
> > >>>>>>>>> high-level
> > >>>>>>>>>>> API,
> > >>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
> > >>>>> and
> > >>>>>>>> force
> > >>>>>>>>>>> users
> > >>>>>>>>>>>>> to translate a Table to a Dataset before caching it. Then
> > >>>>> the
> > >>>>>>>> users
> > >>>>>>>>>>> should
> > >>>>>>>>>>>>> manually register the cached dataset to a table again (we
> > >>>>> may
> > >>>>>>> need
> > >>>>>>>>>> some
> > >>>>>>>>>>>>> table replacement mechanisms for datasets with an identical
> > >>>>>>> schema
> > >>>>>>>>> but
> > >>>>>>>>>>>>> different contents here). After all, it’s the dataset
> rather
> > >>>>>>> than
> > >>>>>>>>> the
> > >>>>>>>>>>>>> dynamic table that need to be cached, right?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>> Xingcan
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> > >>>>>>> becket.qin@gmail.com>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi Piotrek and Jark,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
> > >>>>>>>> arguments.
> > >>>>>>>>>>> But I
> > >>>>>>>>>>>>>>> think those arguments are mostly about materialized view.
> > >>>>>> Let
> > >>>>>>> me
> > >>>>>>>>> try
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>> explain the reason I believe cache() and materialize()
> are
> > >>>>>>>>>> different.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I think cache() and materialize() have quite different
> > >>>>>>>>> implications.
> > >>>>>>>>>>> An
> > >>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When users
> > >>>>> call
> > >>>>>>>>> cache(),
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>> just like they are saving an intermediate result as a
> > >>>>> draft
> > >>>>>> of
> > >>>>>>>>> their
> > >>>>>>>>>>>>> work,
> > >>>>>>>>>>>>>>> this intermediate result may not have any realistic
> > >>>>> meaning.
> > >>>>>>>>> Calling
> > >>>>>>>>>>>>>>> cache() does not mean users want to publish the cached
> > >>>>> table
> > >>>>>>> in
> > >>>>>>>>> any
> > >>>>>>>>>>>>> manner.
> > >>>>>>>>>>>>>>> But when users call materialize(), that means "I have
> > >>>>>>> something
> > >>>>>>>>>>>>> meaningful
> > >>>>>>>>>>>>>>> to be reused by others", now users need to think about
> the
> > >>>>>>>>>> validation,
> > >>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Piotrek's suggestions on variations of the materialize()
> > >>>>>>> methods
> > >>>>>>>>> are
> > >>>>>>>>>>>>> very
> > >>>>>>>>>>>>>>> useful. It would be great if Flink have them. The concept
> > >>>>> of
> > >>>>>>>>>>>>> materialized
> > >>>>>>>>>>>>>>> view is actually a pretty big feature, not to say the
> > >>>>>> related
> > >>>>>>>>> stuff
> > >>>>>>>>>>> like
> > >>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
> > >>>>>> materialized
> > >>>>>>>>> view
> > >>>>>>>>>>>>> itself
> > >>>>>>>>>>>>>>> should be discussed in a more thorough and systematic
> > >>>>>> manner.
> > >>>>>>>> And
> > >>>>>>>>> I
> > >>>>>>>>>>>>> found
> > >>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
> > >>>>>>> interactive
> > >>>>>>>>>>>>>>> programming experience.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The example you gave was interesting. I still have some
> > >>>>>>>> questions,
> > >>>>>>>>>>>>> though.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Table source = … // some source that scans files from a
> > >>>>>>>> directory
> > >>>>>>>>>>>>>>>> “/foo/bar/“
> > >>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > >>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> > >>>>> initialised)
> > >>>>>>>>>>>>>>>> int a1 = t1.count()
> > >>>>>>>>>>>>>>>> int b1 = t2.count()
> > >>>>>>>>>>>>>>>> // something in the background (or we trigger it) writes
> > >>>>>> new
> > >>>>>>>>> files
> > >>>>>>>>>> to
> > >>>>>>>>>>>>>>>> /foo/bar
> > >>>>>>>>>>>>>>>> int a2 = t1.count()
> > >>>>>>>>>>>>>>>> int b2 = t2.count()
> > >>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> > >>>>>>>> implemented
> > >>>>>>>>> in
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> initial version
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> what if someone else added some more files to /foo/bar at
> > >>>>>> this
> > >>>>>>>>>> point?
> > >>>>>>>>>>> In
> > >>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
> > >>>>>>>>>>>>> non-deterministic,
> > >>>>>>>>>>>>>>> right?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> int a3 = t1.count()
> > >>>>>>>>>>>>>>>> int b3 = t2.count()
> > >>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
> > >>>>>>> “cache”
> > >>>>>>>>>>> dropping
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> When we talk about interactive programming, in most
> cases,
> > >>>>>> we
> > >>>>>>>> are
> > >>>>>>>>>>>>> talking
> > >>>>>>>>>>>>>>> about batch applications. A fundamental assumption of
> such
> > >>>>>>> case
> > >>>>>>>> is
> > >>>>>>>>>>> that
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> source data is complete before the data processing
> begins,
> > >>>>>> and
> > >>>>>>>> the
> > >>>>>>>>>>> data
> > >>>>>>>>>>>>>>> will not change during the data processing. IMO, if
> > >>>>>> additional
> > >>>>>>>>> rows
> > >>>>>>>>>>>>> needs
> > >>>>>>>>>>>>>>> to be added to some source during the processing, it
> > >>>>> should
> > >>>>>> be
> > >>>>>>>>> done
> > >>>>>>>>>> in
> > >>>>>>>>>>>>> ways
> > >>>>>>>>>>>>>>> like union the source with another table containing the
> > >>>>> rows
> > >>>>>>> to
> > >>>>>>>> be
> > >>>>>>>>>>>>> added.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> There are a few cases that computations are executed
> > >>>>>>> repeatedly
> > >>>>>>>> on
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>> changing data source.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> For example, people may run a ML training job every hour
> > >>>>>> with
> > >>>>>>>> the
> > >>>>>>>>>>>>> samples
> > >>>>>>>>>>>>>>> newly added in the past hour. In that case, the source
> > >>>>> data
> > >>>>>>>>> between
> > >>>>>>>>>>> will
> > >>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged
> within
> > >>>>>> one
> > >>>>>>>>> run.
> > >>>>>>>>>>> And
> > >>>>>>>>>>>>>>> usually in that case, the result will need versioning,
> > >>>>> i.e.
> > >>>>>>> for
> > >>>>>>>> a
> > >>>>>>>>>>> given
> > >>>>>>>>>>>>>>> result, it tells that the result is a result from the
> > >>>>> source
> > >>>>>>>> data
> > >>>>>>>>>> by a
> > >>>>>>>>>>>>>>> certain timestamp.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Another example is something like data warehouse. In this
> > >>>>>>> case,
> > >>>>>>>>>> there
> > >>>>>>>>>>>>> are a
> > >>>>>>>>>>>>>>> few source of original/raw data. On top of those sources,
> > >>>>>> many
> > >>>>>>>>>>>>> materialized
> > >>>>>>>>>>>>>>> view / queries / reports / dashboards can be created to
> > >>>>>>> generate
> > >>>>>>>>>>> derived
> > >>>>>>>>>>>>>>> data. Those derived data needs to be updated when the
> > >>>>>>> underlying
> > >>>>>>>>>>>>> original
> > >>>>>>>>>>>>>>> data changes. In that case, the processing logic that
> > >>>>>> derives
> > >>>>>>>> the
> > >>>>>>>>>>>>> original
> > >>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
> > >>>>>>>>> reports/views.
> > >>>>>>>>>>>>> Again,
> > >>>>>>>>>>>>>>> all those derived data also need to have version
> > >>>>> management,
> > >>>>>>>> such
> > >>>>>>>>> as
> > >>>>>>>>>>>>>>> timestamp.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> In any of the above two cases, during a single run of the
> > >>>>>>>>> processing
> > >>>>>>>>>>>>> logic,
> > >>>>>>>>>>>>>>> the data cannot change. Otherwise the behavior of the
> > >>>>>>> processing
> > >>>>>>>>>> logic
> > >>>>>>>>>>>>> may
> > >>>>>>>>>>>>>>> be undefined. In the above two examples, when writing the
> > >>>>>>>>> processing
> > >>>>>>>>>>>>> logic,
> > >>>>>>>>>>>>>>> Users can use .cache() to hint Flink that those results
> > >>>>>> should
> > >>>>>>>> be
> > >>>>>>>>>>> saved
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>> avoid repeated computation. And then for the result of my
> > >>>>>>>>>> application
> > >>>>>>>>>>>>>>> logic, I'll call materialize(), so that these results
> > >>>>> could
> > >>>>>> be
> > >>>>>>>>>> managed
> > >>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>> the system with versioning, metadata management,
> lifecycle
> > >>>>>>>>>> management,
> > >>>>>>>>>>>>>>> ACLs, etc.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> It is true we can use materialize() to do the cache()
> job,
> > >>>>>>> but I
> > >>>>>>>>> am
> > >>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>> reluctant to shoehorn cache() into materialize() and
> force
> > >>>>>>> users
> > >>>>>>>>> to
> > >>>>>>>>>>>>> worry
> > >>>>>>>>>>>>>>> about a bunch of implications that they needn't have to.
> I
> > >>>>>> am
> > >>>>>>>>>>>>> absolutely on
> > >>>>>>>>>>>>>>> your side that redundant API is bad. But it is equally
> > >>>>>>>>> frustrating,
> > >>>>>>>>>> if
> > >>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>> more, that the same API does different things.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> > >>>>>>>>> wshaoxuan@gmail.com
> > >>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks Piotrek,
> > >>>>>>>>>>>>>>>> You provided a very good example, it explains all the
> > >>>>>>>> confusions
> > >>>>>>>>> I
> > >>>>>>>>>>>>> have.
> > >>>>>>>>>>>>>>>> It is clear that there is something we have not
> > >>>>> considered
> > >>>>>> in
> > >>>>>>>> the
> > >>>>>>>>>>>>> initial
> > >>>>>>>>>>>>>>>> proposal. We intend to force the user to reuse the
> > >>>>>>>>>>> cached/materialized
> > >>>>>>>>>>>>>>>> table, if its cache() method is executed. We did not
> > >>>>> expect
> > >>>>>>>> that
> > >>>>>>>>>> user
> > >>>>>>>>>>>>> may
> > >>>>>>>>>>>>>>>> want to re-executed the plan from the source table. Let
> > >>>>> me
> > >>>>>>>>> re-think
> > >>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>> it and get back to you later.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> In the meanwhile, this example/observation also infers
> > >>>>> that
> > >>>>>>> we
> > >>>>>>>>>> cannot
> > >>>>>>>>>>>>> fully
> > >>>>>>>>>>>>>>>> involve the optimizer to decide the plan if a
> > >>>>>>> cache/materialize
> > >>>>>>>>> is
> > >>>>>>>>>>>>>>>> explicitly used, because weather to reuse the cache data
> > >>>>> or
> > >>>>>>>>>>> re-execute
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> query from source data may lead to different results.
> > >>>>> (But
> > >>>>>> I
> > >>>>>>>>> guess
> > >>>>>>>>>>>>>>>> optimizer can still help in some cases ---- as long as
> it
> > >>>>>>> does
> > >>>>>>>>> not
> > >>>>>>>>>>>>>>>> re-execute from the varied source, we should be safe).
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>> Shaoxuan
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> > >>>>>>>>>>>>> piotr@data-artisans.com>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi Shaoxuan,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Re 2:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
> > >>>>>> modified
> > >>>>>>>>> to->
> > >>>>>>>>>>> t1’
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> What do you mean that “ t1 is modified to-> t1’ ” ?
> That
> > >>>>>>>>>>>>>>>>> `methodThatAppliesOperators()` method has changed it’s
> > >>>>>> plan?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I was thinking more about something like this:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Table source = … // some source that scans files from a
> > >>>>>>>>> directory
> > >>>>>>>>>>>>>>>>> “/foo/bar/“
> > >>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > >>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> > >>>>>> initialised)
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> int a1 = t1.count()
> > >>>>>>>>>>>>>>>>> int b1 = t2.count()
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> // something in the background (or we trigger it)
> writes
> > >>>>>> new
> > >>>>>>>>> files
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> /foo/bar
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> int a2 = t1.count()
> > >>>>>>>>>>>>>>>>> int b2 = t2.count()
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> > >>>>>>>> implemented
> > >>>>>>>>>> in
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> initial version
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> int a3 = t1.count()
> > >>>>>>>>>>>>>>>>> int b3 = t2.count()
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
> > >>>>>>> “cache”
> > >>>>>>>>>>>>> dropping
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> assertTrue(a1 == b1) // same results, but b1 comes from
> > >>>>>> the
> > >>>>>>>>>> “cache"
> > >>>>>>>>>>>>>>>>> assertTrue(b1 == b2) // both values come from the same
> > >>>>>> cache
> > >>>>>>>>>>>>>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2
> > >>>>> re-executed
> > >>>>>>>> full
> > >>>>>>>>>>> table
> > >>>>>>>>>>>>>>>> scan
> > >>>>>>>>>>>>>>>>> and has more data
> > >>>>>>>>>>>>>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> > >>>>>>>>>>>>>>>>> assertTrue(b3 == a2 == a3)
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> It is an very interesting and useful design!
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Here I want to share some of my thoughts:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> 1. Agree with that cache() method should return some
> > >>>>>> Table
> > >>>>>>> to
> > >>>>>>>>>> avoid
> > >>>>>>>>>>>>>>>> some
> > >>>>>>>>>>>>>>>>>> unexpected problems because of the mutable object.
> > >>>>>>>>>>>>>>>>>> All the existing methods of Table are returning a new
> > >>>>>> Table
> > >>>>>>>>>>> instance.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> 2. I think materialize() would be more consistent with
> > >>>>>> SQL,
> > >>>>>>>>> this
> > >>>>>>>>>>>>> makes
> > >>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>> possible to support the same feature for SQL
> > >>>>> (materialize
> > >>>>>>>> view)
> > >>>>>>>>>> and
> > >>>>>>>>>>>>>>>> keep
> > >>>>>>>>>>>>>>>>>> the same API for users in the future.
> > >>>>>>>>>>>>>>>>>> But I'm also fine if we choose cache().
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> 3. In the proposal, a TableService (or FlinkService?)
> > >>>>> is
> > >>>>>>> used
> > >>>>>>>>> to
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> result of the (intermediate) table.
> > >>>>>>>>>>>>>>>>>> But the name of TableService may be a bit general
> which
> > >>>>>> is
> > >>>>>>>> not
> > >>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>>> understanding correctly in the first glance (a
> > >>>>> metastore
> > >>>>>>> for
> > >>>>>>>>>>>>> tables?).
> > >>>>>>>>>>>>>>>>>> Maybe a more specific name would be better, such as
> > >>>>>>>>>>> TableCacheSerive
> > >>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>> TableMaterializeSerivce or something else.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
> > >>>>>>>> fhueske@gmail.com
> > >>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks for the clarification Becket!
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I have a few thoughts to share / questions:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 1) I'd like to know how you plan to implement the
> > >>>>>> feature
> > >>>>>>>> on a
> > >>>>>>>>>>> plan
> > >>>>>>>>>>>>> /
> > >>>>>>>>>>>>>>>>>>> planner level.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I would imaging the following to happen when
> > >>>>>> Table.cache()
> > >>>>>>>> is
> > >>>>>>>>>>>>> called:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 1) immediately optimize the Table and internally
> > >>>>> convert
> > >>>>>>> it
> > >>>>>>>>>> into a
> > >>>>>>>>>>>>>>>>>>> DataSet/DataStream. This is necessary, to avoid that
> > >>>>>>>> operators
> > >>>>>>>>>> of
> > >>>>>>>>>>>>>>>> later
> > >>>>>>>>>>>>>>>>>>> queries on top of the Table are pushed down.
> > >>>>>>>>>>>>>>>>>>> 2) register the DataSet/DataStream as a
> > >>>>>>>>>> DataSet/DataStream-backed
> > >>>>>>>>>>>>>>>> Table
> > >>>>>>>>>>>>>>>>> X
> > >>>>>>>>>>>>>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> > >>>>>>>>>>> materialization
> > >>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> Table X
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Based on your proposal the following would happen:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Table t1 = ....
> > >>>>>>>>>>>>>>>>>>> t1.cache(); // cache() returns void. The logical plan
> > >>>>> of
> > >>>>>>> t1
> > >>>>>>>> is
> > >>>>>>>>>>>>>>>> replaced
> > >>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>> a scan of X. There is also a reference to the
> > >>>>>>>> materialization
> > >>>>>>>>> of
> > >>>>>>>>>>> X.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, including
> > >>>>> the
> > >>>>>>>>>>>>>>>>> DataSet/DataStream
> > >>>>>>>>>>>>>>>>>>> that backs X and the sink that writes the
> > >>>>>> materialization
> > >>>>>>>> of X
> > >>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, but reads X
> > >>>>>> from
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> materialization.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> My question is, how do you determine when whether the
> > >>>>>> scan
> > >>>>>>>> of
> > >>>>>>>>> t1
> > >>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>> go
> > >>>>>>>>>>>>>>>>>>> against the DataSet/DataStream program and when
> > >>>>> against
> > >>>>>>> the
> > >>>>>>>>>>>>>>>>>>> materialization?
> > >>>>>>>>>>>>>>>>>>> AFAIK, there is no hook that will tell you that a
> part
> > >>>>>> of
> > >>>>>>>> the
> > >>>>>>>>>>>>> program
> > >>>>>>>>>>>>>>>>> was
> > >>>>>>>>>>>>>>>>>>> executed. Flipping a switch during optimization or
> > >>>>> plan
> > >>>>>>>>>> generation
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>> sufficient as there is no guarantee that the plan is
> > >>>>>> also
> > >>>>>>>>>>> executed.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Overall, this behavior is somewhat similar to what I
> > >>>>>>>> proposed
> > >>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>> FLINK-8950, which does not include persisting the
> > >>>>> table,
> > >>>>>>> but
> > >>>>>>>>>> just
> > >>>>>>>>>>>>>>>>>>> optimizing and reregistering it as DataSet/DataStream
> > >>>>>>> scan.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 2) I think Piotr has a point about the implicit
> > >>>>> behavior
> > >>>>>>> and
> > >>>>>>>>>> side
> > >>>>>>>>>>>>>>>>> effects
> > >>>>>>>>>>>>>>>>>>> of the cache() method if it does not return anything.
> > >>>>>>>>>>>>>>>>>>> Consider the following example:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Table t1 = ???
> > >>>>>>>>>>>>>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> > >>>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> In this case, the behavior/performance of the plan
> > >>>>> that
> > >>>>>>>>> results
> > >>>>>>>>>>> from
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> second method call depends on whether t1 was modified
> > >>>>> by
> > >>>>>>> the
> > >>>>>>>>>> first
> > >>>>>>>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>>>> or not.
> > >>>>>>>>>>>>>>>>>>> This is the classic issue of mutable vs. immutable
> > >>>>>>> objects.
> > >>>>>>>>>>>>>>>>>>> Also, as Piotr pointed out, it might also be good to
> > >>>>>> have
> > >>>>>>>> the
> > >>>>>>>>>>>>> original
> > >>>>>>>>>>>>>>>>> plan
> > >>>>>>>>>>>>>>>>>>> of t1, because in some cases it is possible to push
> > >>>>>>> filters
> > >>>>>>>>> down
> > >>>>>>>>>>>>> such
> > >>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>> evaluating the query from scratch might be more
> > >>>>>> efficient
> > >>>>>>>> than
> > >>>>>>>>>>>>>>>> accessing
> > >>>>>>>>>>>>>>>>>>> the cache.
> > >>>>>>>>>>>>>>>>>>> Moreover, a CachedTable could extend Table() and
> > >>>>> offer a
> > >>>>>>>>> method
> > >>>>>>>>>>>>>>>>> refresh().
> > >>>>>>>>>>>>>>>>>>> This sounds quite useful in an interactive session
> > >>>>> mode.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> > >>>>>>>>>>> materialize()
> > >>>>>>>>>>>>>>>>> seems
> > >>>>>>>>>>>>>>>>>>> to be more future proof.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Best, Fabian
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
> > >>>>>> Wang <
> > >>>>>>>>>>>>>>>>>>> wshaoxuan@gmail.com>:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi Piotr,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks for sharing your ideas on the method naming.
> > >>>>> We
> > >>>>>>> will
> > >>>>>>>>>> think
> > >>>>>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>>> your suggestions. But I don't understand why we need
> > >>>>> to
> > >>>>>>>>> change
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> return
> > >>>>>>>>>>>>>>>>>>>> type of cache().
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Cache() is a physical operation, it does not change
> > >>>>> the
> > >>>>>>>> logic
> > >>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>> the `Table`. On the tableAPI layer, we should not
> > >>>>>>>> introduce a
> > >>>>>>>>>> new
> > >>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>> type unless the logic of table has been changed. If
> > >>>>> we
> > >>>>>>>>>> introduce
> > >>>>>>>>>>> a
> > >>>>>>>>>>>>>>>> new
> > >>>>>>>>>>>>>>>>>>>> table type `CachedTable`, we need create the same
> set
> > >>>>>> of
> > >>>>>>>>>> methods
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> `Table`
> > >>>>>>>>>>>>>>>>>>>> for it. I don't think it is worth doing this. Or can
> > >>>>>> you
> > >>>>>>>>> please
> > >>>>>>>>>>>>>>>>> elaborate
> > >>>>>>>>>>>>>>>>>>>> more on what could be the "implicit behaviours/side
> > >>>>>>>> effects"
> > >>>>>>>>>> you
> > >>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>>> thinking about?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>> Shaoxuan
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > >>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Hi Becket,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks for the response.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 1. I wasn’t saying that materialised view must be
> > >>>>>>> mutable
> > >>>>>>>> or
> > >>>>>>>>>>> not.
> > >>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>>> thing applies to caches as well. To the contrary, I
> > >>>>>>> would
> > >>>>>>>>>> expect
> > >>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>> consistency and updates from something that is
> > >>>>> called
> > >>>>>>>>> “cache”
> > >>>>>>>>>> vs
> > >>>>>>>>>>>>>>>>>>>> something
> > >>>>>>>>>>>>>>>>>>>>> that’s a “materialised view”. In other words, IMO
> > >>>>> most
> > >>>>>>>>> caches
> > >>>>>>>>>> do
> > >>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>> serve
> > >>>>>>>>>>>>>>>>>>>>> you invalid/outdated data and they handle updates
> on
> > >>>>>>> their
> > >>>>>>>>>> own.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 2. I don’t think that having in the future two very
> > >>>>>>>> similar
> > >>>>>>>>>>>>> concepts
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>> `materialized` view and `cache` is a good idea. It
> > >>>>>> would
> > >>>>>>>> be
> > >>>>>>>>>>>>>>>> confusing
> > >>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>> the users. I think it could be handled by
> > >>>>>>>>>> variations/overloading
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>> materialised view concept. We could start with:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize()` - immutable,
> > >>>>> session
> > >>>>>>>> life
> > >>>>>>>>>>> scope
> > >>>>>>>>>>>>>>>>>>>>> (basically the same semantic as you are proposing
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> And then in the future (if ever) build on top of
> > >>>>>>>> that/expand
> > >>>>>>>>>> it
> > >>>>>>>>>>>>>>>> with:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> > >>>>>>>>>>>>> `MaterializedTable
> > >>>>>>>>>>>>>>>>>>>>> materialize(refreshHook=…)`
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Or with cross session support:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> > >>>>>>>>>>>>>>>> `MaterializedTable
> > >>>>>>>>>>>>>>>>>>>>> materializeInto(tableFactory=…)`
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> I’m not saying that we should implement cross
> > >>>>>>>>>> session/refreshing
> > >>>>>>>>>>>>> now
> > >>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>> even in the near future. I’m just arguing that
> > >>>>> naming
> > >>>>>>>>> current
> > >>>>>>>>>>>>>>>>> immutable
> > >>>>>>>>>>>>>>>>>>>>> session life scope method `materialize()` is more
> > >>>>>> future
> > >>>>>>>>> proof
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>> consistent with SQL (on which after all table-api
> is
> > >>>>>>>> heavily
> > >>>>>>>>>>>>> basing
> > >>>>>>>>>>>>>>>>>>> on).
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would
> > >>>>>>> still
> > >>>>>>>>>> insist
> > >>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
> > >>>>>>> implicit
> > >>>>>>>>>>>>>>>>>>>> behaviours/side
> > >>>>>>>>>>>>>>>>>>>>> effects and to give both us & users more
> > >>>>> flexibility.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
> > >>>>>>>> becket.qin@gmail.com
> > >>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Just to add a little bit, the materialized view is
> > >>>>>>>> probably
> > >>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>> similar
> > >>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>> the persistent() brought up earlier in the thread.
> > >>>>> So
> > >>>>>>> it
> > >>>>>>>> is
> > >>>>>>>>>>>>> usually
> > >>>>>>>>>>>>>>>>>>>> cross
> > >>>>>>>>>>>>>>>>>>>>>> session and could be used in a larger scope. For
> > >>>>>>>> example, a
> > >>>>>>>>>>>>>>>>>>>> materialized
> > >>>>>>>>>>>>>>>>>>>>>> view created by user A may be visible to user B.
> It
> > >>>>>> is
> > >>>>>>>>>> probably
> > >>>>>>>>>>>>>>>>>>>> something
> > >>>>>>>>>>>>>>>>>>>>>> we want to have in the future. I'll put it in the
> > >>>>>>> future
> > >>>>>>>>> work
> > >>>>>>>>>>>>>>>>>>> section.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> > >>>>>>>>>>> becket.qin@gmail.com
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Right now we are mostly thinking of the cached
> > >>>>> table
> > >>>>>>> as
> > >>>>>>>>>>>>>>>> immutable. I
> > >>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>> see the Materialized view would be useful in the
> > >>>>>>> future.
> > >>>>>>>>>> That
> > >>>>>>>>>>>>>>>> said,
> > >>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>>>>> a simple cache mechanism is probably still
> needed.
> > >>>>>> So
> > >>>>>>> to
> > >>>>>>>>> me,
> > >>>>>>>>>>>>>>>> cache()
> > >>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>> materialize() should be two separate method as
> > >>>>> they
> > >>>>>>>>> address
> > >>>>>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>> needs. Materialize() is a higher level concept
> > >>>>>> usually
> > >>>>>>>>>>> implying
> > >>>>>>>>>>>>>>>>>>>>> periodical
> > >>>>>>>>>>>>>>>>>>>>>>> update, while cache() has much simpler semantic.
> > >>>>> For
> > >>>>>>>>>> example,
> > >>>>>>>>>>>>> one
> > >>>>>>>>>>>>>>>>>>> may
> > >>>>>>>>>>>>>>>>>>>>>>> create a materialized view and use cache() method
> > >>>>> in
> > >>>>>>> the
> > >>>>>>>>>>>>>>>>>>> materialized
> > >>>>>>>>>>>>>>>>>>>>> view
> > >>>>>>>>>>>>>>>>>>>>>>> creation logic. So that during the materialized
> > >>>>> view
> > >>>>>>>>> update,
> > >>>>>>>>>>>>> they
> > >>>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>> need to worry about the case that the cached
> table
> > >>>>>> is
> > >>>>>>>> also
> > >>>>>>>>>>>>>>>> changed.
> > >>>>>>>>>>>>>>>>>>>>> Maybe
> > >>>>>>>>>>>>>>>>>>>>>>> under the hood, materialized() and cache() could
> > >>>>>> share
> > >>>>>>>>> some
> > >>>>>>>>>>>>>>>>>>> mechanism,
> > >>>>>>>>>>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>>>>>>> I think a simple cache() method would be handy in
> > >>>>> a
> > >>>>>>> lot
> > >>>>>>>> of
> > >>>>>>>>>>>>> cases.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > >>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > >>>>>>>>>> MaterializedTable
> > >>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>> they
> > >>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table?
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Maybe not in the initial implementation, but
> > >>>>>> various
> > >>>>>>>> DBs
> > >>>>>>>>>>> offer
> > >>>>>>>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
> > >>>>>>>> triggers,
> > >>>>>>>>>>>>> timers,
> > >>>>>>>>>>>>>>>>>>>>> manually
> > >>>>>>>>>>>>>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to
> > >>>>>>> handle
> > >>>>>>>>>> that
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>> future.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> After users call *table.cache(), *users can
> just
> > >>>>>> use
> > >>>>>>>>> that
> > >>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>>>>>>>>>> anything that is supported on a Table, including
> > >>>>>> SQL.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> This is some implicit behaviour with side
> > >>>>> effects.
> > >>>>>>>>> Imagine
> > >>>>>>>>>> if
> > >>>>>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>>>> has
> > >>>>>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>>>>> long and complicated program, that touches table
> > >>>>>> `b`
> > >>>>>>>>>> multiple
> > >>>>>>>>>>>>>>>>>>> times,
> > >>>>>>>>>>>>>>>>>>>>> maybe
> > >>>>>>>>>>>>>>>>>>>>>>>> scattered around different methods. If he
> > >>>>> modifies
> > >>>>>>> his
> > >>>>>>>>>>> program
> > >>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>> inserting
> > >>>>>>>>>>>>>>>>>>>>>>>> in one place
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> b.cache()
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> This implicitly alters the semantic and
> behaviour
> > >>>>>> of
> > >>>>>>>> his
> > >>>>>>>>>> code
> > >>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>>> over
> > >>>>>>>>>>>>>>>>>>>>>>>> the place, maybe in a ways that might cause
> > >>>>>> problems.
> > >>>>>>>> For
> > >>>>>>>>>>>>> example
> > >>>>>>>>>>>>>>>>>>>> what
> > >>>>>>>>>>>>>>>>>>>>> if
> > >>>>>>>>>>>>>>>>>>>>>>>> underlying data is changing?
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Having invisible side effects is also not very
> > >>>>>> clean,
> > >>>>>>>> for
> > >>>>>>>>>>>>> example
> > >>>>>>>>>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>>>>>> about something like this (but more
> complicated):
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Table b = ...;
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
> > >>>>>>>>>>>>>>>>>>>>>>>> processTable1(b)
> > >>>>>>>>>>>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>>>>>>>>>> else {
> > >>>>>>>>>>>>>>>>>>>>>>>> processTable2(b)
> > >>>>>>>>>>>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> // do more stuff with b
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> And user adds `b.cache()` call to only one of
> the
> > >>>>>>>>>>>>> `processTable1`
> > >>>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>>>>> `processTable2` methods.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> On the other hand
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Table materialisedB = b.materialize()
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Avoids (at least some of) the side effect issues
> > >>>>>> and
> > >>>>>>>>> forces
> > >>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> explicitly use `materialisedB` where it’s
> > >>>>>> appropriate
> > >>>>>>>> and
> > >>>>>>>>>>>>> forces
> > >>>>>>>>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> think what does it actually mean. And if
> > >>>>> something
> > >>>>>>>>> doesn’t
> > >>>>>>>>>>> work
> > >>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>> end
> > >>>>>>>>>>>>>>>>>>>>>>>> for the user, he will know what has he changed
> > >>>>>>> instead
> > >>>>>>>> of
> > >>>>>>>>>>>>> blaming
> > >>>>>>>>>>>>>>>>>>>>> Flink for
> > >>>>>>>>>>>>>>>>>>>>>>>> some “magic” underneath. In the above example,
> > >>>>>> after
> > >>>>>>>>>>>>>>>> materialising
> > >>>>>>>>>>>>>>>>>>> b
> > >>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>> only one of the methods, he should/would realise
> > >>>>>>> about
> > >>>>>>>>> the
> > >>>>>>>>>>>>> issue
> > >>>>>>>>>>>>>>>>>>> when
> > >>>>>>>>>>>>>>>>>>>>>>>> handling the return value `MaterializedTable` of
> > >>>>>> that
> > >>>>>>>>>> method.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> I guess it comes down to personal preferences if
> > >>>>>> you
> > >>>>>>>> like
> > >>>>>>>>>>>>> things
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>> implicit or not. The more power is the user,
> > >>>>>> probably
> > >>>>>>>> the
> > >>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>> likely
> > >>>>>>>>>>>>>>>>>>>>> he is
> > >>>>>>>>>>>>>>>>>>>>>>>> to like/understand implicit behaviour. And we as
> > >>>>>>> Table
> > >>>>>>>>> API
> > >>>>>>>>>>>>>>>>>>> designers
> > >>>>>>>>>>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>>>>>>> the most power users out there, so I would
> > >>>>> proceed
> > >>>>>>> with
> > >>>>>>>>>>> caution
> > >>>>>>>>>>>>>>>> (so
> > >>>>>>>>>>>>>>>>>>>>> that we
> > >>>>>>>>>>>>>>>>>>>>>>>> do not end up in the crazy perl realm with it’s
> > >>>>>>> lovely
> > >>>>>>>>>>> implicit
> > >>>>>>>>>>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>>>>>>>>> arguments ;)  <
> > >>>>>>>>>> https://stackoverflow.com/a/14922656/8149051
> > >>>>>>>>>>>> )
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
> > >>>>>> processing
> > >>>>>>>>> cases,
> > >>>>>>>>>>>>>>>> cache()
> > >>>>>>>>>>>>>>>>>>>>>>>> might be slightly better.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> I think even such extended Table API could
> > >>>>> benefit
> > >>>>>>> from
> > >>>>>>>>>>>>> sticking
> > >>>>>>>>>>>>>>>>>>>>> to/being
> > >>>>>>>>>>>>>>>>>>>>>>>> consistent with SQL where both SQL and Table API
> > >>>>>> are
> > >>>>>>>>>>> basically
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>> same.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> One more thing. `MaterializedTable
> materialize()`
> > >>>>>>> could
> > >>>>>>>>> be
> > >>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>>>>> powerful/flexible allowing the user to operate
> > >>>>> both
> > >>>>>>> on
> > >>>>>>>>>>>>>>>> materialised
> > >>>>>>>>>>>>>>>>>>>>> and not
> > >>>>>>>>>>>>>>>>>>>>>>>> materialised view at the same time for whatever
> > >>>>>>> reasons
> > >>>>>>>>>>>>>>>> (underlying
> > >>>>>>>>>>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>>>>>>> changing/better optimisation opportunities after
> > >>>>>>>> pushing
> > >>>>>>>>>> down
> > >>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>>>>>>>>>> etc). For example:
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Table b = …;
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Val min = mb.min();
> > >>>>>>>>>>>>>>>>>>>>>>>> Val max = mb.max();
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Could be more efficient compared to `b.cache()`
> > >>>>> if
> > >>>>>>>>>>>>>>>> `filter(‘userId
> > >>>>>>>>>>>>>>>>>>> =
> > >>>>>>>>>>>>>>>>>>>>>>>> 42);` allows for much more aggressive
> > >>>>>> optimisations.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> > >>>>>>>>>> fhueske@gmail.com>
> > >>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> I'm not suggesting to add support for Ignite.
> > >>>>> This
> > >>>>>>> was
> > >>>>>>>>>> just
> > >>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>> example.
> > >>>>>>>>>>>>>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> > >>>>>>>>>>>>>>>>>>>>>>>>> For the sake of this proposal, it would be up
> to
> > >>>>>> the
> > >>>>>>>>> user
> > >>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>> implement a
> > >>>>>>>>>>>>>>>>>>>>>>>>> TableFactory and corresponding TableSource /
> > >>>>>>> TableSink
> > >>>>>>>>>>> classes
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> persist
> > >>>>>>>>>>>>>>>>>>>>>>>>> and read the data.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb
> > >>>>> Flavio
> > >>>>>>>>>>> Pompermaier
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>> pompermaier@okkam.it>:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow
> as
> > >>>>>> an
> > >>>>>>>>>>>>> alternative
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> Apache
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Ignite?
> > >>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>
> > https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske
> > >>>>> <
> > >>>>>>>>>>>>>>>>>>> fhueske@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> To summarize, you propose a new method
> > >>>>>>>> Table.cache():
> > >>>>>>>>>>> Table
> > >>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> trigger a job and write the result into some
> > >>>>>>>> temporary
> > >>>>>>>>>>>>> storage
> > >>>>>>>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>>>>>>>>>> defined
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> by a TableFactory.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> The cache() call blocks while the job is
> > >>>>> running
> > >>>>>>> and
> > >>>>>>>>>>>>>>>> eventually
> > >>>>>>>>>>>>>>>>>>>>>>>> returns a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Table object that represents a scan of the
> > >>>>>>> temporary
> > >>>>>>>>>>> table.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> When the "session" is closed (closing to be
> > >>>>>>>> defined?),
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> temporary
> > >>>>>>>>>>>>>>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> are all dropped.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> I think this behavior makes sense and is a
> > >>>>> good
> > >>>>>>>> first
> > >>>>>>>>>> step
> > >>>>>>>>>>>>>>>>>>> towards
> > >>>>>>>>>>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> interactive workloads.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> However, its performance suffers from writing
> > >>>>> to
> > >>>>>>> and
> > >>>>>>>>>>> reading
> > >>>>>>>>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>>>>>>>>>>>> external
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> systems.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> I think this is OK for now. Changes that
> would
> > >>>>>>>>>>> significantly
> > >>>>>>>>>>>>>>>>>>>> improve
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory
> across
> > >>>>>>> jobs)
> > >>>>>>>>>> would
> > >>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>>> large
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> impacts on many components of Flink.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Users could use in-memory filesystems or
> > >>>>> storage
> > >>>>>>>> grids
> > >>>>>>>>>>>>> (Apache
> > >>>>>>>>>>>>>>>>>>>>>>>> Ignite) to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> mitigate some of the performance effects.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Best, Fabian
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
> > >>>>>> Becket
> > >>>>>>>> Qin
> > >>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > >>>>>>>>>>> MaterializedTable
> > >>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>> they
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> > >>>>>>>>> *table.cache(),
> > >>>>>>>>>>>>> *users
> > >>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>> just
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> that table and do anything that is supported
> > >>>>>> on a
> > >>>>>>>>>> Table,
> > >>>>>>>>>>>>>>>>>>>> including
> > >>>>>>>>>>>>>>>>>>>>>>>> SQL.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize()
> > >>>>>>> sounds
> > >>>>>>>>>> fine
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>>>>>> cache()
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> a bit more general than materialize(). Given
> > >>>>>> that
> > >>>>>>>> we
> > >>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>>> enhancing
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
> > >>>>>>> processing
> > >>>>>>>>>>> cases,
> > >>>>>>>>>>>>>>>>>>>> cache()
> > >>>>>>>>>>>>>>>>>>>>>>>>>> might
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> slightly better.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
> > >>>>>> Nowojski <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend
> > >>>>> to
> > >>>>>>>> reuse
> > >>>>>>>>>>>>> existing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
> > >>>>>> assumed
> > >>>>>>>> that
> > >>>>>>>>>> you
> > >>>>>>>>>>>>>>>> want
> > >>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> provide
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternate way of writing the data.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now that I hopefully understand the
> > >>>>> proposal,
> > >>>>>>>> maybe
> > >>>>>>>>> we
> > >>>>>>>>>>>>> could
> > >>>>>>>>>>>>>>>>>>>>> rename
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> void materialize()
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> or going step further
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The second option with returning a handle I
> > >>>>>>> think
> > >>>>>>>> is
> > >>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>> flexible
> > >>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could provide features such as
> > >>>>>>> “refresh”/“delete”
> > >>>>>>>> or
> > >>>>>>>>>>>>>>>> generally
> > >>>>>>>>>>>>>>>>>>>>>>>>>> speaking
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> manage the the view. In the future we could
> > >>>>>> also
> > >>>>>>>>> think
> > >>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>>>> adding
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> hooks
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is
> > >>>>> also
> > >>>>>>> more
> > >>>>>>>>>>>>> explicit
> > >>>>>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialization returning a new table
> handle
> > >>>>>>> will
> > >>>>>>>>> not
> > >>>>>>>>>>> have
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implicit side effects as adding a simple
> > >>>>> line
> > >>>>>> of
> > >>>>>>>>> code
> > >>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>> `b.cache()`
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> would have.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would also be more SQL like, making it
> > >>>>> more
> > >>>>>>>>>> intuitive
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>> users
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> already
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> familiar with the SQL.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> > >>>>>>>>>>>>> becket.qin@gmail.com
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
> > >>>>>>>>> equivalent
> > >>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>> creating
> > >>>>>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> BUILT-IN
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> > >>>>>>>>>> functionality
> > >>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>> missing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> today,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
> > >>>>>> question.
> > >>>>>>>> Do
> > >>>>>>>>>> you
> > >>>>>>>>>>>>> mean
> > >>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> already
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
> > >>>>>> sugar?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is
> > >>>>> do
> > >>>>>>> we
> > >>>>>>>>> want
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>> stop
> > >>>>>>>>>>>>>>>>>>>> at
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> creating
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
> > >>>>>> extend
> > >>>>>>>> that
> > >>>>>>>>>> in
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> future
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful unified data store distributed with
> > >>>>>>> Flink?
> > >>>>>>>>> And
> > >>>>>>>>>>> do
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>> want
> > >>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
> > >>>>>> pattern
> > >>>>>>>> with
> > >>>>>>>>>>> their
> > >>>>>>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> defined
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> services. These considerations are much
> > >>>>> more
> > >>>>>>>>>>>>> architectural.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
> > >>>>>> Nowojski
> > >>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to
> understand
> > >>>>>> the
> > >>>>>>>>>>> problem.
> > >>>>>>>>>>>>>>>>>>> Isn’t
> > >>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing
> > >>>>> data
> > >>>>>>> to
> > >>>>>>>> a
> > >>>>>>>>>> sink
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>> later
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> reading
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited
> > >>>>> live
> > >>>>>>>>>> scope/live
> > >>>>>>>>>>>>>>>> time?
> > >>>>>>>>>>>>>>>>>>>> And
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> sink
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a
> > >>>>> file
> > >>>>>>>> sink?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> > >>>>>>>>>> materialised
> > >>>>>>>>>>>>>>>> view
> > >>>>>>>>>>>>>>>>>>>>> from a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
> > >>>>>> reusing
> > >>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>> materialised
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> view
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to
> > >>>>>>> clean
> > >>>>>>>> up
> > >>>>>>>>>>>>>>>>>>>> materialised
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> views
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> example when current session finishes)?
> > >>>>>> Maybe
> > >>>>>>> we
> > >>>>>>>>>> need
> > >>>>>>>>>>>>> some
> > >>>>>>>>>>>>>>>>>>>>>>>>>> syntactic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> sugar
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on top of it?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> > >>>>>>>>>>>>>>>> becket.qin@gmail.com
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
> > >>>>>>> persist()
> > >>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> lifecycle/defined
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the
> > >>>>> future
> > >>>>>>>> work
> > >>>>>>>>>> for
> > >>>>>>>>>>>>>>>> this.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng
> > >>>>>> sun
> > >>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the
> > >>>>>> name
> > >>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> `cache()`, I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you designed this way!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify
> a
> > >>>>>>>>> lifecycle
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> persistence?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, persist
> > >>>>> (LifeCycle.SESSION),
> > >>>>>> so
> > >>>>>>>>> that
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> worried
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly
> > >>>>> specify
> > >>>>>>> the
> > >>>>>>>>> time
> > >>>>>>>>>>>>> range
> > >>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> keeping
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> time.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand,
> > >>>>> we
> > >>>>>>> can
> > >>>>>>>>>> also
> > >>>>>>>>>>>>>>>> share
> > >>>>>>>>>>>>>>>>>>>> in a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> > >>>>>>>>>>>>>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> > >>>>>>>>>>>>>>>>>>>>>>>>>> am
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for
> > >>>>> reference
> > >>>>>>>> only!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bests,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > >>>>>>>>> 于2018年11月23日周五
> > >>>>>>>>>>>>>>>>>>> 下午1:33写道:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
> > >>>>>> cache()
> > >>>>>>>> v.s.
> > >>>>>>>>>>>>>>>>>>> persist(),
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> personally I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
> > >>>>>>> describing
> > >>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> behavior,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> i.e.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
> > >>>>>>>> deleted
> > >>>>>>>>>>> after
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> session
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> closed.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as
> > >>>>>>> people
> > >>>>>>>>>> might
> > >>>>>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still be there even after the session
> > >>>>> is
> > >>>>>>>> gone.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and
> > >>>>>>> stream
> > >>>>>>>>>>>>>>>> processing
> > >>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> job.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that
> > >>>>>>> goal.
> > >>>>>>>> I
> > >>>>>>>>>>>>> imagine
> > >>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change across the board, including
> > >>>>>> sources,
> > >>>>>>>>>>> operators
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
> > >>>>>>>> separate
> > >>>>>>>>>>>>>>>> in-depth
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> discussions.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM
> Xingcan
> > >>>>>>> Cui <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> xingcanc@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
> > >>>>>> access
> > >>>>>>>>>> domain
> > >>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>> both
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this
> > >>>>> may
> > >>>>>>> be
> > >>>>>>>>> the
> > >>>>>>>>>>>>> first
> > >>>>>>>>>>>>>>>>>>> time
> > >>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> plan
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
> > >>>>>> other
> > >>>>>>>> than
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> state.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Maybe
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> it’s
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
> > >>>>>>>>> concentrate
> > >>>>>>>>>>> on
> > >>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>> specific
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> part?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
> > >>>>>> concerned
> > >>>>>>>>> with
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>> underlying
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> service.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change
> > >>>>> to
> > >>>>>>> the
> > >>>>>>>>>>>>> existing
> > >>>>>>>>>>>>>>>>>>>>>>>>>> codebase.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> As
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
> > >>>>>> extendible
> > >>>>>>> to
> > >>>>>>>>>>> support
> > >>>>>>>>>>>>>>>>>>> other
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> components
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
> > >>>>>>> thread.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the
> > >>>>>> more
> > >>>>>>>>>>>>> interactive
> > >>>>>>>>>>>>>>>>>>>> Table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> API,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough
> > >>>>> service
> > >>>>>>>>>>> mechanism.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM,
> Xiaowei
> > >>>>>>>> Jiang <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
> > >>>>>> table
> > >>>>>>>> for
> > >>>>>>>>>>> clean
> > >>>>>>>>>>>>> up
> > >>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> very
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reliable.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will
> be
> > >>>>>>>>> executed
> > >>>>>>>>>>>>>>>>>>>>> successfully.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> We
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> may
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> risk
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think
> that
> > >>>>>>> it's
> > >>>>>>>>>> safer
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> association
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id.
> So
> > >>>>>> we
> > >>>>>>>> can
> > >>>>>>>>>>> always
> > >>>>>>>>>>>>>>>>>>> clean
> > >>>>>>>>>>>>>>>>>>>>> up
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> temp
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with
> > >>>>> any
> > >>>>>>>>> active
> > >>>>>>>>>>>>>>>>>>> sessions.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
> > >>>>>> jincheng
> > >>>>>>>>> sun <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
> > >>>>>>> proposal!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very
> > >>>>> useful
> > >>>>>>> and
> > >>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>>> friendly
> > >>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a
> business
> > >>>>>> has
> > >>>>>>>> to
> > >>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>> executed
> > >>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> several
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stages
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the
> > >>>>> pipeline
> > >>>>>>> of
> > >>>>>>>>>> Flink
> > >>>>>>>>>>>>> ML,
> > >>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>> order
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> utilize
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we
> > >>>>>> have
> > >>>>>>>> to
> > >>>>>>>>>>>>> submit a
> > >>>>>>>>>>>>>>>>>>> job
> > >>>>>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
> > >>>>>> better
> > >>>>>>>> to
> > >>>>>>>>>>> named
> > >>>>>>>>>>>>>>>>>>>>>>>>>> `persist()`,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> And
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether
> > >>>>> we
> > >>>>>>>>>> internally
> > >>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> memory
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save
> the
> > >>>>>>> data
> > >>>>>>>>> into
> > >>>>>>>>>>>>> state
> > >>>>>>>>>>>>>>>>>>>>> backend
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
> > >>>>>>> RocksDBStateBackend
> > >>>>>>>>>> etc.)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in
> > >>>>> the
> > >>>>>>>>> future,
> > >>>>>>>>>>>>>>>> support
> > >>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> streaming
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same
> job
> > >>>>>>> will
> > >>>>>>>>> also
> > >>>>>>>>>>>>>>>> benefit
> > >>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Interactive
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward
> > >>>>> to
> > >>>>>>>> your
> > >>>>>>>>>>> JIRAs
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>> FLIP!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > >>>>>>>>>>> 于2018年11月20日周二
> > >>>>>>>>>>>>>>>>>>>>> 下午9:56写道:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
> > >>>>>>>> pointed
> > >>>>>>>>>> out,
> > >>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>>> is a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> promising
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table
> > >>>>>> API
> > >>>>>>> in
> > >>>>>>>>>>> various
> > >>>>>>>>>>>>>>>>>>>>> aspects,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> including
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use
> among
> > >>>>>>>> others.
> > >>>>>>>>>> One
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> scenarios
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
> > >>>>>> interactive
> > >>>>>>>>>>>>>>>> programming.
> > >>>>>>>>>>>>>>>>>>> To
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> explain
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on
> > >>>>> the
> > >>>>>>>>>> solution,
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>>>>> put
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> together
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our
> > >>>>> proposal.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
> > >>>>>> welcome!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>>
> >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Till Rohrmann <tr...@apache.org>.
Hi Becket,

I was aiming at semantics similar to 1. I actually thought that `cache()`
would tell the system to materialize the intermediate result so that
subsequent queries don't need to reprocess it. This means that the usage of
the cached table in this example

{
 val cachedTable = a.cache()
 val b1 = cachedTable.select(…)
 val b2 = cachedTable.foo().select(…)
 val b3 = cachedTable.bar().select(...)
 val c1 = a.select(…)
 val c2 = a.foo().select(…)
 val c3 = a.bar().select(...)
}

strongly depends on interleaved calls which trigger the execution of sub
queries. So for example, if there is only a single env.execute call at the
end of  block, then b1, b2, b3, c1, c2 and c3 would all be computed by
reading directly from the sources (given that there is only a single
JobGraph). It just happens that the result of `a` will be cached such that
we skip the processing of `a` when there are subsequent queries reading
from `cachedTable`. If for some reason the system cannot materialize the
table (e.g. running out of disk space, ttl expired), then it could also
happen that we need to reprocess `a`. In that sense `cachedTable` simply is
an identifier for the materialized result of `a` with the lineage how to
reprocess it.

Cheers,
Till





On Tue, Dec 11, 2018 at 11:01 AM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi Becket,
>
> > {
> >  val cachedTable = a.cache()
> >  val b = cachedTable.select(...)
> >  val c = a.select(...)
> > }
> >
> > Semantic 1. b uses cachedTable as user demanded so. c uses original DAG
> as
> > user demanded so. In this case, the optimizer has no chance to optimize.
> > Semantic 2. b uses cachedTable as user demanded so. c leaves the
> optimizer
> > to choose whether the cache or DAG should be used. In this case, user
> lose
> > the option to NOT use cache.
> >
> > As you can see, neither of the options seem perfect. However, I guess you
> > and Till are proposing the third option:
> >
> > Semantic 3. b leaves the optimizer to choose whether cache or DAG should
> be
> > used. c always use the DAG.
>
> I am pretty sure that me, Till, Fabian and others were all proposing and
> advocating in favour of semantic “1”. No cost based optimiser decisions at
> all.
>
> {
>  val cachedTable = a.cache()
>  val b1 = cachedTable.select(…)
>  val b2 = cachedTable.foo().select(…)
>  val b3 = cachedTable.bar().select(...)
>  val c1 = a.select(…)
>  val c2 = a.foo().select(…)
>  val c3 = a.bar().select(...)
> }
>
> All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are
> re-executing whole plan for “a”.
>
> In the future we could discuss going one step further, introducing some
> global optimisation (that can be manually enabled/disabled): deduplicate
> plan nodes/deduplicate sub queries/re-use sub queries results/or whatever
> we could call it. It could do two things:
>
> 1. Automatically try to deduplicate fragments of the plan and share the
> result using CachedTable - in other words automatically insert `CachedTable
> cache()` calls.
> 2. Automatically make decision to bypass explicit `CachedTable` access
> (this would be the equivalent of what you described as “semantic 3”).
>
> However as I wrote previously, I have big doubts if such cost-based
> optimisation would work (this applies also to “Semantic 2”). I would expect
> it to do more harm than good in so many cases, that it wouldn’t make sense.
> Even assuming that we calculate statistics perfectly (this ain’t gonna
> happen), it’s virtually impossible to correctly estimate correct exchange
> rate of CPU cycles vs IO operations as it is changing so much from
> deployment to deployment.
>
> Is this the core of our disagreement here? That you would like this
> “cache()” to be mostly hint for the optimiser?
>
> Piotrek
>
> > On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com> wrote:
> >
> > Another potential concern for semantic 3 is that. In the future, we may
> add
> > automatic caching to Flink. e.g. cache the intermediate results at the
> > shuffle boundary. If our semantic is that reference to the original table
> > means skipping cache, those users may not be able to benefit from the
> > implicit cache.
> >
> >
> >
> > On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <be...@gmail.com>
> wrote:
> >
> >> Hi Piotrek,
> >>
> >> Thanks for the reply. Thought about it again, I might have misunderstood
> >> your proposal in earlier emails. Returning a CachedTable might not be a
> bad
> >> idea.
> >>
> >> I was more concerned about the semantic and its intuitiveness when a
> >> CachedTable is returned. i..e, if cache() returns CachedTable. What are
> the
> >> semantic in the following code:
> >> {
> >>  val cachedTable = a.cache()
> >>  val b = cachedTable.select(...)
> >>  val c = a.select(...)
> >> }
> >> What is the difference between b and c? At the first glance, I see two
> >> options:
> >>
> >> Semantic 1. b uses cachedTable as user demanded so. c uses original DAG
> as
> >> user demanded so. In this case, the optimizer has no chance to optimize.
> >> Semantic 2. b uses cachedTable as user demanded so. c leaves the
> optimizer
> >> to choose whether the cache or DAG should be used. In this case, user
> lose
> >> the option to NOT use cache.
> >>
> >> As you can see, neither of the options seem perfect. However, I guess
> you
> >> and Till are proposing the third option:
> >>
> >> Semantic 3. b leaves the optimizer to choose whether cache or DAG should
> >> be used. c always use the DAG.
> >>
> >> This does address all the concerns. It is just that from intuitiveness
> >> perspective, I found that asking user to explicitly use a CachedTable
> while
> >> the optimizer might choose to ignore is a little weird. That was why I
> did
> >> not think about that semantic. But given there is material benefit, I
> think
> >> this semantic is acceptable.
> >>
> >> 1. If we want to let optimiser make decisions whether to use cache or
> not,
> >>> then why do we need “void cache()” method at all? Would It  “increase”
> the
> >>> chance of using the cache? That’s sounds strange. What would be the
> >>> mechanism of deciding whether to use the cache or not? If we want to
> >>> introduce such kind  automated optimisations of “plan nodes
> deduplication”
> >>> I would turn it on globally, not per table, and let the optimiser do
> all of
> >>> the work.
> >>> 2. We do not have statistics at the moment for any use/not use cache
> >>> decision.
> >>> 3. Even if we had, I would be veeerryy sceptical whether such cost
> based
> >>> optimisations would work properly and I would still insist first on
> >>> providing explicit caching mechanism (`CachedTable cache()`)
> >>>
> >> We are absolutely on the same page here. An explicit cache() method is
> >> necessary not only because optimizer may not be able to make the right
> >> decision, but also because of the nature of interactive programming. For
> >> example, if users write the following code in Scala shell:
> >>  val b = a.select(...)
> >>  val c = b.select(...)
> >>  val d = c.select(...).writeToSink(...)
> >>  tEnv.execute()
> >> There is no way optimizer will know whether b or c will be used in later
> >> code, unless users hint explicitly.
> >>
> >> At the same time I’m not sure if you have responded to our objections of
> >>> `void cache()` being implicit/having side effects, which me, Jark,
> Fabian,
> >>> Till and I think also Shaoxuan are supporting.
> >>
> >> Is there any other side effects if we use semantic 3 mentioned above?
> >>
> >> Thanks,
> >>
> >> JIangjie (Becket) Qin
> >>
> >>
> >> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <piotr@data-artisans.com
> >
> >> wrote:
> >>
> >>> Hi Becket,
> >>>
> >>> Sorry for not responding long time.
> >>>
> >>> Regarding case1.
> >>>
> >>> There wouldn’t be no “a.unCache()” method, but I would expect only
> >>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t affect
> >>> `cachedTableA2`. Just as in any other database dropping modifying one
> >>> independent table/materialised view does not affect others.
> >>>
> >>>> What I meant is that assuming there is already a cached table, ideally
> >>> users need
> >>>> not to specify whether the next query should read from the cache or
> use
> >>> the
> >>>> original DAG. This should be decided by the optimizer.
> >>>
> >>> 1. If we want to let optimiser make decisions whether to use cache or
> >>> not, then why do we need “void cache()” method at all? Would It
> “increase”
> >>> the chance of using the cache? That’s sounds strange. What would be the
> >>> mechanism of deciding whether to use the cache or not? If we want to
> >>> introduce such kind  automated optimisations of “plan nodes
> deduplication”
> >>> I would turn it on globally, not per table, and let the optimiser do
> all of
> >>> the work.
> >>> 2. We do not have statistics at the moment for any use/not use cache
> >>> decision.
> >>> 3. Even if we had, I would be veeerryy sceptical whether such cost
> based
> >>> optimisations would work properly and I would still insist first on
> >>> providing explicit caching mechanism (`CachedTable cache()`)
> >>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
> >>> contradict future work on automated cost based caching.
> >>>
> >>>
> >>> At the same time I’m not sure if you have responded to our objections
> of
> >>> `void cache()` being implicit/having side effects, which me, Jark,
> Fabian,
> >>> Till and I think also Shaoxuan are supporting.
> >>>
> >>> Piotrek
> >>>
> >>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
> >>>>
> >>>> Hi Till,
> >>>>
> >>>> It is true that after the first job submission, there will be no
> >>> ambiguity
> >>>> in terms of whether a cached table is used or not. That is the same
> for
> >>> the
> >>>> cache() without returning a CachedTable.
> >>>>
> >>>> Conceptually one could think of cache() as introducing a caching
> >>> operator
> >>>>> from which you need to consume from if you want to benefit from the
> >>> caching
> >>>>> functionality.
> >>>>
> >>>> I am thinking a little differently. I think it is a hint (as you
> >>> mentioned
> >>>> later) instead of a new operator. I'd like to be careful about the
> >>> semantic
> >>>> of the API. A hint is a property set on an existing operator, but is
> not
> >>>> itself an operator as it does not really manipulate the data.
> >>>>
> >>>> I agree, ideally the optimizer makes this kind of decision which
> >>>>> intermediate result should be cached. But especially when executing
> >>> ad-hoc
> >>>>> queries the user might better know which results need to be cached
> >>> because
> >>>>> Flink might not see the full DAG. In that sense, I would consider the
> >>>>> cache() method as a hint for the optimizer. Of course, in the future
> we
> >>>>> might add functionality which tries to automatically cache results
> >>> (e.g.
> >>>>> caching the latest intermediate results until so and so much space is
> >>>>> used). But this should hopefully not contradict with `CachedTable
> >>> cache()`.
> >>>>
> >>>> I agree that cache() method is needed for exactly the reason you
> >>> mentioned,
> >>>> i.e. Flink cannot predict what users are going to write later, so
> users
> >>>> need to tell Flink explicitly that this table will be used later.
> What I
> >>>> meant is that assuming there is already a cached table, ideally users
> >>> need
> >>>> not to specify whether the next query should read from the cache or
> use
> >>> the
> >>>> original DAG. This should be decided by the optimizer.
> >>>>
> >>>> To explain the difference between returning / not returning a
> >>> CachedTable,
> >>>> I want compare the following two case:
> >>>>
> >>>> *Case 1:  returning a CachedTable*
> >>>> b = a.map(...)
> >>>> val cachedTableA1 = a.cache()
> >>>> val cachedTableA2 = a.cache()
> >>>> b.print() // Just to make sure a is cached.
> >>>>
> >>>> c = a.filter(...) // User specify that the original DAG is used? Or
> the
> >>>> optimizer decides whether DAG or cache should be used?
> >>>> d = cachedTableA1.filter() // User specify that the cached table is
> >>> used.
> >>>>
> >>>> a.unCache() // Can cachedTableA still be used afterwards?
> >>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
> >>>>
> >>>> *Case 2: not returning a CachedTable*
> >>>> b = a.map()
> >>>> a.cache()
> >>>> a.cache() // no-op
> >>>> b.print() // Just to make sure a is cached
> >>>>
> >>>> c = a.filter(...) // Optimizer decides whether the cache or DAG should
> >>> be
> >>>> used
> >>>> d = a.filter(...) // Optimizer decides whether the cache or DAG should
> >>> be
> >>>> used
> >>>>
> >>>> a.unCache()
> >>>> a.unCache() // no-op
> >>>>
> >>>> In case 1, semantic wise, optimizer lose the option to choose between
> >>> DAG
> >>>> and cache. And the unCache() call becomes tricky.
> >>>> In case 2, users do not need to worry about whether cache or DAG is
> >>> used.
> >>>> And the unCache() semantic is clear. However, the caveat is that users
> >>>> cannot explicitly ignore the cache.
> >>>>
> >>>> In order to address the issues mentioned in case 2 and inspired by the
> >>>> discussion so far, I am thinking about using hint to allow user
> >>> explicitly
> >>>> ignore cache. Although we do not have hint yet, but we probably should
> >>> have
> >>>> one. So the code becomes:
> >>>>
> >>>> *Case 3: returning this table*
> >>>> b = a.map()
> >>>> a.cache()
> >>>> a.cache() // no-op
> >>>> b.print() // Just to make sure a is cached
> >>>>
> >>>> c = a.filter(...) // Optimizer decides whether the cache or DAG should
> >>> be
> >>>> used
> >>>> d = a.hint("ignoreCache").filter(...) // DAG will be used instead of
> the
> >>>> cache.
> >>>>
> >>>> a.unCache()
> >>>> a.unCache() // no-op
> >>>>
> >>>> We could also let cache() return this table to allow chained method
> >>> calls.
> >>>> Do you think this API addresses the concerns?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> All the recent discussions are focused on whether there is a problem
> if
> >>>>> cache() not return a Table.
> >>>>> It seems that returning a Table explicitly is more clear (and safe?).
> >>>>>
> >>>>> So whether there are any problems if cache() returns a Table?
> @Becket
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <tr...@apache.org>
> >>> wrote:
> >>>>>
> >>>>>> It's true that b, c, d and e will all read from the original DAG
> that
> >>>>>> generates a. But all subsequent operators (when running multiple
> >>> queries)
> >>>>>> which reference cachedTableA should not need to reproduce `a` but
> >>>>> directly
> >>>>>> consume the intermediate result.
> >>>>>>
> >>>>>> Conceptually one could think of cache() as introducing a caching
> >>> operator
> >>>>>> from which you need to consume from if you want to benefit from the
> >>>>> caching
> >>>>>> functionality.
> >>>>>>
> >>>>>> I agree, ideally the optimizer makes this kind of decision which
> >>>>>> intermediate result should be cached. But especially when executing
> >>>>> ad-hoc
> >>>>>> queries the user might better know which results need to be cached
> >>>>> because
> >>>>>> Flink might not see the full DAG. In that sense, I would consider
> the
> >>>>>> cache() method as a hint for the optimizer. Of course, in the future
> >>> we
> >>>>>> might add functionality which tries to automatically cache results
> >>> (e.g.
> >>>>>> caching the latest intermediate results until so and so much space
> is
> >>>>>> used). But this should hopefully not contradict with `CachedTable
> >>>>> cache()`.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Till
> >>>>>>
> >>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <be...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>>> Hi Till,
> >>>>>>>
> >>>>>>> Thanks for the clarification. I am still a little confused.
> >>>>>>>
> >>>>>>> If cache() returns a CachedTable, the example might become:
> >>>>>>>
> >>>>>>> b = a.map(...)
> >>>>>>> c = a.map(...)
> >>>>>>>
> >>>>>>> cachedTableA = a.cache()
> >>>>>>> d = cachedTableA.map(...)
> >>>>>>> e = a.map()
> >>>>>>>
> >>>>>>> In the above case, if cache() is lazily evaluated, b, c, d and e
> are
> >>>>> all
> >>>>>>> going to be reading from the original DAG that generates a. But
> with
> >>> a
> >>>>>>> naive expectation, d should be reading from the cache. This seems
> not
> >>>>>>> solving the potential confusion you raised, right?
> >>>>>>>
> >>>>>>> Just to be clear, my understanding are all based on the assumption
> >>> that
> >>>>>> the
> >>>>>>> tables are immutable. Therefore, after a.cache(), a the
> >>> c*achedTableA*
> >>>>>> and
> >>>>>>> original table *a * should be completely interchangeable.
> >>>>>>>
> >>>>>>> That said, I think a valid argument is optimization. There are
> indeed
> >>>>>> cases
> >>>>>>> that reading from the original DAG could be faster than reading
> from
> >>>>> the
> >>>>>>> cache. For example, in the following example:
> >>>>>>>
> >>>>>>> a.filter(f1' > 100)
> >>>>>>> a.cache()
> >>>>>>> b = a.filter(f1' < 100)
> >>>>>>>
> >>>>>>> Ideally the optimizer should be intelligent enough to decide which
> >>> way
> >>>>> is
> >>>>>>> faster, without user intervention. In this case, it will identify
> >>> that
> >>>>> b
> >>>>>>> would just be an empty table, thus skip reading from the cache
> >>>>>> completely.
> >>>>>>> But I agree that returning a CachedTable would give user the
> control
> >>> of
> >>>>>>> when to use cache, even though I still feel that letting the
> >>> optimizer
> >>>>>>> handle this is a better option in long run.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jiangjie (Becket) Qin
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <trohrmann@apache.org
> >
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Yes you are right Becket that it still depends on the actual
> >>>>> execution
> >>>>>> of
> >>>>>>>> the job whether a consumer reads from a cached result or not.
> >>>>>>>>
> >>>>>>>> My point was actually about the properties of a (cached vs.
> >>>>> non-cached)
> >>>>>>> and
> >>>>>>>> not about the execution. I would not make cache trigger the
> >>> execution
> >>>>>> of
> >>>>>>>> the job because one loses some flexibility by eagerly triggering
> the
> >>>>>>>> execution.
> >>>>>>>>
> >>>>>>>> I tried to argue for an explicit CachedTable which is returned by
> >>> the
> >>>>>>>> cache() method like Piotr did in order to make the API more
> >>> explicit.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Till
> >>>>>>>>
> >>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Till,
> >>>>>>>>>
> >>>>>>>>> That is a good example. Just a minor correction, in this case,
> b, c
> >>>>>>> and d
> >>>>>>>>> will all consume from a non-cached a. This is because cache will
> >>>>> only
> >>>>>>> be
> >>>>>>>>> created on the very first job submission that generates the table
> >>>>> to
> >>>>>> be
> >>>>>>>>> cached.
> >>>>>>>>>
> >>>>>>>>> If I understand correctly, this is example is about whether
> >>>>> .cache()
> >>>>>>>> method
> >>>>>>>>> should be eagerly evaluated or lazily evaluated. In another word,
> >>>>> if
> >>>>>>>>> cache() method actually triggers a job that creates the cache,
> >>>>> there
> >>>>>>> will
> >>>>>>>>> be no such confusion. Is that right?
> >>>>>>>>>
> >>>>>>>>> In the example, although d will not consume from the cached Table
> >>>>>> while
> >>>>>>>> it
> >>>>>>>>> looks supposed to, from correctness perspective the code will
> still
> >>>>>>>> return
> >>>>>>>>> correct result, assuming that tables are immutable.
> >>>>>>>>>
> >>>>>>>>> Personally I feel it is OK because users probably won't really
> >>>>> worry
> >>>>>>>> about
> >>>>>>>>> whether the table is cached or not. And lazy cache could avoid
> some
> >>>>>>>>> unnecessary caching if a cached table is never created in the
> user
> >>>>>>>>> application. But I am not opposed to do eager evaluation of
> cache.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> >>>>> trohrmann@apache.org>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Another argument for Piotr's point is that lazily changing
> >>>>>> properties
> >>>>>>>> of
> >>>>>>>>> a
> >>>>>>>>>> node affects all down stream consumers but does not necessarily
> >>>>>> have
> >>>>>>> to
> >>>>>>>>>> happen before these consumers are defined. From a user's
> >>>>>> perspective
> >>>>>>>> this
> >>>>>>>>>> can be quite confusing:
> >>>>>>>>>>
> >>>>>>>>>> b = a.map(...)
> >>>>>>>>>> c = a.map(...)
> >>>>>>>>>>
> >>>>>>>>>> a.cache()
> >>>>>>>>>> d = a.map(...)
> >>>>>>>>>>
> >>>>>>>>>> now b, c and d will consume from a cached operator. In this
> case,
> >>>>>> the
> >>>>>>>>> user
> >>>>>>>>>> would most likely expect that only d reads from a cached result.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Till
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> >>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey Shaoxuan and Becket,
> >>>>>>>>>>>
> >>>>>>>>>>>> Can you explain a bit more one what are the side effects? So
> >>>>>> far
> >>>>>>> my
> >>>>>>>>>>>> understanding is that such side effects only exist if a table
> >>>>>> is
> >>>>>>>>>> mutable.
> >>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>
> >>>>>>>>>>> Not only that. There are also performance implications and
> >>>>> those
> >>>>>>> are
> >>>>>>>>>>> another implicit side effects of using `void cache()`. As I
> >>>>> wrote
> >>>>>>>>> before,
> >>>>>>>>>>> reading from cache might not always be desirable, thus it can
> >>>>>> cause
> >>>>>>>>>>> performance degradation and I’m fine with that - user's or
> >>>>>>>> optimiser’s
> >>>>>>>>>>> choice. What I do not like is that this implicit side effect
> >>>>> can
> >>>>>>>>> manifest
> >>>>>>>>>>> in completely different part of code, that wasn’t touched by a
> >>>>>> user
> >>>>>>>>> while
> >>>>>>>>>>> he was adding `void cache()` call somewhere else. And even if
> >>>>>>> caching
> >>>>>>>>>>> improves performance, it’s still a side effect of `void
> >>>>> cache()`.
> >>>>>>>>> Almost
> >>>>>>>>>>> from the definition `void` methods have only side effects. As I
> >>>>>>> wrote
> >>>>>>>>>>> before, there are couple of scenarios where this might be
> >>>>>>> undesirable
> >>>>>>>>>>> and/or unexpected, for example:
> >>>>>>>>>>>
> >>>>>>>>>>> 1.
> >>>>>>>>>>> Table b = …;
> >>>>>>>>>>> b.cache()
> >>>>>>>>>>> x = b.join(…)
> >>>>>>>>>>> y = b.count()
> >>>>>>>>>>> // ...
> >>>>>>>>>>> // 100
> >>>>>>>>>>> // hundred
> >>>>>>>>>>> // lines
> >>>>>>>>>>> // of
> >>>>>>>>>>> // code
> >>>>>>>>>>> // later
> >>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in a
> >>>>>>>> different
> >>>>>>>>>>> method/file/package/dependency
> >>>>>>>>>>>
> >>>>>>>>>>> 2.
> >>>>>>>>>>>
> >>>>>>>>>>> Table b = ...
> >>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>> foo(b)
> >>>>>>>>>>> }
> >>>>>>>>>>> Else {
> >>>>>>>>>>> bar(b)
> >>>>>>>>>>> }
> >>>>>>>>>>> z = b.filter(…).groupBy(…)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Void foo(Table b) {
> >>>>>>>>>>> b.cache()
> >>>>>>>>>>> // do something with b
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> In both above examples, `b.cache()` will implicitly affect
> >>>>>>> (semantic
> >>>>>>>>> of a
> >>>>>>>>>>> program in case of sources being mutable and performance) `z =
> >>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
> >>>>>>>>>>>
> >>>>>>>>>>> On top of that, there is still this argument of mine that
> >>>>> having
> >>>>>> a
> >>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more flexible
> >>>>> for
> >>>>>> us
> >>>>>>>> for
> >>>>>>>>>> the
> >>>>>>>>>>> future and for the user (as a manual option to bypass cache
> >>>>>> reads).
> >>>>>>>>>>>
> >>>>>>>>>>>> But Jiangjie is correct,
> >>>>>>>>>>>> the source table in batching should be immutable. It is the
> >>>>>>> user’s
> >>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
> >>>>> failover
> >>>>>>> may
> >>>>>>>>> lead
> >>>>>>>>>>>> to inconsistent results.
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment should
> >>>>> be.
> >>>>>>> But
> >>>>>>>>> its
> >>>>>>>>>>> often isn’t and while I’m not trying to fix this (since the
> >>>>>> proper
> >>>>>>>> fix
> >>>>>>>>> is
> >>>>>>>>>>> to support transactions), I’m just trying to minimise confusion
> >>>>>> for
> >>>>>>>> the
> >>>>>>>>>>> users that are not fully aware what’s going on and operate in
> >>>>>> less
> >>>>>>>> then
> >>>>>>>>>>> perfect setup. And if something bites them after adding
> >>>>>> `b.cache()`
> >>>>>>>>> call,
> >>>>>>>>>>> to make sure that they at least know all of the places that
> >>>>>> adding
> >>>>>>>> this
> >>>>>>>>>>> line can affect.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks, Piotrek
> >>>>>>>>>>>
> >>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks again for the clarification. Some more replies are
> >>>>>>>> following.
> >>>>>>>>>>>>
> >>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be used
> >>>>> in
> >>>>>>>>>>> interactive
> >>>>>>>>>>>>> programming and not only in batching.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It is true. Actually in stream processing, cache() has the
> >>>>> same
> >>>>>>>>>> semantic
> >>>>>>>>>>> as
> >>>>>>>>>>>> batch processing. The semantic is following:
> >>>>>>>>>>>> For a table created via a series of computation, save that
> >>>>>> table
> >>>>>>>> for
> >>>>>>>>>>> later
> >>>>>>>>>>>> reference to avoid running the computation logic to
> >>>>> regenerate
> >>>>>>> the
> >>>>>>>>>> table.
> >>>>>>>>>>>> Once the application exits, drop all the cache.
> >>>>>>>>>>>> This semantic is same for both batch and stream processing.
> >>>>> The
> >>>>>>>>>>> difference
> >>>>>>>>>>>> is that stream applications will only run once as they are
> >>>>> long
> >>>>>>>>>> running.
> >>>>>>>>>>>> And the batch applications may be run multiple times, hence
> >>>>> the
> >>>>>>>> cache
> >>>>>>>>>> may
> >>>>>>>>>>>> be created and dropped each time the application runs.
> >>>>>>>>>>>> Admittedly, there will probably be some resource management
> >>>>>>>>>> requirements
> >>>>>>>>>>>> for the streaming cached table, such as time based / size
> >>>>> based
> >>>>>>>>>>> retention,
> >>>>>>>>>>>> to address the infinite data issue. But such requirement does
> >>>>>> not
> >>>>>>>>>> change
> >>>>>>>>>>>> the semantic.
> >>>>>>>>>>>> You are right that interactive programming is just one use
> >>>>> case
> >>>>>>> of
> >>>>>>>>>>> cache().
> >>>>>>>>>>>> It is not the only use case.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For me the more important issue is of not having the `void
> >>>>>>> cache()`
> >>>>>>>>>> with
> >>>>>>>>>>>>> side effects.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is indeed the key point. The argument around whether
> >>>>>> cache()
> >>>>>>>>>> should
> >>>>>>>>>>>> return something already indicates that cache() and
> >>>>>> materialize()
> >>>>>>>>>> address
> >>>>>>>>>>>> different issues.
> >>>>>>>>>>>> Can you explain a bit more one what are the side effects? So
> >>>>>> far
> >>>>>>> my
> >>>>>>>>>>>> understanding is that such side effects only exist if a table
> >>>>>> is
> >>>>>>>>>> mutable.
> >>>>>>>>>>>> Is that the case?
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don’t know, probably initially we should make CachedTable
> >>>>>>>>> read-only.
> >>>>>>>>>> I
> >>>>>>>>>>>>> don’t find it more confusing than the fact that user can not
> >>>>>>> write
> >>>>>>>>> to
> >>>>>>>>>>> views
> >>>>>>>>>>>>> or materialised views in SQL or that user currently can not
> >>>>>>> write
> >>>>>>>>> to a
> >>>>>>>>>>>>> Table.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't think anyone should insert something to a cache. By
> >>>>>>>>> definition
> >>>>>>>>>>> the
> >>>>>>>>>>>> cache should only be updated when the corresponding original
> >>>>>>> table
> >>>>>>>> is
> >>>>>>>>>>>> updated. What I am wondering is that given the following two
> >>>>>>> facts:
> >>>>>>>>>>>> 1. If and only if a table is mutable (with something like
> >>>>>>>> insert()),
> >>>>>>>>> a
> >>>>>>>>>>>> CachedTable may have implicit behavior.
> >>>>>>>>>>>> 2. A CachedTable extends a Table.
> >>>>>>>>>>>> We can come to the conclusion that a CachedTable is mutable
> >>>>> and
> >>>>>>>> users
> >>>>>>>>>> can
> >>>>>>>>>>>> insert into the CachedTable directly. This is where I thought
> >>>>>>>>>> confusing.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> >>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
> >>>>>>>> explanation
> >>>>>>>>>> why
> >>>>>>>>>>> I
> >>>>>>>>>>>>> think `materialize()` is more natural to me is that I think
> >>>>> of
> >>>>>>> all
> >>>>>>>>>>> “Table”s
> >>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
> >>>>> views,
> >>>>>>> the
> >>>>>>>>> only
> >>>>>>>>>>>>> difference for me is that their live scope is short -
> >>>>> current
> >>>>>>>>> session
> >>>>>>>>>>> which
> >>>>>>>>>>>>> is limited by different execution model. That’s why
> >>>>> “cashing”
> >>>>>> a
> >>>>>>>> view
> >>>>>>>>>>> for me
> >>>>>>>>>>>>> is just materialising it.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> However I see and I understand your point of view. Coming
> >>>>> from
> >>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
> >>>>>>> `cache()`
> >>>>>>>>> is
> >>>>>>>>>>> more
> >>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
> >>>>> only
> >>>>>> be
> >>>>>>>>> used
> >>>>>>>>>> in
> >>>>>>>>>>>>> interactive programming and not only in batching. But naming
> >>>>>> is
> >>>>>>>> one
> >>>>>>>>>>> issue,
> >>>>>>>>>>>>> and not that critical to me. Especially that once we
> >>>>> implement
> >>>>>>>>> proper
> >>>>>>>>>>>>> materialised views, we can always deprecate/rename `cache()`
> >>>>>> if
> >>>>>>> we
> >>>>>>>>>> deem
> >>>>>>>>>>> so.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For me the more important issue is of not having the `void
> >>>>>>>> cache()`
> >>>>>>>>>> with
> >>>>>>>>>>>>> side effects. Exactly for the reasons that you have
> >>>>> mentioned.
> >>>>>>>> True:
> >>>>>>>>>>>>> results might be non deterministic if underlying source
> >>>>> table
> >>>>>>> are
> >>>>>>>>>>> changing.
> >>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
> >>>>> semantic
> >>>>>>> of
> >>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It can
> >>>>> cause
> >>>>>>>> “wtf”
> >>>>>>>>>>> moment
> >>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some place in
> >>>>> his
> >>>>>>>> code
> >>>>>>>>>> and
> >>>>>>>>>>>>> suddenly some other random places are behaving differently.
> >>>>> If
> >>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
> >>>>> force
> >>>>>>> user
> >>>>>>>>> to
> >>>>>>>>>>>>> explicitly use the cache which removes the “random” part
> >>>>> from
> >>>>>>> the
> >>>>>>>>>>> "suddenly
> >>>>>>>>>>>>> some other random places are behaving differently”.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This argument and others that I’ve raised (greater
> >>>>>>>>>> flexibility/allowing
> >>>>>>>>>>>>> user to explicitly bypass the cache) are independent of
> >>>>>>> `cache()`
> >>>>>>>> vs
> >>>>>>>>>>>>> `materialize()` discussion.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Does that mean one can also insert into the CachedTable?
> >>>>> This
> >>>>>>>>> sounds
> >>>>>>>>>>>>> pretty confusing.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I don’t know, probably initially we should make CachedTable
> >>>>>>>>>> read-only. I
> >>>>>>>>>>>>> don’t find it more confusing than the fact that user can not
> >>>>>>> write
> >>>>>>>>> to
> >>>>>>>>>>> views
> >>>>>>>>>>>>> or materialised views in SQL or that user currently can not
> >>>>>>> write
> >>>>>>>>> to a
> >>>>>>>>>>>>> Table.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
> >>>>>> should
> >>>>>>> be
> >>>>>>>>>>>>> considered as two different methods where the later one is
> >>>>>> more
> >>>>>>>>>>>>> sophisticated.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> According to my understanding, the initial idea is just to
> >>>>>>>>> introduce
> >>>>>>>>>> a
> >>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI is a
> >>>>>>>>> high-level
> >>>>>>>>>>> API,
> >>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
> >>>>> and
> >>>>>>>> force
> >>>>>>>>>>> users
> >>>>>>>>>>>>> to translate a Table to a Dataset before caching it. Then
> >>>>> the
> >>>>>>>> users
> >>>>>>>>>>> should
> >>>>>>>>>>>>> manually register the cached dataset to a table again (we
> >>>>> may
> >>>>>>> need
> >>>>>>>>>> some
> >>>>>>>>>>>>> table replacement mechanisms for datasets with an identical
> >>>>>>> schema
> >>>>>>>>> but
> >>>>>>>>>>>>> different contents here). After all, it’s the dataset rather
> >>>>>>> than
> >>>>>>>>> the
> >>>>>>>>>>>>> dynamic table that need to be cached, right?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> >>>>>>> becket.qin@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Piotrek and Jark,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
> >>>>>>>> arguments.
> >>>>>>>>>>> But I
> >>>>>>>>>>>>>>> think those arguments are mostly about materialized view.
> >>>>>> Let
> >>>>>>> me
> >>>>>>>>> try
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>> explain the reason I believe cache() and materialize() are
> >>>>>>>>>> different.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think cache() and materialize() have quite different
> >>>>>>>>> implications.
> >>>>>>>>>>> An
> >>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When users
> >>>>> call
> >>>>>>>>> cache(),
> >>>>>>>>>>> it
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> just like they are saving an intermediate result as a
> >>>>> draft
> >>>>>> of
> >>>>>>>>> their
> >>>>>>>>>>>>> work,
> >>>>>>>>>>>>>>> this intermediate result may not have any realistic
> >>>>> meaning.
> >>>>>>>>> Calling
> >>>>>>>>>>>>>>> cache() does not mean users want to publish the cached
> >>>>> table
> >>>>>>> in
> >>>>>>>>> any
> >>>>>>>>>>>>> manner.
> >>>>>>>>>>>>>>> But when users call materialize(), that means "I have
> >>>>>>> something
> >>>>>>>>>>>>> meaningful
> >>>>>>>>>>>>>>> to be reused by others", now users need to think about the
> >>>>>>>>>> validation,
> >>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Piotrek's suggestions on variations of the materialize()
> >>>>>>> methods
> >>>>>>>>> are
> >>>>>>>>>>>>> very
> >>>>>>>>>>>>>>> useful. It would be great if Flink have them. The concept
> >>>>> of
> >>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>> view is actually a pretty big feature, not to say the
> >>>>>> related
> >>>>>>>>> stuff
> >>>>>>>>>>> like
> >>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
> >>>>>> materialized
> >>>>>>>>> view
> >>>>>>>>>>>>> itself
> >>>>>>>>>>>>>>> should be discussed in a more thorough and systematic
> >>>>>> manner.
> >>>>>>>> And
> >>>>>>>>> I
> >>>>>>>>>>>>> found
> >>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
> >>>>>>> interactive
> >>>>>>>>>>>>>>> programming experience.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The example you gave was interesting. I still have some
> >>>>>>>> questions,
> >>>>>>>>>>>>> though.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Table source = … // some source that scans files from a
> >>>>>>>> directory
> >>>>>>>>>>>>>>>> “/foo/bar/“
> >>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> >>>>> initialised)
> >>>>>>>>>>>>>>>> int a1 = t1.count()
> >>>>>>>>>>>>>>>> int b1 = t2.count()
> >>>>>>>>>>>>>>>> // something in the background (or we trigger it) writes
> >>>>>> new
> >>>>>>>>> files
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>> /foo/bar
> >>>>>>>>>>>>>>>> int a2 = t1.count()
> >>>>>>>>>>>>>>>> int b2 = t2.count()
> >>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> >>>>>>>> implemented
> >>>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> initial version
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> what if someone else added some more files to /foo/bar at
> >>>>>> this
> >>>>>>>>>> point?
> >>>>>>>>>>> In
> >>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
> >>>>>>>>>>>>> non-deterministic,
> >>>>>>>>>>>>>>> right?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> int a3 = t1.count()
> >>>>>>>>>>>>>>>> int b3 = t2.count()
> >>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
> >>>>>>> “cache”
> >>>>>>>>>>> dropping
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> When we talk about interactive programming, in most cases,
> >>>>>> we
> >>>>>>>> are
> >>>>>>>>>>>>> talking
> >>>>>>>>>>>>>>> about batch applications. A fundamental assumption of such
> >>>>>>> case
> >>>>>>>> is
> >>>>>>>>>>> that
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> source data is complete before the data processing begins,
> >>>>>> and
> >>>>>>>> the
> >>>>>>>>>>> data
> >>>>>>>>>>>>>>> will not change during the data processing. IMO, if
> >>>>>> additional
> >>>>>>>>> rows
> >>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>> to be added to some source during the processing, it
> >>>>> should
> >>>>>> be
> >>>>>>>>> done
> >>>>>>>>>> in
> >>>>>>>>>>>>> ways
> >>>>>>>>>>>>>>> like union the source with another table containing the
> >>>>> rows
> >>>>>>> to
> >>>>>>>> be
> >>>>>>>>>>>>> added.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> There are a few cases that computations are executed
> >>>>>>> repeatedly
> >>>>>>>> on
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> changing data source.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For example, people may run a ML training job every hour
> >>>>>> with
> >>>>>>>> the
> >>>>>>>>>>>>> samples
> >>>>>>>>>>>>>>> newly added in the past hour. In that case, the source
> >>>>> data
> >>>>>>>>> between
> >>>>>>>>>>> will
> >>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged within
> >>>>>> one
> >>>>>>>>> run.
> >>>>>>>>>>> And
> >>>>>>>>>>>>>>> usually in that case, the result will need versioning,
> >>>>> i.e.
> >>>>>>> for
> >>>>>>>> a
> >>>>>>>>>>> given
> >>>>>>>>>>>>>>> result, it tells that the result is a result from the
> >>>>> source
> >>>>>>>> data
> >>>>>>>>>> by a
> >>>>>>>>>>>>>>> certain timestamp.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Another example is something like data warehouse. In this
> >>>>>>> case,
> >>>>>>>>>> there
> >>>>>>>>>>>>> are a
> >>>>>>>>>>>>>>> few source of original/raw data. On top of those sources,
> >>>>>> many
> >>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>> view / queries / reports / dashboards can be created to
> >>>>>>> generate
> >>>>>>>>>>> derived
> >>>>>>>>>>>>>>> data. Those derived data needs to be updated when the
> >>>>>>> underlying
> >>>>>>>>>>>>> original
> >>>>>>>>>>>>>>> data changes. In that case, the processing logic that
> >>>>>> derives
> >>>>>>>> the
> >>>>>>>>>>>>> original
> >>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
> >>>>>>>>> reports/views.
> >>>>>>>>>>>>> Again,
> >>>>>>>>>>>>>>> all those derived data also need to have version
> >>>>> management,
> >>>>>>>> such
> >>>>>>>>> as
> >>>>>>>>>>>>>>> timestamp.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In any of the above two cases, during a single run of the
> >>>>>>>>> processing
> >>>>>>>>>>>>> logic,
> >>>>>>>>>>>>>>> the data cannot change. Otherwise the behavior of the
> >>>>>>> processing
> >>>>>>>>>> logic
> >>>>>>>>>>>>> may
> >>>>>>>>>>>>>>> be undefined. In the above two examples, when writing the
> >>>>>>>>> processing
> >>>>>>>>>>>>> logic,
> >>>>>>>>>>>>>>> Users can use .cache() to hint Flink that those results
> >>>>>> should
> >>>>>>>> be
> >>>>>>>>>>> saved
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> avoid repeated computation. And then for the result of my
> >>>>>>>>>> application
> >>>>>>>>>>>>>>> logic, I'll call materialize(), so that these results
> >>>>> could
> >>>>>> be
> >>>>>>>>>> managed
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>> the system with versioning, metadata management, lifecycle
> >>>>>>>>>> management,
> >>>>>>>>>>>>>>> ACLs, etc.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It is true we can use materialize() to do the cache() job,
> >>>>>>> but I
> >>>>>>>>> am
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>>> reluctant to shoehorn cache() into materialize() and force
> >>>>>>> users
> >>>>>>>>> to
> >>>>>>>>>>>>> worry
> >>>>>>>>>>>>>>> about a bunch of implications that they needn't have to. I
> >>>>>> am
> >>>>>>>>>>>>> absolutely on
> >>>>>>>>>>>>>>> your side that redundant API is bad. But it is equally
> >>>>>>>>> frustrating,
> >>>>>>>>>> if
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>> more, that the same API does different things.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> >>>>>>>>> wshaoxuan@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks Piotrek,
> >>>>>>>>>>>>>>>> You provided a very good example, it explains all the
> >>>>>>>> confusions
> >>>>>>>>> I
> >>>>>>>>>>>>> have.
> >>>>>>>>>>>>>>>> It is clear that there is something we have not
> >>>>> considered
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>> proposal. We intend to force the user to reuse the
> >>>>>>>>>>> cached/materialized
> >>>>>>>>>>>>>>>> table, if its cache() method is executed. We did not
> >>>>> expect
> >>>>>>>> that
> >>>>>>>>>> user
> >>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>> want to re-executed the plan from the source table. Let
> >>>>> me
> >>>>>>>>> re-think
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>> it and get back to you later.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In the meanwhile, this example/observation also infers
> >>>>> that
> >>>>>>> we
> >>>>>>>>>> cannot
> >>>>>>>>>>>>> fully
> >>>>>>>>>>>>>>>> involve the optimizer to decide the plan if a
> >>>>>>> cache/materialize
> >>>>>>>>> is
> >>>>>>>>>>>>>>>> explicitly used, because weather to reuse the cache data
> >>>>> or
> >>>>>>>>>>> re-execute
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> query from source data may lead to different results.
> >>>>> (But
> >>>>>> I
> >>>>>>>>> guess
> >>>>>>>>>>>>>>>> optimizer can still help in some cases ---- as long as it
> >>>>>>> does
> >>>>>>>>> not
> >>>>>>>>>>>>>>>> re-execute from the varied source, we should be safe).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>> Shaoxuan
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> >>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Shaoxuan,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Re 2:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
> >>>>>> modified
> >>>>>>>>> to->
> >>>>>>>>>>> t1’
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> >>>>>>>>>>>>>>>>> `methodThatAppliesOperators()` method has changed it’s
> >>>>>> plan?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I was thinking more about something like this:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Table source = … // some source that scans files from a
> >>>>>>>>> directory
> >>>>>>>>>>>>>>>>> “/foo/bar/“
> >>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> >>>>>> initialised)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> int a1 = t1.count()
> >>>>>>>>>>>>>>>>> int b1 = t2.count()
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> // something in the background (or we trigger it) writes
> >>>>>> new
> >>>>>>>>> files
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> /foo/bar
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> int a2 = t1.count()
> >>>>>>>>>>>>>>>>> int b2 = t2.count()
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> >>>>>>>> implemented
> >>>>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> initial version
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> int a3 = t1.count()
> >>>>>>>>>>>>>>>>> int b3 = t2.count()
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
> >>>>>>> “cache”
> >>>>>>>>>>>>> dropping
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> assertTrue(a1 == b1) // same results, but b1 comes from
> >>>>>> the
> >>>>>>>>>> “cache"
> >>>>>>>>>>>>>>>>> assertTrue(b1 == b2) // both values come from the same
> >>>>>> cache
> >>>>>>>>>>>>>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2
> >>>>> re-executed
> >>>>>>>> full
> >>>>>>>>>>> table
> >>>>>>>>>>>>>>>> scan
> >>>>>>>>>>>>>>>>> and has more data
> >>>>>>>>>>>>>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> >>>>>>>>>>>>>>>>> assertTrue(b3 == a2 == a3)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> It is an very interesting and useful design!
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Here I want to share some of my thoughts:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 1. Agree with that cache() method should return some
> >>>>>> Table
> >>>>>>> to
> >>>>>>>>>> avoid
> >>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>> unexpected problems because of the mutable object.
> >>>>>>>>>>>>>>>>>> All the existing methods of Table are returning a new
> >>>>>> Table
> >>>>>>>>>>> instance.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 2. I think materialize() would be more consistent with
> >>>>>> SQL,
> >>>>>>>>> this
> >>>>>>>>>>>>> makes
> >>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> possible to support the same feature for SQL
> >>>>> (materialize
> >>>>>>>> view)
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>> keep
> >>>>>>>>>>>>>>>>>> the same API for users in the future.
> >>>>>>>>>>>>>>>>>> But I'm also fine if we choose cache().
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 3. In the proposal, a TableService (or FlinkService?)
> >>>>> is
> >>>>>>> used
> >>>>>>>>> to
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> result of the (intermediate) table.
> >>>>>>>>>>>>>>>>>> But the name of TableService may be a bit general which
> >>>>>> is
> >>>>>>>> not
> >>>>>>>>>>> quite
> >>>>>>>>>>>>>>>>>> understanding correctly in the first glance (a
> >>>>> metastore
> >>>>>>> for
> >>>>>>>>>>>>> tables?).
> >>>>>>>>>>>>>>>>>> Maybe a more specific name would be better, such as
> >>>>>>>>>>> TableCacheSerive
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>> TableMaterializeSerivce or something else.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
> >>>>>>>> fhueske@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for the clarification Becket!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I have a few thoughts to share / questions:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1) I'd like to know how you plan to implement the
> >>>>>> feature
> >>>>>>>> on a
> >>>>>>>>>>> plan
> >>>>>>>>>>>>> /
> >>>>>>>>>>>>>>>>>>> planner level.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I would imaging the following to happen when
> >>>>>> Table.cache()
> >>>>>>>> is
> >>>>>>>>>>>>> called:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1) immediately optimize the Table and internally
> >>>>> convert
> >>>>>>> it
> >>>>>>>>>> into a
> >>>>>>>>>>>>>>>>>>> DataSet/DataStream. This is necessary, to avoid that
> >>>>>>>> operators
> >>>>>>>>>> of
> >>>>>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>>>>> queries on top of the Table are pushed down.
> >>>>>>>>>>>>>>>>>>> 2) register the DataSet/DataStream as a
> >>>>>>>>>> DataSet/DataStream-backed
> >>>>>>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>> X
> >>>>>>>>>>>>>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> >>>>>>>>>>> materialization
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> Table X
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Based on your proposal the following would happen:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Table t1 = ....
> >>>>>>>>>>>>>>>>>>> t1.cache(); // cache() returns void. The logical plan
> >>>>> of
> >>>>>>> t1
> >>>>>>>> is
> >>>>>>>>>>>>>>>> replaced
> >>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>> a scan of X. There is also a reference to the
> >>>>>>>> materialization
> >>>>>>>>> of
> >>>>>>>>>>> X.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, including
> >>>>> the
> >>>>>>>>>>>>>>>>> DataSet/DataStream
> >>>>>>>>>>>>>>>>>>> that backs X and the sink that writes the
> >>>>>> materialization
> >>>>>>>> of X
> >>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, but reads X
> >>>>>> from
> >>>>>>>> the
> >>>>>>>>>>>>>>>>>>> materialization.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> My question is, how do you determine when whether the
> >>>>>> scan
> >>>>>>>> of
> >>>>>>>>> t1
> >>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>> go
> >>>>>>>>>>>>>>>>>>> against the DataSet/DataStream program and when
> >>>>> against
> >>>>>>> the
> >>>>>>>>>>>>>>>>>>> materialization?
> >>>>>>>>>>>>>>>>>>> AFAIK, there is no hook that will tell you that a part
> >>>>>> of
> >>>>>>>> the
> >>>>>>>>>>>>> program
> >>>>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>>>>> executed. Flipping a switch during optimization or
> >>>>> plan
> >>>>>>>>>> generation
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> sufficient as there is no guarantee that the plan is
> >>>>>> also
> >>>>>>>>>>> executed.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Overall, this behavior is somewhat similar to what I
> >>>>>>>> proposed
> >>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> FLINK-8950, which does not include persisting the
> >>>>> table,
> >>>>>>> but
> >>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>> optimizing and reregistering it as DataSet/DataStream
> >>>>>>> scan.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2) I think Piotr has a point about the implicit
> >>>>> behavior
> >>>>>>> and
> >>>>>>>>>> side
> >>>>>>>>>>>>>>>>> effects
> >>>>>>>>>>>>>>>>>>> of the cache() method if it does not return anything.
> >>>>>>>>>>>>>>>>>>> Consider the following example:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Table t1 = ???
> >>>>>>>>>>>>>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> >>>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> In this case, the behavior/performance of the plan
> >>>>> that
> >>>>>>>>> results
> >>>>>>>>>>> from
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> second method call depends on whether t1 was modified
> >>>>> by
> >>>>>>> the
> >>>>>>>>>> first
> >>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>> or not.
> >>>>>>>>>>>>>>>>>>> This is the classic issue of mutable vs. immutable
> >>>>>>> objects.
> >>>>>>>>>>>>>>>>>>> Also, as Piotr pointed out, it might also be good to
> >>>>>> have
> >>>>>>>> the
> >>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>>> plan
> >>>>>>>>>>>>>>>>>>> of t1, because in some cases it is possible to push
> >>>>>>> filters
> >>>>>>>>> down
> >>>>>>>>>>>>> such
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> evaluating the query from scratch might be more
> >>>>>> efficient
> >>>>>>>> than
> >>>>>>>>>>>>>>>> accessing
> >>>>>>>>>>>>>>>>>>> the cache.
> >>>>>>>>>>>>>>>>>>> Moreover, a CachedTable could extend Table() and
> >>>>> offer a
> >>>>>>>>> method
> >>>>>>>>>>>>>>>>> refresh().
> >>>>>>>>>>>>>>>>>>> This sounds quite useful in an interactive session
> >>>>> mode.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> >>>>>>>>>>> materialize()
> >>>>>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>>>>> to be more future proof.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best, Fabian
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
> >>>>>> Wang <
> >>>>>>>>>>>>>>>>>>> wshaoxuan@gmail.com>:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Piotr,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for sharing your ideas on the method naming.
> >>>>> We
> >>>>>>> will
> >>>>>>>>>> think
> >>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>> your suggestions. But I don't understand why we need
> >>>>> to
> >>>>>>>>> change
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> return
> >>>>>>>>>>>>>>>>>>>> type of cache().
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Cache() is a physical operation, it does not change
> >>>>> the
> >>>>>>>> logic
> >>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> the `Table`. On the tableAPI layer, we should not
> >>>>>>>> introduce a
> >>>>>>>>>> new
> >>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>> type unless the logic of table has been changed. If
> >>>>> we
> >>>>>>>>>> introduce
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>> table type `CachedTable`, we need create the same set
> >>>>>> of
> >>>>>>>>>> methods
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> `Table`
> >>>>>>>>>>>>>>>>>>>> for it. I don't think it is worth doing this. Or can
> >>>>>> you
> >>>>>>>>> please
> >>>>>>>>>>>>>>>>> elaborate
> >>>>>>>>>>>>>>>>>>>> more on what could be the "implicit behaviours/side
> >>>>>>>> effects"
> >>>>>>>>>> you
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>> thinking about?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>> Shaoxuan
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the response.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1. I wasn’t saying that materialised view must be
> >>>>>>> mutable
> >>>>>>>> or
> >>>>>>>>>>> not.
> >>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>> thing applies to caches as well. To the contrary, I
> >>>>>>> would
> >>>>>>>>>> expect
> >>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>> consistency and updates from something that is
> >>>>> called
> >>>>>>>>> “cache”
> >>>>>>>>>> vs
> >>>>>>>>>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>>>>> that’s a “materialised view”. In other words, IMO
> >>>>> most
> >>>>>>>>> caches
> >>>>>>>>>> do
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> serve
> >>>>>>>>>>>>>>>>>>>>> you invalid/outdated data and they handle updates on
> >>>>>>> their
> >>>>>>>>>> own.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2. I don’t think that having in the future two very
> >>>>>>>> similar
> >>>>>>>>>>>>> concepts
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> `materialized` view and `cache` is a good idea. It
> >>>>>> would
> >>>>>>>> be
> >>>>>>>>>>>>>>>> confusing
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>> the users. I think it could be handled by
> >>>>>>>>>> variations/overloading
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> materialised view concept. We could start with:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize()` - immutable,
> >>>>> session
> >>>>>>>> life
> >>>>>>>>>>> scope
> >>>>>>>>>>>>>>>>>>>>> (basically the same semantic as you are proposing
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> And then in the future (if ever) build on top of
> >>>>>>>> that/expand
> >>>>>>>>>> it
> >>>>>>>>>>>>>>>> with:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> >>>>>>>>>>>>> `MaterializedTable
> >>>>>>>>>>>>>>>>>>>>> materialize(refreshHook=…)`
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Or with cross session support:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> >>>>>>>>>>>>>>>> `MaterializedTable
> >>>>>>>>>>>>>>>>>>>>> materializeInto(tableFactory=…)`
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I’m not saying that we should implement cross
> >>>>>>>>>> session/refreshing
> >>>>>>>>>>>>> now
> >>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>> even in the near future. I’m just arguing that
> >>>>> naming
> >>>>>>>>> current
> >>>>>>>>>>>>>>>>> immutable
> >>>>>>>>>>>>>>>>>>>>> session life scope method `materialize()` is more
> >>>>>> future
> >>>>>>>>> proof
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>> consistent with SQL (on which after all table-api is
> >>>>>>>> heavily
> >>>>>>>>>>>>> basing
> >>>>>>>>>>>>>>>>>>> on).
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would
> >>>>>>> still
> >>>>>>>>>> insist
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
> >>>>>>> implicit
> >>>>>>>>>>>>>>>>>>>> behaviours/side
> >>>>>>>>>>>>>>>>>>>>> effects and to give both us & users more
> >>>>> flexibility.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
> >>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Just to add a little bit, the materialized view is
> >>>>>>>> probably
> >>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>> similar
> >>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> the persistent() brought up earlier in the thread.
> >>>>> So
> >>>>>>> it
> >>>>>>>> is
> >>>>>>>>>>>>> usually
> >>>>>>>>>>>>>>>>>>>> cross
> >>>>>>>>>>>>>>>>>>>>>> session and could be used in a larger scope. For
> >>>>>>>> example, a
> >>>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>>> view created by user A may be visible to user B. It
> >>>>>> is
> >>>>>>>>>> probably
> >>>>>>>>>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>>>>>> we want to have in the future. I'll put it in the
> >>>>>>> future
> >>>>>>>>> work
> >>>>>>>>>>>>>>>>>>> section.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> >>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Right now we are mostly thinking of the cached
> >>>>> table
> >>>>>>> as
> >>>>>>>>>>>>>>>> immutable. I
> >>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>> see the Materialized view would be useful in the
> >>>>>>> future.
> >>>>>>>>>> That
> >>>>>>>>>>>>>>>> said,
> >>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>> a simple cache mechanism is probably still needed.
> >>>>>> So
> >>>>>>> to
> >>>>>>>>> me,
> >>>>>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>> materialize() should be two separate method as
> >>>>> they
> >>>>>>>>> address
> >>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>> needs. Materialize() is a higher level concept
> >>>>>> usually
> >>>>>>>>>>> implying
> >>>>>>>>>>>>>>>>>>>>> periodical
> >>>>>>>>>>>>>>>>>>>>>>> update, while cache() has much simpler semantic.
> >>>>> For
> >>>>>>>>>> example,
> >>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>>>> create a materialized view and use cache() method
> >>>>> in
> >>>>>>> the
> >>>>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>>>> creation logic. So that during the materialized
> >>>>> view
> >>>>>>>>> update,
> >>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>> need to worry about the case that the cached table
> >>>>>> is
> >>>>>>>> also
> >>>>>>>>>>>>>>>> changed.
> >>>>>>>>>>>>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>>>>>>>>>> under the hood, materialized() and cache() could
> >>>>>> share
> >>>>>>>>> some
> >>>>>>>>>>>>>>>>>>> mechanism,
> >>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>> I think a simple cache() method would be handy in
> >>>>> a
> >>>>>>> lot
> >>>>>>>> of
> >>>>>>>>>>>>> cases.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> >>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> >>>>>>>>>> MaterializedTable
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Maybe not in the initial implementation, but
> >>>>>> various
> >>>>>>>> DBs
> >>>>>>>>>>> offer
> >>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
> >>>>>>>> triggers,
> >>>>>>>>>>>>> timers,
> >>>>>>>>>>>>>>>>>>>>> manually
> >>>>>>>>>>>>>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to
> >>>>>>> handle
> >>>>>>>>>> that
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> future.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> After users call *table.cache(), *users can just
> >>>>>> use
> >>>>>>>>> that
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>>>>>>>> anything that is supported on a Table, including
> >>>>>> SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> This is some implicit behaviour with side
> >>>>> effects.
> >>>>>>>>> Imagine
> >>>>>>>>>> if
> >>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> long and complicated program, that touches table
> >>>>>> `b`
> >>>>>>>>>> multiple
> >>>>>>>>>>>>>>>>>>> times,
> >>>>>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>>>>> scattered around different methods. If he
> >>>>> modifies
> >>>>>>> his
> >>>>>>>>>>> program
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>> inserting
> >>>>>>>>>>>>>>>>>>>>>>>> in one place
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> This implicitly alters the semantic and behaviour
> >>>>>> of
> >>>>>>>> his
> >>>>>>>>>> code
> >>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>> over
> >>>>>>>>>>>>>>>>>>>>>>>> the place, maybe in a ways that might cause
> >>>>>> problems.
> >>>>>>>> For
> >>>>>>>>>>>>> example
> >>>>>>>>>>>>>>>>>>>> what
> >>>>>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>>>>>> underlying data is changing?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Having invisible side effects is also not very
> >>>>>> clean,
> >>>>>>>> for
> >>>>>>>>>>>>> example
> >>>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>> about something like this (but more complicated):
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Table b = ...;
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>>>>>>>>>>>>>>> processTable1(b)
> >>>>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>>>> else {
> >>>>>>>>>>>>>>>>>>>>>>>> processTable2(b)
> >>>>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> // do more stuff with b
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> >>>>>>>>>>>>> `processTable1`
> >>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>> `processTable2` methods.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On the other hand
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Table materialisedB = b.materialize()
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Avoids (at least some of) the side effect issues
> >>>>>> and
> >>>>>>>>> forces
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> explicitly use `materialisedB` where it’s
> >>>>>> appropriate
> >>>>>>>> and
> >>>>>>>>>>>>> forces
> >>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> think what does it actually mean. And if
> >>>>> something
> >>>>>>>>> doesn’t
> >>>>>>>>>>> work
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> end
> >>>>>>>>>>>>>>>>>>>>>>>> for the user, he will know what has he changed
> >>>>>>> instead
> >>>>>>>> of
> >>>>>>>>>>>>> blaming
> >>>>>>>>>>>>>>>>>>>>> Flink for
> >>>>>>>>>>>>>>>>>>>>>>>> some “magic” underneath. In the above example,
> >>>>>> after
> >>>>>>>>>>>>>>>> materialising
> >>>>>>>>>>>>>>>>>>> b
> >>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> only one of the methods, he should/would realise
> >>>>>>> about
> >>>>>>>>> the
> >>>>>>>>>>>>> issue
> >>>>>>>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>>>>>>>>>> handling the return value `MaterializedTable` of
> >>>>>> that
> >>>>>>>>>> method.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I guess it comes down to personal preferences if
> >>>>>> you
> >>>>>>>> like
> >>>>>>>>>>>>> things
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>> implicit or not. The more power is the user,
> >>>>>> probably
> >>>>>>>> the
> >>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>> likely
> >>>>>>>>>>>>>>>>>>>>> he is
> >>>>>>>>>>>>>>>>>>>>>>>> to like/understand implicit behaviour. And we as
> >>>>>>> Table
> >>>>>>>>> API
> >>>>>>>>>>>>>>>>>>> designers
> >>>>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>>>>> the most power users out there, so I would
> >>>>> proceed
> >>>>>>> with
> >>>>>>>>>>> caution
> >>>>>>>>>>>>>>>> (so
> >>>>>>>>>>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>>>>>>>> do not end up in the crazy perl realm with it’s
> >>>>>>> lovely
> >>>>>>>>>>> implicit
> >>>>>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>>>> arguments ;)  <
> >>>>>>>>>> https://stackoverflow.com/a/14922656/8149051
> >>>>>>>>>>>> )
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
> >>>>>> processing
> >>>>>>>>> cases,
> >>>>>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>>>>>>> might be slightly better.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I think even such extended Table API could
> >>>>> benefit
> >>>>>>> from
> >>>>>>>>>>>>> sticking
> >>>>>>>>>>>>>>>>>>>>> to/being
> >>>>>>>>>>>>>>>>>>>>>>>> consistent with SQL where both SQL and Table API
> >>>>>> are
> >>>>>>>>>>> basically
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> same.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> One more thing. `MaterializedTable materialize()`
> >>>>>>> could
> >>>>>>>>> be
> >>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>> powerful/flexible allowing the user to operate
> >>>>> both
> >>>>>>> on
> >>>>>>>>>>>>>>>> materialised
> >>>>>>>>>>>>>>>>>>>>> and not
> >>>>>>>>>>>>>>>>>>>>>>>> materialised view at the same time for whatever
> >>>>>>> reasons
> >>>>>>>>>>>>>>>> (underlying
> >>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>> changing/better optimisation opportunities after
> >>>>>>>> pushing
> >>>>>>>>>> down
> >>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>>>>> etc). For example:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Table b = …;
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Val min = mb.min();
> >>>>>>>>>>>>>>>>>>>>>>>> Val max = mb.max();
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Could be more efficient compared to `b.cache()`
> >>>>> if
> >>>>>>>>>>>>>>>> `filter(‘userId
> >>>>>>>>>>>>>>>>>>> =
> >>>>>>>>>>>>>>>>>>>>>>>> 42);` allows for much more aggressive
> >>>>>> optimisations.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> >>>>>>>>>> fhueske@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I'm not suggesting to add support for Ignite.
> >>>>> This
> >>>>>>> was
> >>>>>>>>>> just
> >>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>> example.
> >>>>>>>>>>>>>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> >>>>>>>>>>>>>>>>>>>>>>>>> For the sake of this proposal, it would be up to
> >>>>>> the
> >>>>>>>>> user
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> implement a
> >>>>>>>>>>>>>>>>>>>>>>>>> TableFactory and corresponding TableSource /
> >>>>>>> TableSink
> >>>>>>>>>>> classes
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> persist
> >>>>>>>>>>>>>>>>>>>>>>>>> and read the data.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb
> >>>>> Flavio
> >>>>>>>>>>> Pompermaier
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>> pompermaier@okkam.it>:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as
> >>>>>> an
> >>>>>>>>>>>>> alternative
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> Apache
> >>>>>>>>>>>>>>>>>>>>>>>>>> Ignite?
> >>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske
> >>>>> <
> >>>>>>>>>>>>>>>>>>> fhueske@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> To summarize, you propose a new method
> >>>>>>>> Table.cache():
> >>>>>>>>>>> Table
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>> trigger a job and write the result into some
> >>>>>>>> temporary
> >>>>>>>>>>>>> storage
> >>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>> defined
> >>>>>>>>>>>>>>>>>>>>>>>>>>> by a TableFactory.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> The cache() call blocks while the job is
> >>>>> running
> >>>>>>> and
> >>>>>>>>>>>>>>>> eventually
> >>>>>>>>>>>>>>>>>>>>>>>> returns a
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Table object that represents a scan of the
> >>>>>>> temporary
> >>>>>>>>>>> table.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> When the "session" is closed (closing to be
> >>>>>>>> defined?),
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> temporary
> >>>>>>>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>>>>>> are all dropped.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think this behavior makes sense and is a
> >>>>> good
> >>>>>>>> first
> >>>>>>>>>> step
> >>>>>>>>>>>>>>>>>>> towards
> >>>>>>>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>>> interactive workloads.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> However, its performance suffers from writing
> >>>>> to
> >>>>>>> and
> >>>>>>>>>>> reading
> >>>>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>>>>>>>>> external
> >>>>>>>>>>>>>>>>>>>>>>>>>>> systems.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> >>>>>>>>>>> significantly
> >>>>>>>>>>>>>>>>>>>> improve
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across
> >>>>>>> jobs)
> >>>>>>>>>> would
> >>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>> large
> >>>>>>>>>>>>>>>>>>>>>>>>>>> impacts on many components of Flink.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Users could use in-memory filesystems or
> >>>>> storage
> >>>>>>>> grids
> >>>>>>>>>>>>> (Apache
> >>>>>>>>>>>>>>>>>>>>>>>> Ignite) to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> mitigate some of the performance effects.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Best, Fabian
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
> >>>>>> Becket
> >>>>>>>> Qin
> >>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> >>>>>>>>>>> MaterializedTable
> >>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> >>>>>>>>> *table.cache(),
> >>>>>>>>>>>>> *users
> >>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> that table and do anything that is supported
> >>>>>> on a
> >>>>>>>>>> Table,
> >>>>>>>>>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize()
> >>>>>>> sounds
> >>>>>>>>>> fine
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> a bit more general than materialize(). Given
> >>>>>> that
> >>>>>>>> we
> >>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>> enhancing
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
> >>>>>>> processing
> >>>>>>>>>>> cases,
> >>>>>>>>>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>>>>>>>>> might
> >>>>>>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> slightly better.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
> >>>>>> Nowojski <
> >>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend
> >>>>> to
> >>>>>>>> reuse
> >>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
> >>>>>> assumed
> >>>>>>>> that
> >>>>>>>>>> you
> >>>>>>>>>>>>>>>> want
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> provide
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternate way of writing the data.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now that I hopefully understand the
> >>>>> proposal,
> >>>>>>>> maybe
> >>>>>>>>> we
> >>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>>>>> rename
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> void materialize()
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> or going step further
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The second option with returning a handle I
> >>>>>>> think
> >>>>>>>> is
> >>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>> flexible
> >>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could provide features such as
> >>>>>>> “refresh”/“delete”
> >>>>>>>> or
> >>>>>>>>>>>>>>>> generally
> >>>>>>>>>>>>>>>>>>>>>>>>>> speaking
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> manage the the view. In the future we could
> >>>>>> also
> >>>>>>>>> think
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>> adding
> >>>>>>>>>>>>>>>>>>>>>>>>>>> hooks
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is
> >>>>> also
> >>>>>>> more
> >>>>>>>>>>>>> explicit
> >>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialization returning a new table handle
> >>>>>>> will
> >>>>>>>>> not
> >>>>>>>>>>> have
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implicit side effects as adding a simple
> >>>>> line
> >>>>>> of
> >>>>>>>>> code
> >>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> would have.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would also be more SQL like, making it
> >>>>> more
> >>>>>>>>>> intuitive
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> familiar with the SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> >>>>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
> >>>>>>>>> equivalent
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> creating
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> BUILT-IN
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> >>>>>>>>>> functionality
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> missing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> today,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
> >>>>>> question.
> >>>>>>>> Do
> >>>>>>>>>> you
> >>>>>>>>>>>>> mean
> >>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
> >>>>>> sugar?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is
> >>>>> do
> >>>>>>> we
> >>>>>>>>> want
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>> stop
> >>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> creating
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
> >>>>>> extend
> >>>>>>>> that
> >>>>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> future
> >>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful unified data store distributed with
> >>>>>>> Flink?
> >>>>>>>>> And
> >>>>>>>>>>> do
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>> want
> >>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
> >>>>>> pattern
> >>>>>>>> with
> >>>>>>>>>>> their
> >>>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> defined
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> services. These considerations are much
> >>>>> more
> >>>>>>>>>>>>> architectural.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
> >>>>>> Nowojski
> >>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand
> >>>>>> the
> >>>>>>>>>>> problem.
> >>>>>>>>>>>>>>>>>>> Isn’t
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing
> >>>>> data
> >>>>>>> to
> >>>>>>>> a
> >>>>>>>>>> sink
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> reading
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited
> >>>>> live
> >>>>>>>>>> scope/live
> >>>>>>>>>>>>>>>> time?
> >>>>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> sink
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a
> >>>>> file
> >>>>>>>> sink?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> >>>>>>>>>> materialised
> >>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>> from a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
> >>>>>> reusing
> >>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>> materialised
> >>>>>>>>>>>>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to
> >>>>>>> clean
> >>>>>>>> up
> >>>>>>>>>>>>>>>>>>>> materialised
> >>>>>>>>>>>>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> example when current session finishes)?
> >>>>>> Maybe
> >>>>>>> we
> >>>>>>>>>> need
> >>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>>>>> syntactic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> sugar
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on top of it?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> >>>>>>>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
> >>>>>>> persist()
> >>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> lifecycle/defined
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the
> >>>>> future
> >>>>>>>> work
> >>>>>>>>>> for
> >>>>>>>>>>>>>>>> this.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng
> >>>>>> sun
> >>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the
> >>>>>> name
> >>>>>>>> of
> >>>>>>>>>>>>>>>>>>> `cache()`, I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you designed this way!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
> >>>>>>>>> lifecycle
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> persistence?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, persist
> >>>>> (LifeCycle.SESSION),
> >>>>>> so
> >>>>>>>>> that
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> worried
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly
> >>>>> specify
> >>>>>>> the
> >>>>>>>>> time
> >>>>>>>>>>>>> range
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>> keeping
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> time.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand,
> >>>>> we
> >>>>>>> can
> >>>>>>>>>> also
> >>>>>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> >>>>>>>>>>>>>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> >>>>>>>>>>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for
> >>>>> reference
> >>>>>>>> only!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> >>>>>>>>> 于2018年11月23日周五
> >>>>>>>>>>>>>>>>>>> 下午1:33写道:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
> >>>>>> cache()
> >>>>>>>> v.s.
> >>>>>>>>>>>>>>>>>>> persist(),
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> personally I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
> >>>>>>> describing
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> behavior,
> >>>>>>>>>>>>>>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
> >>>>>>>> deleted
> >>>>>>>>>>> after
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> closed.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as
> >>>>>>> people
> >>>>>>>>>> might
> >>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still be there even after the session
> >>>>> is
> >>>>>>>> gone.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and
> >>>>>>> stream
> >>>>>>>>>>>>>>>> processing
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> job.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that
> >>>>>>> goal.
> >>>>>>>> I
> >>>>>>>>>>>>> imagine
> >>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change across the board, including
> >>>>>> sources,
> >>>>>>>>>>> operators
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
> >>>>>>>> separate
> >>>>>>>>>>>>>>>> in-depth
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> discussions.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan
> >>>>>>> Cui <
> >>>>>>>>>>>>>>>>>>>>>>>>>> xingcanc@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
> >>>>>> access
> >>>>>>>>>> domain
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this
> >>>>> may
> >>>>>>> be
> >>>>>>>>> the
> >>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>>>> time
> >>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>> plan
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
> >>>>>> other
> >>>>>>>> than
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> state.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> it’s
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
> >>>>>>>>> concentrate
> >>>>>>>>>>> on
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> part?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
> >>>>>> concerned
> >>>>>>>>> with
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> underlying
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> service.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change
> >>>>> to
> >>>>>>> the
> >>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>> codebase.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> As
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
> >>>>>> extendible
> >>>>>>> to
> >>>>>>>>>>> support
> >>>>>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
> >>>>>>> thread.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the
> >>>>>> more
> >>>>>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough
> >>>>> service
> >>>>>>>>>>> mechanism.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei
> >>>>>>>> Jiang <
> >>>>>>>>>>>>>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
> >>>>>> table
> >>>>>>>> for
> >>>>>>>>>>> clean
> >>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reliable.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
> >>>>>>>>> executed
> >>>>>>>>>>>>>>>>>>>>> successfully.
> >>>>>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> risk
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that
> >>>>>>> it's
> >>>>>>>>>> safer
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> association
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So
> >>>>>> we
> >>>>>>>> can
> >>>>>>>>>>> always
> >>>>>>>>>>>>>>>>>>> clean
> >>>>>>>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>>>>>>>>>> temp
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with
> >>>>> any
> >>>>>>>>> active
> >>>>>>>>>>>>>>>>>>> sessions.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
> >>>>>> jincheng
> >>>>>>>>> sun <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
> >>>>>>> proposal!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very
> >>>>> useful
> >>>>>>> and
> >>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>> friendly
> >>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business
> >>>>>> has
> >>>>>>>> to
> >>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> executed
> >>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stages
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the
> >>>>> pipeline
> >>>>>>> of
> >>>>>>>>>> Flink
> >>>>>>>>>>>>> ML,
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>> order
> >>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> utilize
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we
> >>>>>> have
> >>>>>>>> to
> >>>>>>>>>>>>> submit a
> >>>>>>>>>>>>>>>>>>> job
> >>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
> >>>>>> better
> >>>>>>>> to
> >>>>>>>>>>> named
> >>>>>>>>>>>>>>>>>>>>>>>>>> `persist()`,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether
> >>>>> we
> >>>>>>>>>> internally
> >>>>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>> memory
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the
> >>>>>>> data
> >>>>>>>>> into
> >>>>>>>>>>>>> state
> >>>>>>>>>>>>>>>>>>>>> backend
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
> >>>>>>> RocksDBStateBackend
> >>>>>>>>>> etc.)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in
> >>>>> the
> >>>>>>>>> future,
> >>>>>>>>>>>>>>>> support
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> streaming
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job
> >>>>>>> will
> >>>>>>>>> also
> >>>>>>>>>>>>>>>> benefit
> >>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Interactive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward
> >>>>> to
> >>>>>>>> your
> >>>>>>>>>>> JIRAs
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> FLIP!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> >>>>>>>>>>> 于2018年11月20日周二
> >>>>>>>>>>>>>>>>>>>>> 下午9:56写道:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
> >>>>>>>> pointed
> >>>>>>>>>> out,
> >>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> promising
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table
> >>>>>> API
> >>>>>>> in
> >>>>>>>>>>> various
> >>>>>>>>>>>>>>>>>>>>> aspects,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among
> >>>>>>>> others.
> >>>>>>>>>> One
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> scenarios
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
> >>>>>> interactive
> >>>>>>>>>>>>>>>> programming.
> >>>>>>>>>>>>>>>>>>> To
> >>>>>>>>>>>>>>>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on
> >>>>> the
> >>>>>>>>>> solution,
> >>>>>>>>>>> we
> >>>>>>>>>>>>>>>> put
> >>>>>>>>>>>>>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our
> >>>>> proposal.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
> >>>>>> welcome!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi Becket,

> {
>  val cachedTable = a.cache()
>  val b = cachedTable.select(...)
>  val c = a.select(...)
> }
> 
> Semantic 1. b uses cachedTable as user demanded so. c uses original DAG as
> user demanded so. In this case, the optimizer has no chance to optimize.
> Semantic 2. b uses cachedTable as user demanded so. c leaves the optimizer
> to choose whether the cache or DAG should be used. In this case, user lose
> the option to NOT use cache.
>
> As you can see, neither of the options seem perfect. However, I guess you
> and Till are proposing the third option:
> 
> Semantic 3. b leaves the optimizer to choose whether cache or DAG should be
> used. c always use the DAG.

I am pretty sure that me, Till, Fabian and others were all proposing and advocating in favour of semantic “1”. No cost based optimiser decisions at all.

{
 val cachedTable = a.cache()
 val b1 = cachedTable.select(…)
 val b2 = cachedTable.foo().select(…)
 val b3 = cachedTable.bar().select(...)
 val c1 = a.select(…)
 val c2 = a.foo().select(…)
 val c3 = a.bar().select(...)
}

All b1, b2 and b3 are reading from cache, while c1, c2 and c3 are re-executing whole plan for “a”.

In the future we could discuss going one step further, introducing some global optimisation (that can be manually enabled/disabled): deduplicate plan nodes/deduplicate sub queries/re-use sub queries results/or whatever we could call it. It could do two things:

1. Automatically try to deduplicate fragments of the plan and share the result using CachedTable - in other words automatically insert `CachedTable cache()` calls.
2. Automatically make decision to bypass explicit `CachedTable` access (this would be the equivalent of what you described as “semantic 3”). 

However as I wrote previously, I have big doubts if such cost-based optimisation would work (this applies also to “Semantic 2”). I would expect it to do more harm than good in so many cases, that it wouldn’t make sense. Even assuming that we calculate statistics perfectly (this ain’t gonna happen), it’s virtually impossible to correctly estimate correct exchange rate of CPU cycles vs IO operations as it is changing so much from deployment to deployment.

Is this the core of our disagreement here? That you would like this “cache()” to be mostly hint for the optimiser?

Piotrek  

> On 11 Dec 2018, at 06:00, Becket Qin <be...@gmail.com> wrote:
> 
> Another potential concern for semantic 3 is that. In the future, we may add
> automatic caching to Flink. e.g. cache the intermediate results at the
> shuffle boundary. If our semantic is that reference to the original table
> means skipping cache, those users may not be able to benefit from the
> implicit cache.
> 
> 
> 
> On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <be...@gmail.com> wrote:
> 
>> Hi Piotrek,
>> 
>> Thanks for the reply. Thought about it again, I might have misunderstood
>> your proposal in earlier emails. Returning a CachedTable might not be a bad
>> idea.
>> 
>> I was more concerned about the semantic and its intuitiveness when a
>> CachedTable is returned. i..e, if cache() returns CachedTable. What are the
>> semantic in the following code:
>> {
>>  val cachedTable = a.cache()
>>  val b = cachedTable.select(...)
>>  val c = a.select(...)
>> }
>> What is the difference between b and c? At the first glance, I see two
>> options:
>> 
>> Semantic 1. b uses cachedTable as user demanded so. c uses original DAG as
>> user demanded so. In this case, the optimizer has no chance to optimize.
>> Semantic 2. b uses cachedTable as user demanded so. c leaves the optimizer
>> to choose whether the cache or DAG should be used. In this case, user lose
>> the option to NOT use cache.
>> 
>> As you can see, neither of the options seem perfect. However, I guess you
>> and Till are proposing the third option:
>> 
>> Semantic 3. b leaves the optimizer to choose whether cache or DAG should
>> be used. c always use the DAG.
>> 
>> This does address all the concerns. It is just that from intuitiveness
>> perspective, I found that asking user to explicitly use a CachedTable while
>> the optimizer might choose to ignore is a little weird. That was why I did
>> not think about that semantic. But given there is material benefit, I think
>> this semantic is acceptable.
>> 
>> 1. If we want to let optimiser make decisions whether to use cache or not,
>>> then why do we need “void cache()” method at all? Would It  “increase” the
>>> chance of using the cache? That’s sounds strange. What would be the
>>> mechanism of deciding whether to use the cache or not? If we want to
>>> introduce such kind  automated optimisations of “plan nodes deduplication”
>>> I would turn it on globally, not per table, and let the optimiser do all of
>>> the work.
>>> 2. We do not have statistics at the moment for any use/not use cache
>>> decision.
>>> 3. Even if we had, I would be veeerryy sceptical whether such cost based
>>> optimisations would work properly and I would still insist first on
>>> providing explicit caching mechanism (`CachedTable cache()`)
>>> 
>> We are absolutely on the same page here. An explicit cache() method is
>> necessary not only because optimizer may not be able to make the right
>> decision, but also because of the nature of interactive programming. For
>> example, if users write the following code in Scala shell:
>>  val b = a.select(...)
>>  val c = b.select(...)
>>  val d = c.select(...).writeToSink(...)
>>  tEnv.execute()
>> There is no way optimizer will know whether b or c will be used in later
>> code, unless users hint explicitly.
>> 
>> At the same time I’m not sure if you have responded to our objections of
>>> `void cache()` being implicit/having side effects, which me, Jark, Fabian,
>>> Till and I think also Shaoxuan are supporting.
>> 
>> Is there any other side effects if we use semantic 3 mentioned above?
>> 
>> Thanks,
>> 
>> JIangjie (Becket) Qin
>> 
>> 
>> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <pi...@data-artisans.com>
>> wrote:
>> 
>>> Hi Becket,
>>> 
>>> Sorry for not responding long time.
>>> 
>>> Regarding case1.
>>> 
>>> There wouldn’t be no “a.unCache()” method, but I would expect only
>>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t affect
>>> `cachedTableA2`. Just as in any other database dropping modifying one
>>> independent table/materialised view does not affect others.
>>> 
>>>> What I meant is that assuming there is already a cached table, ideally
>>> users need
>>>> not to specify whether the next query should read from the cache or use
>>> the
>>>> original DAG. This should be decided by the optimizer.
>>> 
>>> 1. If we want to let optimiser make decisions whether to use cache or
>>> not, then why do we need “void cache()” method at all? Would It  “increase”
>>> the chance of using the cache? That’s sounds strange. What would be the
>>> mechanism of deciding whether to use the cache or not? If we want to
>>> introduce such kind  automated optimisations of “plan nodes deduplication”
>>> I would turn it on globally, not per table, and let the optimiser do all of
>>> the work.
>>> 2. We do not have statistics at the moment for any use/not use cache
>>> decision.
>>> 3. Even if we had, I would be veeerryy sceptical whether such cost based
>>> optimisations would work properly and I would still insist first on
>>> providing explicit caching mechanism (`CachedTable cache()`)
>>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
>>> contradict future work on automated cost based caching.
>>> 
>>> 
>>> At the same time I’m not sure if you have responded to our objections of
>>> `void cache()` being implicit/having side effects, which me, Jark, Fabian,
>>> Till and I think also Shaoxuan are supporting.
>>> 
>>> Piotrek
>>> 
>>>> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
>>>> 
>>>> Hi Till,
>>>> 
>>>> It is true that after the first job submission, there will be no
>>> ambiguity
>>>> in terms of whether a cached table is used or not. That is the same for
>>> the
>>>> cache() without returning a CachedTable.
>>>> 
>>>> Conceptually one could think of cache() as introducing a caching
>>> operator
>>>>> from which you need to consume from if you want to benefit from the
>>> caching
>>>>> functionality.
>>>> 
>>>> I am thinking a little differently. I think it is a hint (as you
>>> mentioned
>>>> later) instead of a new operator. I'd like to be careful about the
>>> semantic
>>>> of the API. A hint is a property set on an existing operator, but is not
>>>> itself an operator as it does not really manipulate the data.
>>>> 
>>>> I agree, ideally the optimizer makes this kind of decision which
>>>>> intermediate result should be cached. But especially when executing
>>> ad-hoc
>>>>> queries the user might better know which results need to be cached
>>> because
>>>>> Flink might not see the full DAG. In that sense, I would consider the
>>>>> cache() method as a hint for the optimizer. Of course, in the future we
>>>>> might add functionality which tries to automatically cache results
>>> (e.g.
>>>>> caching the latest intermediate results until so and so much space is
>>>>> used). But this should hopefully not contradict with `CachedTable
>>> cache()`.
>>>> 
>>>> I agree that cache() method is needed for exactly the reason you
>>> mentioned,
>>>> i.e. Flink cannot predict what users are going to write later, so users
>>>> need to tell Flink explicitly that this table will be used later. What I
>>>> meant is that assuming there is already a cached table, ideally users
>>> need
>>>> not to specify whether the next query should read from the cache or use
>>> the
>>>> original DAG. This should be decided by the optimizer.
>>>> 
>>>> To explain the difference between returning / not returning a
>>> CachedTable,
>>>> I want compare the following two case:
>>>> 
>>>> *Case 1:  returning a CachedTable*
>>>> b = a.map(...)
>>>> val cachedTableA1 = a.cache()
>>>> val cachedTableA2 = a.cache()
>>>> b.print() // Just to make sure a is cached.
>>>> 
>>>> c = a.filter(...) // User specify that the original DAG is used? Or the
>>>> optimizer decides whether DAG or cache should be used?
>>>> d = cachedTableA1.filter() // User specify that the cached table is
>>> used.
>>>> 
>>>> a.unCache() // Can cachedTableA still be used afterwards?
>>>> cachedTableA1.uncache() // Can cachedTableA2 still be used?
>>>> 
>>>> *Case 2: not returning a CachedTable*
>>>> b = a.map()
>>>> a.cache()
>>>> a.cache() // no-op
>>>> b.print() // Just to make sure a is cached
>>>> 
>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG should
>>> be
>>>> used
>>>> d = a.filter(...) // Optimizer decides whether the cache or DAG should
>>> be
>>>> used
>>>> 
>>>> a.unCache()
>>>> a.unCache() // no-op
>>>> 
>>>> In case 1, semantic wise, optimizer lose the option to choose between
>>> DAG
>>>> and cache. And the unCache() call becomes tricky.
>>>> In case 2, users do not need to worry about whether cache or DAG is
>>> used.
>>>> And the unCache() semantic is clear. However, the caveat is that users
>>>> cannot explicitly ignore the cache.
>>>> 
>>>> In order to address the issues mentioned in case 2 and inspired by the
>>>> discussion so far, I am thinking about using hint to allow user
>>> explicitly
>>>> ignore cache. Although we do not have hint yet, but we probably should
>>> have
>>>> one. So the code becomes:
>>>> 
>>>> *Case 3: returning this table*
>>>> b = a.map()
>>>> a.cache()
>>>> a.cache() // no-op
>>>> b.print() // Just to make sure a is cached
>>>> 
>>>> c = a.filter(...) // Optimizer decides whether the cache or DAG should
>>> be
>>>> used
>>>> d = a.hint("ignoreCache").filter(...) // DAG will be used instead of the
>>>> cache.
>>>> 
>>>> a.unCache()
>>>> a.unCache() // no-op
>>>> 
>>>> We could also let cache() return this table to allow chained method
>>> calls.
>>>> Do you think this API addresses the concerns?
>>>> 
>>>> Thanks,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> All the recent discussions are focused on whether there is a problem if
>>>>> cache() not return a Table.
>>>>> It seems that returning a Table explicitly is more clear (and safe?).
>>>>> 
>>>>> So whether there are any problems if cache() returns a Table?  @Becket
>>>>> 
>>>>> Best,
>>>>> Jark
>>>>> 
>>>>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>>> 
>>>>>> It's true that b, c, d and e will all read from the original DAG that
>>>>>> generates a. But all subsequent operators (when running multiple
>>> queries)
>>>>>> which reference cachedTableA should not need to reproduce `a` but
>>>>> directly
>>>>>> consume the intermediate result.
>>>>>> 
>>>>>> Conceptually one could think of cache() as introducing a caching
>>> operator
>>>>>> from which you need to consume from if you want to benefit from the
>>>>> caching
>>>>>> functionality.
>>>>>> 
>>>>>> I agree, ideally the optimizer makes this kind of decision which
>>>>>> intermediate result should be cached. But especially when executing
>>>>> ad-hoc
>>>>>> queries the user might better know which results need to be cached
>>>>> because
>>>>>> Flink might not see the full DAG. In that sense, I would consider the
>>>>>> cache() method as a hint for the optimizer. Of course, in the future
>>> we
>>>>>> might add functionality which tries to automatically cache results
>>> (e.g.
>>>>>> caching the latest intermediate results until so and so much space is
>>>>>> used). But this should hopefully not contradict with `CachedTable
>>>>> cache()`.
>>>>>> 
>>>>>> Cheers,
>>>>>> Till
>>>>>> 
>>>>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <be...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> Hi Till,
>>>>>>> 
>>>>>>> Thanks for the clarification. I am still a little confused.
>>>>>>> 
>>>>>>> If cache() returns a CachedTable, the example might become:
>>>>>>> 
>>>>>>> b = a.map(...)
>>>>>>> c = a.map(...)
>>>>>>> 
>>>>>>> cachedTableA = a.cache()
>>>>>>> d = cachedTableA.map(...)
>>>>>>> e = a.map()
>>>>>>> 
>>>>>>> In the above case, if cache() is lazily evaluated, b, c, d and e are
>>>>> all
>>>>>>> going to be reading from the original DAG that generates a. But with
>>> a
>>>>>>> naive expectation, d should be reading from the cache. This seems not
>>>>>>> solving the potential confusion you raised, right?
>>>>>>> 
>>>>>>> Just to be clear, my understanding are all based on the assumption
>>> that
>>>>>> the
>>>>>>> tables are immutable. Therefore, after a.cache(), a the
>>> c*achedTableA*
>>>>>> and
>>>>>>> original table *a * should be completely interchangeable.
>>>>>>> 
>>>>>>> That said, I think a valid argument is optimization. There are indeed
>>>>>> cases
>>>>>>> that reading from the original DAG could be faster than reading from
>>>>> the
>>>>>>> cache. For example, in the following example:
>>>>>>> 
>>>>>>> a.filter(f1' > 100)
>>>>>>> a.cache()
>>>>>>> b = a.filter(f1' < 100)
>>>>>>> 
>>>>>>> Ideally the optimizer should be intelligent enough to decide which
>>> way
>>>>> is
>>>>>>> faster, without user intervention. In this case, it will identify
>>> that
>>>>> b
>>>>>>> would just be an empty table, thus skip reading from the cache
>>>>>> completely.
>>>>>>> But I agree that returning a CachedTable would give user the control
>>> of
>>>>>>> when to use cache, even though I still feel that letting the
>>> optimizer
>>>>>>> handle this is a better option in long run.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Jiangjie (Becket) Qin
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <tr...@apache.org>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Yes you are right Becket that it still depends on the actual
>>>>> execution
>>>>>> of
>>>>>>>> the job whether a consumer reads from a cached result or not.
>>>>>>>> 
>>>>>>>> My point was actually about the properties of a (cached vs.
>>>>> non-cached)
>>>>>>> and
>>>>>>>> not about the execution. I would not make cache trigger the
>>> execution
>>>>>> of
>>>>>>>> the job because one loses some flexibility by eagerly triggering the
>>>>>>>> execution.
>>>>>>>> 
>>>>>>>> I tried to argue for an explicit CachedTable which is returned by
>>> the
>>>>>>>> cache() method like Piotr did in order to make the API more
>>> explicit.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>> 
>>>>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Till,
>>>>>>>>> 
>>>>>>>>> That is a good example. Just a minor correction, in this case, b, c
>>>>>>> and d
>>>>>>>>> will all consume from a non-cached a. This is because cache will
>>>>> only
>>>>>>> be
>>>>>>>>> created on the very first job submission that generates the table
>>>>> to
>>>>>> be
>>>>>>>>> cached.
>>>>>>>>> 
>>>>>>>>> If I understand correctly, this is example is about whether
>>>>> .cache()
>>>>>>>> method
>>>>>>>>> should be eagerly evaluated or lazily evaluated. In another word,
>>>>> if
>>>>>>>>> cache() method actually triggers a job that creates the cache,
>>>>> there
>>>>>>> will
>>>>>>>>> be no such confusion. Is that right?
>>>>>>>>> 
>>>>>>>>> In the example, although d will not consume from the cached Table
>>>>>> while
>>>>>>>> it
>>>>>>>>> looks supposed to, from correctness perspective the code will still
>>>>>>>> return
>>>>>>>>> correct result, assuming that tables are immutable.
>>>>>>>>> 
>>>>>>>>> Personally I feel it is OK because users probably won't really
>>>>> worry
>>>>>>>> about
>>>>>>>>> whether the table is cached or not. And lazy cache could avoid some
>>>>>>>>> unnecessary caching if a cached table is never created in the user
>>>>>>>>> application. But I am not opposed to do eager evaluation of cache.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
>>>>> trohrmann@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Another argument for Piotr's point is that lazily changing
>>>>>> properties
>>>>>>>> of
>>>>>>>>> a
>>>>>>>>>> node affects all down stream consumers but does not necessarily
>>>>>> have
>>>>>>> to
>>>>>>>>>> happen before these consumers are defined. From a user's
>>>>>> perspective
>>>>>>>> this
>>>>>>>>>> can be quite confusing:
>>>>>>>>>> 
>>>>>>>>>> b = a.map(...)
>>>>>>>>>> c = a.map(...)
>>>>>>>>>> 
>>>>>>>>>> a.cache()
>>>>>>>>>> d = a.map(...)
>>>>>>>>>> 
>>>>>>>>>> now b, c and d will consume from a cached operator. In this case,
>>>>>> the
>>>>>>>>> user
>>>>>>>>>> would most likely expect that only d reads from a cached result.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Till
>>>>>>>>>> 
>>>>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hey Shaoxuan and Becket,
>>>>>>>>>>> 
>>>>>>>>>>>> Can you explain a bit more one what are the side effects? So
>>>>>> far
>>>>>>> my
>>>>>>>>>>>> understanding is that such side effects only exist if a table
>>>>>> is
>>>>>>>>>> mutable.
>>>>>>>>>>>> Is that the case?
>>>>>>>>>>> 
>>>>>>>>>>> Not only that. There are also performance implications and
>>>>> those
>>>>>>> are
>>>>>>>>>>> another implicit side effects of using `void cache()`. As I
>>>>> wrote
>>>>>>>>> before,
>>>>>>>>>>> reading from cache might not always be desirable, thus it can
>>>>>> cause
>>>>>>>>>>> performance degradation and I’m fine with that - user's or
>>>>>>>> optimiser’s
>>>>>>>>>>> choice. What I do not like is that this implicit side effect
>>>>> can
>>>>>>>>> manifest
>>>>>>>>>>> in completely different part of code, that wasn’t touched by a
>>>>>> user
>>>>>>>>> while
>>>>>>>>>>> he was adding `void cache()` call somewhere else. And even if
>>>>>>> caching
>>>>>>>>>>> improves performance, it’s still a side effect of `void
>>>>> cache()`.
>>>>>>>>> Almost
>>>>>>>>>>> from the definition `void` methods have only side effects. As I
>>>>>>> wrote
>>>>>>>>>>> before, there are couple of scenarios where this might be
>>>>>>> undesirable
>>>>>>>>>>> and/or unexpected, for example:
>>>>>>>>>>> 
>>>>>>>>>>> 1.
>>>>>>>>>>> Table b = …;
>>>>>>>>>>> b.cache()
>>>>>>>>>>> x = b.join(…)
>>>>>>>>>>> y = b.count()
>>>>>>>>>>> // ...
>>>>>>>>>>> // 100
>>>>>>>>>>> // hundred
>>>>>>>>>>> // lines
>>>>>>>>>>> // of
>>>>>>>>>>> // code
>>>>>>>>>>> // later
>>>>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in a
>>>>>>>> different
>>>>>>>>>>> method/file/package/dependency
>>>>>>>>>>> 
>>>>>>>>>>> 2.
>>>>>>>>>>> 
>>>>>>>>>>> Table b = ...
>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>> foo(b)
>>>>>>>>>>> }
>>>>>>>>>>> Else {
>>>>>>>>>>> bar(b)
>>>>>>>>>>> }
>>>>>>>>>>> z = b.filter(…).groupBy(…)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Void foo(Table b) {
>>>>>>>>>>> b.cache()
>>>>>>>>>>> // do something with b
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> In both above examples, `b.cache()` will implicitly affect
>>>>>>> (semantic
>>>>>>>>> of a
>>>>>>>>>>> program in case of sources being mutable and performance) `z =
>>>>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
>>>>>>>>>>> 
>>>>>>>>>>> On top of that, there is still this argument of mine that
>>>>> having
>>>>>> a
>>>>>>>>>>> `MaterializedTable` or `CachedTable` handle is more flexible
>>>>> for
>>>>>> us
>>>>>>>> for
>>>>>>>>>> the
>>>>>>>>>>> future and for the user (as a manual option to bypass cache
>>>>>> reads).
>>>>>>>>>>> 
>>>>>>>>>>>> But Jiangjie is correct,
>>>>>>>>>>>> the source table in batching should be immutable. It is the
>>>>>>> user’s
>>>>>>>>>>>> responsibility to ensure it, otherwise even a regular
>>>>> failover
>>>>>>> may
>>>>>>>>> lead
>>>>>>>>>>>> to inconsistent results.
>>>>>>>>>>> 
>>>>>>>>>>> Yes, I agree that’s what perfect world/good deployment should
>>>>> be.
>>>>>>> But
>>>>>>>>> its
>>>>>>>>>>> often isn’t and while I’m not trying to fix this (since the
>>>>>> proper
>>>>>>>> fix
>>>>>>>>> is
>>>>>>>>>>> to support transactions), I’m just trying to minimise confusion
>>>>>> for
>>>>>>>> the
>>>>>>>>>>> users that are not fully aware what’s going on and operate in
>>>>>> less
>>>>>>>> then
>>>>>>>>>>> perfect setup. And if something bites them after adding
>>>>>> `b.cache()`
>>>>>>>>> call,
>>>>>>>>>>> to make sure that they at least know all of the places that
>>>>>> adding
>>>>>>>> this
>>>>>>>>>>> line can affect.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks, Piotrek
>>>>>>>>>>> 
>>>>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks again for the clarification. Some more replies are
>>>>>>>> following.
>>>>>>>>>>>> 
>>>>>>>>>>>> But keep in mind that `.cache()` will/might not only be used
>>>>> in
>>>>>>>>>>> interactive
>>>>>>>>>>>>> programming and not only in batching.
>>>>>>>>>>>> 
>>>>>>>>>>>> It is true. Actually in stream processing, cache() has the
>>>>> same
>>>>>>>>>> semantic
>>>>>>>>>>> as
>>>>>>>>>>>> batch processing. The semantic is following:
>>>>>>>>>>>> For a table created via a series of computation, save that
>>>>>> table
>>>>>>>> for
>>>>>>>>>>> later
>>>>>>>>>>>> reference to avoid running the computation logic to
>>>>> regenerate
>>>>>>> the
>>>>>>>>>> table.
>>>>>>>>>>>> Once the application exits, drop all the cache.
>>>>>>>>>>>> This semantic is same for both batch and stream processing.
>>>>> The
>>>>>>>>>>> difference
>>>>>>>>>>>> is that stream applications will only run once as they are
>>>>> long
>>>>>>>>>> running.
>>>>>>>>>>>> And the batch applications may be run multiple times, hence
>>>>> the
>>>>>>>> cache
>>>>>>>>>> may
>>>>>>>>>>>> be created and dropped each time the application runs.
>>>>>>>>>>>> Admittedly, there will probably be some resource management
>>>>>>>>>> requirements
>>>>>>>>>>>> for the streaming cached table, such as time based / size
>>>>> based
>>>>>>>>>>> retention,
>>>>>>>>>>>> to address the infinite data issue. But such requirement does
>>>>>> not
>>>>>>>>>> change
>>>>>>>>>>>> the semantic.
>>>>>>>>>>>> You are right that interactive programming is just one use
>>>>> case
>>>>>>> of
>>>>>>>>>>> cache().
>>>>>>>>>>>> It is not the only use case.
>>>>>>>>>>>> 
>>>>>>>>>>>> For me the more important issue is of not having the `void
>>>>>>> cache()`
>>>>>>>>>> with
>>>>>>>>>>>>> side effects.
>>>>>>>>>>>> 
>>>>>>>>>>>> This is indeed the key point. The argument around whether
>>>>>> cache()
>>>>>>>>>> should
>>>>>>>>>>>> return something already indicates that cache() and
>>>>>> materialize()
>>>>>>>>>> address
>>>>>>>>>>>> different issues.
>>>>>>>>>>>> Can you explain a bit more one what are the side effects? So
>>>>>> far
>>>>>>> my
>>>>>>>>>>>> understanding is that such side effects only exist if a table
>>>>>> is
>>>>>>>>>> mutable.
>>>>>>>>>>>> Is that the case?
>>>>>>>>>>>> 
>>>>>>>>>>>> I don’t know, probably initially we should make CachedTable
>>>>>>>>> read-only.
>>>>>>>>>> I
>>>>>>>>>>>>> don’t find it more confusing than the fact that user can not
>>>>>>> write
>>>>>>>>> to
>>>>>>>>>>> views
>>>>>>>>>>>>> or materialised views in SQL or that user currently can not
>>>>>>> write
>>>>>>>>> to a
>>>>>>>>>>>>> Table.
>>>>>>>>>>>> 
>>>>>>>>>>>> I don't think anyone should insert something to a cache. By
>>>>>>>>> definition
>>>>>>>>>>> the
>>>>>>>>>>>> cache should only be updated when the corresponding original
>>>>>>> table
>>>>>>>> is
>>>>>>>>>>>> updated. What I am wondering is that given the following two
>>>>>>> facts:
>>>>>>>>>>>> 1. If and only if a table is mutable (with something like
>>>>>>>> insert()),
>>>>>>>>> a
>>>>>>>>>>>> CachedTable may have implicit behavior.
>>>>>>>>>>>> 2. A CachedTable extends a Table.
>>>>>>>>>>>> We can come to the conclusion that a CachedTable is mutable
>>>>> and
>>>>>>>> users
>>>>>>>>>> can
>>>>>>>>>>>> insert into the CachedTable directly. This is where I thought
>>>>>>>>>> confusing.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
>>>>>>>> explanation
>>>>>>>>>> why
>>>>>>>>>>> I
>>>>>>>>>>>>> think `materialize()` is more natural to me is that I think
>>>>> of
>>>>>>> all
>>>>>>>>>>> “Table”s
>>>>>>>>>>>>> in Table-API as views. They behave the same way as SQL
>>>>> views,
>>>>>>> the
>>>>>>>>> only
>>>>>>>>>>>>> difference for me is that their live scope is short -
>>>>> current
>>>>>>>>> session
>>>>>>>>>>> which
>>>>>>>>>>>>> is limited by different execution model. That’s why
>>>>> “cashing”
>>>>>> a
>>>>>>>> view
>>>>>>>>>>> for me
>>>>>>>>>>>>> is just materialising it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However I see and I understand your point of view. Coming
>>>>> from
>>>>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
>>>>>>> `cache()`
>>>>>>>>> is
>>>>>>>>>>> more
>>>>>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
>>>>> only
>>>>>> be
>>>>>>>>> used
>>>>>>>>>> in
>>>>>>>>>>>>> interactive programming and not only in batching. But naming
>>>>>> is
>>>>>>>> one
>>>>>>>>>>> issue,
>>>>>>>>>>>>> and not that critical to me. Especially that once we
>>>>> implement
>>>>>>>>> proper
>>>>>>>>>>>>> materialised views, we can always deprecate/rename `cache()`
>>>>>> if
>>>>>>> we
>>>>>>>>>> deem
>>>>>>>>>>> so.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For me the more important issue is of not having the `void
>>>>>>>> cache()`
>>>>>>>>>> with
>>>>>>>>>>>>> side effects. Exactly for the reasons that you have
>>>>> mentioned.
>>>>>>>> True:
>>>>>>>>>>>>> results might be non deterministic if underlying source
>>>>> table
>>>>>>> are
>>>>>>>>>>> changing.
>>>>>>>>>>>>> Problem is that `void cache()` implicitly changes the
>>>>> semantic
>>>>>>> of
>>>>>>>>>>>>> subsequent uses of the cached/materialized Table. It can
>>>>> cause
>>>>>>>> “wtf”
>>>>>>>>>>> moment
>>>>>>>>>>>>> for a user if he inserts “b.cache()” call in some place in
>>>>> his
>>>>>>>> code
>>>>>>>>>> and
>>>>>>>>>>>>> suddenly some other random places are behaving differently.
>>>>> If
>>>>>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
>>>>> force
>>>>>>> user
>>>>>>>>> to
>>>>>>>>>>>>> explicitly use the cache which removes the “random” part
>>>>> from
>>>>>>> the
>>>>>>>>>>> "suddenly
>>>>>>>>>>>>> some other random places are behaving differently”.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This argument and others that I’ve raised (greater
>>>>>>>>>> flexibility/allowing
>>>>>>>>>>>>> user to explicitly bypass the cache) are independent of
>>>>>>> `cache()`
>>>>>>>> vs
>>>>>>>>>>>>> `materialize()` discussion.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Does that mean one can also insert into the CachedTable?
>>>>> This
>>>>>>>>> sounds
>>>>>>>>>>>>> pretty confusing.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I don’t know, probably initially we should make CachedTable
>>>>>>>>>> read-only. I
>>>>>>>>>>>>> don’t find it more confusing than the fact that user can not
>>>>>>> write
>>>>>>>>> to
>>>>>>>>>>> views
>>>>>>>>>>>>> or materialised views in SQL or that user currently can not
>>>>>>> write
>>>>>>>>> to a
>>>>>>>>>>>>> Table.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
>>>>>> should
>>>>>>> be
>>>>>>>>>>>>> considered as two different methods where the later one is
>>>>>> more
>>>>>>>>>>>>> sophisticated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> According to my understanding, the initial idea is just to
>>>>>>>>> introduce
>>>>>>>>>> a
>>>>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI is a
>>>>>>>>> high-level
>>>>>>>>>>> API,
>>>>>>>>>>>>> it’s naturally for as to think in a SQL way.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
>>>>> and
>>>>>>>> force
>>>>>>>>>>> users
>>>>>>>>>>>>> to translate a Table to a Dataset before caching it. Then
>>>>> the
>>>>>>>> users
>>>>>>>>>>> should
>>>>>>>>>>>>> manually register the cached dataset to a table again (we
>>>>> may
>>>>>>> need
>>>>>>>>>> some
>>>>>>>>>>>>> table replacement mechanisms for datasets with an identical
>>>>>>> schema
>>>>>>>>> but
>>>>>>>>>>>>> different contents here). After all, it’s the dataset rather
>>>>>>> than
>>>>>>>>> the
>>>>>>>>>>>>> dynamic table that need to be cached, right?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
>>>>>>> becket.qin@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Piotrek and Jark,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
>>>>>>>> arguments.
>>>>>>>>>>> But I
>>>>>>>>>>>>>>> think those arguments are mostly about materialized view.
>>>>>> Let
>>>>>>> me
>>>>>>>>> try
>>>>>>>>>>> to
>>>>>>>>>>>>>>> explain the reason I believe cache() and materialize() are
>>>>>>>>>> different.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I think cache() and materialize() have quite different
>>>>>>>>> implications.
>>>>>>>>>>> An
>>>>>>>>>>>>>>> analogy I can think of is save()/publish(). When users
>>>>> call
>>>>>>>>> cache(),
>>>>>>>>>>> it
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> just like they are saving an intermediate result as a
>>>>> draft
>>>>>> of
>>>>>>>>> their
>>>>>>>>>>>>> work,
>>>>>>>>>>>>>>> this intermediate result may not have any realistic
>>>>> meaning.
>>>>>>>>> Calling
>>>>>>>>>>>>>>> cache() does not mean users want to publish the cached
>>>>> table
>>>>>>> in
>>>>>>>>> any
>>>>>>>>>>>>> manner.
>>>>>>>>>>>>>>> But when users call materialize(), that means "I have
>>>>>>> something
>>>>>>>>>>>>> meaningful
>>>>>>>>>>>>>>> to be reused by others", now users need to think about the
>>>>>>>>>> validation,
>>>>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Piotrek's suggestions on variations of the materialize()
>>>>>>> methods
>>>>>>>>> are
>>>>>>>>>>>>> very
>>>>>>>>>>>>>>> useful. It would be great if Flink have them. The concept
>>>>> of
>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>> view is actually a pretty big feature, not to say the
>>>>>> related
>>>>>>>>> stuff
>>>>>>>>>>> like
>>>>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
>>>>>> materialized
>>>>>>>>> view
>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>> should be discussed in a more thorough and systematic
>>>>>> manner.
>>>>>>>> And
>>>>>>>>> I
>>>>>>>>>>>>> found
>>>>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
>>>>>>> interactive
>>>>>>>>>>>>>>> programming experience.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The example you gave was interesting. I still have some
>>>>>>>> questions,
>>>>>>>>>>>>> though.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Table source = … // some source that scans files from a
>>>>>>>> directory
>>>>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>>>>> initialised)
>>>>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>>>>> // something in the background (or we trigger it) writes
>>>>>> new
>>>>>>>>> files
>>>>>>>>>> to
>>>>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
>>>>>>>> implemented
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> initial version
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> what if someone else added some more files to /foo/bar at
>>>>>> this
>>>>>>>>>> point?
>>>>>>>>>>> In
>>>>>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
>>>>>>>>>>>>> non-deterministic,
>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
>>>>>>> “cache”
>>>>>>>>>>> dropping
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> When we talk about interactive programming, in most cases,
>>>>>> we
>>>>>>>> are
>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>> about batch applications. A fundamental assumption of such
>>>>>>> case
>>>>>>>> is
>>>>>>>>>>> that
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> source data is complete before the data processing begins,
>>>>>> and
>>>>>>>> the
>>>>>>>>>>> data
>>>>>>>>>>>>>>> will not change during the data processing. IMO, if
>>>>>> additional
>>>>>>>>> rows
>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>> to be added to some source during the processing, it
>>>>> should
>>>>>> be
>>>>>>>>> done
>>>>>>>>>> in
>>>>>>>>>>>>> ways
>>>>>>>>>>>>>>> like union the source with another table containing the
>>>>> rows
>>>>>>> to
>>>>>>>> be
>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> There are a few cases that computations are executed
>>>>>>> repeatedly
>>>>>>>> on
>>>>>>>>>> the
>>>>>>>>>>>>>>> changing data source.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For example, people may run a ML training job every hour
>>>>>> with
>>>>>>>> the
>>>>>>>>>>>>> samples
>>>>>>>>>>>>>>> newly added in the past hour. In that case, the source
>>>>> data
>>>>>>>>> between
>>>>>>>>>>> will
>>>>>>>>>>>>>>> indeed change. But still, the data remain unchanged within
>>>>>> one
>>>>>>>>> run.
>>>>>>>>>>> And
>>>>>>>>>>>>>>> usually in that case, the result will need versioning,
>>>>> i.e.
>>>>>>> for
>>>>>>>> a
>>>>>>>>>>> given
>>>>>>>>>>>>>>> result, it tells that the result is a result from the
>>>>> source
>>>>>>>> data
>>>>>>>>>> by a
>>>>>>>>>>>>>>> certain timestamp.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Another example is something like data warehouse. In this
>>>>>>> case,
>>>>>>>>>> there
>>>>>>>>>>>>> are a
>>>>>>>>>>>>>>> few source of original/raw data. On top of those sources,
>>>>>> many
>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>> view / queries / reports / dashboards can be created to
>>>>>>> generate
>>>>>>>>>>> derived
>>>>>>>>>>>>>>> data. Those derived data needs to be updated when the
>>>>>>> underlying
>>>>>>>>>>>>> original
>>>>>>>>>>>>>>> data changes. In that case, the processing logic that
>>>>>> derives
>>>>>>>> the
>>>>>>>>>>>>> original
>>>>>>>>>>>>>>> data needs to be executed repeatedly to update those
>>>>>>>>> reports/views.
>>>>>>>>>>>>> Again,
>>>>>>>>>>>>>>> all those derived data also need to have version
>>>>> management,
>>>>>>>> such
>>>>>>>>> as
>>>>>>>>>>>>>>> timestamp.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In any of the above two cases, during a single run of the
>>>>>>>>> processing
>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>> the data cannot change. Otherwise the behavior of the
>>>>>>> processing
>>>>>>>>>> logic
>>>>>>>>>>>>> may
>>>>>>>>>>>>>>> be undefined. In the above two examples, when writing the
>>>>>>>>> processing
>>>>>>>>>>>>> logic,
>>>>>>>>>>>>>>> Users can use .cache() to hint Flink that those results
>>>>>> should
>>>>>>>> be
>>>>>>>>>>> saved
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> avoid repeated computation. And then for the result of my
>>>>>>>>>> application
>>>>>>>>>>>>>>> logic, I'll call materialize(), so that these results
>>>>> could
>>>>>> be
>>>>>>>>>> managed
>>>>>>>>>>>>> by
>>>>>>>>>>>>>>> the system with versioning, metadata management, lifecycle
>>>>>>>>>> management,
>>>>>>>>>>>>>>> ACLs, etc.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It is true we can use materialize() to do the cache() job,
>>>>>>> but I
>>>>>>>>> am
>>>>>>>>>>>>> really
>>>>>>>>>>>>>>> reluctant to shoehorn cache() into materialize() and force
>>>>>>> users
>>>>>>>>> to
>>>>>>>>>>>>> worry
>>>>>>>>>>>>>>> about a bunch of implications that they needn't have to. I
>>>>>> am
>>>>>>>>>>>>> absolutely on
>>>>>>>>>>>>>>> your side that redundant API is bad. But it is equally
>>>>>>>>> frustrating,
>>>>>>>>>> if
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> more, that the same API does different things.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
>>>>>>>>> wshaoxuan@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks Piotrek,
>>>>>>>>>>>>>>>> You provided a very good example, it explains all the
>>>>>>>> confusions
>>>>>>>>> I
>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>> It is clear that there is something we have not
>>>>> considered
>>>>>> in
>>>>>>>> the
>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>> proposal. We intend to force the user to reuse the
>>>>>>>>>>> cached/materialized
>>>>>>>>>>>>>>>> table, if its cache() method is executed. We did not
>>>>> expect
>>>>>>>> that
>>>>>>>>>> user
>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>> want to re-executed the plan from the source table. Let
>>>>> me
>>>>>>>>> re-think
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>> it and get back to you later.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In the meanwhile, this example/observation also infers
>>>>> that
>>>>>>> we
>>>>>>>>>> cannot
>>>>>>>>>>>>> fully
>>>>>>>>>>>>>>>> involve the optimizer to decide the plan if a
>>>>>>> cache/materialize
>>>>>>>>> is
>>>>>>>>>>>>>>>> explicitly used, because weather to reuse the cache data
>>>>> or
>>>>>>>>>>> re-execute
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> query from source data may lead to different results.
>>>>> (But
>>>>>> I
>>>>>>>>> guess
>>>>>>>>>>>>>>>> optimizer can still help in some cases ---- as long as it
>>>>>>> does
>>>>>>>>> not
>>>>>>>>>>>>>>>> re-execute from the varied source, we should be safe).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Shaoxuan
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Shaoxuan,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Re 2:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
>>>>>> modified
>>>>>>>>> to->
>>>>>>>>>>> t1’
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
>>>>>>>>>>>>>>>>> `methodThatAppliesOperators()` method has changed it’s
>>>>>> plan?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I was thinking more about something like this:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Table source = … // some source that scans files from a
>>>>>>>>> directory
>>>>>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>>>>>> initialised)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> // something in the background (or we trigger it) writes
>>>>>> new
>>>>>>>>> files
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
>>>>>>>> implemented
>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> initial version
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
>>>>>>> “cache”
>>>>>>>>>>>>> dropping
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> assertTrue(a1 == b1) // same results, but b1 comes from
>>>>>> the
>>>>>>>>>> “cache"
>>>>>>>>>>>>>>>>> assertTrue(b1 == b2) // both values come from the same
>>>>>> cache
>>>>>>>>>>>>>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2
>>>>> re-executed
>>>>>>>> full
>>>>>>>>>>> table
>>>>>>>>>>>>>>>> scan
>>>>>>>>>>>>>>>>> and has more data
>>>>>>>>>>>>>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
>>>>>>>>>>>>>>>>> assertTrue(b3 == a2 == a3)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It is an very interesting and useful design!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Here I want to share some of my thoughts:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 1. Agree with that cache() method should return some
>>>>>> Table
>>>>>>> to
>>>>>>>>>> avoid
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> unexpected problems because of the mutable object.
>>>>>>>>>>>>>>>>>> All the existing methods of Table are returning a new
>>>>>> Table
>>>>>>>>>>> instance.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2. I think materialize() would be more consistent with
>>>>>> SQL,
>>>>>>>>> this
>>>>>>>>>>>>> makes
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> possible to support the same feature for SQL
>>>>> (materialize
>>>>>>>> view)
>>>>>>>>>> and
>>>>>>>>>>>>>>>> keep
>>>>>>>>>>>>>>>>>> the same API for users in the future.
>>>>>>>>>>>>>>>>>> But I'm also fine if we choose cache().
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 3. In the proposal, a TableService (or FlinkService?)
>>>>> is
>>>>>>> used
>>>>>>>>> to
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> result of the (intermediate) table.
>>>>>>>>>>>>>>>>>> But the name of TableService may be a bit general which
>>>>>> is
>>>>>>>> not
>>>>>>>>>>> quite
>>>>>>>>>>>>>>>>>> understanding correctly in the first glance (a
>>>>> metastore
>>>>>>> for
>>>>>>>>>>>>> tables?).
>>>>>>>>>>>>>>>>>> Maybe a more specific name would be better, such as
>>>>>>>>>>> TableCacheSerive
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> TableMaterializeSerivce or something else.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
>>>>>>>> fhueske@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for the clarification Becket!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I have a few thoughts to share / questions:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1) I'd like to know how you plan to implement the
>>>>>> feature
>>>>>>>> on a
>>>>>>>>>>> plan
>>>>>>>>>>>>> /
>>>>>>>>>>>>>>>>>>> planner level.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I would imaging the following to happen when
>>>>>> Table.cache()
>>>>>>>> is
>>>>>>>>>>>>> called:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1) immediately optimize the Table and internally
>>>>> convert
>>>>>>> it
>>>>>>>>>> into a
>>>>>>>>>>>>>>>>>>> DataSet/DataStream. This is necessary, to avoid that
>>>>>>>> operators
>>>>>>>>>> of
>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>>> queries on top of the Table are pushed down.
>>>>>>>>>>>>>>>>>>> 2) register the DataSet/DataStream as a
>>>>>>>>>> DataSet/DataStream-backed
>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>> X
>>>>>>>>>>>>>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
>>>>>>>>>>> materialization
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> Table X
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Based on your proposal the following would happen:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Table t1 = ....
>>>>>>>>>>>>>>>>>>> t1.cache(); // cache() returns void. The logical plan
>>>>> of
>>>>>>> t1
>>>>>>>> is
>>>>>>>>>>>>>>>> replaced
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>> a scan of X. There is also a reference to the
>>>>>>>> materialization
>>>>>>>>> of
>>>>>>>>>>> X.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, including
>>>>> the
>>>>>>>>>>>>>>>>> DataSet/DataStream
>>>>>>>>>>>>>>>>>>> that backs X and the sink that writes the
>>>>>> materialization
>>>>>>>> of X
>>>>>>>>>>>>>>>>>>> t1.count(); // this executes the program, but reads X
>>>>>> from
>>>>>>>> the
>>>>>>>>>>>>>>>>>>> materialization.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> My question is, how do you determine when whether the
>>>>>> scan
>>>>>>>> of
>>>>>>>>> t1
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> go
>>>>>>>>>>>>>>>>>>> against the DataSet/DataStream program and when
>>>>> against
>>>>>>> the
>>>>>>>>>>>>>>>>>>> materialization?
>>>>>>>>>>>>>>>>>>> AFAIK, there is no hook that will tell you that a part
>>>>>> of
>>>>>>>> the
>>>>>>>>>>>>> program
>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>> executed. Flipping a switch during optimization or
>>>>> plan
>>>>>>>>>> generation
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> sufficient as there is no guarantee that the plan is
>>>>>> also
>>>>>>>>>>> executed.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Overall, this behavior is somewhat similar to what I
>>>>>>>> proposed
>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> FLINK-8950, which does not include persisting the
>>>>> table,
>>>>>>> but
>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>> optimizing and reregistering it as DataSet/DataStream
>>>>>>> scan.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2) I think Piotr has a point about the implicit
>>>>> behavior
>>>>>>> and
>>>>>>>>>> side
>>>>>>>>>>>>>>>>> effects
>>>>>>>>>>>>>>>>>>> of the cache() method if it does not return anything.
>>>>>>>>>>>>>>>>>>> Consider the following example:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Table t1 = ???
>>>>>>>>>>>>>>>>>>> Table t2 = methodThatAppliesOperators(t1);
>>>>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> In this case, the behavior/performance of the plan
>>>>> that
>>>>>>>>> results
>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> second method call depends on whether t1 was modified
>>>>> by
>>>>>>> the
>>>>>>>>>> first
>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>> or not.
>>>>>>>>>>>>>>>>>>> This is the classic issue of mutable vs. immutable
>>>>>>> objects.
>>>>>>>>>>>>>>>>>>> Also, as Piotr pointed out, it might also be good to
>>>>>> have
>>>>>>>> the
>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>>>>>> of t1, because in some cases it is possible to push
>>>>>>> filters
>>>>>>>>> down
>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> evaluating the query from scratch might be more
>>>>>> efficient
>>>>>>>> than
>>>>>>>>>>>>>>>> accessing
>>>>>>>>>>>>>>>>>>> the cache.
>>>>>>>>>>>>>>>>>>> Moreover, a CachedTable could extend Table() and
>>>>> offer a
>>>>>>>>> method
>>>>>>>>>>>>>>>>> refresh().
>>>>>>>>>>>>>>>>>>> This sounds quite useful in an interactive session
>>>>> mode.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
>>>>>>>>>>> materialize()
>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>> to be more future proof.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
>>>>>> Wang <
>>>>>>>>>>>>>>>>>>> wshaoxuan@gmail.com>:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Piotr,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for sharing your ideas on the method naming.
>>>>> We
>>>>>>> will
>>>>>>>>>> think
>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>> your suggestions. But I don't understand why we need
>>>>> to
>>>>>>>>> change
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>> type of cache().
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Cache() is a physical operation, it does not change
>>>>> the
>>>>>>>> logic
>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> the `Table`. On the tableAPI layer, we should not
>>>>>>>> introduce a
>>>>>>>>>> new
>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>> type unless the logic of table has been changed. If
>>>>> we
>>>>>>>>>> introduce
>>>>>>>>>>> a
>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>> table type `CachedTable`, we need create the same set
>>>>>> of
>>>>>>>>>> methods
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> `Table`
>>>>>>>>>>>>>>>>>>>> for it. I don't think it is worth doing this. Or can
>>>>>> you
>>>>>>>>> please
>>>>>>>>>>>>>>>>> elaborate
>>>>>>>>>>>>>>>>>>>> more on what could be the "implicit behaviours/side
>>>>>>>> effects"
>>>>>>>>>> you
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>> thinking about?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Shaoxuan
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for the response.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 1. I wasn’t saying that materialised view must be
>>>>>>> mutable
>>>>>>>> or
>>>>>>>>>>> not.
>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>> thing applies to caches as well. To the contrary, I
>>>>>>> would
>>>>>>>>>> expect
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>> consistency and updates from something that is
>>>>> called
>>>>>>>>> “cache”
>>>>>>>>>> vs
>>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>> that’s a “materialised view”. In other words, IMO
>>>>> most
>>>>>>>>> caches
>>>>>>>>>> do
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> serve
>>>>>>>>>>>>>>>>>>>>> you invalid/outdated data and they handle updates on
>>>>>>> their
>>>>>>>>>> own.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 2. I don’t think that having in the future two very
>>>>>>>> similar
>>>>>>>>>>>>> concepts
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> `materialized` view and `cache` is a good idea. It
>>>>>> would
>>>>>>>> be
>>>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> the users. I think it could be handled by
>>>>>>>>>> variations/overloading
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> materialised view concept. We could start with:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize()` - immutable,
>>>>> session
>>>>>>>> life
>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>>> (basically the same semantic as you are proposing
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> And then in the future (if ever) build on top of
>>>>>>>> that/expand
>>>>>>>>>> it
>>>>>>>>>>>>>>>> with:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
>>>>>>>>>>>>> `MaterializedTable
>>>>>>>>>>>>>>>>>>>>> materialize(refreshHook=…)`
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Or with cross session support:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
>>>>>>>>>>>>>>>> `MaterializedTable
>>>>>>>>>>>>>>>>>>>>> materializeInto(tableFactory=…)`
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I’m not saying that we should implement cross
>>>>>>>>>> session/refreshing
>>>>>>>>>>>>> now
>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>> even in the near future. I’m just arguing that
>>>>> naming
>>>>>>>>> current
>>>>>>>>>>>>>>>>> immutable
>>>>>>>>>>>>>>>>>>>>> session life scope method `materialize()` is more
>>>>>> future
>>>>>>>>> proof
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>> consistent with SQL (on which after all table-api is
>>>>>>>> heavily
>>>>>>>>>>>>> basing
>>>>>>>>>>>>>>>>>>> on).
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would
>>>>>>> still
>>>>>>>>>> insist
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
>>>>>>> implicit
>>>>>>>>>>>>>>>>>>>> behaviours/side
>>>>>>>>>>>>>>>>>>>>> effects and to give both us & users more
>>>>> flexibility.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
>>>>>>>> becket.qin@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Just to add a little bit, the materialized view is
>>>>>>>> probably
>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>> similar
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> the persistent() brought up earlier in the thread.
>>>>> So
>>>>>>> it
>>>>>>>> is
>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>>>>> cross
>>>>>>>>>>>>>>>>>>>>>> session and could be used in a larger scope. For
>>>>>>>> example, a
>>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>>> view created by user A may be visible to user B. It
>>>>>> is
>>>>>>>>>> probably
>>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>>> we want to have in the future. I'll put it in the
>>>>>>> future
>>>>>>>>> work
>>>>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Right now we are mostly thinking of the cached
>>>>> table
>>>>>>> as
>>>>>>>>>>>>>>>> immutable. I
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> see the Materialized view would be useful in the
>>>>>>> future.
>>>>>>>>>> That
>>>>>>>>>>>>>>>> said,
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>> a simple cache mechanism is probably still needed.
>>>>>> So
>>>>>>> to
>>>>>>>>> me,
>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> materialize() should be two separate method as
>>>>> they
>>>>>>>>> address
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>> needs. Materialize() is a higher level concept
>>>>>> usually
>>>>>>>>>>> implying
>>>>>>>>>>>>>>>>>>>>> periodical
>>>>>>>>>>>>>>>>>>>>>>> update, while cache() has much simpler semantic.
>>>>> For
>>>>>>>>>> example,
>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>> create a materialized view and use cache() method
>>>>> in
>>>>>>> the
>>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>> creation logic. So that during the materialized
>>>>> view
>>>>>>>>> update,
>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> need to worry about the case that the cached table
>>>>>> is
>>>>>>>> also
>>>>>>>>>>>>>>>> changed.
>>>>>>>>>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>>>>>>>>>> under the hood, materialized() and cache() could
>>>>>> share
>>>>>>>>> some
>>>>>>>>>>>>>>>>>>> mechanism,
>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>> I think a simple cache() method would be handy in
>>>>> a
>>>>>>> lot
>>>>>>>> of
>>>>>>>>>>>>> cases.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
>>>>>>>>>> MaterializedTable
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Maybe not in the initial implementation, but
>>>>>> various
>>>>>>>> DBs
>>>>>>>>>>> offer
>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
>>>>>>>> triggers,
>>>>>>>>>>>>> timers,
>>>>>>>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to
>>>>>>> handle
>>>>>>>>>> that
>>>>>>>>>>> in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> After users call *table.cache(), *users can just
>>>>>> use
>>>>>>>>> that
>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>> anything that is supported on a Table, including
>>>>>> SQL.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> This is some implicit behaviour with side
>>>>> effects.
>>>>>>>>> Imagine
>>>>>>>>>> if
>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> long and complicated program, that touches table
>>>>>> `b`
>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>>> times,
>>>>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>>>>> scattered around different methods. If he
>>>>> modifies
>>>>>>> his
>>>>>>>>>>> program
>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>> inserting
>>>>>>>>>>>>>>>>>>>>>>>> in one place
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> This implicitly alters the semantic and behaviour
>>>>>> of
>>>>>>>> his
>>>>>>>>>> code
>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> over
>>>>>>>>>>>>>>>>>>>>>>>> the place, maybe in a ways that might cause
>>>>>> problems.
>>>>>>>> For
>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>> underlying data is changing?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Having invisible side effects is also not very
>>>>>> clean,
>>>>>>>> for
>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>> about something like this (but more complicated):
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Table b = ...;
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>>>>>>>>>>>>>>> processTable1(b)
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>> else {
>>>>>>>>>>>>>>>>>>>>>>>> processTable2(b)
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> // do more stuff with b
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
>>>>>>>>>>>>> `processTable1`
>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>> `processTable2` methods.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On the other hand
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Table materialisedB = b.materialize()
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Avoids (at least some of) the side effect issues
>>>>>> and
>>>>>>>>> forces
>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> explicitly use `materialisedB` where it’s
>>>>>> appropriate
>>>>>>>> and
>>>>>>>>>>>>> forces
>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> think what does it actually mean. And if
>>>>> something
>>>>>>>>> doesn’t
>>>>>>>>>>> work
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>>>>>>>>>>> for the user, he will know what has he changed
>>>>>>> instead
>>>>>>>> of
>>>>>>>>>>>>> blaming
>>>>>>>>>>>>>>>>>>>>> Flink for
>>>>>>>>>>>>>>>>>>>>>>>> some “magic” underneath. In the above example,
>>>>>> after
>>>>>>>>>>>>>>>> materialising
>>>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> only one of the methods, he should/would realise
>>>>>>> about
>>>>>>>>> the
>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>>>>>>> handling the return value `MaterializedTable` of
>>>>>> that
>>>>>>>>>> method.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I guess it comes down to personal preferences if
>>>>>> you
>>>>>>>> like
>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>> implicit or not. The more power is the user,
>>>>>> probably
>>>>>>>> the
>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> likely
>>>>>>>>>>>>>>>>>>>>> he is
>>>>>>>>>>>>>>>>>>>>>>>> to like/understand implicit behaviour. And we as
>>>>>>> Table
>>>>>>>>> API
>>>>>>>>>>>>>>>>>>> designers
>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>> the most power users out there, so I would
>>>>> proceed
>>>>>>> with
>>>>>>>>>>> caution
>>>>>>>>>>>>>>>> (so
>>>>>>>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>>>>>>>> do not end up in the crazy perl realm with it’s
>>>>>>> lovely
>>>>>>>>>>> implicit
>>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>> arguments ;)  <
>>>>>>>>>> https://stackoverflow.com/a/14922656/8149051
>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
>>>>>> processing
>>>>>>>>> cases,
>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>>> might be slightly better.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I think even such extended Table API could
>>>>> benefit
>>>>>>> from
>>>>>>>>>>>>> sticking
>>>>>>>>>>>>>>>>>>>>> to/being
>>>>>>>>>>>>>>>>>>>>>>>> consistent with SQL where both SQL and Table API
>>>>>> are
>>>>>>>>>>> basically
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> One more thing. `MaterializedTable materialize()`
>>>>>>> could
>>>>>>>>> be
>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>> powerful/flexible allowing the user to operate
>>>>> both
>>>>>>> on
>>>>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>>>>> and not
>>>>>>>>>>>>>>>>>>>>>>>> materialised view at the same time for whatever
>>>>>>> reasons
>>>>>>>>>>>>>>>> (underlying
>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>> changing/better optimisation opportunities after
>>>>>>>> pushing
>>>>>>>>>> down
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>>>>> etc). For example:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Table b = …;
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Val min = mb.min();
>>>>>>>>>>>>>>>>>>>>>>>> Val max = mb.max();
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Could be more efficient compared to `b.cache()`
>>>>> if
>>>>>>>>>>>>>>>> `filter(‘userId
>>>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>>>>>>> 42);` allows for much more aggressive
>>>>>> optimisations.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
>>>>>>>>>> fhueske@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I'm not suggesting to add support for Ignite.
>>>>> This
>>>>>>> was
>>>>>>>>>> just
>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>> example.
>>>>>>>>>>>>>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
>>>>>>>>>>>>>>>>>>>>>>>>> For the sake of this proposal, it would be up to
>>>>>> the
>>>>>>>>> user
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> implement a
>>>>>>>>>>>>>>>>>>>>>>>>> TableFactory and corresponding TableSource /
>>>>>>> TableSink
>>>>>>>>>>> classes
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>>>>> and read the data.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb
>>>>> Flavio
>>>>>>>>>>> Pompermaier
>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>> pompermaier@okkam.it>:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as
>>>>>> an
>>>>>>>>>>>>> alternative
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>>>>> Ignite?
>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske
>>>>> <
>>>>>>>>>>>>>>>>>>> fhueske@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> To summarize, you propose a new method
>>>>>>>> Table.cache():
>>>>>>>>>>> Table
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger a job and write the result into some
>>>>>>>> temporary
>>>>>>>>>>>>> storage
>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>>>>>>>>>>>> by a TableFactory.
>>>>>>>>>>>>>>>>>>>>>>>>>>> The cache() call blocks while the job is
>>>>> running
>>>>>>> and
>>>>>>>>>>>>>>>> eventually
>>>>>>>>>>>>>>>>>>>>>>>> returns a
>>>>>>>>>>>>>>>>>>>>>>>>>>> Table object that represents a scan of the
>>>>>>> temporary
>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>>>>>>> When the "session" is closed (closing to be
>>>>>>>> defined?),
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> temporary
>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>> are all dropped.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think this behavior makes sense and is a
>>>>> good
>>>>>>>> first
>>>>>>>>>> step
>>>>>>>>>>>>>>>>>>> towards
>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>> interactive workloads.
>>>>>>>>>>>>>>>>>>>>>>>>>>> However, its performance suffers from writing
>>>>> to
>>>>>>> and
>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>> external
>>>>>>>>>>>>>>>>>>>>>>>>>>> systems.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
>>>>>>>>>>> significantly
>>>>>>>>>>>>>>>>>>>> improve
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across
>>>>>>> jobs)
>>>>>>>>>> would
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>> large
>>>>>>>>>>>>>>>>>>>>>>>>>>> impacts on many components of Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Users could use in-memory filesystems or
>>>>> storage
>>>>>>>> grids
>>>>>>>>>>>>> (Apache
>>>>>>>>>>>>>>>>>>>>>>>> Ignite) to
>>>>>>>>>>>>>>>>>>>>>>>>>>> mitigate some of the performance effects.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
>>>>>> Becket
>>>>>>>> Qin
>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
>>>>>>>>>>> MaterializedTable
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table? After users call
>>>>>>>>> *table.cache(),
>>>>>>>>>>>>> *users
>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that table and do anything that is supported
>>>>>> on a
>>>>>>>>>> Table,
>>>>>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>>>>> SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize()
>>>>>>> sounds
>>>>>>>>>> fine
>>>>>>>>>>> to
>>>>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a bit more general than materialize(). Given
>>>>>> that
>>>>>>>> we
>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>> enhancing
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
>>>>>>> processing
>>>>>>>>>>> cases,
>>>>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>> slightly better.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
>>>>>> Nowojski <
>>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend
>>>>> to
>>>>>>>> reuse
>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
>>>>>> assumed
>>>>>>>> that
>>>>>>>>>> you
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternate way of writing the data.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now that I hopefully understand the
>>>>> proposal,
>>>>>>>> maybe
>>>>>>>>> we
>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>> rename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> void materialize()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or going step further
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable materialize()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The second option with returning a handle I
>>>>>>> think
>>>>>>>> is
>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>> flexible
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could provide features such as
>>>>>>> “refresh”/“delete”
>>>>>>>> or
>>>>>>>>>>>>>>>> generally
>>>>>>>>>>>>>>>>>>>>>>>>>> speaking
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manage the the view. In the future we could
>>>>>> also
>>>>>>>>> think
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>>>>>>> hooks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is
>>>>> also
>>>>>>> more
>>>>>>>>>>>>> explicit
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialization returning a new table handle
>>>>>>> will
>>>>>>>>> not
>>>>>>>>>>> have
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implicit side effects as adding a simple
>>>>> line
>>>>>> of
>>>>>>>>> code
>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would have.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would also be more SQL like, making it
>>>>> more
>>>>>>>>>> intuitive
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> familiar with the SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
>>>>>>>>> equivalent
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> creating
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BUILT-IN
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
>>>>>>>>>> functionality
>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> today,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
>>>>>> question.
>>>>>>>> Do
>>>>>>>>>> you
>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
>>>>>> sugar?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is
>>>>> do
>>>>>>> we
>>>>>>>>> want
>>>>>>>>>>> to
>>>>>>>>>>>>>>>> stop
>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> creating
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
>>>>>> extend
>>>>>>>> that
>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful unified data store distributed with
>>>>>>> Flink?
>>>>>>>>> And
>>>>>>>>>>> do
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
>>>>>> pattern
>>>>>>>> with
>>>>>>>>>>> their
>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> services. These considerations are much
>>>>> more
>>>>>>>>>>>>> architectural.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
>>>>>> Nowojski
>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand
>>>>>> the
>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>> Isn’t
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing
>>>>> data
>>>>>>> to
>>>>>>>> a
>>>>>>>>>> sink
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited
>>>>> live
>>>>>>>>>> scope/live
>>>>>>>>>>>>>>>> time?
>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a
>>>>> file
>>>>>>>> sink?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
>>>>>>>>>> materialised
>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>> from a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
>>>>>> reusing
>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to
>>>>>>> clean
>>>>>>>> up
>>>>>>>>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> example when current session finishes)?
>>>>>> Maybe
>>>>>>> we
>>>>>>>>>> need
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>> syntactic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sugar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on top of it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
>>>>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
>>>>>>> persist()
>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>> lifecycle/defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the
>>>>> future
>>>>>>>> work
>>>>>>>>>> for
>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng
>>>>>> sun
>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the
>>>>>> name
>>>>>>>> of
>>>>>>>>>>>>>>>>>>> `cache()`, I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you designed this way!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
>>>>>>>>> lifecycle
>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persistence?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, persist
>>>>> (LifeCycle.SESSION),
>>>>>> so
>>>>>>>>> that
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> worried
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly
>>>>> specify
>>>>>>> the
>>>>>>>>> time
>>>>>>>>>>>>> range
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand,
>>>>> we
>>>>>>> can
>>>>>>>>>> also
>>>>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group of session, for example:
>>>>>>>>>>>>>>>>>>>> LifeCycle.SESSION_GROUP(...), I
>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for
>>>>> reference
>>>>>>>> only!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
>>>>>>>>> 于2018年11月23日周五
>>>>>>>>>>>>>>>>>>> 下午1:33写道:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
>>>>>> cache()
>>>>>>>> v.s.
>>>>>>>>>>>>>>>>>>> persist(),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> personally I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
>>>>>>> describing
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> behavior,
>>>>>>>>>>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
>>>>>>>> deleted
>>>>>>>>>>> after
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> closed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as
>>>>>>> people
>>>>>>>>>> might
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still be there even after the session
>>>>> is
>>>>>>>> gone.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and
>>>>>>> stream
>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that
>>>>>>> goal.
>>>>>>>> I
>>>>>>>>>>>>> imagine
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change across the board, including
>>>>>> sources,
>>>>>>>>>>> operators
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
>>>>>>>> separate
>>>>>>>>>>>>>>>> in-depth
>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan
>>>>>>> Cui <
>>>>>>>>>>>>>>>>>>>>>>>>>> xingcanc@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
>>>>>> access
>>>>>>>>>> domain
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this
>>>>> may
>>>>>>> be
>>>>>>>>> the
>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
>>>>>> other
>>>>>>>> than
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> state.
>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>>>>>>>>>>>>>>> it’s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
>>>>>>>>> concentrate
>>>>>>>>>>> on
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>> part?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
>>>>>> concerned
>>>>>>>>> with
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> service.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change
>>>>> to
>>>>>>> the
>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>> codebase.
>>>>>>>>>>>>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
>>>>>> extendible
>>>>>>> to
>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
>>>>>>> thread.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the
>>>>>> more
>>>>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough
>>>>> service
>>>>>>>>>>> mechanism.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei
>>>>>>>> Jiang <
>>>>>>>>>>>>>>>>>>>>>>>>>>> xiaoweij@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
>>>>>> table
>>>>>>>> for
>>>>>>>>>>> clean
>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reliable.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
>>>>>>>>> executed
>>>>>>>>>>>>>>>>>>>>> successfully.
>>>>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> risk
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that
>>>>>>> it's
>>>>>>>>>> safer
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> association
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So
>>>>>> we
>>>>>>>> can
>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>> clean
>>>>>>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>>>>>>>>>>> temp
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with
>>>>> any
>>>>>>>>> active
>>>>>>>>>>>>>>>>>>> sessions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
>>>>>> jincheng
>>>>>>>>> sun <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
>>>>>>> proposal!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very
>>>>> useful
>>>>>>> and
>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>> friendly
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> examples.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business
>>>>>> has
>>>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the
>>>>> pipeline
>>>>>>> of
>>>>>>>>>> Flink
>>>>>>>>>>>>> ML,
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> utilize
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we
>>>>>> have
>>>>>>>> to
>>>>>>>>>>>>> submit a
>>>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> env.execute().
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
>>>>>> better
>>>>>>>> to
>>>>>>>>>>> named
>>>>>>>>>>>>>>>>>>>>>>>>>> `persist()`,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether
>>>>> we
>>>>>>>>>> internally
>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the
>>>>>>> data
>>>>>>>>> into
>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
>>>>>>> RocksDBStateBackend
>>>>>>>>>> etc.)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in
>>>>> the
>>>>>>>>> future,
>>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job
>>>>>>> will
>>>>>>>>> also
>>>>>>>>>>>>>>>> benefit
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Interactive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward
>>>>> to
>>>>>>>> your
>>>>>>>>>>> JIRAs
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> FLIP!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
>>>>>>>>>>> 于2018年11月20日周二
>>>>>>>>>>>>>>>>>>>>> 下午9:56写道:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
>>>>>>>> pointed
>>>>>>>>>> out,
>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> promising
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table
>>>>>> API
>>>>>>> in
>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>>>> aspects,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among
>>>>>>>> others.
>>>>>>>>>> One
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> scenarios
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
>>>>>> interactive
>>>>>>>>>>>>>>>> programming.
>>>>>>>>>>>>>>>>>>> To
>>>>>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on
>>>>> the
>>>>>>>>>> solution,
>>>>>>>>>>> we
>>>>>>>>>>>>>>>> put
>>>>>>>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our
>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
>>>>>> welcome!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Another potential concern for semantic 3 is that. In the future, we may add
automatic caching to Flink. e.g. cache the intermediate results at the
shuffle boundary. If our semantic is that reference to the original table
means skipping cache, those users may not be able to benefit from the
implicit cache.



On Tue, Dec 11, 2018 at 12:10 PM Becket Qin <be...@gmail.com> wrote:

> Hi Piotrek,
>
> Thanks for the reply. Thought about it again, I might have misunderstood
> your proposal in earlier emails. Returning a CachedTable might not be a bad
> idea.
>
> I was more concerned about the semantic and its intuitiveness when a
> CachedTable is returned. i..e, if cache() returns CachedTable. What are the
> semantic in the following code:
> {
>   val cachedTable = a.cache()
>   val b = cachedTable.select(...)
>   val c = a.select(...)
> }
> What is the difference between b and c? At the first glance, I see two
> options:
>
> Semantic 1. b uses cachedTable as user demanded so. c uses original DAG as
> user demanded so. In this case, the optimizer has no chance to optimize.
> Semantic 2. b uses cachedTable as user demanded so. c leaves the optimizer
> to choose whether the cache or DAG should be used. In this case, user lose
> the option to NOT use cache.
>
> As you can see, neither of the options seem perfect. However, I guess you
> and Till are proposing the third option:
>
> Semantic 3. b leaves the optimizer to choose whether cache or DAG should
> be used. c always use the DAG.
>
> This does address all the concerns. It is just that from intuitiveness
> perspective, I found that asking user to explicitly use a CachedTable while
> the optimizer might choose to ignore is a little weird. That was why I did
> not think about that semantic. But given there is material benefit, I think
> this semantic is acceptable.
>
> 1. If we want to let optimiser make decisions whether to use cache or not,
>> then why do we need “void cache()” method at all? Would It  “increase” the
>> chance of using the cache? That’s sounds strange. What would be the
>> mechanism of deciding whether to use the cache or not? If we want to
>> introduce such kind  automated optimisations of “plan nodes deduplication”
>> I would turn it on globally, not per table, and let the optimiser do all of
>> the work.
>> 2. We do not have statistics at the moment for any use/not use cache
>> decision.
>> 3. Even if we had, I would be veeerryy sceptical whether such cost based
>> optimisations would work properly and I would still insist first on
>> providing explicit caching mechanism (`CachedTable cache()`)
>>
> We are absolutely on the same page here. An explicit cache() method is
> necessary not only because optimizer may not be able to make the right
> decision, but also because of the nature of interactive programming. For
> example, if users write the following code in Scala shell:
>   val b = a.select(...)
>   val c = b.select(...)
>   val d = c.select(...).writeToSink(...)
>   tEnv.execute()
> There is no way optimizer will know whether b or c will be used in later
> code, unless users hint explicitly.
>
> At the same time I’m not sure if you have responded to our objections of
>> `void cache()` being implicit/having side effects, which me, Jark, Fabian,
>> Till and I think also Shaoxuan are supporting.
>
> Is there any other side effects if we use semantic 3 mentioned above?
>
> Thanks,
>
> JIangjie (Becket) Qin
>
>
> On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
>> Hi Becket,
>>
>> Sorry for not responding long time.
>>
>> Regarding case1.
>>
>> There wouldn’t be no “a.unCache()” method, but I would expect only
>> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t affect
>> `cachedTableA2`. Just as in any other database dropping modifying one
>> independent table/materialised view does not affect others.
>>
>> > What I meant is that assuming there is already a cached table, ideally
>> users need
>> > not to specify whether the next query should read from the cache or use
>> the
>> > original DAG. This should be decided by the optimizer.
>>
>> 1. If we want to let optimiser make decisions whether to use cache or
>> not, then why do we need “void cache()” method at all? Would It  “increase”
>> the chance of using the cache? That’s sounds strange. What would be the
>> mechanism of deciding whether to use the cache or not? If we want to
>> introduce such kind  automated optimisations of “plan nodes deduplication”
>> I would turn it on globally, not per table, and let the optimiser do all of
>> the work.
>> 2. We do not have statistics at the moment for any use/not use cache
>> decision.
>> 3. Even if we had, I would be veeerryy sceptical whether such cost based
>> optimisations would work properly and I would still insist first on
>> providing explicit caching mechanism (`CachedTable cache()`)
>> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t
>> contradict future work on automated cost based caching.
>>
>>
>> At the same time I’m not sure if you have responded to our objections of
>> `void cache()` being implicit/having side effects, which me, Jark, Fabian,
>> Till and I think also Shaoxuan are supporting.
>>
>> Piotrek
>>
>> > On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
>> >
>> > Hi Till,
>> >
>> > It is true that after the first job submission, there will be no
>> ambiguity
>> > in terms of whether a cached table is used or not. That is the same for
>> the
>> > cache() without returning a CachedTable.
>> >
>> > Conceptually one could think of cache() as introducing a caching
>> operator
>> >> from which you need to consume from if you want to benefit from the
>> caching
>> >> functionality.
>> >
>> > I am thinking a little differently. I think it is a hint (as you
>> mentioned
>> > later) instead of a new operator. I'd like to be careful about the
>> semantic
>> > of the API. A hint is a property set on an existing operator, but is not
>> > itself an operator as it does not really manipulate the data.
>> >
>> > I agree, ideally the optimizer makes this kind of decision which
>> >> intermediate result should be cached. But especially when executing
>> ad-hoc
>> >> queries the user might better know which results need to be cached
>> because
>> >> Flink might not see the full DAG. In that sense, I would consider the
>> >> cache() method as a hint for the optimizer. Of course, in the future we
>> >> might add functionality which tries to automatically cache results
>> (e.g.
>> >> caching the latest intermediate results until so and so much space is
>> >> used). But this should hopefully not contradict with `CachedTable
>> cache()`.
>> >
>> > I agree that cache() method is needed for exactly the reason you
>> mentioned,
>> > i.e. Flink cannot predict what users are going to write later, so users
>> > need to tell Flink explicitly that this table will be used later. What I
>> > meant is that assuming there is already a cached table, ideally users
>> need
>> > not to specify whether the next query should read from the cache or use
>> the
>> > original DAG. This should be decided by the optimizer.
>> >
>> > To explain the difference between returning / not returning a
>> CachedTable,
>> > I want compare the following two case:
>> >
>> > *Case 1:  returning a CachedTable*
>> > b = a.map(...)
>> > val cachedTableA1 = a.cache()
>> > val cachedTableA2 = a.cache()
>> > b.print() // Just to make sure a is cached.
>> >
>> > c = a.filter(...) // User specify that the original DAG is used? Or the
>> > optimizer decides whether DAG or cache should be used?
>> > d = cachedTableA1.filter() // User specify that the cached table is
>> used.
>> >
>> > a.unCache() // Can cachedTableA still be used afterwards?
>> > cachedTableA1.uncache() // Can cachedTableA2 still be used?
>> >
>> > *Case 2: not returning a CachedTable*
>> > b = a.map()
>> > a.cache()
>> > a.cache() // no-op
>> > b.print() // Just to make sure a is cached
>> >
>> > c = a.filter(...) // Optimizer decides whether the cache or DAG should
>> be
>> > used
>> > d = a.filter(...) // Optimizer decides whether the cache or DAG should
>> be
>> > used
>> >
>> > a.unCache()
>> > a.unCache() // no-op
>> >
>> > In case 1, semantic wise, optimizer lose the option to choose between
>> DAG
>> > and cache. And the unCache() call becomes tricky.
>> > In case 2, users do not need to worry about whether cache or DAG is
>> used.
>> > And the unCache() semantic is clear. However, the caveat is that users
>> > cannot explicitly ignore the cache.
>> >
>> > In order to address the issues mentioned in case 2 and inspired by the
>> > discussion so far, I am thinking about using hint to allow user
>> explicitly
>> > ignore cache. Although we do not have hint yet, but we probably should
>> have
>> > one. So the code becomes:
>> >
>> > *Case 3: returning this table*
>> > b = a.map()
>> > a.cache()
>> > a.cache() // no-op
>> > b.print() // Just to make sure a is cached
>> >
>> > c = a.filter(...) // Optimizer decides whether the cache or DAG should
>> be
>> > used
>> > d = a.hint("ignoreCache").filter(...) // DAG will be used instead of the
>> > cache.
>> >
>> > a.unCache()
>> > a.unCache() // no-op
>> >
>> > We could also let cache() return this table to allow chained method
>> calls.
>> > Do you think this API addresses the concerns?
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> >
>> > On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> All the recent discussions are focused on whether there is a problem if
>> >> cache() not return a Table.
>> >> It seems that returning a Table explicitly is more clear (and safe?).
>> >>
>> >> So whether there are any problems if cache() returns a Table?  @Becket
>> >>
>> >> Best,
>> >> Jark
>> >>
>> >> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <tr...@apache.org>
>> wrote:
>> >>
>> >>> It's true that b, c, d and e will all read from the original DAG that
>> >>> generates a. But all subsequent operators (when running multiple
>> queries)
>> >>> which reference cachedTableA should not need to reproduce `a` but
>> >> directly
>> >>> consume the intermediate result.
>> >>>
>> >>> Conceptually one could think of cache() as introducing a caching
>> operator
>> >>> from which you need to consume from if you want to benefit from the
>> >> caching
>> >>> functionality.
>> >>>
>> >>> I agree, ideally the optimizer makes this kind of decision which
>> >>> intermediate result should be cached. But especially when executing
>> >> ad-hoc
>> >>> queries the user might better know which results need to be cached
>> >> because
>> >>> Flink might not see the full DAG. In that sense, I would consider the
>> >>> cache() method as a hint for the optimizer. Of course, in the future
>> we
>> >>> might add functionality which tries to automatically cache results
>> (e.g.
>> >>> caching the latest intermediate results until so and so much space is
>> >>> used). But this should hopefully not contradict with `CachedTable
>> >> cache()`.
>> >>>
>> >>> Cheers,
>> >>> Till
>> >>>
>> >>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <be...@gmail.com>
>> wrote:
>> >>>
>> >>>> Hi Till,
>> >>>>
>> >>>> Thanks for the clarification. I am still a little confused.
>> >>>>
>> >>>> If cache() returns a CachedTable, the example might become:
>> >>>>
>> >>>> b = a.map(...)
>> >>>> c = a.map(...)
>> >>>>
>> >>>> cachedTableA = a.cache()
>> >>>> d = cachedTableA.map(...)
>> >>>> e = a.map()
>> >>>>
>> >>>> In the above case, if cache() is lazily evaluated, b, c, d and e are
>> >> all
>> >>>> going to be reading from the original DAG that generates a. But with
>> a
>> >>>> naive expectation, d should be reading from the cache. This seems not
>> >>>> solving the potential confusion you raised, right?
>> >>>>
>> >>>> Just to be clear, my understanding are all based on the assumption
>> that
>> >>> the
>> >>>> tables are immutable. Therefore, after a.cache(), a the
>> c*achedTableA*
>> >>> and
>> >>>> original table *a * should be completely interchangeable.
>> >>>>
>> >>>> That said, I think a valid argument is optimization. There are indeed
>> >>> cases
>> >>>> that reading from the original DAG could be faster than reading from
>> >> the
>> >>>> cache. For example, in the following example:
>> >>>>
>> >>>> a.filter(f1' > 100)
>> >>>> a.cache()
>> >>>> b = a.filter(f1' < 100)
>> >>>>
>> >>>> Ideally the optimizer should be intelligent enough to decide which
>> way
>> >> is
>> >>>> faster, without user intervention. In this case, it will identify
>> that
>> >> b
>> >>>> would just be an empty table, thus skip reading from the cache
>> >>> completely.
>> >>>> But I agree that returning a CachedTable would give user the control
>> of
>> >>>> when to use cache, even though I still feel that letting the
>> optimizer
>> >>>> handle this is a better option in long run.
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Jiangjie (Becket) Qin
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <tr...@apache.org>
>> >>> wrote:
>> >>>>
>> >>>>> Yes you are right Becket that it still depends on the actual
>> >> execution
>> >>> of
>> >>>>> the job whether a consumer reads from a cached result or not.
>> >>>>>
>> >>>>> My point was actually about the properties of a (cached vs.
>> >> non-cached)
>> >>>> and
>> >>>>> not about the execution. I would not make cache trigger the
>> execution
>> >>> of
>> >>>>> the job because one loses some flexibility by eagerly triggering the
>> >>>>> execution.
>> >>>>>
>> >>>>> I tried to argue for an explicit CachedTable which is returned by
>> the
>> >>>>> cache() method like Piotr did in order to make the API more
>> explicit.
>> >>>>>
>> >>>>> Cheers,
>> >>>>> Till
>> >>>>>
>> >>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com>
>> >>> wrote:
>> >>>>>
>> >>>>>> Hi Till,
>> >>>>>>
>> >>>>>> That is a good example. Just a minor correction, in this case, b, c
>> >>>> and d
>> >>>>>> will all consume from a non-cached a. This is because cache will
>> >> only
>> >>>> be
>> >>>>>> created on the very first job submission that generates the table
>> >> to
>> >>> be
>> >>>>>> cached.
>> >>>>>>
>> >>>>>> If I understand correctly, this is example is about whether
>> >> .cache()
>> >>>>> method
>> >>>>>> should be eagerly evaluated or lazily evaluated. In another word,
>> >> if
>> >>>>>> cache() method actually triggers a job that creates the cache,
>> >> there
>> >>>> will
>> >>>>>> be no such confusion. Is that right?
>> >>>>>>
>> >>>>>> In the example, although d will not consume from the cached Table
>> >>> while
>> >>>>> it
>> >>>>>> looks supposed to, from correctness perspective the code will still
>> >>>>> return
>> >>>>>> correct result, assuming that tables are immutable.
>> >>>>>>
>> >>>>>> Personally I feel it is OK because users probably won't really
>> >> worry
>> >>>>> about
>> >>>>>> whether the table is cached or not. And lazy cache could avoid some
>> >>>>>> unnecessary caching if a cached table is never created in the user
>> >>>>>> application. But I am not opposed to do eager evaluation of cache.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>>
>> >>>>>> Jiangjie (Becket) Qin
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
>> >> trohrmann@apache.org>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Another argument for Piotr's point is that lazily changing
>> >>> properties
>> >>>>> of
>> >>>>>> a
>> >>>>>>> node affects all down stream consumers but does not necessarily
>> >>> have
>> >>>> to
>> >>>>>>> happen before these consumers are defined. From a user's
>> >>> perspective
>> >>>>> this
>> >>>>>>> can be quite confusing:
>> >>>>>>>
>> >>>>>>> b = a.map(...)
>> >>>>>>> c = a.map(...)
>> >>>>>>>
>> >>>>>>> a.cache()
>> >>>>>>> d = a.map(...)
>> >>>>>>>
>> >>>>>>> now b, c and d will consume from a cached operator. In this case,
>> >>> the
>> >>>>>> user
>> >>>>>>> would most likely expect that only d reads from a cached result.
>> >>>>>>>
>> >>>>>>> Cheers,
>> >>>>>>> Till
>> >>>>>>>
>> >>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
>> >>>>> piotr@data-artisans.com>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Hey Shaoxuan and Becket,
>> >>>>>>>>
>> >>>>>>>>> Can you explain a bit more one what are the side effects? So
>> >>> far
>> >>>> my
>> >>>>>>>>> understanding is that such side effects only exist if a table
>> >>> is
>> >>>>>>> mutable.
>> >>>>>>>>> Is that the case?
>> >>>>>>>>
>> >>>>>>>> Not only that. There are also performance implications and
>> >> those
>> >>>> are
>> >>>>>>>> another implicit side effects of using `void cache()`. As I
>> >> wrote
>> >>>>>> before,
>> >>>>>>>> reading from cache might not always be desirable, thus it can
>> >>> cause
>> >>>>>>>> performance degradation and I’m fine with that - user's or
>> >>>>> optimiser’s
>> >>>>>>>> choice. What I do not like is that this implicit side effect
>> >> can
>> >>>>>> manifest
>> >>>>>>>> in completely different part of code, that wasn’t touched by a
>> >>> user
>> >>>>>> while
>> >>>>>>>> he was adding `void cache()` call somewhere else. And even if
>> >>>> caching
>> >>>>>>>> improves performance, it’s still a side effect of `void
>> >> cache()`.
>> >>>>>> Almost
>> >>>>>>>> from the definition `void` methods have only side effects. As I
>> >>>> wrote
>> >>>>>>>> before, there are couple of scenarios where this might be
>> >>>> undesirable
>> >>>>>>>> and/or unexpected, for example:
>> >>>>>>>>
>> >>>>>>>> 1.
>> >>>>>>>> Table b = …;
>> >>>>>>>> b.cache()
>> >>>>>>>> x = b.join(…)
>> >>>>>>>> y = b.count()
>> >>>>>>>> // ...
>> >>>>>>>> // 100
>> >>>>>>>> // hundred
>> >>>>>>>> // lines
>> >>>>>>>> // of
>> >>>>>>>> // code
>> >>>>>>>> // later
>> >>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in a
>> >>>>> different
>> >>>>>>>> method/file/package/dependency
>> >>>>>>>>
>> >>>>>>>> 2.
>> >>>>>>>>
>> >>>>>>>> Table b = ...
>> >>>>>>>> If (some_condition) {
>> >>>>>>>>  foo(b)
>> >>>>>>>> }
>> >>>>>>>> Else {
>> >>>>>>>>  bar(b)
>> >>>>>>>> }
>> >>>>>>>> z = b.filter(…).groupBy(…)
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Void foo(Table b) {
>> >>>>>>>>  b.cache()
>> >>>>>>>>  // do something with b
>> >>>>>>>> }
>> >>>>>>>>
>> >>>>>>>> In both above examples, `b.cache()` will implicitly affect
>> >>>> (semantic
>> >>>>>> of a
>> >>>>>>>> program in case of sources being mutable and performance) `z =
>> >>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
>> >>>>>>>>
>> >>>>>>>> On top of that, there is still this argument of mine that
>> >> having
>> >>> a
>> >>>>>>>> `MaterializedTable` or `CachedTable` handle is more flexible
>> >> for
>> >>> us
>> >>>>> for
>> >>>>>>> the
>> >>>>>>>> future and for the user (as a manual option to bypass cache
>> >>> reads).
>> >>>>>>>>
>> >>>>>>>>> But Jiangjie is correct,
>> >>>>>>>>> the source table in batching should be immutable. It is the
>> >>>> user’s
>> >>>>>>>>> responsibility to ensure it, otherwise even a regular
>> >> failover
>> >>>> may
>> >>>>>> lead
>> >>>>>>>>> to inconsistent results.
>> >>>>>>>>
>> >>>>>>>> Yes, I agree that’s what perfect world/good deployment should
>> >> be.
>> >>>> But
>> >>>>>> its
>> >>>>>>>> often isn’t and while I’m not trying to fix this (since the
>> >>> proper
>> >>>>> fix
>> >>>>>> is
>> >>>>>>>> to support transactions), I’m just trying to minimise confusion
>> >>> for
>> >>>>> the
>> >>>>>>>> users that are not fully aware what’s going on and operate in
>> >>> less
>> >>>>> then
>> >>>>>>>> perfect setup. And if something bites them after adding
>> >>> `b.cache()`
>> >>>>>> call,
>> >>>>>>>> to make sure that they at least know all of the places that
>> >>> adding
>> >>>>> this
>> >>>>>>>> line can affect.
>> >>>>>>>>
>> >>>>>>>> Thanks, Piotrek
>> >>>>>>>>
>> >>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com>
>> >>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi Piotrek,
>> >>>>>>>>>
>> >>>>>>>>> Thanks again for the clarification. Some more replies are
>> >>>>> following.
>> >>>>>>>>>
>> >>>>>>>>> But keep in mind that `.cache()` will/might not only be used
>> >> in
>> >>>>>>>> interactive
>> >>>>>>>>>> programming and not only in batching.
>> >>>>>>>>>
>> >>>>>>>>> It is true. Actually in stream processing, cache() has the
>> >> same
>> >>>>>>> semantic
>> >>>>>>>> as
>> >>>>>>>>> batch processing. The semantic is following:
>> >>>>>>>>> For a table created via a series of computation, save that
>> >>> table
>> >>>>> for
>> >>>>>>>> later
>> >>>>>>>>> reference to avoid running the computation logic to
>> >> regenerate
>> >>>> the
>> >>>>>>> table.
>> >>>>>>>>> Once the application exits, drop all the cache.
>> >>>>>>>>> This semantic is same for both batch and stream processing.
>> >> The
>> >>>>>>>> difference
>> >>>>>>>>> is that stream applications will only run once as they are
>> >> long
>> >>>>>>> running.
>> >>>>>>>>> And the batch applications may be run multiple times, hence
>> >> the
>> >>>>> cache
>> >>>>>>> may
>> >>>>>>>>> be created and dropped each time the application runs.
>> >>>>>>>>> Admittedly, there will probably be some resource management
>> >>>>>>> requirements
>> >>>>>>>>> for the streaming cached table, such as time based / size
>> >> based
>> >>>>>>>> retention,
>> >>>>>>>>> to address the infinite data issue. But such requirement does
>> >>> not
>> >>>>>>> change
>> >>>>>>>>> the semantic.
>> >>>>>>>>> You are right that interactive programming is just one use
>> >> case
>> >>>> of
>> >>>>>>>> cache().
>> >>>>>>>>> It is not the only use case.
>> >>>>>>>>>
>> >>>>>>>>> For me the more important issue is of not having the `void
>> >>>> cache()`
>> >>>>>>> with
>> >>>>>>>>>> side effects.
>> >>>>>>>>>
>> >>>>>>>>> This is indeed the key point. The argument around whether
>> >>> cache()
>> >>>>>>> should
>> >>>>>>>>> return something already indicates that cache() and
>> >>> materialize()
>> >>>>>>> address
>> >>>>>>>>> different issues.
>> >>>>>>>>> Can you explain a bit more one what are the side effects? So
>> >>> far
>> >>>> my
>> >>>>>>>>> understanding is that such side effects only exist if a table
>> >>> is
>> >>>>>>> mutable.
>> >>>>>>>>> Is that the case?
>> >>>>>>>>>
>> >>>>>>>>> I don’t know, probably initially we should make CachedTable
>> >>>>>> read-only.
>> >>>>>>> I
>> >>>>>>>>>> don’t find it more confusing than the fact that user can not
>> >>>> write
>> >>>>>> to
>> >>>>>>>> views
>> >>>>>>>>>> or materialised views in SQL or that user currently can not
>> >>>> write
>> >>>>>> to a
>> >>>>>>>>>> Table.
>> >>>>>>>>>
>> >>>>>>>>> I don't think anyone should insert something to a cache. By
>> >>>>>> definition
>> >>>>>>>> the
>> >>>>>>>>> cache should only be updated when the corresponding original
>> >>>> table
>> >>>>> is
>> >>>>>>>>> updated. What I am wondering is that given the following two
>> >>>> facts:
>> >>>>>>>>> 1. If and only if a table is mutable (with something like
>> >>>>> insert()),
>> >>>>>> a
>> >>>>>>>>> CachedTable may have implicit behavior.
>> >>>>>>>>> 2. A CachedTable extends a Table.
>> >>>>>>>>> We can come to the conclusion that a CachedTable is mutable
>> >> and
>> >>>>> users
>> >>>>>>> can
>> >>>>>>>>> insert into the CachedTable directly. This is where I thought
>> >>>>>>> confusing.
>> >>>>>>>>>
>> >>>>>>>>> Thanks,
>> >>>>>>>>>
>> >>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>
>> >>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
>> >>>>>> piotr@data-artisans.com
>> >>>>>>>>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi all,
>> >>>>>>>>>>
>> >>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
>> >>>>> explanation
>> >>>>>>> why
>> >>>>>>>> I
>> >>>>>>>>>> think `materialize()` is more natural to me is that I think
>> >> of
>> >>>> all
>> >>>>>>>> “Table”s
>> >>>>>>>>>> in Table-API as views. They behave the same way as SQL
>> >> views,
>> >>>> the
>> >>>>>> only
>> >>>>>>>>>> difference for me is that their live scope is short -
>> >> current
>> >>>>>> session
>> >>>>>>>> which
>> >>>>>>>>>> is limited by different execution model. That’s why
>> >> “cashing”
>> >>> a
>> >>>>> view
>> >>>>>>>> for me
>> >>>>>>>>>> is just materialising it.
>> >>>>>>>>>>
>> >>>>>>>>>> However I see and I understand your point of view. Coming
>> >> from
>> >>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
>> >>>> `cache()`
>> >>>>>> is
>> >>>>>>>> more
>> >>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
>> >> only
>> >>> be
>> >>>>>> used
>> >>>>>>> in
>> >>>>>>>>>> interactive programming and not only in batching. But naming
>> >>> is
>> >>>>> one
>> >>>>>>>> issue,
>> >>>>>>>>>> and not that critical to me. Especially that once we
>> >> implement
>> >>>>>> proper
>> >>>>>>>>>> materialised views, we can always deprecate/rename `cache()`
>> >>> if
>> >>>> we
>> >>>>>>> deem
>> >>>>>>>> so.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> For me the more important issue is of not having the `void
>> >>>>> cache()`
>> >>>>>>> with
>> >>>>>>>>>> side effects. Exactly for the reasons that you have
>> >> mentioned.
>> >>>>> True:
>> >>>>>>>>>> results might be non deterministic if underlying source
>> >> table
>> >>>> are
>> >>>>>>>> changing.
>> >>>>>>>>>> Problem is that `void cache()` implicitly changes the
>> >> semantic
>> >>>> of
>> >>>>>>>>>> subsequent uses of the cached/materialized Table. It can
>> >> cause
>> >>>>> “wtf”
>> >>>>>>>> moment
>> >>>>>>>>>> for a user if he inserts “b.cache()” call in some place in
>> >> his
>> >>>>> code
>> >>>>>>> and
>> >>>>>>>>>> suddenly some other random places are behaving differently.
>> >> If
>> >>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
>> >> force
>> >>>> user
>> >>>>>> to
>> >>>>>>>>>> explicitly use the cache which removes the “random” part
>> >> from
>> >>>> the
>> >>>>>>>> "suddenly
>> >>>>>>>>>> some other random places are behaving differently”.
>> >>>>>>>>>>
>> >>>>>>>>>> This argument and others that I’ve raised (greater
>> >>>>>>> flexibility/allowing
>> >>>>>>>>>> user to explicitly bypass the cache) are independent of
>> >>>> `cache()`
>> >>>>> vs
>> >>>>>>>>>> `materialize()` discussion.
>> >>>>>>>>>>
>> >>>>>>>>>>> Does that mean one can also insert into the CachedTable?
>> >> This
>> >>>>>> sounds
>> >>>>>>>>>> pretty confusing.
>> >>>>>>>>>>
>> >>>>>>>>>> I don’t know, probably initially we should make CachedTable
>> >>>>>>> read-only. I
>> >>>>>>>>>> don’t find it more confusing than the fact that user can not
>> >>>> write
>> >>>>>> to
>> >>>>>>>> views
>> >>>>>>>>>> or materialised views in SQL or that user currently can not
>> >>>> write
>> >>>>>> to a
>> >>>>>>>>>> Table.
>> >>>>>>>>>>
>> >>>>>>>>>> Piotrek
>> >>>>>>>>>>
>> >>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com>
>> >>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi all,
>> >>>>>>>>>>>
>> >>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
>> >>> should
>> >>>> be
>> >>>>>>>>>> considered as two different methods where the later one is
>> >>> more
>> >>>>>>>>>> sophisticated.
>> >>>>>>>>>>>
>> >>>>>>>>>>> According to my understanding, the initial idea is just to
>> >>>>>> introduce
>> >>>>>>> a
>> >>>>>>>>>> simple cache or persist mechanism, but as the TableAPI is a
>> >>>>>> high-level
>> >>>>>>>> API,
>> >>>>>>>>>> it’s naturally for as to think in a SQL way.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
>> >> and
>> >>>>> force
>> >>>>>>>> users
>> >>>>>>>>>> to translate a Table to a Dataset before caching it. Then
>> >> the
>> >>>>> users
>> >>>>>>>> should
>> >>>>>>>>>> manually register the cached dataset to a table again (we
>> >> may
>> >>>> need
>> >>>>>>> some
>> >>>>>>>>>> table replacement mechanisms for datasets with an identical
>> >>>> schema
>> >>>>>> but
>> >>>>>>>>>> different contents here). After all, it’s the dataset rather
>> >>>> than
>> >>>>>> the
>> >>>>>>>>>> dynamic table that need to be cached, right?
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>> Xingcan
>> >>>>>>>>>>>
>> >>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
>> >>>> becket.qin@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Hi Piotrek and Jark,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
>> >>>>> arguments.
>> >>>>>>>> But I
>> >>>>>>>>>>>> think those arguments are mostly about materialized view.
>> >>> Let
>> >>>> me
>> >>>>>> try
>> >>>>>>>> to
>> >>>>>>>>>>>> explain the reason I believe cache() and materialize() are
>> >>>>>>> different.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I think cache() and materialize() have quite different
>> >>>>>> implications.
>> >>>>>>>> An
>> >>>>>>>>>>>> analogy I can think of is save()/publish(). When users
>> >> call
>> >>>>>> cache(),
>> >>>>>>>> it
>> >>>>>>>>>> is
>> >>>>>>>>>>>> just like they are saving an intermediate result as a
>> >> draft
>> >>> of
>> >>>>>> their
>> >>>>>>>>>> work,
>> >>>>>>>>>>>> this intermediate result may not have any realistic
>> >> meaning.
>> >>>>>> Calling
>> >>>>>>>>>>>> cache() does not mean users want to publish the cached
>> >> table
>> >>>> in
>> >>>>>> any
>> >>>>>>>>>> manner.
>> >>>>>>>>>>>> But when users call materialize(), that means "I have
>> >>>> something
>> >>>>>>>>>> meaningful
>> >>>>>>>>>>>> to be reused by others", now users need to think about the
>> >>>>>>> validation,
>> >>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Piotrek's suggestions on variations of the materialize()
>> >>>> methods
>> >>>>>> are
>> >>>>>>>>>> very
>> >>>>>>>>>>>> useful. It would be great if Flink have them. The concept
>> >> of
>> >>>>>>>>>> materialized
>> >>>>>>>>>>>> view is actually a pretty big feature, not to say the
>> >>> related
>> >>>>>> stuff
>> >>>>>>>> like
>> >>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
>> >>> materialized
>> >>>>>> view
>> >>>>>>>>>> itself
>> >>>>>>>>>>>> should be discussed in a more thorough and systematic
>> >>> manner.
>> >>>>> And
>> >>>>>> I
>> >>>>>>>>>> found
>> >>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
>> >>>> interactive
>> >>>>>>>>>>>> programming experience.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> The example you gave was interesting. I still have some
>> >>>>> questions,
>> >>>>>>>>>> though.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Table source = … // some source that scans files from a
>> >>>>> directory
>> >>>>>>>>>>>>> “/foo/bar/“
>> >>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>> >>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>> >> initialised)
>> >>>>>>>>>>>>> int a1 = t1.count()
>> >>>>>>>>>>>>> int b1 = t2.count()
>> >>>>>>>>>>>>> // something in the background (or we trigger it) writes
>> >>> new
>> >>>>>> files
>> >>>>>>> to
>> >>>>>>>>>>>>> /foo/bar
>> >>>>>>>>>>>>> int a2 = t1.count()
>> >>>>>>>>>>>>> int b2 = t2.count()
>> >>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
>> >>>>> implemented
>> >>>>>> in
>> >>>>>>>> the
>> >>>>>>>>>>>>> initial version
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> what if someone else added some more files to /foo/bar at
>> >>> this
>> >>>>>>> point?
>> >>>>>>>> In
>> >>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
>> >>>>>>>>>> non-deterministic,
>> >>>>>>>>>>>> right?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> int a3 = t1.count()
>> >>>>>>>>>>>>> int b3 = t2.count()
>> >>>>>>>>>>>>> t2.drop() // another possible future extension, manual
>> >>>> “cache”
>> >>>>>>>> dropping
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> When we talk about interactive programming, in most cases,
>> >>> we
>> >>>>> are
>> >>>>>>>>>> talking
>> >>>>>>>>>>>> about batch applications. A fundamental assumption of such
>> >>>> case
>> >>>>> is
>> >>>>>>>> that
>> >>>>>>>>>> the
>> >>>>>>>>>>>> source data is complete before the data processing begins,
>> >>> and
>> >>>>> the
>> >>>>>>>> data
>> >>>>>>>>>>>> will not change during the data processing. IMO, if
>> >>> additional
>> >>>>>> rows
>> >>>>>>>>>> needs
>> >>>>>>>>>>>> to be added to some source during the processing, it
>> >> should
>> >>> be
>> >>>>>> done
>> >>>>>>> in
>> >>>>>>>>>> ways
>> >>>>>>>>>>>> like union the source with another table containing the
>> >> rows
>> >>>> to
>> >>>>> be
>> >>>>>>>>>> added.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> There are a few cases that computations are executed
>> >>>> repeatedly
>> >>>>> on
>> >>>>>>> the
>> >>>>>>>>>>>> changing data source.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> For example, people may run a ML training job every hour
>> >>> with
>> >>>>> the
>> >>>>>>>>>> samples
>> >>>>>>>>>>>> newly added in the past hour. In that case, the source
>> >> data
>> >>>>>> between
>> >>>>>>>> will
>> >>>>>>>>>>>> indeed change. But still, the data remain unchanged within
>> >>> one
>> >>>>>> run.
>> >>>>>>>> And
>> >>>>>>>>>>>> usually in that case, the result will need versioning,
>> >> i.e.
>> >>>> for
>> >>>>> a
>> >>>>>>>> given
>> >>>>>>>>>>>> result, it tells that the result is a result from the
>> >> source
>> >>>>> data
>> >>>>>>> by a
>> >>>>>>>>>>>> certain timestamp.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Another example is something like data warehouse. In this
>> >>>> case,
>> >>>>>>> there
>> >>>>>>>>>> are a
>> >>>>>>>>>>>> few source of original/raw data. On top of those sources,
>> >>> many
>> >>>>>>>>>> materialized
>> >>>>>>>>>>>> view / queries / reports / dashboards can be created to
>> >>>> generate
>> >>>>>>>> derived
>> >>>>>>>>>>>> data. Those derived data needs to be updated when the
>> >>>> underlying
>> >>>>>>>>>> original
>> >>>>>>>>>>>> data changes. In that case, the processing logic that
>> >>> derives
>> >>>>> the
>> >>>>>>>>>> original
>> >>>>>>>>>>>> data needs to be executed repeatedly to update those
>> >>>>>> reports/views.
>> >>>>>>>>>> Again,
>> >>>>>>>>>>>> all those derived data also need to have version
>> >> management,
>> >>>>> such
>> >>>>>> as
>> >>>>>>>>>>>> timestamp.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> In any of the above two cases, during a single run of the
>> >>>>>> processing
>> >>>>>>>>>> logic,
>> >>>>>>>>>>>> the data cannot change. Otherwise the behavior of the
>> >>>> processing
>> >>>>>>> logic
>> >>>>>>>>>> may
>> >>>>>>>>>>>> be undefined. In the above two examples, when writing the
>> >>>>>> processing
>> >>>>>>>>>> logic,
>> >>>>>>>>>>>> Users can use .cache() to hint Flink that those results
>> >>> should
>> >>>>> be
>> >>>>>>>> saved
>> >>>>>>>>>> to
>> >>>>>>>>>>>> avoid repeated computation. And then for the result of my
>> >>>>>>> application
>> >>>>>>>>>>>> logic, I'll call materialize(), so that these results
>> >> could
>> >>> be
>> >>>>>>> managed
>> >>>>>>>>>> by
>> >>>>>>>>>>>> the system with versioning, metadata management, lifecycle
>> >>>>>>> management,
>> >>>>>>>>>>>> ACLs, etc.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> It is true we can use materialize() to do the cache() job,
>> >>>> but I
>> >>>>>> am
>> >>>>>>>>>> really
>> >>>>>>>>>>>> reluctant to shoehorn cache() into materialize() and force
>> >>>> users
>> >>>>>> to
>> >>>>>>>>>> worry
>> >>>>>>>>>>>> about a bunch of implications that they needn't have to. I
>> >>> am
>> >>>>>>>>>> absolutely on
>> >>>>>>>>>>>> your side that redundant API is bad. But it is equally
>> >>>>>> frustrating,
>> >>>>>>> if
>> >>>>>>>>>> not
>> >>>>>>>>>>>> more, that the same API does different things.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
>> >>>>>> wshaoxuan@gmail.com
>> >>>>>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks Piotrek,
>> >>>>>>>>>>>>> You provided a very good example, it explains all the
>> >>>>> confusions
>> >>>>>> I
>> >>>>>>>>>> have.
>> >>>>>>>>>>>>> It is clear that there is something we have not
>> >> considered
>> >>> in
>> >>>>> the
>> >>>>>>>>>> initial
>> >>>>>>>>>>>>> proposal. We intend to force the user to reuse the
>> >>>>>>>> cached/materialized
>> >>>>>>>>>>>>> table, if its cache() method is executed. We did not
>> >> expect
>> >>>>> that
>> >>>>>>> user
>> >>>>>>>>>> may
>> >>>>>>>>>>>>> want to re-executed the plan from the source table. Let
>> >> me
>> >>>>>> re-think
>> >>>>>>>>>> about
>> >>>>>>>>>>>>> it and get back to you later.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> In the meanwhile, this example/observation also infers
>> >> that
>> >>>> we
>> >>>>>>> cannot
>> >>>>>>>>>> fully
>> >>>>>>>>>>>>> involve the optimizer to decide the plan if a
>> >>>> cache/materialize
>> >>>>>> is
>> >>>>>>>>>>>>> explicitly used, because weather to reuse the cache data
>> >> or
>> >>>>>>>> re-execute
>> >>>>>>>>>> the
>> >>>>>>>>>>>>> query from source data may lead to different results.
>> >> (But
>> >>> I
>> >>>>>> guess
>> >>>>>>>>>>>>> optimizer can still help in some cases ---- as long as it
>> >>>> does
>> >>>>>> not
>> >>>>>>>>>>>>> re-execute from the varied source, we should be safe).
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Regards,
>> >>>>>>>>>>>>> Shaoxuan
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
>> >>>>>>>>>> piotr@data-artisans.com>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hi Shaoxuan,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Re 2:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
>> >>> modified
>> >>>>>> to->
>> >>>>>>>> t1’
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
>> >>>>>>>>>>>>>> `methodThatAppliesOperators()` method has changed it’s
>> >>> plan?
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I was thinking more about something like this:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Table source = … // some source that scans files from a
>> >>>>>> directory
>> >>>>>>>>>>>>>> “/foo/bar/“
>> >>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>> >>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>> >>> initialised)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> int a1 = t1.count()
>> >>>>>>>>>>>>>> int b1 = t2.count()
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> // something in the background (or we trigger it) writes
>> >>> new
>> >>>>>> files
>> >>>>>>>> to
>> >>>>>>>>>>>>>> /foo/bar
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> int a2 = t1.count()
>> >>>>>>>>>>>>>> int b2 = t2.count()
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
>> >>>>> implemented
>> >>>>>>> in
>> >>>>>>>>>> the
>> >>>>>>>>>>>>>> initial version
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> int a3 = t1.count()
>> >>>>>>>>>>>>>> int b3 = t2.count()
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
>> >>>> “cache”
>> >>>>>>>>>> dropping
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> assertTrue(a1 == b1) // same results, but b1 comes from
>> >>> the
>> >>>>>>> “cache"
>> >>>>>>>>>>>>>> assertTrue(b1 == b2) // both values come from the same
>> >>> cache
>> >>>>>>>>>>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2
>> >> re-executed
>> >>>>> full
>> >>>>>>>> table
>> >>>>>>>>>>>>> scan
>> >>>>>>>>>>>>>> and has more data
>> >>>>>>>>>>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
>> >>>>>>>>>>>>>> assertTrue(b3 == a2 == a3)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Piotrek
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com>
>> >>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> It is an very interesting and useful design!
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Here I want to share some of my thoughts:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> 1. Agree with that cache() method should return some
>> >>> Table
>> >>>> to
>> >>>>>>> avoid
>> >>>>>>>>>>>>> some
>> >>>>>>>>>>>>>>> unexpected problems because of the mutable object.
>> >>>>>>>>>>>>>>> All the existing methods of Table are returning a new
>> >>> Table
>> >>>>>>>> instance.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> 2. I think materialize() would be more consistent with
>> >>> SQL,
>> >>>>>> this
>> >>>>>>>>>> makes
>> >>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>> possible to support the same feature for SQL
>> >> (materialize
>> >>>>> view)
>> >>>>>>> and
>> >>>>>>>>>>>>> keep
>> >>>>>>>>>>>>>>> the same API for users in the future.
>> >>>>>>>>>>>>>>> But I'm also fine if we choose cache().
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> 3. In the proposal, a TableService (or FlinkService?)
>> >> is
>> >>>> used
>> >>>>>> to
>> >>>>>>>>>> cache
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>> result of the (intermediate) table.
>> >>>>>>>>>>>>>>> But the name of TableService may be a bit general which
>> >>> is
>> >>>>> not
>> >>>>>>>> quite
>> >>>>>>>>>>>>>>> understanding correctly in the first glance (a
>> >> metastore
>> >>>> for
>> >>>>>>>>>> tables?).
>> >>>>>>>>>>>>>>> Maybe a more specific name would be better, such as
>> >>>>>>>> TableCacheSerive
>> >>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>> TableMaterializeSerivce or something else.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>> Jark
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
>> >>>>> fhueske@gmail.com
>> >>>>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Hi,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Thanks for the clarification Becket!
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I have a few thoughts to share / questions:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> 1) I'd like to know how you plan to implement the
>> >>> feature
>> >>>>> on a
>> >>>>>>>> plan
>> >>>>>>>>>> /
>> >>>>>>>>>>>>>>>> planner level.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I would imaging the following to happen when
>> >>> Table.cache()
>> >>>>> is
>> >>>>>>>>>> called:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> 1) immediately optimize the Table and internally
>> >> convert
>> >>>> it
>> >>>>>>> into a
>> >>>>>>>>>>>>>>>> DataSet/DataStream. This is necessary, to avoid that
>> >>>>> operators
>> >>>>>>> of
>> >>>>>>>>>>>>> later
>> >>>>>>>>>>>>>>>> queries on top of the Table are pushed down.
>> >>>>>>>>>>>>>>>> 2) register the DataSet/DataStream as a
>> >>>>>>> DataSet/DataStream-backed
>> >>>>>>>>>>>>> Table
>> >>>>>>>>>>>>>> X
>> >>>>>>>>>>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
>> >>>>>>>> materialization
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> Table X
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Based on your proposal the following would happen:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Table t1 = ....
>> >>>>>>>>>>>>>>>> t1.cache(); // cache() returns void. The logical plan
>> >> of
>> >>>> t1
>> >>>>> is
>> >>>>>>>>>>>>> replaced
>> >>>>>>>>>>>>>> by
>> >>>>>>>>>>>>>>>> a scan of X. There is also a reference to the
>> >>>>> materialization
>> >>>>>> of
>> >>>>>>>> X.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> t1.count(); // this executes the program, including
>> >> the
>> >>>>>>>>>>>>>> DataSet/DataStream
>> >>>>>>>>>>>>>>>> that backs X and the sink that writes the
>> >>> materialization
>> >>>>> of X
>> >>>>>>>>>>>>>>>> t1.count(); // this executes the program, but reads X
>> >>> from
>> >>>>> the
>> >>>>>>>>>>>>>>>> materialization.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> My question is, how do you determine when whether the
>> >>> scan
>> >>>>> of
>> >>>>>> t1
>> >>>>>>>>>>>>> should
>> >>>>>>>>>>>>>> go
>> >>>>>>>>>>>>>>>> against the DataSet/DataStream program and when
>> >> against
>> >>>> the
>> >>>>>>>>>>>>>>>> materialization?
>> >>>>>>>>>>>>>>>> AFAIK, there is no hook that will tell you that a part
>> >>> of
>> >>>>> the
>> >>>>>>>>>> program
>> >>>>>>>>>>>>>> was
>> >>>>>>>>>>>>>>>> executed. Flipping a switch during optimization or
>> >> plan
>> >>>>>>> generation
>> >>>>>>>>>> is
>> >>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>> sufficient as there is no guarantee that the plan is
>> >>> also
>> >>>>>>>> executed.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Overall, this behavior is somewhat similar to what I
>> >>>>> proposed
>> >>>>>> in
>> >>>>>>>>>>>>>>>> FLINK-8950, which does not include persisting the
>> >> table,
>> >>>> but
>> >>>>>>> just
>> >>>>>>>>>>>>>>>> optimizing and reregistering it as DataSet/DataStream
>> >>>> scan.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> 2) I think Piotr has a point about the implicit
>> >> behavior
>> >>>> and
>> >>>>>>> side
>> >>>>>>>>>>>>>> effects
>> >>>>>>>>>>>>>>>> of the cache() method if it does not return anything.
>> >>>>>>>>>>>>>>>> Consider the following example:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Table t1 = ???
>> >>>>>>>>>>>>>>>> Table t2 = methodThatAppliesOperators(t1);
>> >>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> In this case, the behavior/performance of the plan
>> >> that
>> >>>>>> results
>> >>>>>>>> from
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> second method call depends on whether t1 was modified
>> >> by
>> >>>> the
>> >>>>>>> first
>> >>>>>>>>>>>>>> method
>> >>>>>>>>>>>>>>>> or not.
>> >>>>>>>>>>>>>>>> This is the classic issue of mutable vs. immutable
>> >>>> objects.
>> >>>>>>>>>>>>>>>> Also, as Piotr pointed out, it might also be good to
>> >>> have
>> >>>>> the
>> >>>>>>>>>> original
>> >>>>>>>>>>>>>> plan
>> >>>>>>>>>>>>>>>> of t1, because in some cases it is possible to push
>> >>>> filters
>> >>>>>> down
>> >>>>>>>>>> such
>> >>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>> evaluating the query from scratch might be more
>> >>> efficient
>> >>>>> than
>> >>>>>>>>>>>>> accessing
>> >>>>>>>>>>>>>>>> the cache.
>> >>>>>>>>>>>>>>>> Moreover, a CachedTable could extend Table() and
>> >> offer a
>> >>>>>> method
>> >>>>>>>>>>>>>> refresh().
>> >>>>>>>>>>>>>>>> This sounds quite useful in an interactive session
>> >> mode.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
>> >>>>>>>> materialize()
>> >>>>>>>>>>>>>> seems
>> >>>>>>>>>>>>>>>> to be more future proof.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Best, Fabian
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
>> >>> Wang <
>> >>>>>>>>>>>>>>>> wshaoxuan@gmail.com>:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Hi Piotr,
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thanks for sharing your ideas on the method naming.
>> >> We
>> >>>> will
>> >>>>>>> think
>> >>>>>>>>>>>>> about
>> >>>>>>>>>>>>>>>>> your suggestions. But I don't understand why we need
>> >> to
>> >>>>>> change
>> >>>>>>>> the
>> >>>>>>>>>>>>>> return
>> >>>>>>>>>>>>>>>>> type of cache().
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Cache() is a physical operation, it does not change
>> >> the
>> >>>>> logic
>> >>>>>>> of
>> >>>>>>>>>>>>>>>>> the `Table`. On the tableAPI layer, we should not
>> >>>>> introduce a
>> >>>>>>> new
>> >>>>>>>>>>>>> table
>> >>>>>>>>>>>>>>>>> type unless the logic of table has been changed. If
>> >> we
>> >>>>>>> introduce
>> >>>>>>>> a
>> >>>>>>>>>>>>> new
>> >>>>>>>>>>>>>>>>> table type `CachedTable`, we need create the same set
>> >>> of
>> >>>>>>> methods
>> >>>>>>>> of
>> >>>>>>>>>>>>>>>> `Table`
>> >>>>>>>>>>>>>>>>> for it. I don't think it is worth doing this. Or can
>> >>> you
>> >>>>>> please
>> >>>>>>>>>>>>>> elaborate
>> >>>>>>>>>>>>>>>>> more on what could be the "implicit behaviours/side
>> >>>>> effects"
>> >>>>>>> you
>> >>>>>>>>>> are
>> >>>>>>>>>>>>>>>>> thinking about?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Regards,
>> >>>>>>>>>>>>>>>>> Shaoxuan
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
>> >>>>>>>>>>>>>> piotr@data-artisans.com>
>> >>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Hi Becket,
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks for the response.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> 1. I wasn’t saying that materialised view must be
>> >>>> mutable
>> >>>>> or
>> >>>>>>>> not.
>> >>>>>>>>>>>>> The
>> >>>>>>>>>>>>>>>>> same
>> >>>>>>>>>>>>>>>>>> thing applies to caches as well. To the contrary, I
>> >>>> would
>> >>>>>>> expect
>> >>>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>>> consistency and updates from something that is
>> >> called
>> >>>>>> “cache”
>> >>>>>>> vs
>> >>>>>>>>>>>>>>>>> something
>> >>>>>>>>>>>>>>>>>> that’s a “materialised view”. In other words, IMO
>> >> most
>> >>>>>> caches
>> >>>>>>> do
>> >>>>>>>>>> not
>> >>>>>>>>>>>>>>>>> serve
>> >>>>>>>>>>>>>>>>>> you invalid/outdated data and they handle updates on
>> >>>> their
>> >>>>>>> own.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> 2. I don’t think that having in the future two very
>> >>>>> similar
>> >>>>>>>>>> concepts
>> >>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>> `materialized` view and `cache` is a good idea. It
>> >>> would
>> >>>>> be
>> >>>>>>>>>>>>> confusing
>> >>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>> the users. I think it could be handled by
>> >>>>>>> variations/overloading
>> >>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>> materialised view concept. We could start with:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> `MaterializedTable materialize()` - immutable,
>> >> session
>> >>>>> life
>> >>>>>>>> scope
>> >>>>>>>>>>>>>>>>>> (basically the same semantic as you are proposing
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> And then in the future (if ever) build on top of
>> >>>>> that/expand
>> >>>>>>> it
>> >>>>>>>>>>>>> with:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
>> >>>>>>>>>> `MaterializedTable
>> >>>>>>>>>>>>>>>>>> materialize(refreshHook=…)`
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Or with cross session support:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
>> >>>>>>>>>>>>> `MaterializedTable
>> >>>>>>>>>>>>>>>>>> materializeInto(tableFactory=…)`
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> I’m not saying that we should implement cross
>> >>>>>>> session/refreshing
>> >>>>>>>>>> now
>> >>>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>>>>> even in the near future. I’m just arguing that
>> >> naming
>> >>>>>> current
>> >>>>>>>>>>>>>> immutable
>> >>>>>>>>>>>>>>>>>> session life scope method `materialize()` is more
>> >>> future
>> >>>>>> proof
>> >>>>>>>> and
>> >>>>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>>> consistent with SQL (on which after all table-api is
>> >>>>> heavily
>> >>>>>>>>>> basing
>> >>>>>>>>>>>>>>>> on).
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would
>> >>>> still
>> >>>>>>> insist
>> >>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
>> >>>> implicit
>> >>>>>>>>>>>>>>>>> behaviours/side
>> >>>>>>>>>>>>>>>>>> effects and to give both us & users more
>> >> flexibility.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Piotrek
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
>> >>>>> becket.qin@gmail.com
>> >>>>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Just to add a little bit, the materialized view is
>> >>>>> probably
>> >>>>>>>> more
>> >>>>>>>>>>>>>>>>> similar
>> >>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>> the persistent() brought up earlier in the thread.
>> >> So
>> >>>> it
>> >>>>> is
>> >>>>>>>>>> usually
>> >>>>>>>>>>>>>>>>> cross
>> >>>>>>>>>>>>>>>>>>> session and could be used in a larger scope. For
>> >>>>> example, a
>> >>>>>>>>>>>>>>>>> materialized
>> >>>>>>>>>>>>>>>>>>> view created by user A may be visible to user B. It
>> >>> is
>> >>>>>>> probably
>> >>>>>>>>>>>>>>>>> something
>> >>>>>>>>>>>>>>>>>>> we want to have in the future. I'll put it in the
>> >>>> future
>> >>>>>> work
>> >>>>>>>>>>>>>>>> section.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
>> >>>>>>>> becket.qin@gmail.com
>> >>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Thanks for the explanation.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Right now we are mostly thinking of the cached
>> >> table
>> >>>> as
>> >>>>>>>>>>>>> immutable. I
>> >>>>>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>>>> see the Materialized view would be useful in the
>> >>>> future.
>> >>>>>>> That
>> >>>>>>>>>>>>> said,
>> >>>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>> think
>> >>>>>>>>>>>>>>>>>>>> a simple cache mechanism is probably still needed.
>> >>> So
>> >>>> to
>> >>>>>> me,
>> >>>>>>>>>>>>> cache()
>> >>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>> materialize() should be two separate method as
>> >> they
>> >>>>>> address
>> >>>>>>>>>>>>>>>> different
>> >>>>>>>>>>>>>>>>>>>> needs. Materialize() is a higher level concept
>> >>> usually
>> >>>>>>>> implying
>> >>>>>>>>>>>>>>>>>> periodical
>> >>>>>>>>>>>>>>>>>>>> update, while cache() has much simpler semantic.
>> >> For
>> >>>>>>> example,
>> >>>>>>>>>> one
>> >>>>>>>>>>>>>>>> may
>> >>>>>>>>>>>>>>>>>>>> create a materialized view and use cache() method
>> >> in
>> >>>> the
>> >>>>>>>>>>>>>>>> materialized
>> >>>>>>>>>>>>>>>>>> view
>> >>>>>>>>>>>>>>>>>>>> creation logic. So that during the materialized
>> >> view
>> >>>>>> update,
>> >>>>>>>>>> they
>> >>>>>>>>>>>>> do
>> >>>>>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>> need to worry about the case that the cached table
>> >>> is
>> >>>>> also
>> >>>>>>>>>>>>> changed.
>> >>>>>>>>>>>>>>>>>> Maybe
>> >>>>>>>>>>>>>>>>>>>> under the hood, materialized() and cache() could
>> >>> share
>> >>>>>> some
>> >>>>>>>>>>>>>>>> mechanism,
>> >>>>>>>>>>>>>>>>>> but
>> >>>>>>>>>>>>>>>>>>>> I think a simple cache() method would be handy in
>> >> a
>> >>>> lot
>> >>>>> of
>> >>>>>>>>>> cases.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
>> >>>>>>>>>>>>>>>>> piotr@data-artisans.com
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Hi Becket,
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
>> >>>>>>> MaterializedTable
>> >>>>>>>>>> that
>> >>>>>>>>>>>>>>>>> they
>> >>>>>>>>>>>>>>>>>>>>> cannot do on a Table?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Maybe not in the initial implementation, but
>> >>> various
>> >>>>> DBs
>> >>>>>>>> offer
>> >>>>>>>>>>>>>>>>>> different
>> >>>>>>>>>>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
>> >>>>> triggers,
>> >>>>>>>>>> timers,
>> >>>>>>>>>>>>>>>>>> manually
>> >>>>>>>>>>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to
>> >>>> handle
>> >>>>>>> that
>> >>>>>>>> in
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> future.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> After users call *table.cache(), *users can just
>> >>> use
>> >>>>>> that
>> >>>>>>>>>> table
>> >>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>> do
>> >>>>>>>>>>>>>>>>>>>>> anything that is supported on a Table, including
>> >>> SQL.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> This is some implicit behaviour with side
>> >> effects.
>> >>>>>> Imagine
>> >>>>>>> if
>> >>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>>>> has
>> >>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>> long and complicated program, that touches table
>> >>> `b`
>> >>>>>>> multiple
>> >>>>>>>>>>>>>>>> times,
>> >>>>>>>>>>>>>>>>>> maybe
>> >>>>>>>>>>>>>>>>>>>>> scattered around different methods. If he
>> >> modifies
>> >>>> his
>> >>>>>>>> program
>> >>>>>>>>>> by
>> >>>>>>>>>>>>>>>>>> inserting
>> >>>>>>>>>>>>>>>>>>>>> in one place
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> b.cache()
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> This implicitly alters the semantic and behaviour
>> >>> of
>> >>>>> his
>> >>>>>>> code
>> >>>>>>>>>> all
>> >>>>>>>>>>>>>>>>> over
>> >>>>>>>>>>>>>>>>>>>>> the place, maybe in a ways that might cause
>> >>> problems.
>> >>>>> For
>> >>>>>>>>>> example
>> >>>>>>>>>>>>>>>>> what
>> >>>>>>>>>>>>>>>>>> if
>> >>>>>>>>>>>>>>>>>>>>> underlying data is changing?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Having invisible side effects is also not very
>> >>> clean,
>> >>>>> for
>> >>>>>>>>>> example
>> >>>>>>>>>>>>>>>>> think
>> >>>>>>>>>>>>>>>>>>>>> about something like this (but more complicated):
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Table b = ...;
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> If (some_condition) {
>> >>>>>>>>>>>>>>>>>>>>> processTable1(b)
>> >>>>>>>>>>>>>>>>>>>>> }
>> >>>>>>>>>>>>>>>>>>>>> else {
>> >>>>>>>>>>>>>>>>>>>>> processTable2(b)
>> >>>>>>>>>>>>>>>>>>>>> }
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> // do more stuff with b
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
>> >>>>>>>>>> `processTable1`
>> >>>>>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>>>>>>>> `processTable2` methods.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> On the other hand
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Table materialisedB = b.materialize()
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Avoids (at least some of) the side effect issues
>> >>> and
>> >>>>>> forces
>> >>>>>>>>>> user
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>> explicitly use `materialisedB` where it’s
>> >>> appropriate
>> >>>>> and
>> >>>>>>>>>> forces
>> >>>>>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>> think what does it actually mean. And if
>> >> something
>> >>>>>> doesn’t
>> >>>>>>>> work
>> >>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> end
>> >>>>>>>>>>>>>>>>>>>>> for the user, he will know what has he changed
>> >>>> instead
>> >>>>> of
>> >>>>>>>>>> blaming
>> >>>>>>>>>>>>>>>>>> Flink for
>> >>>>>>>>>>>>>>>>>>>>> some “magic” underneath. In the above example,
>> >>> after
>> >>>>>>>>>>>>> materialising
>> >>>>>>>>>>>>>>>> b
>> >>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>> only one of the methods, he should/would realise
>> >>>> about
>> >>>>>> the
>> >>>>>>>>>> issue
>> >>>>>>>>>>>>>>>> when
>> >>>>>>>>>>>>>>>>>>>>> handling the return value `MaterializedTable` of
>> >>> that
>> >>>>>>> method.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> I guess it comes down to personal preferences if
>> >>> you
>> >>>>> like
>> >>>>>>>>>> things
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>>>> implicit or not. The more power is the user,
>> >>> probably
>> >>>>> the
>> >>>>>>>> more
>> >>>>>>>>>>>>>>>> likely
>> >>>>>>>>>>>>>>>>>> he is
>> >>>>>>>>>>>>>>>>>>>>> to like/understand implicit behaviour. And we as
>> >>>> Table
>> >>>>>> API
>> >>>>>>>>>>>>>>>> designers
>> >>>>>>>>>>>>>>>>>> are
>> >>>>>>>>>>>>>>>>>>>>> the most power users out there, so I would
>> >> proceed
>> >>>> with
>> >>>>>>>> caution
>> >>>>>>>>>>>>> (so
>> >>>>>>>>>>>>>>>>>> that we
>> >>>>>>>>>>>>>>>>>>>>> do not end up in the crazy perl realm with it’s
>> >>>> lovely
>> >>>>>>>> implicit
>> >>>>>>>>>>>>>>>>> method
>> >>>>>>>>>>>>>>>>>>>>> arguments ;)  <
>> >>>>>>> https://stackoverflow.com/a/14922656/8149051
>> >>>>>>>>> )
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
>> >>> processing
>> >>>>>> cases,
>> >>>>>>>>>>>>> cache()
>> >>>>>>>>>>>>>>>>>>>>> might be slightly better.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> I think even such extended Table API could
>> >> benefit
>> >>>> from
>> >>>>>>>>>> sticking
>> >>>>>>>>>>>>>>>>>> to/being
>> >>>>>>>>>>>>>>>>>>>>> consistent with SQL where both SQL and Table API
>> >>> are
>> >>>>>>>> basically
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> same.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> One more thing. `MaterializedTable materialize()`
>> >>>> could
>> >>>>>> be
>> >>>>>>>> more
>> >>>>>>>>>>>>>>>>>>>>> powerful/flexible allowing the user to operate
>> >> both
>> >>>> on
>> >>>>>>>>>>>>> materialised
>> >>>>>>>>>>>>>>>>>> and not
>> >>>>>>>>>>>>>>>>>>>>> materialised view at the same time for whatever
>> >>>> reasons
>> >>>>>>>>>>>>> (underlying
>> >>>>>>>>>>>>>>>>>> data
>> >>>>>>>>>>>>>>>>>>>>> changing/better optimisation opportunities after
>> >>>>> pushing
>> >>>>>>> down
>> >>>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>>> filters
>> >>>>>>>>>>>>>>>>>>>>> etc). For example:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Table b = …;
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Val min = mb.min();
>> >>>>>>>>>>>>>>>>>>>>> Val max = mb.max();
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Could be more efficient compared to `b.cache()`
>> >> if
>> >>>>>>>>>>>>> `filter(‘userId
>> >>>>>>>>>>>>>>>> =
>> >>>>>>>>>>>>>>>>>>>>> 42);` allows for much more aggressive
>> >>> optimisations.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Piotrek
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
>> >>>>>>> fhueske@gmail.com>
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> I'm not suggesting to add support for Ignite.
>> >> This
>> >>>> was
>> >>>>>>> just
>> >>>>>>>> an
>> >>>>>>>>>>>>>>>>>> example.
>> >>>>>>>>>>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
>> >>>>>>>>>>>>>>>>>>>>>> For the sake of this proposal, it would be up to
>> >>> the
>> >>>>>> user
>> >>>>>>> to
>> >>>>>>>>>>>>>>>>>> implement a
>> >>>>>>>>>>>>>>>>>>>>>> TableFactory and corresponding TableSource /
>> >>>> TableSink
>> >>>>>>>> classes
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>> persist
>> >>>>>>>>>>>>>>>>>>>>>> and read the data.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb
>> >> Flavio
>> >>>>>>>> Pompermaier
>> >>>>>>>>>> <
>> >>>>>>>>>>>>>>>>>>>>>> pompermaier@okkam.it>:
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as
>> >>> an
>> >>>>>>>>>> alternative
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>> Apache
>> >>>>>>>>>>>>>>>>>>>>>>> Ignite?
>> >>>>>>>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske
>> >> <
>> >>>>>>>>>>>>>>>> fhueske@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Hi,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal!
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> To summarize, you propose a new method
>> >>>>> Table.cache():
>> >>>>>>>> Table
>> >>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>> will
>> >>>>>>>>>>>>>>>>>>>>>>>> trigger a job and write the result into some
>> >>>>> temporary
>> >>>>>>>>>> storage
>> >>>>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>>>>>> defined
>> >>>>>>>>>>>>>>>>>>>>>>>> by a TableFactory.
>> >>>>>>>>>>>>>>>>>>>>>>>> The cache() call blocks while the job is
>> >> running
>> >>>> and
>> >>>>>>>>>>>>> eventually
>> >>>>>>>>>>>>>>>>>>>>> returns a
>> >>>>>>>>>>>>>>>>>>>>>>>> Table object that represents a scan of the
>> >>>> temporary
>> >>>>>>>> table.
>> >>>>>>>>>>>>>>>>>>>>>>>> When the "session" is closed (closing to be
>> >>>>> defined?),
>> >>>>>>> the
>> >>>>>>>>>>>>>>>>> temporary
>> >>>>>>>>>>>>>>>>>>>>>>> tables
>> >>>>>>>>>>>>>>>>>>>>>>>> are all dropped.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> I think this behavior makes sense and is a
>> >> good
>> >>>>> first
>> >>>>>>> step
>> >>>>>>>>>>>>>>>> towards
>> >>>>>>>>>>>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>>>>>>>>> interactive workloads.
>> >>>>>>>>>>>>>>>>>>>>>>>> However, its performance suffers from writing
>> >> to
>> >>>> and
>> >>>>>>>> reading
>> >>>>>>>>>>>>>>>> from
>> >>>>>>>>>>>>>>>>>>>>>>> external
>> >>>>>>>>>>>>>>>>>>>>>>>> systems.
>> >>>>>>>>>>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
>> >>>>>>>> significantly
>> >>>>>>>>>>>>>>>>> improve
>> >>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across
>> >>>> jobs)
>> >>>>>>> would
>> >>>>>>>>>>>>> have
>> >>>>>>>>>>>>>>>>>> large
>> >>>>>>>>>>>>>>>>>>>>>>>> impacts on many components of Flink.
>> >>>>>>>>>>>>>>>>>>>>>>>> Users could use in-memory filesystems or
>> >> storage
>> >>>>> grids
>> >>>>>>>>>> (Apache
>> >>>>>>>>>>>>>>>>>>>>> Ignite) to
>> >>>>>>>>>>>>>>>>>>>>>>>> mitigate some of the performance effects.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Best, Fabian
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
>> >>> Becket
>> >>>>> Qin
>> >>>>>> <
>> >>>>>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com
>> >>>>>>>>>>>>>>>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
>> >>>>>>>> MaterializedTable
>> >>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>> they
>> >>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table? After users call
>> >>>>>> *table.cache(),
>> >>>>>>>>>> *users
>> >>>>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>>>>> just
>> >>>>>>>>>>>>>>>>>>>>>>>> use
>> >>>>>>>>>>>>>>>>>>>>>>>>> that table and do anything that is supported
>> >>> on a
>> >>>>>>> Table,
>> >>>>>>>>>>>>>>>>> including
>> >>>>>>>>>>>>>>>>>>>>> SQL.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize()
>> >>>> sounds
>> >>>>>>> fine
>> >>>>>>>> to
>> >>>>>>>>>>>>> me.
>> >>>>>>>>>>>>>>>>>>>>> cache()
>> >>>>>>>>>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>>>>>> a bit more general than materialize(). Given
>> >>> that
>> >>>>> we
>> >>>>>>> are
>> >>>>>>>>>>>>>>>>> enhancing
>> >>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
>> >>>> processing
>> >>>>>>>> cases,
>> >>>>>>>>>>>>>>>>> cache()
>> >>>>>>>>>>>>>>>>>>>>>>> might
>> >>>>>>>>>>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>>>>>>>> slightly better.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
>> >>> Nowojski <
>> >>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend
>> >> to
>> >>>>> reuse
>> >>>>>>>>>> existing
>> >>>>>>>>>>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
>> >>> assumed
>> >>>>> that
>> >>>>>>> you
>> >>>>>>>>>>>>> want
>> >>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>>> provide
>> >>>>>>>>>>>>>>>>>>>>>>>>> an
>> >>>>>>>>>>>>>>>>>>>>>>>>>> alternate way of writing the data.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Now that I hopefully understand the
>> >> proposal,
>> >>>>> maybe
>> >>>>>> we
>> >>>>>>>>>> could
>> >>>>>>>>>>>>>>>>>> rename
>> >>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` to
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> void materialize()
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> or going step further
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable materialize()
>> >>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> ?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> The second option with returning a handle I
>> >>>> think
>> >>>>> is
>> >>>>>>>> more
>> >>>>>>>>>>>>>>>>> flexible
>> >>>>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>>> could provide features such as
>> >>>> “refresh”/“delete”
>> >>>>> or
>> >>>>>>>>>>>>> generally
>> >>>>>>>>>>>>>>>>>>>>>>> speaking
>> >>>>>>>>>>>>>>>>>>>>>>>>>> manage the the view. In the future we could
>> >>> also
>> >>>>>> think
>> >>>>>>>>>> about
>> >>>>>>>>>>>>>>>>>> adding
>> >>>>>>>>>>>>>>>>>>>>>>>> hooks
>> >>>>>>>>>>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is
>> >> also
>> >>>> more
>> >>>>>>>>>> explicit
>> >>>>>>>>>>>>> -
>> >>>>>>>>>>>>>>>>>>>>>>>>>> materialization returning a new table handle
>> >>>> will
>> >>>>>> not
>> >>>>>>>> have
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> same
>> >>>>>>>>>>>>>>>>>>>>>>>>>> implicit side effects as adding a simple
>> >> line
>> >>> of
>> >>>>>> code
>> >>>>>>>> like
>> >>>>>>>>>>>>>>>>>>>>>>> `b.cache()`
>> >>>>>>>>>>>>>>>>>>>>>>>>>> would have.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> It would also be more SQL like, making it
>> >> more
>> >>>>>>> intuitive
>> >>>>>>>>>> for
>> >>>>>>>>>>>>>>>>> users
>> >>>>>>>>>>>>>>>>>>>>>>>>> already
>> >>>>>>>>>>>>>>>>>>>>>>>>>> familiar with the SQL.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
>> >>>>>>>>>> becket.qin@gmail.com
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
>> >>>>>> equivalent
>> >>>>>>> to
>> >>>>>>>>>>>>>>>>> creating
>> >>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>>>>>>> BUILT-IN
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
>> >>>>>>> functionality
>> >>>>>>>> is
>> >>>>>>>>>>>>>>>>> missing
>> >>>>>>>>>>>>>>>>>>>>>>>>> today,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
>> >>> question.
>> >>>>> Do
>> >>>>>>> you
>> >>>>>>>>>> mean
>> >>>>>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>>>>>>>>> already
>> >>>>>>>>>>>>>>>>>>>>>>>>>> have
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
>> >>> sugar?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is
>> >> do
>> >>>> we
>> >>>>>> want
>> >>>>>>>> to
>> >>>>>>>>>>>>> stop
>> >>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>> creating
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
>> >>> extend
>> >>>>> that
>> >>>>>>> in
>> >>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> future
>> >>>>>>>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> useful unified data store distributed with
>> >>>> Flink?
>> >>>>>> And
>> >>>>>>>> do
>> >>>>>>>>>> we
>> >>>>>>>>>>>>>>>>> want
>> >>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>>>> have
>> >>>>>>>>>>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
>> >>> pattern
>> >>>>> with
>> >>>>>>>> their
>> >>>>>>>>>>>>> own
>> >>>>>>>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>>>>>>>>>>>>> defined
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> services. These considerations are much
>> >> more
>> >>>>>>>>>> architectural.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
>> >>> Nowojski
>> >>>> <
>> >>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand
>> >>> the
>> >>>>>>>> problem.
>> >>>>>>>>>>>>>>>> Isn’t
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing
>> >> data
>> >>>> to
>> >>>>> a
>> >>>>>>> sink
>> >>>>>>>>>> and
>> >>>>>>>>>>>>>>>>> later
>> >>>>>>>>>>>>>>>>>>>>>>>>> reading
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited
>> >> live
>> >>>>>>> scope/live
>> >>>>>>>>>>>>> time?
>> >>>>>>>>>>>>>>>>> And
>> >>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>> sink
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a
>> >> file
>> >>>>> sink?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
>> >>>>>>> materialised
>> >>>>>>>>>>>>> view
>> >>>>>>>>>>>>>>>>>> from a
>> >>>>>>>>>>>>>>>>>>>>>>>>> table
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
>> >>> reusing
>> >>>>>> this
>> >>>>>>>>>>>>>>>>> materialised
>> >>>>>>>>>>>>>>>>>>>>>>>> view
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to
>> >>>> clean
>> >>>>> up
>> >>>>>>>>>>>>>>>>> materialised
>> >>>>>>>>>>>>>>>>>>>>>>>> views
>> >>>>>>>>>>>>>>>>>>>>>>>>>> (for
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> example when current session finishes)?
>> >>> Maybe
>> >>>> we
>> >>>>>>> need
>> >>>>>>>>>> some
>> >>>>>>>>>>>>>>>>>>>>>>> syntactic
>> >>>>>>>>>>>>>>>>>>>>>>>>>> sugar
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> on top of it?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
>> >>>>>>>>>>>>> becket.qin@gmail.com
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
>> >>>> persist()
>> >>>>>>> with
>> >>>>>>>>>>>>>>>>>>>>>>>>> lifecycle/defined
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the
>> >> future
>> >>>>> work
>> >>>>>>> for
>> >>>>>>>>>>>>> this.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng
>> >>> sun
>> >>>> <
>> >>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the
>> >>> name
>> >>>>> of
>> >>>>>>>>>>>>>>>> `cache()`, I
>> >>>>>>>>>>>>>>>>>>>>>>>>>> understand
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> why
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you designed this way!
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
>> >>>>>> lifecycle
>> >>>>>>>> for
>> >>>>>>>>>>>>>>>> data
>> >>>>>>>>>>>>>>>>>>>>>>>>>> persistence?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, persist
>> >> (LifeCycle.SESSION),
>> >>> so
>> >>>>>> that
>> >>>>>>>> the
>> >>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> worried
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly
>> >> specify
>> >>>> the
>> >>>>>> time
>> >>>>>>>>>> range
>> >>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>>>>>>>> keeping
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> time.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand,
>> >> we
>> >>>> can
>> >>>>>>> also
>> >>>>>>>>>>>>> share
>> >>>>>>>>>>>>>>>>> in a
>> >>>>>>>>>>>>>>>>>>>>>>>>> certain
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group of session, for example:
>> >>>>>>>>>>>>>>>>> LifeCycle.SESSION_GROUP(...), I
>> >>>>>>>>>>>>>>>>>>>>>>> am
>> >>>>>>>>>>>>>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> sure,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for
>> >> reference
>> >>>>> only!
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bests,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
>> >>>>>> 于2018年11月23日周五
>> >>>>>>>>>>>>>>>> 下午1:33写道:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
>> >>> cache()
>> >>>>> v.s.
>> >>>>>>>>>>>>>>>> persist(),
>> >>>>>>>>>>>>>>>>>>>>>>>>>> personally I
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
>> >>>> describing
>> >>>>>> the
>> >>>>>>>>>>>>>>>> behavior,
>> >>>>>>>>>>>>>>>>>>>>>>> i.e.
>> >>>>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
>> >>>>> deleted
>> >>>>>>>> after
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> session
>> >>>>>>>>>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> closed.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as
>> >>>> people
>> >>>>>>> might
>> >>>>>>>>>>>>> think
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>> table
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> will
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still be there even after the session
>> >> is
>> >>>>> gone.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and
>> >>>> stream
>> >>>>>>>>>>>>> processing
>> >>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>> same
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> job.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that
>> >>>> goal.
>> >>>>> I
>> >>>>>>>>>> imagine
>> >>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>>>>> would
>> >>>>>>>>>>>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change across the board, including
>> >>> sources,
>> >>>>>>>> operators
>> >>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
>> >>>>> separate
>> >>>>>>>>>>>>> in-depth
>> >>>>>>>>>>>>>>>>>>>>>>>>> discussions.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan
>> >>>> Cui <
>> >>>>>>>>>>>>>>>>>>>>>>> xingcanc@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
>> >>> access
>> >>>>>>> domain
>> >>>>>>>>>> are
>> >>>>>>>>>>>>>>>> both
>> >>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this
>> >> may
>> >>>> be
>> >>>>>> the
>> >>>>>>>>>> first
>> >>>>>>>>>>>>>>>> time
>> >>>>>>>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>>>>>>>>> plan
>> >>>>>>>>>>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
>> >>> other
>> >>>>> than
>> >>>>>>> the
>> >>>>>>>>>>>>>>>> state.
>> >>>>>>>>>>>>>>>>>>>>>>> Maybe
>> >>>>>>>>>>>>>>>>>>>>>>>>> it’s
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
>> >>>>>> concentrate
>> >>>>>>>> on
>> >>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>> specific
>> >>>>>>>>>>>>>>>>>>>>>>>>> part?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
>> >>> concerned
>> >>>>>> with
>> >>>>>>>> the
>> >>>>>>>>>>>>>>>>>> underlying
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> service.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change
>> >> to
>> >>>> the
>> >>>>>>>>>> existing
>> >>>>>>>>>>>>>>>>>>>>>>> codebase.
>> >>>>>>>>>>>>>>>>>>>>>>>> As
>> >>>>>>>>>>>>>>>>>>>>>>>>>> you
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
>> >>> extendible
>> >>>> to
>> >>>>>>>> support
>> >>>>>>>>>>>>>>>> other
>> >>>>>>>>>>>>>>>>>>>>>>>>>> components
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
>> >>>> thread.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the
>> >>> more
>> >>>>>>>>>> interactive
>> >>>>>>>>>>>>>>>>> Table
>> >>>>>>>>>>>>>>>>>>>>>>>> API,
>> >>>>>>>>>>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough
>> >> service
>> >>>>>>>> mechanism.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei
>> >>>>> Jiang <
>> >>>>>>>>>>>>>>>>>>>>>>>> xiaoweij@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
>> >>> table
>> >>>>> for
>> >>>>>>>> clean
>> >>>>>>>>>> up
>> >>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>>> very
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reliable.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
>> >>>>>> executed
>> >>>>>>>>>>>>>>>>>> successfully.
>> >>>>>>>>>>>>>>>>>>>>>>> We
>> >>>>>>>>>>>>>>>>>>>>>>>>> may
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> risk
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that
>> >>>> it's
>> >>>>>>> safer
>> >>>>>>>> to
>> >>>>>>>>>>>>>>>> have
>> >>>>>>>>>>>>>>>>> an
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> association
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So
>> >>> we
>> >>>>> can
>> >>>>>>>> always
>> >>>>>>>>>>>>>>>> clean
>> >>>>>>>>>>>>>>>>>> up
>> >>>>>>>>>>>>>>>>>>>>>>>> temp
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with
>> >> any
>> >>>>>> active
>> >>>>>>>>>>>>>>>> sessions.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
>> >>> jincheng
>> >>>>>> sun <
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
>> >>>> proposal!
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very
>> >> useful
>> >>>> and
>> >>>>>>> user
>> >>>>>>>>>>>>>>>> friendly
>> >>>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>>>>> case
>> >>>>>>>>>>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> examples.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business
>> >>> has
>> >>>>> to
>> >>>>>> be
>> >>>>>>>>>>>>>>>> executed
>> >>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>>>>>> several
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stages
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the
>> >> pipeline
>> >>>> of
>> >>>>>>> Flink
>> >>>>>>>>>> ML,
>> >>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>> order
>> >>>>>>>>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> utilize
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we
>> >>> have
>> >>>>> to
>> >>>>>>>>>> submit a
>> >>>>>>>>>>>>>>>> job
>> >>>>>>>>>>>>>>>>>> by
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> env.execute().
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
>> >>> better
>> >>>>> to
>> >>>>>>>> named
>> >>>>>>>>>>>>>>>>>>>>>>> `persist()`,
>> >>>>>>>>>>>>>>>>>>>>>>>>> And
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether
>> >> we
>> >>>>>>> internally
>> >>>>>>>>>>>>> cache
>> >>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>>>>> memory
>> >>>>>>>>>>>>>>>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the
>> >>>> data
>> >>>>>> into
>> >>>>>>>>>> state
>> >>>>>>>>>>>>>>>>>> backend
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
>> >>>> RocksDBStateBackend
>> >>>>>>> etc.)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in
>> >> the
>> >>>>>> future,
>> >>>>>>>>>>>>> support
>> >>>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>>>>>>>>>> streaming
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job
>> >>>> will
>> >>>>>> also
>> >>>>>>>>>>>>> benefit
>> >>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Interactive
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward
>> >> to
>> >>>>> your
>> >>>>>>>> JIRAs
>> >>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>> FLIP!
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
>> >>>>>>>> 于2018年11月20日周二
>> >>>>>>>>>>>>>>>>>> 下午9:56写道:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
>> >>>>> pointed
>> >>>>>>> out,
>> >>>>>>>>>> it
>> >>>>>>>>>>>>>>>> is a
>> >>>>>>>>>>>>>>>>>>>>>>>>> promising
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table
>> >>> API
>> >>>> in
>> >>>>>>>> various
>> >>>>>>>>>>>>>>>>>> aspects,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> including
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among
>> >>>>> others.
>> >>>>>>> One
>> >>>>>>>>>> of
>> >>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>> scenarios
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
>> >>> interactive
>> >>>>>>>>>>>>> programming.
>> >>>>>>>>>>>>>>>> To
>> >>>>>>>>>>>>>>>>>>>>>>>> explain
>> >>>>>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on
>> >> the
>> >>>>>>> solution,
>> >>>>>>>> we
>> >>>>>>>>>>>>> put
>> >>>>>>>>>>>>>>>>>>>>>>>> together
>> >>>>>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our
>> >> proposal.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
>> >>> welcome!
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotrek,

Thanks for the reply. Thought about it again, I might have misunderstood
your proposal in earlier emails. Returning a CachedTable might not be a bad
idea.

I was more concerned about the semantic and its intuitiveness when a
CachedTable is returned. i..e, if cache() returns CachedTable. What are the
semantic in the following code:
{
  val cachedTable = a.cache()
  val b = cachedTable.select(...)
  val c = a.select(...)
}
What is the difference between b and c? At the first glance, I see two
options:

Semantic 1. b uses cachedTable as user demanded so. c uses original DAG as
user demanded so. In this case, the optimizer has no chance to optimize.
Semantic 2. b uses cachedTable as user demanded so. c leaves the optimizer
to choose whether the cache or DAG should be used. In this case, user lose
the option to NOT use cache.

As you can see, neither of the options seem perfect. However, I guess you
and Till are proposing the third option:

Semantic 3. b leaves the optimizer to choose whether cache or DAG should be
used. c always use the DAG.

This does address all the concerns. It is just that from intuitiveness
perspective, I found that asking user to explicitly use a CachedTable while
the optimizer might choose to ignore is a little weird. That was why I did
not think about that semantic. But given there is material benefit, I think
this semantic is acceptable.

1. If we want to let optimiser make decisions whether to use cache or not,
> then why do we need “void cache()” method at all? Would It  “increase” the
> chance of using the cache? That’s sounds strange. What would be the
> mechanism of deciding whether to use the cache or not? If we want to
> introduce such kind  automated optimisations of “plan nodes deduplication”
> I would turn it on globally, not per table, and let the optimiser do all of
> the work.
> 2. We do not have statistics at the moment for any use/not use cache
> decision.
> 3. Even if we had, I would be veeerryy sceptical whether such cost based
> optimisations would work properly and I would still insist first on
> providing explicit caching mechanism (`CachedTable cache()`)
>
We are absolutely on the same page here. An explicit cache() method is
necessary not only because optimizer may not be able to make the right
decision, but also because of the nature of interactive programming. For
example, if users write the following code in Scala shell:
  val b = a.select(...)
  val c = b.select(...)
  val d = c.select(...).writeToSink(...)
  tEnv.execute()
There is no way optimizer will know whether b or c will be used in later
code, unless users hint explicitly.

At the same time I’m not sure if you have responded to our objections of
> `void cache()` being implicit/having side effects, which me, Jark, Fabian,
> Till and I think also Shaoxuan are supporting.

Is there any other side effects if we use semantic 3 mentioned above?

Thanks,

JIangjie (Becket) Qin


On Mon, Dec 10, 2018 at 7:54 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi Becket,
>
> Sorry for not responding long time.
>
> Regarding case1.
>
> There wouldn’t be no “a.unCache()” method, but I would expect only
> `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t affect
> `cachedTableA2`. Just as in any other database dropping modifying one
> independent table/materialised view does not affect others.
>
> > What I meant is that assuming there is already a cached table, ideally
> users need
> > not to specify whether the next query should read from the cache or use
> the
> > original DAG. This should be decided by the optimizer.
>
> 1. If we want to let optimiser make decisions whether to use cache or not,
> then why do we need “void cache()” method at all? Would It  “increase” the
> chance of using the cache? That’s sounds strange. What would be the
> mechanism of deciding whether to use the cache or not? If we want to
> introduce such kind  automated optimisations of “plan nodes deduplication”
> I would turn it on globally, not per table, and let the optimiser do all of
> the work.
> 2. We do not have statistics at the moment for any use/not use cache
> decision.
> 3. Even if we had, I would be veeerryy sceptical whether such cost based
> optimisations would work properly and I would still insist first on
> providing explicit caching mechanism (`CachedTable cache()`)
> 4. As Till wrote, having explicit `CachedTable cache()` doesn’t contradict
> future work on automated cost based caching.
>
>
> At the same time I’m not sure if you have responded to our objections of
> `void cache()` being implicit/having side effects, which me, Jark, Fabian,
> Till and I think also Shaoxuan are supporting.
>
> Piotrek
>
> > On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
> >
> > Hi Till,
> >
> > It is true that after the first job submission, there will be no
> ambiguity
> > in terms of whether a cached table is used or not. That is the same for
> the
> > cache() without returning a CachedTable.
> >
> > Conceptually one could think of cache() as introducing a caching operator
> >> from which you need to consume from if you want to benefit from the
> caching
> >> functionality.
> >
> > I am thinking a little differently. I think it is a hint (as you
> mentioned
> > later) instead of a new operator. I'd like to be careful about the
> semantic
> > of the API. A hint is a property set on an existing operator, but is not
> > itself an operator as it does not really manipulate the data.
> >
> > I agree, ideally the optimizer makes this kind of decision which
> >> intermediate result should be cached. But especially when executing
> ad-hoc
> >> queries the user might better know which results need to be cached
> because
> >> Flink might not see the full DAG. In that sense, I would consider the
> >> cache() method as a hint for the optimizer. Of course, in the future we
> >> might add functionality which tries to automatically cache results (e.g.
> >> caching the latest intermediate results until so and so much space is
> >> used). But this should hopefully not contradict with `CachedTable
> cache()`.
> >
> > I agree that cache() method is needed for exactly the reason you
> mentioned,
> > i.e. Flink cannot predict what users are going to write later, so users
> > need to tell Flink explicitly that this table will be used later. What I
> > meant is that assuming there is already a cached table, ideally users
> need
> > not to specify whether the next query should read from the cache or use
> the
> > original DAG. This should be decided by the optimizer.
> >
> > To explain the difference between returning / not returning a
> CachedTable,
> > I want compare the following two case:
> >
> > *Case 1:  returning a CachedTable*
> > b = a.map(...)
> > val cachedTableA1 = a.cache()
> > val cachedTableA2 = a.cache()
> > b.print() // Just to make sure a is cached.
> >
> > c = a.filter(...) // User specify that the original DAG is used? Or the
> > optimizer decides whether DAG or cache should be used?
> > d = cachedTableA1.filter() // User specify that the cached table is used.
> >
> > a.unCache() // Can cachedTableA still be used afterwards?
> > cachedTableA1.uncache() // Can cachedTableA2 still be used?
> >
> > *Case 2: not returning a CachedTable*
> > b = a.map()
> > a.cache()
> > a.cache() // no-op
> > b.print() // Just to make sure a is cached
> >
> > c = a.filter(...) // Optimizer decides whether the cache or DAG should be
> > used
> > d = a.filter(...) // Optimizer decides whether the cache or DAG should be
> > used
> >
> > a.unCache()
> > a.unCache() // no-op
> >
> > In case 1, semantic wise, optimizer lose the option to choose between DAG
> > and cache. And the unCache() call becomes tricky.
> > In case 2, users do not need to worry about whether cache or DAG is used.
> > And the unCache() semantic is clear. However, the caveat is that users
> > cannot explicitly ignore the cache.
> >
> > In order to address the issues mentioned in case 2 and inspired by the
> > discussion so far, I am thinking about using hint to allow user
> explicitly
> > ignore cache. Although we do not have hint yet, but we probably should
> have
> > one. So the code becomes:
> >
> > *Case 3: returning this table*
> > b = a.map()
> > a.cache()
> > a.cache() // no-op
> > b.print() // Just to make sure a is cached
> >
> > c = a.filter(...) // Optimizer decides whether the cache or DAG should be
> > used
> > d = a.hint("ignoreCache").filter(...) // DAG will be used instead of the
> > cache.
> >
> > a.unCache()
> > a.unCache() // no-op
> >
> > We could also let cache() return this table to allow chained method
> calls.
> > Do you think this API addresses the concerns?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> All the recent discussions are focused on whether there is a problem if
> >> cache() not return a Table.
> >> It seems that returning a Table explicitly is more clear (and safe?).
> >>
> >> So whether there are any problems if cache() returns a Table?  @Becket
> >>
> >> Best,
> >> Jark
> >>
> >> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <tr...@apache.org>
> wrote:
> >>
> >>> It's true that b, c, d and e will all read from the original DAG that
> >>> generates a. But all subsequent operators (when running multiple
> queries)
> >>> which reference cachedTableA should not need to reproduce `a` but
> >> directly
> >>> consume the intermediate result.
> >>>
> >>> Conceptually one could think of cache() as introducing a caching
> operator
> >>> from which you need to consume from if you want to benefit from the
> >> caching
> >>> functionality.
> >>>
> >>> I agree, ideally the optimizer makes this kind of decision which
> >>> intermediate result should be cached. But especially when executing
> >> ad-hoc
> >>> queries the user might better know which results need to be cached
> >> because
> >>> Flink might not see the full DAG. In that sense, I would consider the
> >>> cache() method as a hint for the optimizer. Of course, in the future we
> >>> might add functionality which tries to automatically cache results
> (e.g.
> >>> caching the latest intermediate results until so and so much space is
> >>> used). But this should hopefully not contradict with `CachedTable
> >> cache()`.
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <be...@gmail.com>
> wrote:
> >>>
> >>>> Hi Till,
> >>>>
> >>>> Thanks for the clarification. I am still a little confused.
> >>>>
> >>>> If cache() returns a CachedTable, the example might become:
> >>>>
> >>>> b = a.map(...)
> >>>> c = a.map(...)
> >>>>
> >>>> cachedTableA = a.cache()
> >>>> d = cachedTableA.map(...)
> >>>> e = a.map()
> >>>>
> >>>> In the above case, if cache() is lazily evaluated, b, c, d and e are
> >> all
> >>>> going to be reading from the original DAG that generates a. But with a
> >>>> naive expectation, d should be reading from the cache. This seems not
> >>>> solving the potential confusion you raised, right?
> >>>>
> >>>> Just to be clear, my understanding are all based on the assumption
> that
> >>> the
> >>>> tables are immutable. Therefore, after a.cache(), a the c*achedTableA*
> >>> and
> >>>> original table *a * should be completely interchangeable.
> >>>>
> >>>> That said, I think a valid argument is optimization. There are indeed
> >>> cases
> >>>> that reading from the original DAG could be faster than reading from
> >> the
> >>>> cache. For example, in the following example:
> >>>>
> >>>> a.filter(f1' > 100)
> >>>> a.cache()
> >>>> b = a.filter(f1' < 100)
> >>>>
> >>>> Ideally the optimizer should be intelligent enough to decide which way
> >> is
> >>>> faster, without user intervention. In this case, it will identify that
> >> b
> >>>> would just be an empty table, thus skip reading from the cache
> >>> completely.
> >>>> But I agree that returning a CachedTable would give user the control
> of
> >>>> when to use cache, even though I still feel that letting the optimizer
> >>>> handle this is a better option in long run.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <tr...@apache.org>
> >>> wrote:
> >>>>
> >>>>> Yes you are right Becket that it still depends on the actual
> >> execution
> >>> of
> >>>>> the job whether a consumer reads from a cached result or not.
> >>>>>
> >>>>> My point was actually about the properties of a (cached vs.
> >> non-cached)
> >>>> and
> >>>>> not about the execution. I would not make cache trigger the execution
> >>> of
> >>>>> the job because one loses some flexibility by eagerly triggering the
> >>>>> execution.
> >>>>>
> >>>>> I tried to argue for an explicit CachedTable which is returned by the
> >>>>> cache() method like Piotr did in order to make the API more explicit.
> >>>>>
> >>>>> Cheers,
> >>>>> Till
> >>>>>
> >>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Hi Till,
> >>>>>>
> >>>>>> That is a good example. Just a minor correction, in this case, b, c
> >>>> and d
> >>>>>> will all consume from a non-cached a. This is because cache will
> >> only
> >>>> be
> >>>>>> created on the very first job submission that generates the table
> >> to
> >>> be
> >>>>>> cached.
> >>>>>>
> >>>>>> If I understand correctly, this is example is about whether
> >> .cache()
> >>>>> method
> >>>>>> should be eagerly evaluated or lazily evaluated. In another word,
> >> if
> >>>>>> cache() method actually triggers a job that creates the cache,
> >> there
> >>>> will
> >>>>>> be no such confusion. Is that right?
> >>>>>>
> >>>>>> In the example, although d will not consume from the cached Table
> >>> while
> >>>>> it
> >>>>>> looks supposed to, from correctness perspective the code will still
> >>>>> return
> >>>>>> correct result, assuming that tables are immutable.
> >>>>>>
> >>>>>> Personally I feel it is OK because users probably won't really
> >> worry
> >>>>> about
> >>>>>> whether the table is cached or not. And lazy cache could avoid some
> >>>>>> unnecessary caching if a cached table is never created in the user
> >>>>>> application. But I am not opposed to do eager evaluation of cache.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> >> trohrmann@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Another argument for Piotr's point is that lazily changing
> >>> properties
> >>>>> of
> >>>>>> a
> >>>>>>> node affects all down stream consumers but does not necessarily
> >>> have
> >>>> to
> >>>>>>> happen before these consumers are defined. From a user's
> >>> perspective
> >>>>> this
> >>>>>>> can be quite confusing:
> >>>>>>>
> >>>>>>> b = a.map(...)
> >>>>>>> c = a.map(...)
> >>>>>>>
> >>>>>>> a.cache()
> >>>>>>> d = a.map(...)
> >>>>>>>
> >>>>>>> now b, c and d will consume from a cached operator. In this case,
> >>> the
> >>>>>> user
> >>>>>>> would most likely expect that only d reads from a cached result.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Till
> >>>>>>>
> >>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> >>>>> piotr@data-artisans.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hey Shaoxuan and Becket,
> >>>>>>>>
> >>>>>>>>> Can you explain a bit more one what are the side effects? So
> >>> far
> >>>> my
> >>>>>>>>> understanding is that such side effects only exist if a table
> >>> is
> >>>>>>> mutable.
> >>>>>>>>> Is that the case?
> >>>>>>>>
> >>>>>>>> Not only that. There are also performance implications and
> >> those
> >>>> are
> >>>>>>>> another implicit side effects of using `void cache()`. As I
> >> wrote
> >>>>>> before,
> >>>>>>>> reading from cache might not always be desirable, thus it can
> >>> cause
> >>>>>>>> performance degradation and I’m fine with that - user's or
> >>>>> optimiser’s
> >>>>>>>> choice. What I do not like is that this implicit side effect
> >> can
> >>>>>> manifest
> >>>>>>>> in completely different part of code, that wasn’t touched by a
> >>> user
> >>>>>> while
> >>>>>>>> he was adding `void cache()` call somewhere else. And even if
> >>>> caching
> >>>>>>>> improves performance, it’s still a side effect of `void
> >> cache()`.
> >>>>>> Almost
> >>>>>>>> from the definition `void` methods have only side effects. As I
> >>>> wrote
> >>>>>>>> before, there are couple of scenarios where this might be
> >>>> undesirable
> >>>>>>>> and/or unexpected, for example:
> >>>>>>>>
> >>>>>>>> 1.
> >>>>>>>> Table b = …;
> >>>>>>>> b.cache()
> >>>>>>>> x = b.join(…)
> >>>>>>>> y = b.count()
> >>>>>>>> // ...
> >>>>>>>> // 100
> >>>>>>>> // hundred
> >>>>>>>> // lines
> >>>>>>>> // of
> >>>>>>>> // code
> >>>>>>>> // later
> >>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in a
> >>>>> different
> >>>>>>>> method/file/package/dependency
> >>>>>>>>
> >>>>>>>> 2.
> >>>>>>>>
> >>>>>>>> Table b = ...
> >>>>>>>> If (some_condition) {
> >>>>>>>>  foo(b)
> >>>>>>>> }
> >>>>>>>> Else {
> >>>>>>>>  bar(b)
> >>>>>>>> }
> >>>>>>>> z = b.filter(…).groupBy(…)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Void foo(Table b) {
> >>>>>>>>  b.cache()
> >>>>>>>>  // do something with b
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> In both above examples, `b.cache()` will implicitly affect
> >>>> (semantic
> >>>>>> of a
> >>>>>>>> program in case of sources being mutable and performance) `z =
> >>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
> >>>>>>>>
> >>>>>>>> On top of that, there is still this argument of mine that
> >> having
> >>> a
> >>>>>>>> `MaterializedTable` or `CachedTable` handle is more flexible
> >> for
> >>> us
> >>>>> for
> >>>>>>> the
> >>>>>>>> future and for the user (as a manual option to bypass cache
> >>> reads).
> >>>>>>>>
> >>>>>>>>> But Jiangjie is correct,
> >>>>>>>>> the source table in batching should be immutable. It is the
> >>>> user’s
> >>>>>>>>> responsibility to ensure it, otherwise even a regular
> >> failover
> >>>> may
> >>>>>> lead
> >>>>>>>>> to inconsistent results.
> >>>>>>>>
> >>>>>>>> Yes, I agree that’s what perfect world/good deployment should
> >> be.
> >>>> But
> >>>>>> its
> >>>>>>>> often isn’t and while I’m not trying to fix this (since the
> >>> proper
> >>>>> fix
> >>>>>> is
> >>>>>>>> to support transactions), I’m just trying to minimise confusion
> >>> for
> >>>>> the
> >>>>>>>> users that are not fully aware what’s going on and operate in
> >>> less
> >>>>> then
> >>>>>>>> perfect setup. And if something bites them after adding
> >>> `b.cache()`
> >>>>>> call,
> >>>>>>>> to make sure that they at least know all of the places that
> >>> adding
> >>>>> this
> >>>>>>>> line can affect.
> >>>>>>>>
> >>>>>>>> Thanks, Piotrek
> >>>>>>>>
> >>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Piotrek,
> >>>>>>>>>
> >>>>>>>>> Thanks again for the clarification. Some more replies are
> >>>>> following.
> >>>>>>>>>
> >>>>>>>>> But keep in mind that `.cache()` will/might not only be used
> >> in
> >>>>>>>> interactive
> >>>>>>>>>> programming and not only in batching.
> >>>>>>>>>
> >>>>>>>>> It is true. Actually in stream processing, cache() has the
> >> same
> >>>>>>> semantic
> >>>>>>>> as
> >>>>>>>>> batch processing. The semantic is following:
> >>>>>>>>> For a table created via a series of computation, save that
> >>> table
> >>>>> for
> >>>>>>>> later
> >>>>>>>>> reference to avoid running the computation logic to
> >> regenerate
> >>>> the
> >>>>>>> table.
> >>>>>>>>> Once the application exits, drop all the cache.
> >>>>>>>>> This semantic is same for both batch and stream processing.
> >> The
> >>>>>>>> difference
> >>>>>>>>> is that stream applications will only run once as they are
> >> long
> >>>>>>> running.
> >>>>>>>>> And the batch applications may be run multiple times, hence
> >> the
> >>>>> cache
> >>>>>>> may
> >>>>>>>>> be created and dropped each time the application runs.
> >>>>>>>>> Admittedly, there will probably be some resource management
> >>>>>>> requirements
> >>>>>>>>> for the streaming cached table, such as time based / size
> >> based
> >>>>>>>> retention,
> >>>>>>>>> to address the infinite data issue. But such requirement does
> >>> not
> >>>>>>> change
> >>>>>>>>> the semantic.
> >>>>>>>>> You are right that interactive programming is just one use
> >> case
> >>>> of
> >>>>>>>> cache().
> >>>>>>>>> It is not the only use case.
> >>>>>>>>>
> >>>>>>>>> For me the more important issue is of not having the `void
> >>>> cache()`
> >>>>>>> with
> >>>>>>>>>> side effects.
> >>>>>>>>>
> >>>>>>>>> This is indeed the key point. The argument around whether
> >>> cache()
> >>>>>>> should
> >>>>>>>>> return something already indicates that cache() and
> >>> materialize()
> >>>>>>> address
> >>>>>>>>> different issues.
> >>>>>>>>> Can you explain a bit more one what are the side effects? So
> >>> far
> >>>> my
> >>>>>>>>> understanding is that such side effects only exist if a table
> >>> is
> >>>>>>> mutable.
> >>>>>>>>> Is that the case?
> >>>>>>>>>
> >>>>>>>>> I don’t know, probably initially we should make CachedTable
> >>>>>> read-only.
> >>>>>>> I
> >>>>>>>>>> don’t find it more confusing than the fact that user can not
> >>>> write
> >>>>>> to
> >>>>>>>> views
> >>>>>>>>>> or materialised views in SQL or that user currently can not
> >>>> write
> >>>>>> to a
> >>>>>>>>>> Table.
> >>>>>>>>>
> >>>>>>>>> I don't think anyone should insert something to a cache. By
> >>>>>> definition
> >>>>>>>> the
> >>>>>>>>> cache should only be updated when the corresponding original
> >>>> table
> >>>>> is
> >>>>>>>>> updated. What I am wondering is that given the following two
> >>>> facts:
> >>>>>>>>> 1. If and only if a table is mutable (with something like
> >>>>> insert()),
> >>>>>> a
> >>>>>>>>> CachedTable may have implicit behavior.
> >>>>>>>>> 2. A CachedTable extends a Table.
> >>>>>>>>> We can come to the conclusion that a CachedTable is mutable
> >> and
> >>>>> users
> >>>>>>> can
> >>>>>>>>> insert into the CachedTable directly. This is where I thought
> >>>>>>> confusing.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>
> >>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> >>>>>> piotr@data-artisans.com
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
> >>>>> explanation
> >>>>>>> why
> >>>>>>>> I
> >>>>>>>>>> think `materialize()` is more natural to me is that I think
> >> of
> >>>> all
> >>>>>>>> “Table”s
> >>>>>>>>>> in Table-API as views. They behave the same way as SQL
> >> views,
> >>>> the
> >>>>>> only
> >>>>>>>>>> difference for me is that their live scope is short -
> >> current
> >>>>>> session
> >>>>>>>> which
> >>>>>>>>>> is limited by different execution model. That’s why
> >> “cashing”
> >>> a
> >>>>> view
> >>>>>>>> for me
> >>>>>>>>>> is just materialising it.
> >>>>>>>>>>
> >>>>>>>>>> However I see and I understand your point of view. Coming
> >> from
> >>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
> >>>> `cache()`
> >>>>>> is
> >>>>>>>> more
> >>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
> >> only
> >>> be
> >>>>>> used
> >>>>>>> in
> >>>>>>>>>> interactive programming and not only in batching. But naming
> >>> is
> >>>>> one
> >>>>>>>> issue,
> >>>>>>>>>> and not that critical to me. Especially that once we
> >> implement
> >>>>>> proper
> >>>>>>>>>> materialised views, we can always deprecate/rename `cache()`
> >>> if
> >>>> we
> >>>>>>> deem
> >>>>>>>> so.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> For me the more important issue is of not having the `void
> >>>>> cache()`
> >>>>>>> with
> >>>>>>>>>> side effects. Exactly for the reasons that you have
> >> mentioned.
> >>>>> True:
> >>>>>>>>>> results might be non deterministic if underlying source
> >> table
> >>>> are
> >>>>>>>> changing.
> >>>>>>>>>> Problem is that `void cache()` implicitly changes the
> >> semantic
> >>>> of
> >>>>>>>>>> subsequent uses of the cached/materialized Table. It can
> >> cause
> >>>>> “wtf”
> >>>>>>>> moment
> >>>>>>>>>> for a user if he inserts “b.cache()” call in some place in
> >> his
> >>>>> code
> >>>>>>> and
> >>>>>>>>>> suddenly some other random places are behaving differently.
> >> If
> >>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
> >> force
> >>>> user
> >>>>>> to
> >>>>>>>>>> explicitly use the cache which removes the “random” part
> >> from
> >>>> the
> >>>>>>>> "suddenly
> >>>>>>>>>> some other random places are behaving differently”.
> >>>>>>>>>>
> >>>>>>>>>> This argument and others that I’ve raised (greater
> >>>>>>> flexibility/allowing
> >>>>>>>>>> user to explicitly bypass the cache) are independent of
> >>>> `cache()`
> >>>>> vs
> >>>>>>>>>> `materialize()` discussion.
> >>>>>>>>>>
> >>>>>>>>>>> Does that mean one can also insert into the CachedTable?
> >> This
> >>>>>> sounds
> >>>>>>>>>> pretty confusing.
> >>>>>>>>>>
> >>>>>>>>>> I don’t know, probably initially we should make CachedTable
> >>>>>>> read-only. I
> >>>>>>>>>> don’t find it more confusing than the fact that user can not
> >>>> write
> >>>>>> to
> >>>>>>>> views
> >>>>>>>>>> or materialised views in SQL or that user currently can not
> >>>> write
> >>>>>> to a
> >>>>>>>>>> Table.
> >>>>>>>>>>
> >>>>>>>>>> Piotrek
> >>>>>>>>>>
> >>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
> >>> should
> >>>> be
> >>>>>>>>>> considered as two different methods where the later one is
> >>> more
> >>>>>>>>>> sophisticated.
> >>>>>>>>>>>
> >>>>>>>>>>> According to my understanding, the initial idea is just to
> >>>>>> introduce
> >>>>>>> a
> >>>>>>>>>> simple cache or persist mechanism, but as the TableAPI is a
> >>>>>> high-level
> >>>>>>>> API,
> >>>>>>>>>> it’s naturally for as to think in a SQL way.
> >>>>>>>>>>>
> >>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
> >> and
> >>>>> force
> >>>>>>>> users
> >>>>>>>>>> to translate a Table to a Dataset before caching it. Then
> >> the
> >>>>> users
> >>>>>>>> should
> >>>>>>>>>> manually register the cached dataset to a table again (we
> >> may
> >>>> need
> >>>>>>> some
> >>>>>>>>>> table replacement mechanisms for datasets with an identical
> >>>> schema
> >>>>>> but
> >>>>>>>>>> different contents here). After all, it’s the dataset rather
> >>>> than
> >>>>>> the
> >>>>>>>>>> dynamic table that need to be cached, right?
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Xingcan
> >>>>>>>>>>>
> >>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> >>>> becket.qin@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Piotrek and Jark,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
> >>>>> arguments.
> >>>>>>>> But I
> >>>>>>>>>>>> think those arguments are mostly about materialized view.
> >>> Let
> >>>> me
> >>>>>> try
> >>>>>>>> to
> >>>>>>>>>>>> explain the reason I believe cache() and materialize() are
> >>>>>>> different.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think cache() and materialize() have quite different
> >>>>>> implications.
> >>>>>>>> An
> >>>>>>>>>>>> analogy I can think of is save()/publish(). When users
> >> call
> >>>>>> cache(),
> >>>>>>>> it
> >>>>>>>>>> is
> >>>>>>>>>>>> just like they are saving an intermediate result as a
> >> draft
> >>> of
> >>>>>> their
> >>>>>>>>>> work,
> >>>>>>>>>>>> this intermediate result may not have any realistic
> >> meaning.
> >>>>>> Calling
> >>>>>>>>>>>> cache() does not mean users want to publish the cached
> >> table
> >>>> in
> >>>>>> any
> >>>>>>>>>> manner.
> >>>>>>>>>>>> But when users call materialize(), that means "I have
> >>>> something
> >>>>>>>>>> meaningful
> >>>>>>>>>>>> to be reused by others", now users need to think about the
> >>>>>>> validation,
> >>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Piotrek's suggestions on variations of the materialize()
> >>>> methods
> >>>>>> are
> >>>>>>>>>> very
> >>>>>>>>>>>> useful. It would be great if Flink have them. The concept
> >> of
> >>>>>>>>>> materialized
> >>>>>>>>>>>> view is actually a pretty big feature, not to say the
> >>> related
> >>>>>> stuff
> >>>>>>>> like
> >>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
> >>> materialized
> >>>>>> view
> >>>>>>>>>> itself
> >>>>>>>>>>>> should be discussed in a more thorough and systematic
> >>> manner.
> >>>>> And
> >>>>>> I
> >>>>>>>>>> found
> >>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
> >>>> interactive
> >>>>>>>>>>>> programming experience.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The example you gave was interesting. I still have some
> >>>>> questions,
> >>>>>>>>>> though.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Table source = … // some source that scans files from a
> >>>>> directory
> >>>>>>>>>>>>> “/foo/bar/“
> >>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>>>>>>>
> >>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> >> initialised)
> >>>>>>>>>>>>> int a1 = t1.count()
> >>>>>>>>>>>>> int b1 = t2.count()
> >>>>>>>>>>>>> // something in the background (or we trigger it) writes
> >>> new
> >>>>>> files
> >>>>>>> to
> >>>>>>>>>>>>> /foo/bar
> >>>>>>>>>>>>> int a2 = t1.count()
> >>>>>>>>>>>>> int b2 = t2.count()
> >>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> >>>>> implemented
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>>> initial version
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> what if someone else added some more files to /foo/bar at
> >>> this
> >>>>>>> point?
> >>>>>>>> In
> >>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
> >>>>>>>>>> non-deterministic,
> >>>>>>>>>>>> right?
> >>>>>>>>>>>>
> >>>>>>>>>>>> int a3 = t1.count()
> >>>>>>>>>>>>> int b3 = t2.count()
> >>>>>>>>>>>>> t2.drop() // another possible future extension, manual
> >>>> “cache”
> >>>>>>>> dropping
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> When we talk about interactive programming, in most cases,
> >>> we
> >>>>> are
> >>>>>>>>>> talking
> >>>>>>>>>>>> about batch applications. A fundamental assumption of such
> >>>> case
> >>>>> is
> >>>>>>>> that
> >>>>>>>>>> the
> >>>>>>>>>>>> source data is complete before the data processing begins,
> >>> and
> >>>>> the
> >>>>>>>> data
> >>>>>>>>>>>> will not change during the data processing. IMO, if
> >>> additional
> >>>>>> rows
> >>>>>>>>>> needs
> >>>>>>>>>>>> to be added to some source during the processing, it
> >> should
> >>> be
> >>>>>> done
> >>>>>>> in
> >>>>>>>>>> ways
> >>>>>>>>>>>> like union the source with another table containing the
> >> rows
> >>>> to
> >>>>> be
> >>>>>>>>>> added.
> >>>>>>>>>>>>
> >>>>>>>>>>>> There are a few cases that computations are executed
> >>>> repeatedly
> >>>>> on
> >>>>>>> the
> >>>>>>>>>>>> changing data source.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For example, people may run a ML training job every hour
> >>> with
> >>>>> the
> >>>>>>>>>> samples
> >>>>>>>>>>>> newly added in the past hour. In that case, the source
> >> data
> >>>>>> between
> >>>>>>>> will
> >>>>>>>>>>>> indeed change. But still, the data remain unchanged within
> >>> one
> >>>>>> run.
> >>>>>>>> And
> >>>>>>>>>>>> usually in that case, the result will need versioning,
> >> i.e.
> >>>> for
> >>>>> a
> >>>>>>>> given
> >>>>>>>>>>>> result, it tells that the result is a result from the
> >> source
> >>>>> data
> >>>>>>> by a
> >>>>>>>>>>>> certain timestamp.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another example is something like data warehouse. In this
> >>>> case,
> >>>>>>> there
> >>>>>>>>>> are a
> >>>>>>>>>>>> few source of original/raw data. On top of those sources,
> >>> many
> >>>>>>>>>> materialized
> >>>>>>>>>>>> view / queries / reports / dashboards can be created to
> >>>> generate
> >>>>>>>> derived
> >>>>>>>>>>>> data. Those derived data needs to be updated when the
> >>>> underlying
> >>>>>>>>>> original
> >>>>>>>>>>>> data changes. In that case, the processing logic that
> >>> derives
> >>>>> the
> >>>>>>>>>> original
> >>>>>>>>>>>> data needs to be executed repeatedly to update those
> >>>>>> reports/views.
> >>>>>>>>>> Again,
> >>>>>>>>>>>> all those derived data also need to have version
> >> management,
> >>>>> such
> >>>>>> as
> >>>>>>>>>>>> timestamp.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In any of the above two cases, during a single run of the
> >>>>>> processing
> >>>>>>>>>> logic,
> >>>>>>>>>>>> the data cannot change. Otherwise the behavior of the
> >>>> processing
> >>>>>>> logic
> >>>>>>>>>> may
> >>>>>>>>>>>> be undefined. In the above two examples, when writing the
> >>>>>> processing
> >>>>>>>>>> logic,
> >>>>>>>>>>>> Users can use .cache() to hint Flink that those results
> >>> should
> >>>>> be
> >>>>>>>> saved
> >>>>>>>>>> to
> >>>>>>>>>>>> avoid repeated computation. And then for the result of my
> >>>>>>> application
> >>>>>>>>>>>> logic, I'll call materialize(), so that these results
> >> could
> >>> be
> >>>>>>> managed
> >>>>>>>>>> by
> >>>>>>>>>>>> the system with versioning, metadata management, lifecycle
> >>>>>>> management,
> >>>>>>>>>>>> ACLs, etc.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It is true we can use materialize() to do the cache() job,
> >>>> but I
> >>>>>> am
> >>>>>>>>>> really
> >>>>>>>>>>>> reluctant to shoehorn cache() into materialize() and force
> >>>> users
> >>>>>> to
> >>>>>>>>>> worry
> >>>>>>>>>>>> about a bunch of implications that they needn't have to. I
> >>> am
> >>>>>>>>>> absolutely on
> >>>>>>>>>>>> your side that redundant API is bad. But it is equally
> >>>>>> frustrating,
> >>>>>>> if
> >>>>>>>>>> not
> >>>>>>>>>>>> more, that the same API does different things.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> >>>>>> wshaoxuan@gmail.com
> >>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks Piotrek,
> >>>>>>>>>>>>> You provided a very good example, it explains all the
> >>>>> confusions
> >>>>>> I
> >>>>>>>>>> have.
> >>>>>>>>>>>>> It is clear that there is something we have not
> >> considered
> >>> in
> >>>>> the
> >>>>>>>>>> initial
> >>>>>>>>>>>>> proposal. We intend to force the user to reuse the
> >>>>>>>> cached/materialized
> >>>>>>>>>>>>> table, if its cache() method is executed. We did not
> >> expect
> >>>>> that
> >>>>>>> user
> >>>>>>>>>> may
> >>>>>>>>>>>>> want to re-executed the plan from the source table. Let
> >> me
> >>>>>> re-think
> >>>>>>>>>> about
> >>>>>>>>>>>>> it and get back to you later.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In the meanwhile, this example/observation also infers
> >> that
> >>>> we
> >>>>>>> cannot
> >>>>>>>>>> fully
> >>>>>>>>>>>>> involve the optimizer to decide the plan if a
> >>>> cache/materialize
> >>>>>> is
> >>>>>>>>>>>>> explicitly used, because weather to reuse the cache data
> >> or
> >>>>>>>> re-execute
> >>>>>>>>>> the
> >>>>>>>>>>>>> query from source data may lead to different results.
> >> (But
> >>> I
> >>>>>> guess
> >>>>>>>>>>>>> optimizer can still help in some cases ---- as long as it
> >>>> does
> >>>>>> not
> >>>>>>>>>>>>> re-execute from the varied source, we should be safe).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Shaoxuan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> >>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Shaoxuan,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Re 2:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
> >>> modified
> >>>>>> to->
> >>>>>>>> t1’
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> >>>>>>>>>>>>>> `methodThatAppliesOperators()` method has changed it’s
> >>> plan?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I was thinking more about something like this:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Table source = … // some source that scans files from a
> >>>>>> directory
> >>>>>>>>>>>>>> “/foo/bar/“
> >>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
> >>> initialised)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> int a1 = t1.count()
> >>>>>>>>>>>>>> int b1 = t2.count()
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> // something in the background (or we trigger it) writes
> >>> new
> >>>>>> files
> >>>>>>>> to
> >>>>>>>>>>>>>> /foo/bar
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> int a2 = t1.count()
> >>>>>>>>>>>>>> int b2 = t2.count()
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
> >>>>> implemented
> >>>>>>> in
> >>>>>>>>>> the
> >>>>>>>>>>>>>> initial version
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> int a3 = t1.count()
> >>>>>>>>>>>>>> int b3 = t2.count()
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
> >>>> “cache”
> >>>>>>>>>> dropping
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> assertTrue(a1 == b1) // same results, but b1 comes from
> >>> the
> >>>>>>> “cache"
> >>>>>>>>>>>>>> assertTrue(b1 == b2) // both values come from the same
> >>> cache
> >>>>>>>>>>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2
> >> re-executed
> >>>>> full
> >>>>>>>> table
> >>>>>>>>>>>>> scan
> >>>>>>>>>>>>>> and has more data
> >>>>>>>>>>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> >>>>>>>>>>>>>> assertTrue(b3 == a2 == a3)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It is an very interesting and useful design!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Here I want to share some of my thoughts:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1. Agree with that cache() method should return some
> >>> Table
> >>>> to
> >>>>>>> avoid
> >>>>>>>>>>>>> some
> >>>>>>>>>>>>>>> unexpected problems because of the mutable object.
> >>>>>>>>>>>>>>> All the existing methods of Table are returning a new
> >>> Table
> >>>>>>>> instance.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2. I think materialize() would be more consistent with
> >>> SQL,
> >>>>>> this
> >>>>>>>>>> makes
> >>>>>>>>>>>>> it
> >>>>>>>>>>>>>>> possible to support the same feature for SQL
> >> (materialize
> >>>>> view)
> >>>>>>> and
> >>>>>>>>>>>>> keep
> >>>>>>>>>>>>>>> the same API for users in the future.
> >>>>>>>>>>>>>>> But I'm also fine if we choose cache().
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 3. In the proposal, a TableService (or FlinkService?)
> >> is
> >>>> used
> >>>>>> to
> >>>>>>>>>> cache
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> result of the (intermediate) table.
> >>>>>>>>>>>>>>> But the name of TableService may be a bit general which
> >>> is
> >>>>> not
> >>>>>>>> quite
> >>>>>>>>>>>>>>> understanding correctly in the first glance (a
> >> metastore
> >>>> for
> >>>>>>>>>> tables?).
> >>>>>>>>>>>>>>> Maybe a more specific name would be better, such as
> >>>>>>>> TableCacheSerive
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>>>> TableMaterializeSerivce or something else.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
> >>>>> fhueske@gmail.com
> >>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the clarification Becket!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have a few thoughts to share / questions:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1) I'd like to know how you plan to implement the
> >>> feature
> >>>>> on a
> >>>>>>>> plan
> >>>>>>>>>> /
> >>>>>>>>>>>>>>>> planner level.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I would imaging the following to happen when
> >>> Table.cache()
> >>>>> is
> >>>>>>>>>> called:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1) immediately optimize the Table and internally
> >> convert
> >>>> it
> >>>>>>> into a
> >>>>>>>>>>>>>>>> DataSet/DataStream. This is necessary, to avoid that
> >>>>> operators
> >>>>>>> of
> >>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>> queries on top of the Table are pushed down.
> >>>>>>>>>>>>>>>> 2) register the DataSet/DataStream as a
> >>>>>>> DataSet/DataStream-backed
> >>>>>>>>>>>>> Table
> >>>>>>>>>>>>>> X
> >>>>>>>>>>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> >>>>>>>> materialization
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> Table X
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Based on your proposal the following would happen:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Table t1 = ....
> >>>>>>>>>>>>>>>> t1.cache(); // cache() returns void. The logical plan
> >> of
> >>>> t1
> >>>>> is
> >>>>>>>>>>>>> replaced
> >>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>> a scan of X. There is also a reference to the
> >>>>> materialization
> >>>>>> of
> >>>>>>>> X.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> t1.count(); // this executes the program, including
> >> the
> >>>>>>>>>>>>>> DataSet/DataStream
> >>>>>>>>>>>>>>>> that backs X and the sink that writes the
> >>> materialization
> >>>>> of X
> >>>>>>>>>>>>>>>> t1.count(); // this executes the program, but reads X
> >>> from
> >>>>> the
> >>>>>>>>>>>>>>>> materialization.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> My question is, how do you determine when whether the
> >>> scan
> >>>>> of
> >>>>>> t1
> >>>>>>>>>>>>> should
> >>>>>>>>>>>>>> go
> >>>>>>>>>>>>>>>> against the DataSet/DataStream program and when
> >> against
> >>>> the
> >>>>>>>>>>>>>>>> materialization?
> >>>>>>>>>>>>>>>> AFAIK, there is no hook that will tell you that a part
> >>> of
> >>>>> the
> >>>>>>>>>> program
> >>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>> executed. Flipping a switch during optimization or
> >> plan
> >>>>>>> generation
> >>>>>>>>>> is
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>> sufficient as there is no guarantee that the plan is
> >>> also
> >>>>>>>> executed.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Overall, this behavior is somewhat similar to what I
> >>>>> proposed
> >>>>>> in
> >>>>>>>>>>>>>>>> FLINK-8950, which does not include persisting the
> >> table,
> >>>> but
> >>>>>>> just
> >>>>>>>>>>>>>>>> optimizing and reregistering it as DataSet/DataStream
> >>>> scan.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2) I think Piotr has a point about the implicit
> >> behavior
> >>>> and
> >>>>>>> side
> >>>>>>>>>>>>>> effects
> >>>>>>>>>>>>>>>> of the cache() method if it does not return anything.
> >>>>>>>>>>>>>>>> Consider the following example:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Table t1 = ???
> >>>>>>>>>>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> >>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In this case, the behavior/performance of the plan
> >> that
> >>>>>> results
> >>>>>>>> from
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> second method call depends on whether t1 was modified
> >> by
> >>>> the
> >>>>>>> first
> >>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>> or not.
> >>>>>>>>>>>>>>>> This is the classic issue of mutable vs. immutable
> >>>> objects.
> >>>>>>>>>>>>>>>> Also, as Piotr pointed out, it might also be good to
> >>> have
> >>>>> the
> >>>>>>>>>> original
> >>>>>>>>>>>>>> plan
> >>>>>>>>>>>>>>>> of t1, because in some cases it is possible to push
> >>>> filters
> >>>>>> down
> >>>>>>>>>> such
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> evaluating the query from scratch might be more
> >>> efficient
> >>>>> than
> >>>>>>>>>>>>> accessing
> >>>>>>>>>>>>>>>> the cache.
> >>>>>>>>>>>>>>>> Moreover, a CachedTable could extend Table() and
> >> offer a
> >>>>>> method
> >>>>>>>>>>>>>> refresh().
> >>>>>>>>>>>>>>>> This sounds quite useful in an interactive session
> >> mode.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> >>>>>>>> materialize()
> >>>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>> to be more future proof.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best, Fabian
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
> >>> Wang <
> >>>>>>>>>>>>>>>> wshaoxuan@gmail.com>:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Piotr,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for sharing your ideas on the method naming.
> >> We
> >>>> will
> >>>>>>> think
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>> your suggestions. But I don't understand why we need
> >> to
> >>>>>> change
> >>>>>>>> the
> >>>>>>>>>>>>>> return
> >>>>>>>>>>>>>>>>> type of cache().
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Cache() is a physical operation, it does not change
> >> the
> >>>>> logic
> >>>>>>> of
> >>>>>>>>>>>>>>>>> the `Table`. On the tableAPI layer, we should not
> >>>>> introduce a
> >>>>>>> new
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>> type unless the logic of table has been changed. If
> >> we
> >>>>>>> introduce
> >>>>>>>> a
> >>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>> table type `CachedTable`, we need create the same set
> >>> of
> >>>>>>> methods
> >>>>>>>> of
> >>>>>>>>>>>>>>>> `Table`
> >>>>>>>>>>>>>>>>> for it. I don't think it is worth doing this. Or can
> >>> you
> >>>>>> please
> >>>>>>>>>>>>>> elaborate
> >>>>>>>>>>>>>>>>> more on what could be the "implicit behaviours/side
> >>>>> effects"
> >>>>>>> you
> >>>>>>>>>> are
> >>>>>>>>>>>>>>>>> thinking about?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>> Shaoxuan
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> >>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for the response.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 1. I wasn’t saying that materialised view must be
> >>>> mutable
> >>>>> or
> >>>>>>>> not.
> >>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>> thing applies to caches as well. To the contrary, I
> >>>> would
> >>>>>>> expect
> >>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>> consistency and updates from something that is
> >> called
> >>>>>> “cache”
> >>>>>>> vs
> >>>>>>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>> that’s a “materialised view”. In other words, IMO
> >> most
> >>>>>> caches
> >>>>>>> do
> >>>>>>>>>> not
> >>>>>>>>>>>>>>>>> serve
> >>>>>>>>>>>>>>>>>> you invalid/outdated data and they handle updates on
> >>>> their
> >>>>>>> own.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 2. I don’t think that having in the future two very
> >>>>> similar
> >>>>>>>>>> concepts
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> `materialized` view and `cache` is a good idea. It
> >>> would
> >>>>> be
> >>>>>>>>>>>>> confusing
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> the users. I think it could be handled by
> >>>>>>> variations/overloading
> >>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> materialised view concept. We could start with:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> `MaterializedTable materialize()` - immutable,
> >> session
> >>>>> life
> >>>>>>>> scope
> >>>>>>>>>>>>>>>>>> (basically the same semantic as you are proposing
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> And then in the future (if ever) build on top of
> >>>>> that/expand
> >>>>>>> it
> >>>>>>>>>>>>> with:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> >>>>>>>>>> `MaterializedTable
> >>>>>>>>>>>>>>>>>> materialize(refreshHook=…)`
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Or with cross session support:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> >>>>>>>>>>>>> `MaterializedTable
> >>>>>>>>>>>>>>>>>> materializeInto(tableFactory=…)`
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I’m not saying that we should implement cross
> >>>>>>> session/refreshing
> >>>>>>>>>> now
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>> even in the near future. I’m just arguing that
> >> naming
> >>>>>> current
> >>>>>>>>>>>>>> immutable
> >>>>>>>>>>>>>>>>>> session life scope method `materialize()` is more
> >>> future
> >>>>>> proof
> >>>>>>>> and
> >>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>> consistent with SQL (on which after all table-api is
> >>>>> heavily
> >>>>>>>>>> basing
> >>>>>>>>>>>>>>>> on).
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would
> >>>> still
> >>>>>>> insist
> >>>>>>>>>> on
> >>>>>>>>>>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
> >>>> implicit
> >>>>>>>>>>>>>>>>> behaviours/side
> >>>>>>>>>>>>>>>>>> effects and to give both us & users more
> >> flexibility.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
> >>>>> becket.qin@gmail.com
> >>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Just to add a little bit, the materialized view is
> >>>>> probably
> >>>>>>>> more
> >>>>>>>>>>>>>>>>> similar
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> the persistent() brought up earlier in the thread.
> >> So
> >>>> it
> >>>>> is
> >>>>>>>>>> usually
> >>>>>>>>>>>>>>>>> cross
> >>>>>>>>>>>>>>>>>>> session and could be used in a larger scope. For
> >>>>> example, a
> >>>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>>> view created by user A may be visible to user B. It
> >>> is
> >>>>>>> probably
> >>>>>>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>>> we want to have in the future. I'll put it in the
> >>>> future
> >>>>>> work
> >>>>>>>>>>>>>>>> section.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> >>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the explanation.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Right now we are mostly thinking of the cached
> >> table
> >>>> as
> >>>>>>>>>>>>> immutable. I
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> see the Materialized view would be useful in the
> >>>> future.
> >>>>>>> That
> >>>>>>>>>>>>> said,
> >>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>> a simple cache mechanism is probably still needed.
> >>> So
> >>>> to
> >>>>>> me,
> >>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> materialize() should be two separate method as
> >> they
> >>>>>> address
> >>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>> needs. Materialize() is a higher level concept
> >>> usually
> >>>>>>>> implying
> >>>>>>>>>>>>>>>>>> periodical
> >>>>>>>>>>>>>>>>>>>> update, while cache() has much simpler semantic.
> >> For
> >>>>>>> example,
> >>>>>>>>>> one
> >>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>> create a materialized view and use cache() method
> >> in
> >>>> the
> >>>>>>>>>>>>>>>> materialized
> >>>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>> creation logic. So that during the materialized
> >> view
> >>>>>> update,
> >>>>>>>>>> they
> >>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> need to worry about the case that the cached table
> >>> is
> >>>>> also
> >>>>>>>>>>>>> changed.
> >>>>>>>>>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>>>>>>> under the hood, materialized() and cache() could
> >>> share
> >>>>>> some
> >>>>>>>>>>>>>>>> mechanism,
> >>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>> I think a simple cache() method would be handy in
> >> a
> >>>> lot
> >>>>> of
> >>>>>>>>>> cases.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> >>>>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> >>>>>>> MaterializedTable
> >>>>>>>>>> that
> >>>>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>>>> cannot do on a Table?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Maybe not in the initial implementation, but
> >>> various
> >>>>> DBs
> >>>>>>>> offer
> >>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
> >>>>> triggers,
> >>>>>>>>>> timers,
> >>>>>>>>>>>>>>>>>> manually
> >>>>>>>>>>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to
> >>>> handle
> >>>>>>> that
> >>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> future.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> After users call *table.cache(), *users can just
> >>> use
> >>>>>> that
> >>>>>>>>>> table
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>>>>> anything that is supported on a Table, including
> >>> SQL.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> This is some implicit behaviour with side
> >> effects.
> >>>>>> Imagine
> >>>>>>> if
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> long and complicated program, that touches table
> >>> `b`
> >>>>>>> multiple
> >>>>>>>>>>>>>>>> times,
> >>>>>>>>>>>>>>>>>> maybe
> >>>>>>>>>>>>>>>>>>>>> scattered around different methods. If he
> >> modifies
> >>>> his
> >>>>>>>> program
> >>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> inserting
> >>>>>>>>>>>>>>>>>>>>> in one place
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> This implicitly alters the semantic and behaviour
> >>> of
> >>>>> his
> >>>>>>> code
> >>>>>>>>>> all
> >>>>>>>>>>>>>>>>> over
> >>>>>>>>>>>>>>>>>>>>> the place, maybe in a ways that might cause
> >>> problems.
> >>>>> For
> >>>>>>>>>> example
> >>>>>>>>>>>>>>>>> what
> >>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>>> underlying data is changing?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Having invisible side effects is also not very
> >>> clean,
> >>>>> for
> >>>>>>>>>> example
> >>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>> about something like this (but more complicated):
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Table b = ...;
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>>>>>>>>>>>> processTable1(b)
> >>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>> else {
> >>>>>>>>>>>>>>>>>>>>> processTable2(b)
> >>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> // do more stuff with b
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> >>>>>>>>>> `processTable1`
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>> `processTable2` methods.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On the other hand
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Table materialisedB = b.materialize()
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Avoids (at least some of) the side effect issues
> >>> and
> >>>>>> forces
> >>>>>>>>>> user
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> explicitly use `materialisedB` where it’s
> >>> appropriate
> >>>>> and
> >>>>>>>>>> forces
> >>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> think what does it actually mean. And if
> >> something
> >>>>>> doesn’t
> >>>>>>>> work
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> end
> >>>>>>>>>>>>>>>>>>>>> for the user, he will know what has he changed
> >>>> instead
> >>>>> of
> >>>>>>>>>> blaming
> >>>>>>>>>>>>>>>>>> Flink for
> >>>>>>>>>>>>>>>>>>>>> some “magic” underneath. In the above example,
> >>> after
> >>>>>>>>>>>>> materialising
> >>>>>>>>>>>>>>>> b
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>> only one of the methods, he should/would realise
> >>>> about
> >>>>>> the
> >>>>>>>>>> issue
> >>>>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>>>>>>> handling the return value `MaterializedTable` of
> >>> that
> >>>>>>> method.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I guess it comes down to personal preferences if
> >>> you
> >>>>> like
> >>>>>>>>>> things
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>> implicit or not. The more power is the user,
> >>> probably
> >>>>> the
> >>>>>>>> more
> >>>>>>>>>>>>>>>> likely
> >>>>>>>>>>>>>>>>>> he is
> >>>>>>>>>>>>>>>>>>>>> to like/understand implicit behaviour. And we as
> >>>> Table
> >>>>>> API
> >>>>>>>>>>>>>>>> designers
> >>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>> the most power users out there, so I would
> >> proceed
> >>>> with
> >>>>>>>> caution
> >>>>>>>>>>>>> (so
> >>>>>>>>>>>>>>>>>> that we
> >>>>>>>>>>>>>>>>>>>>> do not end up in the crazy perl realm with it’s
> >>>> lovely
> >>>>>>>> implicit
> >>>>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>> arguments ;)  <
> >>>>>>> https://stackoverflow.com/a/14922656/8149051
> >>>>>>>>> )
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
> >>> processing
> >>>>>> cases,
> >>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>>>> might be slightly better.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I think even such extended Table API could
> >> benefit
> >>>> from
> >>>>>>>>>> sticking
> >>>>>>>>>>>>>>>>>> to/being
> >>>>>>>>>>>>>>>>>>>>> consistent with SQL where both SQL and Table API
> >>> are
> >>>>>>>> basically
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> same.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> One more thing. `MaterializedTable materialize()`
> >>>> could
> >>>>>> be
> >>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>> powerful/flexible allowing the user to operate
> >> both
> >>>> on
> >>>>>>>>>>>>> materialised
> >>>>>>>>>>>>>>>>>> and not
> >>>>>>>>>>>>>>>>>>>>> materialised view at the same time for whatever
> >>>> reasons
> >>>>>>>>>>>>> (underlying
> >>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>> changing/better optimisation opportunities after
> >>>>> pushing
> >>>>>>> down
> >>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>>>>>> etc). For example:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Table b = …;
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Val min = mb.min();
> >>>>>>>>>>>>>>>>>>>>> Val max = mb.max();
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Could be more efficient compared to `b.cache()`
> >> if
> >>>>>>>>>>>>> `filter(‘userId
> >>>>>>>>>>>>>>>> =
> >>>>>>>>>>>>>>>>>>>>> 42);` allows for much more aggressive
> >>> optimisations.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> >>>>>>> fhueske@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I'm not suggesting to add support for Ignite.
> >> This
> >>>> was
> >>>>>>> just
> >>>>>>>> an
> >>>>>>>>>>>>>>>>>> example.
> >>>>>>>>>>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> >>>>>>>>>>>>>>>>>>>>>> For the sake of this proposal, it would be up to
> >>> the
> >>>>>> user
> >>>>>>> to
> >>>>>>>>>>>>>>>>>> implement a
> >>>>>>>>>>>>>>>>>>>>>> TableFactory and corresponding TableSource /
> >>>> TableSink
> >>>>>>>> classes
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> persist
> >>>>>>>>>>>>>>>>>>>>>> and read the data.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb
> >> Flavio
> >>>>>>>> Pompermaier
> >>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>> pompermaier@okkam.it>:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as
> >>> an
> >>>>>>>>>> alternative
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> Apache
> >>>>>>>>>>>>>>>>>>>>>>> Ignite?
> >>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske
> >> <
> >>>>>>>>>>>>>>>> fhueske@gmail.com>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> To summarize, you propose a new method
> >>>>> Table.cache():
> >>>>>>>> Table
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>> trigger a job and write the result into some
> >>>>> temporary
> >>>>>>>>>> storage
> >>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>> defined
> >>>>>>>>>>>>>>>>>>>>>>>> by a TableFactory.
> >>>>>>>>>>>>>>>>>>>>>>>> The cache() call blocks while the job is
> >> running
> >>>> and
> >>>>>>>>>>>>> eventually
> >>>>>>>>>>>>>>>>>>>>> returns a
> >>>>>>>>>>>>>>>>>>>>>>>> Table object that represents a scan of the
> >>>> temporary
> >>>>>>>> table.
> >>>>>>>>>>>>>>>>>>>>>>>> When the "session" is closed (closing to be
> >>>>> defined?),
> >>>>>>> the
> >>>>>>>>>>>>>>>>> temporary
> >>>>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>>> are all dropped.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I think this behavior makes sense and is a
> >> good
> >>>>> first
> >>>>>>> step
> >>>>>>>>>>>>>>>> towards
> >>>>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>> interactive workloads.
> >>>>>>>>>>>>>>>>>>>>>>>> However, its performance suffers from writing
> >> to
> >>>> and
> >>>>>>>> reading
> >>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>>>>>> external
> >>>>>>>>>>>>>>>>>>>>>>>> systems.
> >>>>>>>>>>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> >>>>>>>> significantly
> >>>>>>>>>>>>>>>>> improve
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across
> >>>> jobs)
> >>>>>>> would
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>> large
> >>>>>>>>>>>>>>>>>>>>>>>> impacts on many components of Flink.
> >>>>>>>>>>>>>>>>>>>>>>>> Users could use in-memory filesystems or
> >> storage
> >>>>> grids
> >>>>>>>>>> (Apache
> >>>>>>>>>>>>>>>>>>>>> Ignite) to
> >>>>>>>>>>>>>>>>>>>>>>>> mitigate some of the performance effects.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best, Fabian
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
> >>> Becket
> >>>>> Qin
> >>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> >>>>>>>> MaterializedTable
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> >>>>>> *table.cache(),
> >>>>>>>>>> *users
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>>>>>> that table and do anything that is supported
> >>> on a
> >>>>>>> Table,
> >>>>>>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize()
> >>>> sounds
> >>>>>>> fine
> >>>>>>>> to
> >>>>>>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>> a bit more general than materialize(). Given
> >>> that
> >>>>> we
> >>>>>>> are
> >>>>>>>>>>>>>>>>> enhancing
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
> >>>> processing
> >>>>>>>> cases,
> >>>>>>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>>>>>>>>> might
> >>>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>> slightly better.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
> >>> Nowojski <
> >>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend
> >> to
> >>>>> reuse
> >>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
> >>> assumed
> >>>>> that
> >>>>>>> you
> >>>>>>>>>>>>> want
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> provide
> >>>>>>>>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>> alternate way of writing the data.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Now that I hopefully understand the
> >> proposal,
> >>>>> maybe
> >>>>>> we
> >>>>>>>>>> could
> >>>>>>>>>>>>>>>>>> rename
> >>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` to
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> void materialize()
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> or going step further
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> >>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> ?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> The second option with returning a handle I
> >>>> think
> >>>>> is
> >>>>>>>> more
> >>>>>>>>>>>>>>>>> flexible
> >>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> could provide features such as
> >>>> “refresh”/“delete”
> >>>>> or
> >>>>>>>>>>>>> generally
> >>>>>>>>>>>>>>>>>>>>>>> speaking
> >>>>>>>>>>>>>>>>>>>>>>>>>> manage the the view. In the future we could
> >>> also
> >>>>>> think
> >>>>>>>>>> about
> >>>>>>>>>>>>>>>>>> adding
> >>>>>>>>>>>>>>>>>>>>>>>> hooks
> >>>>>>>>>>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is
> >> also
> >>>> more
> >>>>>>>>>> explicit
> >>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>> materialization returning a new table handle
> >>>> will
> >>>>>> not
> >>>>>>>> have
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>> implicit side effects as adding a simple
> >> line
> >>> of
> >>>>>> code
> >>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>>>>>>>>>>>>>>> would have.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> It would also be more SQL like, making it
> >> more
> >>>>>>> intuitive
> >>>>>>>>>> for
> >>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>>>>>>>>>> familiar with the SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> >>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
> >>>>>> equivalent
> >>>>>>> to
> >>>>>>>>>>>>>>>>> creating
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>> BUILT-IN
> >>>>>>>>>>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> >>>>>>> functionality
> >>>>>>>> is
> >>>>>>>>>>>>>>>>> missing
> >>>>>>>>>>>>>>>>>>>>>>>>> today,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
> >>> question.
> >>>>> Do
> >>>>>>> you
> >>>>>>>>>> mean
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
> >>> sugar?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is
> >> do
> >>>> we
> >>>>>> want
> >>>>>>>> to
> >>>>>>>>>>>>> stop
> >>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>> creating
> >>>>>>>>>>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
> >>> extend
> >>>>> that
> >>>>>>> in
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>> future
> >>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>>>>>> useful unified data store distributed with
> >>>> Flink?
> >>>>>> And
> >>>>>>>> do
> >>>>>>>>>> we
> >>>>>>>>>>>>>>>>> want
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
> >>> pattern
> >>>>> with
> >>>>>>>> their
> >>>>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>>>>>> defined
> >>>>>>>>>>>>>>>>>>>>>>>>>>> services. These considerations are much
> >> more
> >>>>>>>>>> architectural.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
> >>> Nowojski
> >>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand
> >>> the
> >>>>>>>> problem.
> >>>>>>>>>>>>>>>> Isn’t
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing
> >> data
> >>>> to
> >>>>> a
> >>>>>>> sink
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>>> later
> >>>>>>>>>>>>>>>>>>>>>>>>> reading
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited
> >> live
> >>>>>>> scope/live
> >>>>>>>>>>>>> time?
> >>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> sink
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a
> >> file
> >>>>> sink?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> >>>>>>> materialised
> >>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>> from a
> >>>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
> >>> reusing
> >>>>>> this
> >>>>>>>>>>>>>>>>> materialised
> >>>>>>>>>>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to
> >>>> clean
> >>>>> up
> >>>>>>>>>>>>>>>>> materialised
> >>>>>>>>>>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>>>>>>>>>> (for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> example when current session finishes)?
> >>> Maybe
> >>>> we
> >>>>>>> need
> >>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>>> syntactic
> >>>>>>>>>>>>>>>>>>>>>>>>>> sugar
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> on top of it?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> >>>>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
> >>>> persist()
> >>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>> lifecycle/defined
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the
> >> future
> >>>>> work
> >>>>>>> for
> >>>>>>>>>>>>> this.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng
> >>> sun
> >>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the
> >>> name
> >>>>> of
> >>>>>>>>>>>>>>>> `cache()`, I
> >>>>>>>>>>>>>>>>>>>>>>>>>> understand
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you designed this way!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
> >>>>>> lifecycle
> >>>>>>>> for
> >>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>>>>>>> persistence?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, persist
> >> (LifeCycle.SESSION),
> >>> so
> >>>>>> that
> >>>>>>>> the
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> worried
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly
> >> specify
> >>>> the
> >>>>>> time
> >>>>>>>>>> range
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>> keeping
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> time.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand,
> >> we
> >>>> can
> >>>>>>> also
> >>>>>>>>>>>>> share
> >>>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> >>>>>>>>>>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> >>>>>>>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> sure,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for
> >> reference
> >>>>> only!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> >>>>>> 于2018年11月23日周五
> >>>>>>>>>>>>>>>> 下午1:33写道:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
> >>> cache()
> >>>>> v.s.
> >>>>>>>>>>>>>>>> persist(),
> >>>>>>>>>>>>>>>>>>>>>>>>>> personally I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
> >>>> describing
> >>>>>> the
> >>>>>>>>>>>>>>>> behavior,
> >>>>>>>>>>>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
> >>>>> deleted
> >>>>>>>> after
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> closed.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as
> >>>> people
> >>>>>>> might
> >>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still be there even after the session
> >> is
> >>>>> gone.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and
> >>>> stream
> >>>>>>>>>>>>> processing
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> job.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that
> >>>> goal.
> >>>>> I
> >>>>>>>>>> imagine
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change across the board, including
> >>> sources,
> >>>>>>>> operators
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
> >>>>> separate
> >>>>>>>>>>>>> in-depth
> >>>>>>>>>>>>>>>>>>>>>>>>> discussions.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan
> >>>> Cui <
> >>>>>>>>>>>>>>>>>>>>>>> xingcanc@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
> >>> access
> >>>>>>> domain
> >>>>>>>>>> are
> >>>>>>>>>>>>>>>> both
> >>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this
> >> may
> >>>> be
> >>>>>> the
> >>>>>>>>>> first
> >>>>>>>>>>>>>>>> time
> >>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>> plan
> >>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
> >>> other
> >>>>> than
> >>>>>>> the
> >>>>>>>>>>>>>>>> state.
> >>>>>>>>>>>>>>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>>>>>>>>>>>> it’s
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
> >>>>>> concentrate
> >>>>>>>> on
> >>>>>>>>>> a
> >>>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>>>>>>>> part?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
> >>> concerned
> >>>>>> with
> >>>>>>>> the
> >>>>>>>>>>>>>>>>>> underlying
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> service.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change
> >> to
> >>>> the
> >>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>>>> codebase.
> >>>>>>>>>>>>>>>>>>>>>>>> As
> >>>>>>>>>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
> >>> extendible
> >>>> to
> >>>>>>>> support
> >>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
> >>>> thread.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the
> >>> more
> >>>>>>>>>> interactive
> >>>>>>>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough
> >> service
> >>>>>>>> mechanism.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei
> >>>>> Jiang <
> >>>>>>>>>>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
> >>> table
> >>>>> for
> >>>>>>>> clean
> >>>>>>>>>> up
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reliable.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
> >>>>>> executed
> >>>>>>>>>>>>>>>>>> successfully.
> >>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> risk
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that
> >>>> it's
> >>>>>>> safer
> >>>>>>>> to
> >>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> association
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So
> >>> we
> >>>>> can
> >>>>>>>> always
> >>>>>>>>>>>>>>>> clean
> >>>>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>>>>>>> temp
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with
> >> any
> >>>>>> active
> >>>>>>>>>>>>>>>> sessions.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
> >>> jincheng
> >>>>>> sun <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
> >>>> proposal!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very
> >> useful
> >>>> and
> >>>>>>> user
> >>>>>>>>>>>>>>>> friendly
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business
> >>> has
> >>>>> to
> >>>>>> be
> >>>>>>>>>>>>>>>> executed
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stages
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the
> >> pipeline
> >>>> of
> >>>>>>> Flink
> >>>>>>>>>> ML,
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> order
> >>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> utilize
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we
> >>> have
> >>>>> to
> >>>>>>>>>> submit a
> >>>>>>>>>>>>>>>> job
> >>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
> >>> better
> >>>>> to
> >>>>>>>> named
> >>>>>>>>>>>>>>>>>>>>>>> `persist()`,
> >>>>>>>>>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether
> >> we
> >>>>>>> internally
> >>>>>>>>>>>>> cache
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> memory
> >>>>>>>>>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the
> >>>> data
> >>>>>> into
> >>>>>>>>>> state
> >>>>>>>>>>>>>>>>>> backend
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
> >>>> RocksDBStateBackend
> >>>>>>> etc.)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in
> >> the
> >>>>>> future,
> >>>>>>>>>>>>> support
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>> streaming
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job
> >>>> will
> >>>>>> also
> >>>>>>>>>>>>> benefit
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Interactive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward
> >> to
> >>>>> your
> >>>>>>>> JIRAs
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> FLIP!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> >>>>>>>> 于2018年11月20日周二
> >>>>>>>>>>>>>>>>>> 下午9:56写道:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
> >>>>> pointed
> >>>>>>> out,
> >>>>>>>>>> it
> >>>>>>>>>>>>>>>> is a
> >>>>>>>>>>>>>>>>>>>>>>>>> promising
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table
> >>> API
> >>>> in
> >>>>>>>> various
> >>>>>>>>>>>>>>>>>> aspects,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among
> >>>>> others.
> >>>>>>> One
> >>>>>>>>>> of
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>> scenarios
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
> >>> interactive
> >>>>>>>>>>>>> programming.
> >>>>>>>>>>>>>>>> To
> >>>>>>>>>>>>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on
> >> the
> >>>>>>> solution,
> >>>>>>>> we
> >>>>>>>>>>>>> put
> >>>>>>>>>>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our
> >> proposal.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
> >>> welcome!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi Becket,

Sorry for not responding long time.

Regarding case1.

There wouldn’t be no “a.unCache()” method, but I would expect only `cachedTableA1.dropCache()`. Dropping `cachedTableA1` wouldn’t affect `cachedTableA2`. Just as in any other database dropping modifying one independent table/materialised view does not affect others.

> What I meant is that assuming there is already a cached table, ideally users need
> not to specify whether the next query should read from the cache or use the
> original DAG. This should be decided by the optimizer.

1. If we want to let optimiser make decisions whether to use cache or not, then why do we need “void cache()” method at all? Would It  “increase” the chance of using the cache? That’s sounds strange. What would be the mechanism of deciding whether to use the cache or not? If we want to introduce such kind  automated optimisations of “plan nodes deduplication” I would turn it on globally, not per table, and let the optimiser do all of the work.
2. We do not have statistics at the moment for any use/not use cache decision.
3. Even if we had, I would be veeerryy sceptical whether such cost based optimisations would work properly and I would still insist first on providing explicit caching mechanism (`CachedTable cache()`)
4. As Till wrote, having explicit `CachedTable cache()` doesn’t contradict future work on automated cost based caching.


At the same time I’m not sure if you have responded to our objections of `void cache()` being implicit/having side effects, which me, Jark, Fabian, Till and I think also Shaoxuan are supporting.

Piotrek

> On 5 Dec 2018, at 12:42, Becket Qin <be...@gmail.com> wrote:
> 
> Hi Till,
> 
> It is true that after the first job submission, there will be no ambiguity
> in terms of whether a cached table is used or not. That is the same for the
> cache() without returning a CachedTable.
> 
> Conceptually one could think of cache() as introducing a caching operator
>> from which you need to consume from if you want to benefit from the caching
>> functionality.
> 
> I am thinking a little differently. I think it is a hint (as you mentioned
> later) instead of a new operator. I'd like to be careful about the semantic
> of the API. A hint is a property set on an existing operator, but is not
> itself an operator as it does not really manipulate the data.
> 
> I agree, ideally the optimizer makes this kind of decision which
>> intermediate result should be cached. But especially when executing ad-hoc
>> queries the user might better know which results need to be cached because
>> Flink might not see the full DAG. In that sense, I would consider the
>> cache() method as a hint for the optimizer. Of course, in the future we
>> might add functionality which tries to automatically cache results (e.g.
>> caching the latest intermediate results until so and so much space is
>> used). But this should hopefully not contradict with `CachedTable cache()`.
> 
> I agree that cache() method is needed for exactly the reason you mentioned,
> i.e. Flink cannot predict what users are going to write later, so users
> need to tell Flink explicitly that this table will be used later. What I
> meant is that assuming there is already a cached table, ideally users need
> not to specify whether the next query should read from the cache or use the
> original DAG. This should be decided by the optimizer.
> 
> To explain the difference between returning / not returning a CachedTable,
> I want compare the following two case:
> 
> *Case 1:  returning a CachedTable*
> b = a.map(...)
> val cachedTableA1 = a.cache()
> val cachedTableA2 = a.cache()
> b.print() // Just to make sure a is cached.
> 
> c = a.filter(...) // User specify that the original DAG is used? Or the
> optimizer decides whether DAG or cache should be used?
> d = cachedTableA1.filter() // User specify that the cached table is used.
> 
> a.unCache() // Can cachedTableA still be used afterwards?
> cachedTableA1.uncache() // Can cachedTableA2 still be used?
> 
> *Case 2: not returning a CachedTable*
> b = a.map()
> a.cache()
> a.cache() // no-op
> b.print() // Just to make sure a is cached
> 
> c = a.filter(...) // Optimizer decides whether the cache or DAG should be
> used
> d = a.filter(...) // Optimizer decides whether the cache or DAG should be
> used
> 
> a.unCache()
> a.unCache() // no-op
> 
> In case 1, semantic wise, optimizer lose the option to choose between DAG
> and cache. And the unCache() call becomes tricky.
> In case 2, users do not need to worry about whether cache or DAG is used.
> And the unCache() semantic is clear. However, the caveat is that users
> cannot explicitly ignore the cache.
> 
> In order to address the issues mentioned in case 2 and inspired by the
> discussion so far, I am thinking about using hint to allow user explicitly
> ignore cache. Although we do not have hint yet, but we probably should have
> one. So the code becomes:
> 
> *Case 3: returning this table*
> b = a.map()
> a.cache()
> a.cache() // no-op
> b.print() // Just to make sure a is cached
> 
> c = a.filter(...) // Optimizer decides whether the cache or DAG should be
> used
> d = a.hint("ignoreCache").filter(...) // DAG will be used instead of the
> cache.
> 
> a.unCache()
> a.unCache() // no-op
> 
> We could also let cache() return this table to allow chained method calls.
> Do you think this API addresses the concerns?
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:
> 
>> Hi,
>> 
>> All the recent discussions are focused on whether there is a problem if
>> cache() not return a Table.
>> It seems that returning a Table explicitly is more clear (and safe?).
>> 
>> So whether there are any problems if cache() returns a Table?  @Becket
>> 
>> Best,
>> Jark
>> 
>> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <tr...@apache.org> wrote:
>> 
>>> It's true that b, c, d and e will all read from the original DAG that
>>> generates a. But all subsequent operators (when running multiple queries)
>>> which reference cachedTableA should not need to reproduce `a` but
>> directly
>>> consume the intermediate result.
>>> 
>>> Conceptually one could think of cache() as introducing a caching operator
>>> from which you need to consume from if you want to benefit from the
>> caching
>>> functionality.
>>> 
>>> I agree, ideally the optimizer makes this kind of decision which
>>> intermediate result should be cached. But especially when executing
>> ad-hoc
>>> queries the user might better know which results need to be cached
>> because
>>> Flink might not see the full DAG. In that sense, I would consider the
>>> cache() method as a hint for the optimizer. Of course, in the future we
>>> might add functionality which tries to automatically cache results (e.g.
>>> caching the latest intermediate results until so and so much space is
>>> used). But this should hopefully not contradict with `CachedTable
>> cache()`.
>>> 
>>> Cheers,
>>> Till
>>> 
>>> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <be...@gmail.com> wrote:
>>> 
>>>> Hi Till,
>>>> 
>>>> Thanks for the clarification. I am still a little confused.
>>>> 
>>>> If cache() returns a CachedTable, the example might become:
>>>> 
>>>> b = a.map(...)
>>>> c = a.map(...)
>>>> 
>>>> cachedTableA = a.cache()
>>>> d = cachedTableA.map(...)
>>>> e = a.map()
>>>> 
>>>> In the above case, if cache() is lazily evaluated, b, c, d and e are
>> all
>>>> going to be reading from the original DAG that generates a. But with a
>>>> naive expectation, d should be reading from the cache. This seems not
>>>> solving the potential confusion you raised, right?
>>>> 
>>>> Just to be clear, my understanding are all based on the assumption that
>>> the
>>>> tables are immutable. Therefore, after a.cache(), a the c*achedTableA*
>>> and
>>>> original table *a * should be completely interchangeable.
>>>> 
>>>> That said, I think a valid argument is optimization. There are indeed
>>> cases
>>>> that reading from the original DAG could be faster than reading from
>> the
>>>> cache. For example, in the following example:
>>>> 
>>>> a.filter(f1' > 100)
>>>> a.cache()
>>>> b = a.filter(f1' < 100)
>>>> 
>>>> Ideally the optimizer should be intelligent enough to decide which way
>> is
>>>> faster, without user intervention. In this case, it will identify that
>> b
>>>> would just be an empty table, thus skip reading from the cache
>>> completely.
>>>> But I agree that returning a CachedTable would give user the control of
>>>> when to use cache, even though I still feel that letting the optimizer
>>>> handle this is a better option in long run.
>>>> 
>>>> Thanks,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>> 
>>>>> Yes you are right Becket that it still depends on the actual
>> execution
>>> of
>>>>> the job whether a consumer reads from a cached result or not.
>>>>> 
>>>>> My point was actually about the properties of a (cached vs.
>> non-cached)
>>>> and
>>>>> not about the execution. I would not make cache trigger the execution
>>> of
>>>>> the job because one loses some flexibility by eagerly triggering the
>>>>> execution.
>>>>> 
>>>>> I tried to argue for an explicit CachedTable which is returned by the
>>>>> cache() method like Piotr did in order to make the API more explicit.
>>>>> 
>>>>> Cheers,
>>>>> Till
>>>>> 
>>>>> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Hi Till,
>>>>>> 
>>>>>> That is a good example. Just a minor correction, in this case, b, c
>>>> and d
>>>>>> will all consume from a non-cached a. This is because cache will
>> only
>>>> be
>>>>>> created on the very first job submission that generates the table
>> to
>>> be
>>>>>> cached.
>>>>>> 
>>>>>> If I understand correctly, this is example is about whether
>> .cache()
>>>>> method
>>>>>> should be eagerly evaluated or lazily evaluated. In another word,
>> if
>>>>>> cache() method actually triggers a job that creates the cache,
>> there
>>>> will
>>>>>> be no such confusion. Is that right?
>>>>>> 
>>>>>> In the example, although d will not consume from the cached Table
>>> while
>>>>> it
>>>>>> looks supposed to, from correctness perspective the code will still
>>>>> return
>>>>>> correct result, assuming that tables are immutable.
>>>>>> 
>>>>>> Personally I feel it is OK because users probably won't really
>> worry
>>>>> about
>>>>>> whether the table is cached or not. And lazy cache could avoid some
>>>>>> unnecessary caching if a cached table is never created in the user
>>>>>> application. But I am not opposed to do eager evaluation of cache.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jiangjie (Becket) Qin
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
>> trohrmann@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> Another argument for Piotr's point is that lazily changing
>>> properties
>>>>> of
>>>>>> a
>>>>>>> node affects all down stream consumers but does not necessarily
>>> have
>>>> to
>>>>>>> happen before these consumers are defined. From a user's
>>> perspective
>>>>> this
>>>>>>> can be quite confusing:
>>>>>>> 
>>>>>>> b = a.map(...)
>>>>>>> c = a.map(...)
>>>>>>> 
>>>>>>> a.cache()
>>>>>>> d = a.map(...)
>>>>>>> 
>>>>>>> now b, c and d will consume from a cached operator. In this case,
>>> the
>>>>>> user
>>>>>>> would most likely expect that only d reads from a cached result.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>> 
>>>>>>> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
>>>>> piotr@data-artisans.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hey Shaoxuan and Becket,
>>>>>>>> 
>>>>>>>>> Can you explain a bit more one what are the side effects? So
>>> far
>>>> my
>>>>>>>>> understanding is that such side effects only exist if a table
>>> is
>>>>>>> mutable.
>>>>>>>>> Is that the case?
>>>>>>>> 
>>>>>>>> Not only that. There are also performance implications and
>> those
>>>> are
>>>>>>>> another implicit side effects of using `void cache()`. As I
>> wrote
>>>>>> before,
>>>>>>>> reading from cache might not always be desirable, thus it can
>>> cause
>>>>>>>> performance degradation and I’m fine with that - user's or
>>>>> optimiser’s
>>>>>>>> choice. What I do not like is that this implicit side effect
>> can
>>>>>> manifest
>>>>>>>> in completely different part of code, that wasn’t touched by a
>>> user
>>>>>> while
>>>>>>>> he was adding `void cache()` call somewhere else. And even if
>>>> caching
>>>>>>>> improves performance, it’s still a side effect of `void
>> cache()`.
>>>>>> Almost
>>>>>>>> from the definition `void` methods have only side effects. As I
>>>> wrote
>>>>>>>> before, there are couple of scenarios where this might be
>>>> undesirable
>>>>>>>> and/or unexpected, for example:
>>>>>>>> 
>>>>>>>> 1.
>>>>>>>> Table b = …;
>>>>>>>> b.cache()
>>>>>>>> x = b.join(…)
>>>>>>>> y = b.count()
>>>>>>>> // ...
>>>>>>>> // 100
>>>>>>>> // hundred
>>>>>>>> // lines
>>>>>>>> // of
>>>>>>>> // code
>>>>>>>> // later
>>>>>>>> z = b.filter(…).groupBy(…) // this might be even hidden in a
>>>>> different
>>>>>>>> method/file/package/dependency
>>>>>>>> 
>>>>>>>> 2.
>>>>>>>> 
>>>>>>>> Table b = ...
>>>>>>>> If (some_condition) {
>>>>>>>>  foo(b)
>>>>>>>> }
>>>>>>>> Else {
>>>>>>>>  bar(b)
>>>>>>>> }
>>>>>>>> z = b.filter(…).groupBy(…)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Void foo(Table b) {
>>>>>>>>  b.cache()
>>>>>>>>  // do something with b
>>>>>>>> }
>>>>>>>> 
>>>>>>>> In both above examples, `b.cache()` will implicitly affect
>>>> (semantic
>>>>>> of a
>>>>>>>> program in case of sources being mutable and performance) `z =
>>>>>>>> b.filter(…).groupBy(…)` which might be far from obvious.
>>>>>>>> 
>>>>>>>> On top of that, there is still this argument of mine that
>> having
>>> a
>>>>>>>> `MaterializedTable` or `CachedTable` handle is more flexible
>> for
>>> us
>>>>> for
>>>>>>> the
>>>>>>>> future and for the user (as a manual option to bypass cache
>>> reads).
>>>>>>>> 
>>>>>>>>> But Jiangjie is correct,
>>>>>>>>> the source table in batching should be immutable. It is the
>>>> user’s
>>>>>>>>> responsibility to ensure it, otherwise even a regular
>> failover
>>>> may
>>>>>> lead
>>>>>>>>> to inconsistent results.
>>>>>>>> 
>>>>>>>> Yes, I agree that’s what perfect world/good deployment should
>> be.
>>>> But
>>>>>> its
>>>>>>>> often isn’t and while I’m not trying to fix this (since the
>>> proper
>>>>> fix
>>>>>> is
>>>>>>>> to support transactions), I’m just trying to minimise confusion
>>> for
>>>>> the
>>>>>>>> users that are not fully aware what’s going on and operate in
>>> less
>>>>> then
>>>>>>>> perfect setup. And if something bites them after adding
>>> `b.cache()`
>>>>>> call,
>>>>>>>> to make sure that they at least know all of the places that
>>> adding
>>>>> this
>>>>>>>> line can affect.
>>>>>>>> 
>>>>>>>> Thanks, Piotrek
>>>>>>>> 
>>>>>>>>> On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Piotrek,
>>>>>>>>> 
>>>>>>>>> Thanks again for the clarification. Some more replies are
>>>>> following.
>>>>>>>>> 
>>>>>>>>> But keep in mind that `.cache()` will/might not only be used
>> in
>>>>>>>> interactive
>>>>>>>>>> programming and not only in batching.
>>>>>>>>> 
>>>>>>>>> It is true. Actually in stream processing, cache() has the
>> same
>>>>>>> semantic
>>>>>>>> as
>>>>>>>>> batch processing. The semantic is following:
>>>>>>>>> For a table created via a series of computation, save that
>>> table
>>>>> for
>>>>>>>> later
>>>>>>>>> reference to avoid running the computation logic to
>> regenerate
>>>> the
>>>>>>> table.
>>>>>>>>> Once the application exits, drop all the cache.
>>>>>>>>> This semantic is same for both batch and stream processing.
>> The
>>>>>>>> difference
>>>>>>>>> is that stream applications will only run once as they are
>> long
>>>>>>> running.
>>>>>>>>> And the batch applications may be run multiple times, hence
>> the
>>>>> cache
>>>>>>> may
>>>>>>>>> be created and dropped each time the application runs.
>>>>>>>>> Admittedly, there will probably be some resource management
>>>>>>> requirements
>>>>>>>>> for the streaming cached table, such as time based / size
>> based
>>>>>>>> retention,
>>>>>>>>> to address the infinite data issue. But such requirement does
>>> not
>>>>>>> change
>>>>>>>>> the semantic.
>>>>>>>>> You are right that interactive programming is just one use
>> case
>>>> of
>>>>>>>> cache().
>>>>>>>>> It is not the only use case.
>>>>>>>>> 
>>>>>>>>> For me the more important issue is of not having the `void
>>>> cache()`
>>>>>>> with
>>>>>>>>>> side effects.
>>>>>>>>> 
>>>>>>>>> This is indeed the key point. The argument around whether
>>> cache()
>>>>>>> should
>>>>>>>>> return something already indicates that cache() and
>>> materialize()
>>>>>>> address
>>>>>>>>> different issues.
>>>>>>>>> Can you explain a bit more one what are the side effects? So
>>> far
>>>> my
>>>>>>>>> understanding is that such side effects only exist if a table
>>> is
>>>>>>> mutable.
>>>>>>>>> Is that the case?
>>>>>>>>> 
>>>>>>>>> I don’t know, probably initially we should make CachedTable
>>>>>> read-only.
>>>>>>> I
>>>>>>>>>> don’t find it more confusing than the fact that user can not
>>>> write
>>>>>> to
>>>>>>>> views
>>>>>>>>>> or materialised views in SQL or that user currently can not
>>>> write
>>>>>> to a
>>>>>>>>>> Table.
>>>>>>>>> 
>>>>>>>>> I don't think anyone should insert something to a cache. By
>>>>>> definition
>>>>>>>> the
>>>>>>>>> cache should only be updated when the corresponding original
>>>> table
>>>>> is
>>>>>>>>> updated. What I am wondering is that given the following two
>>>> facts:
>>>>>>>>> 1. If and only if a table is mutable (with something like
>>>>> insert()),
>>>>>> a
>>>>>>>>> CachedTable may have implicit behavior.
>>>>>>>>> 2. A CachedTable extends a Table.
>>>>>>>>> We can come to the conclusion that a CachedTable is mutable
>> and
>>>>> users
>>>>>>> can
>>>>>>>>> insert into the CachedTable directly. This is where I thought
>>>>>>> confusing.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>>> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
>>>>>> piotr@data-artisans.com
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> Regarding naming `cache()` vs `materialize()`. One more
>>>>> explanation
>>>>>>> why
>>>>>>>> I
>>>>>>>>>> think `materialize()` is more natural to me is that I think
>> of
>>>> all
>>>>>>>> “Table”s
>>>>>>>>>> in Table-API as views. They behave the same way as SQL
>> views,
>>>> the
>>>>>> only
>>>>>>>>>> difference for me is that their live scope is short -
>> current
>>>>>> session
>>>>>>>> which
>>>>>>>>>> is limited by different execution model. That’s why
>> “cashing”
>>> a
>>>>> view
>>>>>>>> for me
>>>>>>>>>> is just materialising it.
>>>>>>>>>> 
>>>>>>>>>> However I see and I understand your point of view. Coming
>> from
>>>>>>>>>> DataSet/DataStream and generally speaking non-SQL world,
>>>> `cache()`
>>>>>> is
>>>>>>>> more
>>>>>>>>>> natural. But keep in mind that `.cache()` will/might not
>> only
>>> be
>>>>>> used
>>>>>>> in
>>>>>>>>>> interactive programming and not only in batching. But naming
>>> is
>>>>> one
>>>>>>>> issue,
>>>>>>>>>> and not that critical to me. Especially that once we
>> implement
>>>>>> proper
>>>>>>>>>> materialised views, we can always deprecate/rename `cache()`
>>> if
>>>> we
>>>>>>> deem
>>>>>>>> so.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> For me the more important issue is of not having the `void
>>>>> cache()`
>>>>>>> with
>>>>>>>>>> side effects. Exactly for the reasons that you have
>> mentioned.
>>>>> True:
>>>>>>>>>> results might be non deterministic if underlying source
>> table
>>>> are
>>>>>>>> changing.
>>>>>>>>>> Problem is that `void cache()` implicitly changes the
>> semantic
>>>> of
>>>>>>>>>> subsequent uses of the cached/materialized Table. It can
>> cause
>>>>> “wtf”
>>>>>>>> moment
>>>>>>>>>> for a user if he inserts “b.cache()” call in some place in
>> his
>>>>> code
>>>>>>> and
>>>>>>>>>> suddenly some other random places are behaving differently.
>> If
>>>>>>>>>> `materialize()` or `cache()` returns a Table handle, we
>> force
>>>> user
>>>>>> to
>>>>>>>>>> explicitly use the cache which removes the “random” part
>> from
>>>> the
>>>>>>>> "suddenly
>>>>>>>>>> some other random places are behaving differently”.
>>>>>>>>>> 
>>>>>>>>>> This argument and others that I’ve raised (greater
>>>>>>> flexibility/allowing
>>>>>>>>>> user to explicitly bypass the cache) are independent of
>>>> `cache()`
>>>>> vs
>>>>>>>>>> `materialize()` discussion.
>>>>>>>>>> 
>>>>>>>>>>> Does that mean one can also insert into the CachedTable?
>> This
>>>>>> sounds
>>>>>>>>>> pretty confusing.
>>>>>>>>>> 
>>>>>>>>>> I don’t know, probably initially we should make CachedTable
>>>>>>> read-only. I
>>>>>>>>>> don’t find it more confusing than the fact that user can not
>>>> write
>>>>>> to
>>>>>>>> views
>>>>>>>>>> or materialised views in SQL or that user currently can not
>>>> write
>>>>>> to a
>>>>>>>>>> Table.
>>>>>>>>>> 
>>>>>>>>>> Piotrek
>>>>>>>>>> 
>>>>>>>>>>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> I agree with @Becket that `cache()` and `materialize()`
>>> should
>>>> be
>>>>>>>>>> considered as two different methods where the later one is
>>> more
>>>>>>>>>> sophisticated.
>>>>>>>>>>> 
>>>>>>>>>>> According to my understanding, the initial idea is just to
>>>>>> introduce
>>>>>>> a
>>>>>>>>>> simple cache or persist mechanism, but as the TableAPI is a
>>>>>> high-level
>>>>>>>> API,
>>>>>>>>>> it’s naturally for as to think in a SQL way.
>>>>>>>>>>> 
>>>>>>>>>>> Maybe we can add the `cache()` method to the DataSet API
>> and
>>>>> force
>>>>>>>> users
>>>>>>>>>> to translate a Table to a Dataset before caching it. Then
>> the
>>>>> users
>>>>>>>> should
>>>>>>>>>> manually register the cached dataset to a table again (we
>> may
>>>> need
>>>>>>> some
>>>>>>>>>> table replacement mechanisms for datasets with an identical
>>>> schema
>>>>>> but
>>>>>>>>>> different contents here). After all, it’s the dataset rather
>>>> than
>>>>>> the
>>>>>>>>>> dynamic table that need to be cached, right?
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Xingcan
>>>>>>>>>>> 
>>>>>>>>>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
>>>> becket.qin@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Piotrek and Jark,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the feedback and explanation. Those are good
>>>>> arguments.
>>>>>>>> But I
>>>>>>>>>>>> think those arguments are mostly about materialized view.
>>> Let
>>>> me
>>>>>> try
>>>>>>>> to
>>>>>>>>>>>> explain the reason I believe cache() and materialize() are
>>>>>>> different.
>>>>>>>>>>>> 
>>>>>>>>>>>> I think cache() and materialize() have quite different
>>>>>> implications.
>>>>>>>> An
>>>>>>>>>>>> analogy I can think of is save()/publish(). When users
>> call
>>>>>> cache(),
>>>>>>>> it
>>>>>>>>>> is
>>>>>>>>>>>> just like they are saving an intermediate result as a
>> draft
>>> of
>>>>>> their
>>>>>>>>>> work,
>>>>>>>>>>>> this intermediate result may not have any realistic
>> meaning.
>>>>>> Calling
>>>>>>>>>>>> cache() does not mean users want to publish the cached
>> table
>>>> in
>>>>>> any
>>>>>>>>>> manner.
>>>>>>>>>>>> But when users call materialize(), that means "I have
>>>> something
>>>>>>>>>> meaningful
>>>>>>>>>>>> to be reused by others", now users need to think about the
>>>>>>> validation,
>>>>>>>>>>>> update & versioning, lifecycle of the result, etc.
>>>>>>>>>>>> 
>>>>>>>>>>>> Piotrek's suggestions on variations of the materialize()
>>>> methods
>>>>>> are
>>>>>>>>>> very
>>>>>>>>>>>> useful. It would be great if Flink have them. The concept
>> of
>>>>>>>>>> materialized
>>>>>>>>>>>> view is actually a pretty big feature, not to say the
>>> related
>>>>>> stuff
>>>>>>>> like
>>>>>>>>>>>> triggers/hooks you mentioned earlier. I think the
>>> materialized
>>>>>> view
>>>>>>>>>> itself
>>>>>>>>>>>> should be discussed in a more thorough and systematic
>>> manner.
>>>>> And
>>>>>> I
>>>>>>>>>> found
>>>>>>>>>>>> that discussion is kind of orthogonal and way beyond
>>>> interactive
>>>>>>>>>>>> programming experience.
>>>>>>>>>>>> 
>>>>>>>>>>>> The example you gave was interesting. I still have some
>>>>> questions,
>>>>>>>>>> though.
>>>>>>>>>>>> 
>>>>>>>>>>>> Table source = … // some source that scans files from a
>>>>> directory
>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>> 
>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>> initialised)
>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>> // something in the background (or we trigger it) writes
>>> new
>>>>>> files
>>>>>>> to
>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
>>>>> implemented
>>>>>> in
>>>>>>>> the
>>>>>>>>>>>>> initial version
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> what if someone else added some more files to /foo/bar at
>>> this
>>>>>>> point?
>>>>>>>> In
>>>>>>>>>>>> that case, a3 won't equals to b3, and the result become
>>>>>>>>>> non-deterministic,
>>>>>>>>>>>> right?
>>>>>>>>>>>> 
>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
>>>> “cache”
>>>>>>>> dropping
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> When we talk about interactive programming, in most cases,
>>> we
>>>>> are
>>>>>>>>>> talking
>>>>>>>>>>>> about batch applications. A fundamental assumption of such
>>>> case
>>>>> is
>>>>>>>> that
>>>>>>>>>> the
>>>>>>>>>>>> source data is complete before the data processing begins,
>>> and
>>>>> the
>>>>>>>> data
>>>>>>>>>>>> will not change during the data processing. IMO, if
>>> additional
>>>>>> rows
>>>>>>>>>> needs
>>>>>>>>>>>> to be added to some source during the processing, it
>> should
>>> be
>>>>>> done
>>>>>>> in
>>>>>>>>>> ways
>>>>>>>>>>>> like union the source with another table containing the
>> rows
>>>> to
>>>>> be
>>>>>>>>>> added.
>>>>>>>>>>>> 
>>>>>>>>>>>> There are a few cases that computations are executed
>>>> repeatedly
>>>>> on
>>>>>>> the
>>>>>>>>>>>> changing data source.
>>>>>>>>>>>> 
>>>>>>>>>>>> For example, people may run a ML training job every hour
>>> with
>>>>> the
>>>>>>>>>> samples
>>>>>>>>>>>> newly added in the past hour. In that case, the source
>> data
>>>>>> between
>>>>>>>> will
>>>>>>>>>>>> indeed change. But still, the data remain unchanged within
>>> one
>>>>>> run.
>>>>>>>> And
>>>>>>>>>>>> usually in that case, the result will need versioning,
>> i.e.
>>>> for
>>>>> a
>>>>>>>> given
>>>>>>>>>>>> result, it tells that the result is a result from the
>> source
>>>>> data
>>>>>>> by a
>>>>>>>>>>>> certain timestamp.
>>>>>>>>>>>> 
>>>>>>>>>>>> Another example is something like data warehouse. In this
>>>> case,
>>>>>>> there
>>>>>>>>>> are a
>>>>>>>>>>>> few source of original/raw data. On top of those sources,
>>> many
>>>>>>>>>> materialized
>>>>>>>>>>>> view / queries / reports / dashboards can be created to
>>>> generate
>>>>>>>> derived
>>>>>>>>>>>> data. Those derived data needs to be updated when the
>>>> underlying
>>>>>>>>>> original
>>>>>>>>>>>> data changes. In that case, the processing logic that
>>> derives
>>>>> the
>>>>>>>>>> original
>>>>>>>>>>>> data needs to be executed repeatedly to update those
>>>>>> reports/views.
>>>>>>>>>> Again,
>>>>>>>>>>>> all those derived data also need to have version
>> management,
>>>>> such
>>>>>> as
>>>>>>>>>>>> timestamp.
>>>>>>>>>>>> 
>>>>>>>>>>>> In any of the above two cases, during a single run of the
>>>>>> processing
>>>>>>>>>> logic,
>>>>>>>>>>>> the data cannot change. Otherwise the behavior of the
>>>> processing
>>>>>>> logic
>>>>>>>>>> may
>>>>>>>>>>>> be undefined. In the above two examples, when writing the
>>>>>> processing
>>>>>>>>>> logic,
>>>>>>>>>>>> Users can use .cache() to hint Flink that those results
>>> should
>>>>> be
>>>>>>>> saved
>>>>>>>>>> to
>>>>>>>>>>>> avoid repeated computation. And then for the result of my
>>>>>>> application
>>>>>>>>>>>> logic, I'll call materialize(), so that these results
>> could
>>> be
>>>>>>> managed
>>>>>>>>>> by
>>>>>>>>>>>> the system with versioning, metadata management, lifecycle
>>>>>>> management,
>>>>>>>>>>>> ACLs, etc.
>>>>>>>>>>>> 
>>>>>>>>>>>> It is true we can use materialize() to do the cache() job,
>>>> but I
>>>>>> am
>>>>>>>>>> really
>>>>>>>>>>>> reluctant to shoehorn cache() into materialize() and force
>>>> users
>>>>>> to
>>>>>>>>>> worry
>>>>>>>>>>>> about a bunch of implications that they needn't have to. I
>>> am
>>>>>>>>>> absolutely on
>>>>>>>>>>>> your side that redundant API is bad. But it is equally
>>>>>> frustrating,
>>>>>>> if
>>>>>>>>>> not
>>>>>>>>>>>> more, that the same API does different things.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
>>>>>> wshaoxuan@gmail.com
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks Piotrek,
>>>>>>>>>>>>> You provided a very good example, it explains all the
>>>>> confusions
>>>>>> I
>>>>>>>>>> have.
>>>>>>>>>>>>> It is clear that there is something we have not
>> considered
>>> in
>>>>> the
>>>>>>>>>> initial
>>>>>>>>>>>>> proposal. We intend to force the user to reuse the
>>>>>>>> cached/materialized
>>>>>>>>>>>>> table, if its cache() method is executed. We did not
>> expect
>>>>> that
>>>>>>> user
>>>>>>>>>> may
>>>>>>>>>>>>> want to re-executed the plan from the source table. Let
>> me
>>>>>> re-think
>>>>>>>>>> about
>>>>>>>>>>>>> it and get back to you later.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In the meanwhile, this example/observation also infers
>> that
>>>> we
>>>>>>> cannot
>>>>>>>>>> fully
>>>>>>>>>>>>> involve the optimizer to decide the plan if a
>>>> cache/materialize
>>>>>> is
>>>>>>>>>>>>> explicitly used, because weather to reuse the cache data
>> or
>>>>>>>> re-execute
>>>>>>>>>> the
>>>>>>>>>>>>> query from source data may lead to different results.
>> (But
>>> I
>>>>>> guess
>>>>>>>>>>>>> optimizer can still help in some cases ---- as long as it
>>>> does
>>>>>> not
>>>>>>>>>>>>> re-execute from the varied source, we should be safe).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Shaoxuan
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Shaoxuan,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Re 2:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
>>> modified
>>>>>> to->
>>>>>>>> t1’
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
>>>>>>>>>>>>>> `methodThatAppliesOperators()` method has changed it’s
>>> plan?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I was thinking more about something like this:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Table source = … // some source that scans files from a
>>>>>> directory
>>>>>>>>>>>>>> “/foo/bar/“
>>>>>>>>>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>>>>>>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> t2.count() // initialise cache (if it’s lazily
>>> initialised)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> int a1 = t1.count()
>>>>>>>>>>>>>> int b1 = t2.count()
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> // something in the background (or we trigger it) writes
>>> new
>>>>>> files
>>>>>>>> to
>>>>>>>>>>>>>> /foo/bar
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> int a2 = t1.count()
>>>>>>>>>>>>>> int b2 = t2.count()
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> t2.refresh() // possible future extension, not to be
>>>>> implemented
>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>>>>> initial version
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> int a3 = t1.count()
>>>>>>>>>>>>>> int b3 = t2.count()
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> t2.drop() // another possible future extension, manual
>>>> “cache”
>>>>>>>>>> dropping
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> assertTrue(a1 == b1) // same results, but b1 comes from
>>> the
>>>>>>> “cache"
>>>>>>>>>>>>>> assertTrue(b1 == b2) // both values come from the same
>>> cache
>>>>>>>>>>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2
>> re-executed
>>>>> full
>>>>>>>> table
>>>>>>>>>>>>> scan
>>>>>>>>>>>>>> and has more data
>>>>>>>>>>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
>>>>>>>>>>>>>> assertTrue(b3 == a2 == a3)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com>
>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It is an very interesting and useful design!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Here I want to share some of my thoughts:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. Agree with that cache() method should return some
>>> Table
>>>> to
>>>>>>> avoid
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>> unexpected problems because of the mutable object.
>>>>>>>>>>>>>>> All the existing methods of Table are returning a new
>>> Table
>>>>>>>> instance.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2. I think materialize() would be more consistent with
>>> SQL,
>>>>>> this
>>>>>>>>>> makes
>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> possible to support the same feature for SQL
>> (materialize
>>>>> view)
>>>>>>> and
>>>>>>>>>>>>> keep
>>>>>>>>>>>>>>> the same API for users in the future.
>>>>>>>>>>>>>>> But I'm also fine if we choose cache().
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 3. In the proposal, a TableService (or FlinkService?)
>> is
>>>> used
>>>>>> to
>>>>>>>>>> cache
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> result of the (intermediate) table.
>>>>>>>>>>>>>>> But the name of TableService may be a bit general which
>>> is
>>>>> not
>>>>>>>> quite
>>>>>>>>>>>>>>> understanding correctly in the first glance (a
>> metastore
>>>> for
>>>>>>>>>> tables?).
>>>>>>>>>>>>>>> Maybe a more specific name would be better, such as
>>>>>>>> TableCacheSerive
>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> TableMaterializeSerivce or something else.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
>>>>> fhueske@gmail.com
>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the clarification Becket!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have a few thoughts to share / questions:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1) I'd like to know how you plan to implement the
>>> feature
>>>>> on a
>>>>>>>> plan
>>>>>>>>>> /
>>>>>>>>>>>>>>>> planner level.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I would imaging the following to happen when
>>> Table.cache()
>>>>> is
>>>>>>>>>> called:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1) immediately optimize the Table and internally
>> convert
>>>> it
>>>>>>> into a
>>>>>>>>>>>>>>>> DataSet/DataStream. This is necessary, to avoid that
>>>>> operators
>>>>>>> of
>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>> queries on top of the Table are pushed down.
>>>>>>>>>>>>>>>> 2) register the DataSet/DataStream as a
>>>>>>> DataSet/DataStream-backed
>>>>>>>>>>>>> Table
>>>>>>>>>>>>>> X
>>>>>>>>>>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
>>>>>>>> materialization
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> Table X
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Based on your proposal the following would happen:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Table t1 = ....
>>>>>>>>>>>>>>>> t1.cache(); // cache() returns void. The logical plan
>> of
>>>> t1
>>>>> is
>>>>>>>>>>>>> replaced
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>> a scan of X. There is also a reference to the
>>>>> materialization
>>>>>> of
>>>>>>>> X.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> t1.count(); // this executes the program, including
>> the
>>>>>>>>>>>>>> DataSet/DataStream
>>>>>>>>>>>>>>>> that backs X and the sink that writes the
>>> materialization
>>>>> of X
>>>>>>>>>>>>>>>> t1.count(); // this executes the program, but reads X
>>> from
>>>>> the
>>>>>>>>>>>>>>>> materialization.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> My question is, how do you determine when whether the
>>> scan
>>>>> of
>>>>>> t1
>>>>>>>>>>>>> should
>>>>>>>>>>>>>> go
>>>>>>>>>>>>>>>> against the DataSet/DataStream program and when
>> against
>>>> the
>>>>>>>>>>>>>>>> materialization?
>>>>>>>>>>>>>>>> AFAIK, there is no hook that will tell you that a part
>>> of
>>>>> the
>>>>>>>>>> program
>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>> executed. Flipping a switch during optimization or
>> plan
>>>>>>> generation
>>>>>>>>>> is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> sufficient as there is no guarantee that the plan is
>>> also
>>>>>>>> executed.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Overall, this behavior is somewhat similar to what I
>>>>> proposed
>>>>>> in
>>>>>>>>>>>>>>>> FLINK-8950, which does not include persisting the
>> table,
>>>> but
>>>>>>> just
>>>>>>>>>>>>>>>> optimizing and reregistering it as DataSet/DataStream
>>>> scan.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2) I think Piotr has a point about the implicit
>> behavior
>>>> and
>>>>>>> side
>>>>>>>>>>>>>> effects
>>>>>>>>>>>>>>>> of the cache() method if it does not return anything.
>>>>>>>>>>>>>>>> Consider the following example:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Table t1 = ???
>>>>>>>>>>>>>>>> Table t2 = methodThatAppliesOperators(t1);
>>>>>>>>>>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In this case, the behavior/performance of the plan
>> that
>>>>>> results
>>>>>>>> from
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> second method call depends on whether t1 was modified
>> by
>>>> the
>>>>>>> first
>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>> or not.
>>>>>>>>>>>>>>>> This is the classic issue of mutable vs. immutable
>>>> objects.
>>>>>>>>>>>>>>>> Also, as Piotr pointed out, it might also be good to
>>> have
>>>>> the
>>>>>>>>>> original
>>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>>> of t1, because in some cases it is possible to push
>>>> filters
>>>>>> down
>>>>>>>>>> such
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> evaluating the query from scratch might be more
>>> efficient
>>>>> than
>>>>>>>>>>>>> accessing
>>>>>>>>>>>>>>>> the cache.
>>>>>>>>>>>>>>>> Moreover, a CachedTable could extend Table() and
>> offer a
>>>>>> method
>>>>>>>>>>>>>> refresh().
>>>>>>>>>>>>>>>> This sounds quite useful in an interactive session
>> mode.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
>>>>>>>> materialize()
>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>> to be more future proof.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
>>> Wang <
>>>>>>>>>>>>>>>> wshaoxuan@gmail.com>:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Piotr,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for sharing your ideas on the method naming.
>> We
>>>> will
>>>>>>> think
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>> your suggestions. But I don't understand why we need
>> to
>>>>>> change
>>>>>>>> the
>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>> type of cache().
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Cache() is a physical operation, it does not change
>> the
>>>>> logic
>>>>>>> of
>>>>>>>>>>>>>>>>> the `Table`. On the tableAPI layer, we should not
>>>>> introduce a
>>>>>>> new
>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> type unless the logic of table has been changed. If
>> we
>>>>>>> introduce
>>>>>>>> a
>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>> table type `CachedTable`, we need create the same set
>>> of
>>>>>>> methods
>>>>>>>> of
>>>>>>>>>>>>>>>> `Table`
>>>>>>>>>>>>>>>>> for it. I don't think it is worth doing this. Or can
>>> you
>>>>>> please
>>>>>>>>>>>>>> elaborate
>>>>>>>>>>>>>>>>> more on what could be the "implicit behaviours/side
>>>>> effects"
>>>>>>> you
>>>>>>>>>> are
>>>>>>>>>>>>>>>>> thinking about?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Shaoxuan
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the response.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 1. I wasn’t saying that materialised view must be
>>>> mutable
>>>>> or
>>>>>>>> not.
>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>> thing applies to caches as well. To the contrary, I
>>>> would
>>>>>>> expect
>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> consistency and updates from something that is
>> called
>>>>>> “cache”
>>>>>>> vs
>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>> that’s a “materialised view”. In other words, IMO
>> most
>>>>>> caches
>>>>>>> do
>>>>>>>>>> not
>>>>>>>>>>>>>>>>> serve
>>>>>>>>>>>>>>>>>> you invalid/outdated data and they handle updates on
>>>> their
>>>>>>> own.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2. I don’t think that having in the future two very
>>>>> similar
>>>>>>>>>> concepts
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> `materialized` view and `cache` is a good idea. It
>>> would
>>>>> be
>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> the users. I think it could be handled by
>>>>>>> variations/overloading
>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> materialised view concept. We could start with:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> `MaterializedTable materialize()` - immutable,
>> session
>>>>> life
>>>>>>>> scope
>>>>>>>>>>>>>>>>>> (basically the same semantic as you are proposing
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> And then in the future (if ever) build on top of
>>>>> that/expand
>>>>>>> it
>>>>>>>>>>>>> with:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
>>>>>>>>>> `MaterializedTable
>>>>>>>>>>>>>>>>>> materialize(refreshHook=…)`
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Or with cross session support:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
>>>>>>>>>>>>> `MaterializedTable
>>>>>>>>>>>>>>>>>> materializeInto(tableFactory=…)`
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I’m not saying that we should implement cross
>>>>>>> session/refreshing
>>>>>>>>>> now
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> even in the near future. I’m just arguing that
>> naming
>>>>>> current
>>>>>>>>>>>>>> immutable
>>>>>>>>>>>>>>>>>> session life scope method `materialize()` is more
>>> future
>>>>>> proof
>>>>>>>> and
>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> consistent with SQL (on which after all table-api is
>>>>> heavily
>>>>>>>>>> basing
>>>>>>>>>>>>>>>> on).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would
>>>> still
>>>>>>> insist
>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
>>>> implicit
>>>>>>>>>>>>>>>>> behaviours/side
>>>>>>>>>>>>>>>>>> effects and to give both us & users more
>> flexibility.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
>>>>> becket.qin@gmail.com
>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Just to add a little bit, the materialized view is
>>>>> probably
>>>>>>>> more
>>>>>>>>>>>>>>>>> similar
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> the persistent() brought up earlier in the thread.
>> So
>>>> it
>>>>> is
>>>>>>>>>> usually
>>>>>>>>>>>>>>>>> cross
>>>>>>>>>>>>>>>>>>> session and could be used in a larger scope. For
>>>>> example, a
>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>>> view created by user A may be visible to user B. It
>>> is
>>>>>>> probably
>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>> we want to have in the future. I'll put it in the
>>>> future
>>>>>> work
>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Right now we are mostly thinking of the cached
>> table
>>>> as
>>>>>>>>>>>>> immutable. I
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> see the Materialized view would be useful in the
>>>> future.
>>>>>>> That
>>>>>>>>>>>>> said,
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>> a simple cache mechanism is probably still needed.
>>> So
>>>> to
>>>>>> me,
>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> materialize() should be two separate method as
>> they
>>>>>> address
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>> needs. Materialize() is a higher level concept
>>> usually
>>>>>>>> implying
>>>>>>>>>>>>>>>>>> periodical
>>>>>>>>>>>>>>>>>>>> update, while cache() has much simpler semantic.
>> For
>>>>>>> example,
>>>>>>>>>> one
>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>> create a materialized view and use cache() method
>> in
>>>> the
>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>> creation logic. So that during the materialized
>> view
>>>>>> update,
>>>>>>>>>> they
>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> need to worry about the case that the cached table
>>> is
>>>>> also
>>>>>>>>>>>>> changed.
>>>>>>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>>>>>>> under the hood, materialized() and cache() could
>>> share
>>>>>> some
>>>>>>>>>>>>>>>> mechanism,
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>> I think a simple cache() method would be handy in
>> a
>>>> lot
>>>>> of
>>>>>>>>>> cases.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
>>>>>>> MaterializedTable
>>>>>>>>>> that
>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>> cannot do on a Table?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Maybe not in the initial implementation, but
>>> various
>>>>> DBs
>>>>>>>> offer
>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
>>>>> triggers,
>>>>>>>>>> timers,
>>>>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to
>>>> handle
>>>>>>> that
>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> After users call *table.cache(), *users can just
>>> use
>>>>>> that
>>>>>>>>>> table
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>> anything that is supported on a Table, including
>>> SQL.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> This is some implicit behaviour with side
>> effects.
>>>>>> Imagine
>>>>>>> if
>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> long and complicated program, that touches table
>>> `b`
>>>>>>> multiple
>>>>>>>>>>>>>>>> times,
>>>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>>>> scattered around different methods. If he
>> modifies
>>>> his
>>>>>>>> program
>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> inserting
>>>>>>>>>>>>>>>>>>>>> in one place
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> This implicitly alters the semantic and behaviour
>>> of
>>>>> his
>>>>>>> code
>>>>>>>>>> all
>>>>>>>>>>>>>>>>> over
>>>>>>>>>>>>>>>>>>>>> the place, maybe in a ways that might cause
>>> problems.
>>>>> For
>>>>>>>>>> example
>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>> underlying data is changing?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Having invisible side effects is also not very
>>> clean,
>>>>> for
>>>>>>>>>> example
>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>> about something like this (but more complicated):
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Table b = ...;
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>>>>>>>>>>>> processTable1(b)
>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>> else {
>>>>>>>>>>>>>>>>>>>>> processTable2(b)
>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> // do more stuff with b
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
>>>>>>>>>> `processTable1`
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>> `processTable2` methods.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On the other hand
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Table materialisedB = b.materialize()
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Avoids (at least some of) the side effect issues
>>> and
>>>>>> forces
>>>>>>>>>> user
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> explicitly use `materialisedB` where it’s
>>> appropriate
>>>>> and
>>>>>>>>>> forces
>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> think what does it actually mean. And if
>> something
>>>>>> doesn’t
>>>>>>>> work
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>>>>>>>> for the user, he will know what has he changed
>>>> instead
>>>>> of
>>>>>>>>>> blaming
>>>>>>>>>>>>>>>>>> Flink for
>>>>>>>>>>>>>>>>>>>>> some “magic” underneath. In the above example,
>>> after
>>>>>>>>>>>>> materialising
>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> only one of the methods, he should/would realise
>>>> about
>>>>>> the
>>>>>>>>>> issue
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>>>> handling the return value `MaterializedTable` of
>>> that
>>>>>>> method.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I guess it comes down to personal preferences if
>>> you
>>>>> like
>>>>>>>>>> things
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> implicit or not. The more power is the user,
>>> probably
>>>>> the
>>>>>>>> more
>>>>>>>>>>>>>>>> likely
>>>>>>>>>>>>>>>>>> he is
>>>>>>>>>>>>>>>>>>>>> to like/understand implicit behaviour. And we as
>>>> Table
>>>>>> API
>>>>>>>>>>>>>>>> designers
>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>> the most power users out there, so I would
>> proceed
>>>> with
>>>>>>>> caution
>>>>>>>>>>>>> (so
>>>>>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>>>>>> do not end up in the crazy perl realm with it’s
>>>> lovely
>>>>>>>> implicit
>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>> arguments ;)  <
>>>>>>> https://stackoverflow.com/a/14922656/8149051
>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
>>> processing
>>>>>> cases,
>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>> might be slightly better.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I think even such extended Table API could
>> benefit
>>>> from
>>>>>>>>>> sticking
>>>>>>>>>>>>>>>>>> to/being
>>>>>>>>>>>>>>>>>>>>> consistent with SQL where both SQL and Table API
>>> are
>>>>>>>> basically
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> One more thing. `MaterializedTable materialize()`
>>>> could
>>>>>> be
>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>> powerful/flexible allowing the user to operate
>> both
>>>> on
>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>> and not
>>>>>>>>>>>>>>>>>>>>> materialised view at the same time for whatever
>>>> reasons
>>>>>>>>>>>>> (underlying
>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>> changing/better optimisation opportunities after
>>>>> pushing
>>>>>>> down
>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>>>>> etc). For example:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Table b = …;
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Val min = mb.min();
>>>>>>>>>>>>>>>>>>>>> Val max = mb.max();
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Could be more efficient compared to `b.cache()`
>> if
>>>>>>>>>>>>> `filter(‘userId
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>>>> 42);` allows for much more aggressive
>>> optimisations.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
>>>>>>> fhueske@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I'm not suggesting to add support for Ignite.
>> This
>>>> was
>>>>>>> just
>>>>>>>> an
>>>>>>>>>>>>>>>>>> example.
>>>>>>>>>>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
>>>>>>>>>>>>>>>>>>>>>> For the sake of this proposal, it would be up to
>>> the
>>>>>> user
>>>>>>> to
>>>>>>>>>>>>>>>>>> implement a
>>>>>>>>>>>>>>>>>>>>>> TableFactory and corresponding TableSource /
>>>> TableSink
>>>>>>>> classes
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>> and read the data.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb
>> Flavio
>>>>>>>> Pompermaier
>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>> pompermaier@okkam.it>:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as
>>> an
>>>>>>>>>> alternative
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>> Ignite?
>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske
>> <
>>>>>>>>>>>>>>>> fhueske@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> To summarize, you propose a new method
>>>>> Table.cache():
>>>>>>>> Table
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>> trigger a job and write the result into some
>>>>> temporary
>>>>>>>>>> storage
>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>>>>>>>>> by a TableFactory.
>>>>>>>>>>>>>>>>>>>>>>>> The cache() call blocks while the job is
>> running
>>>> and
>>>>>>>>>>>>> eventually
>>>>>>>>>>>>>>>>>>>>> returns a
>>>>>>>>>>>>>>>>>>>>>>>> Table object that represents a scan of the
>>>> temporary
>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>>>> When the "session" is closed (closing to be
>>>>> defined?),
>>>>>>> the
>>>>>>>>>>>>>>>>> temporary
>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>> are all dropped.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I think this behavior makes sense and is a
>> good
>>>>> first
>>>>>>> step
>>>>>>>>>>>>>>>> towards
>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>> interactive workloads.
>>>>>>>>>>>>>>>>>>>>>>>> However, its performance suffers from writing
>> to
>>>> and
>>>>>>>> reading
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>> external
>>>>>>>>>>>>>>>>>>>>>>>> systems.
>>>>>>>>>>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
>>>>>>>> significantly
>>>>>>>>>>>>>>>>> improve
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across
>>>> jobs)
>>>>>>> would
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> large
>>>>>>>>>>>>>>>>>>>>>>>> impacts on many components of Flink.
>>>>>>>>>>>>>>>>>>>>>>>> Users could use in-memory filesystems or
>> storage
>>>>> grids
>>>>>>>>>> (Apache
>>>>>>>>>>>>>>>>>>>>> Ignite) to
>>>>>>>>>>>>>>>>>>>>>>>> mitigate some of the performance effects.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
>>> Becket
>>>>> Qin
>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
>>>>>>>> MaterializedTable
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>>>>>> cannot do on a Table? After users call
>>>>>> *table.cache(),
>>>>>>>>>> *users
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>>>>> that table and do anything that is supported
>>> on a
>>>>>>> Table,
>>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>> SQL.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize()
>>>> sounds
>>>>>>> fine
>>>>>>>> to
>>>>>>>>>>>>> me.
>>>>>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>> a bit more general than materialize(). Given
>>> that
>>>>> we
>>>>>>> are
>>>>>>>>>>>>>>>>> enhancing
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> Table API to also support non-relational
>>>> processing
>>>>>>>> cases,
>>>>>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>> slightly better.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
>>> Nowojski <
>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend
>> to
>>>>> reuse
>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
>>> assumed
>>>>> that
>>>>>>> you
>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>> alternate way of writing the data.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Now that I hopefully understand the
>> proposal,
>>>>> maybe
>>>>>> we
>>>>>>>>>> could
>>>>>>>>>>>>>>>>>> rename
>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` to
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> void materialize()
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> or going step further
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable materialize()
>>>>>>>>>>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> ?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> The second option with returning a handle I
>>>> think
>>>>> is
>>>>>>>> more
>>>>>>>>>>>>>>>>> flexible
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> could provide features such as
>>>> “refresh”/“delete”
>>>>> or
>>>>>>>>>>>>> generally
>>>>>>>>>>>>>>>>>>>>>>> speaking
>>>>>>>>>>>>>>>>>>>>>>>>>> manage the the view. In the future we could
>>> also
>>>>>> think
>>>>>>>>>> about
>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>>>> hooks
>>>>>>>>>>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is
>> also
>>>> more
>>>>>>>>>> explicit
>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>> materialization returning a new table handle
>>>> will
>>>>>> not
>>>>>>>> have
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>> implicit side effects as adding a simple
>> line
>>> of
>>>>>> code
>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>>>>>>>>>>>>> would have.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> It would also be more SQL like, making it
>> more
>>>>>>> intuitive
>>>>>>>>>> for
>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>>>>>>>>>> familiar with the SQL.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
>>>>>> equivalent
>>>>>>> to
>>>>>>>>>>>>>>>>> creating
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>> BUILT-IN
>>>>>>>>>>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
>>>>>>> functionality
>>>>>>>> is
>>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>>>>>> today,
>>>>>>>>>>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
>>> question.
>>>>> Do
>>>>>>> you
>>>>>>>>>> mean
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
>>> sugar?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is
>> do
>>>> we
>>>>>> want
>>>>>>>> to
>>>>>>>>>>>>> stop
>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> creating
>>>>>>>>>>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
>>> extend
>>>>> that
>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>> useful unified data store distributed with
>>>> Flink?
>>>>>> And
>>>>>>>> do
>>>>>>>>>> we
>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
>>> pattern
>>>>> with
>>>>>>>> their
>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>>>>>>>>>>>> services. These considerations are much
>> more
>>>>>>>>>> architectural.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
>>> Nowojski
>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand
>>> the
>>>>>>>> problem.
>>>>>>>>>>>>>>>> Isn’t
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing
>> data
>>>> to
>>>>> a
>>>>>>> sink
>>>>>>>>>> and
>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited
>> live
>>>>>>> scope/live
>>>>>>>>>>>>> time?
>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> sink
>>>>>>>>>>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a
>> file
>>>>> sink?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
>>>>>>> materialised
>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>> from a
>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
>>> reusing
>>>>>> this
>>>>>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to
>>>> clean
>>>>> up
>>>>>>>>>>>>>>>>> materialised
>>>>>>>>>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> example when current session finishes)?
>>> Maybe
>>>> we
>>>>>>> need
>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>> syntactic
>>>>>>>>>>>>>>>>>>>>>>>>>> sugar
>>>>>>>>>>>>>>>>>>>>>>>>>>>> on top of it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
>>>> persist()
>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>> lifecycle/defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the
>> future
>>>>> work
>>>>>>> for
>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng
>>> sun
>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the
>>> name
>>>>> of
>>>>>>>>>>>>>>>> `cache()`, I
>>>>>>>>>>>>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you designed this way!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
>>>>>> lifecycle
>>>>>>>> for
>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>>> persistence?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, persist
>> (LifeCycle.SESSION),
>>> so
>>>>>> that
>>>>>>>> the
>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> worried
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly
>> specify
>>>> the
>>>>>> time
>>>>>>>>>> range
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand,
>> we
>>>> can
>>>>>>> also
>>>>>>>>>>>>> share
>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group of session, for example:
>>>>>>>>>>>>>>>>> LifeCycle.SESSION_GROUP(...), I
>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for
>> reference
>>>>> only!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
>>>>>> 于2018年11月23日周五
>>>>>>>>>>>>>>>> 下午1:33写道:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
>>> cache()
>>>>> v.s.
>>>>>>>>>>>>>>>> persist(),
>>>>>>>>>>>>>>>>>>>>>>>>>> personally I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
>>>> describing
>>>>>> the
>>>>>>>>>>>>>>>> behavior,
>>>>>>>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
>>>>> deleted
>>>>>>>> after
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> closed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as
>>>> people
>>>>>>> might
>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still be there even after the session
>> is
>>>>> gone.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and
>>>> stream
>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that
>>>> goal.
>>>>> I
>>>>>>>>>> imagine
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change across the board, including
>>> sources,
>>>>>>>> operators
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
>>>>> separate
>>>>>>>>>>>>> in-depth
>>>>>>>>>>>>>>>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan
>>>> Cui <
>>>>>>>>>>>>>>>>>>>>>>> xingcanc@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
>>> access
>>>>>>> domain
>>>>>>>>>> are
>>>>>>>>>>>>>>>> both
>>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this
>> may
>>>> be
>>>>>> the
>>>>>>>>>> first
>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
>>> other
>>>>> than
>>>>>>> the
>>>>>>>>>>>>>>>> state.
>>>>>>>>>>>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>>>>>>>>>>>> it’s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
>>>>>> concentrate
>>>>>>>> on
>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>> part?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
>>> concerned
>>>>>> with
>>>>>>>> the
>>>>>>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>>>>>>>>>>>>> service.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change
>> to
>>>> the
>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>>> codebase.
>>>>>>>>>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
>>> extendible
>>>> to
>>>>>>>> support
>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
>>>> thread.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the
>>> more
>>>>>>>>>> interactive
>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough
>> service
>>>>>>>> mechanism.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei
>>>>> Jiang <
>>>>>>>>>>>>>>>>>>>>>>>> xiaoweij@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
>>> table
>>>>> for
>>>>>>>> clean
>>>>>>>>>> up
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reliable.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
>>>>>> executed
>>>>>>>>>>>>>>>>>> successfully.
>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> risk
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that
>>>> it's
>>>>>>> safer
>>>>>>>> to
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> association
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So
>>> we
>>>>> can
>>>>>>>> always
>>>>>>>>>>>>>>>> clean
>>>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>>>>>>>> temp
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with
>> any
>>>>>> active
>>>>>>>>>>>>>>>> sessions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
>>> jincheng
>>>>>> sun <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
>>>> proposal!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very
>> useful
>>>> and
>>>>>>> user
>>>>>>>>>>>>>>>> friendly
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> examples.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business
>>> has
>>>>> to
>>>>>> be
>>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the
>> pipeline
>>>> of
>>>>>>> Flink
>>>>>>>>>> ML,
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> utilize
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we
>>> have
>>>>> to
>>>>>>>>>> submit a
>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> env.execute().
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
>>> better
>>>>> to
>>>>>>>> named
>>>>>>>>>>>>>>>>>>>>>>> `persist()`,
>>>>>>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether
>> we
>>>>>>> internally
>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the
>>>> data
>>>>>> into
>>>>>>>>>> state
>>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
>>>> RocksDBStateBackend
>>>>>>> etc.)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in
>> the
>>>>>> future,
>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job
>>>> will
>>>>>> also
>>>>>>>>>>>>> benefit
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Interactive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward
>> to
>>>>> your
>>>>>>>> JIRAs
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> FLIP!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
>>>>>>>> 于2018年11月20日周二
>>>>>>>>>>>>>>>>>> 下午9:56写道:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
>>>>> pointed
>>>>>>> out,
>>>>>>>>>> it
>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>>>>>>>>>> promising
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table
>>> API
>>>> in
>>>>>>>> various
>>>>>>>>>>>>>>>>>> aspects,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among
>>>>> others.
>>>>>>> One
>>>>>>>>>> of
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> scenarios
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
>>> interactive
>>>>>>>>>>>>> programming.
>>>>>>>>>>>>>>>> To
>>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on
>> the
>>>>>>> solution,
>>>>>>>> we
>>>>>>>>>>>>> put
>>>>>>>>>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our
>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
>>> welcome!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Till,

It is true that after the first job submission, there will be no ambiguity
in terms of whether a cached table is used or not. That is the same for the
cache() without returning a CachedTable.

Conceptually one could think of cache() as introducing a caching operator
> from which you need to consume from if you want to benefit from the caching
> functionality.

I am thinking a little differently. I think it is a hint (as you mentioned
later) instead of a new operator. I'd like to be careful about the semantic
of the API. A hint is a property set on an existing operator, but is not
itself an operator as it does not really manipulate the data.

I agree, ideally the optimizer makes this kind of decision which
> intermediate result should be cached. But especially when executing ad-hoc
> queries the user might better know which results need to be cached because
> Flink might not see the full DAG. In that sense, I would consider the
> cache() method as a hint for the optimizer. Of course, in the future we
> might add functionality which tries to automatically cache results (e.g.
> caching the latest intermediate results until so and so much space is
> used). But this should hopefully not contradict with `CachedTable cache()`.

I agree that cache() method is needed for exactly the reason you mentioned,
i.e. Flink cannot predict what users are going to write later, so users
need to tell Flink explicitly that this table will be used later. What I
meant is that assuming there is already a cached table, ideally users need
not to specify whether the next query should read from the cache or use the
original DAG. This should be decided by the optimizer.

To explain the difference between returning / not returning a CachedTable,
I want compare the following two case:

*Case 1:  returning a CachedTable*
b = a.map(...)
val cachedTableA1 = a.cache()
val cachedTableA2 = a.cache()
b.print() // Just to make sure a is cached.

c = a.filter(...) // User specify that the original DAG is used? Or the
optimizer decides whether DAG or cache should be used?
d = cachedTableA1.filter() // User specify that the cached table is used.

a.unCache() // Can cachedTableA still be used afterwards?
cachedTableA1.uncache() // Can cachedTableA2 still be used?

*Case 2: not returning a CachedTable*
b = a.map()
a.cache()
a.cache() // no-op
b.print() // Just to make sure a is cached

c = a.filter(...) // Optimizer decides whether the cache or DAG should be
used
d = a.filter(...) // Optimizer decides whether the cache or DAG should be
used

a.unCache()
a.unCache() // no-op

In case 1, semantic wise, optimizer lose the option to choose between DAG
and cache. And the unCache() call becomes tricky.
In case 2, users do not need to worry about whether cache or DAG is used.
And the unCache() semantic is clear. However, the caveat is that users
cannot explicitly ignore the cache.

In order to address the issues mentioned in case 2 and inspired by the
discussion so far, I am thinking about using hint to allow user explicitly
ignore cache. Although we do not have hint yet, but we probably should have
one. So the code becomes:

*Case 3: returning this table*
b = a.map()
a.cache()
a.cache() // no-op
b.print() // Just to make sure a is cached

c = a.filter(...) // Optimizer decides whether the cache or DAG should be
used
d = a.hint("ignoreCache").filter(...) // DAG will be used instead of the
cache.

a.unCache()
a.unCache() // no-op

We could also let cache() return this table to allow chained method calls.
Do you think this API addresses the concerns?

Thanks,

Jiangjie (Becket) Qin


On Wed, Dec 5, 2018 at 10:55 AM Jark Wu <im...@gmail.com> wrote:

> Hi,
>
> All the recent discussions are focused on whether there is a problem if
> cache() not return a Table.
> It seems that returning a Table explicitly is more clear (and safe?).
>
> So whether there are any problems if cache() returns a Table?  @Becket
>
> Best,
> Jark
>
> On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <tr...@apache.org> wrote:
>
> > It's true that b, c, d and e will all read from the original DAG that
> > generates a. But all subsequent operators (when running multiple queries)
> > which reference cachedTableA should not need to reproduce `a` but
> directly
> > consume the intermediate result.
> >
> > Conceptually one could think of cache() as introducing a caching operator
> > from which you need to consume from if you want to benefit from the
> caching
> > functionality.
> >
> > I agree, ideally the optimizer makes this kind of decision which
> > intermediate result should be cached. But especially when executing
> ad-hoc
> > queries the user might better know which results need to be cached
> because
> > Flink might not see the full DAG. In that sense, I would consider the
> > cache() method as a hint for the optimizer. Of course, in the future we
> > might add functionality which tries to automatically cache results (e.g.
> > caching the latest intermediate results until so and so much space is
> > used). But this should hopefully not contradict with `CachedTable
> cache()`.
> >
> > Cheers,
> > Till
> >
> > On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <be...@gmail.com> wrote:
> >
> > > Hi Till,
> > >
> > > Thanks for the clarification. I am still a little confused.
> > >
> > > If cache() returns a CachedTable, the example might become:
> > >
> > > b = a.map(...)
> > > c = a.map(...)
> > >
> > > cachedTableA = a.cache()
> > > d = cachedTableA.map(...)
> > > e = a.map()
> > >
> > > In the above case, if cache() is lazily evaluated, b, c, d and e are
> all
> > > going to be reading from the original DAG that generates a. But with a
> > > naive expectation, d should be reading from the cache. This seems not
> > > solving the potential confusion you raised, right?
> > >
> > > Just to be clear, my understanding are all based on the assumption that
> > the
> > > tables are immutable. Therefore, after a.cache(), a the c*achedTableA*
> > and
> > > original table *a * should be completely interchangeable.
> > >
> > > That said, I think a valid argument is optimization. There are indeed
> > cases
> > > that reading from the original DAG could be faster than reading from
> the
> > > cache. For example, in the following example:
> > >
> > > a.filter(f1' > 100)
> > > a.cache()
> > > b = a.filter(f1' < 100)
> > >
> > > Ideally the optimizer should be intelligent enough to decide which way
> is
> > > faster, without user intervention. In this case, it will identify that
> b
> > > would just be an empty table, thus skip reading from the cache
> > completely.
> > > But I agree that returning a CachedTable would give user the control of
> > > when to use cache, even though I still feel that letting the optimizer
> > > handle this is a better option in long run.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > >
> > > On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <tr...@apache.org>
> > wrote:
> > >
> > > > Yes you are right Becket that it still depends on the actual
> execution
> > of
> > > > the job whether a consumer reads from a cached result or not.
> > > >
> > > > My point was actually about the properties of a (cached vs.
> non-cached)
> > > and
> > > > not about the execution. I would not make cache trigger the execution
> > of
> > > > the job because one loses some flexibility by eagerly triggering the
> > > > execution.
> > > >
> > > > I tried to argue for an explicit CachedTable which is returned by the
> > > > cache() method like Piotr did in order to make the API more explicit.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Till,
> > > > >
> > > > > That is a good example. Just a minor correction, in this case, b, c
> > > and d
> > > > > will all consume from a non-cached a. This is because cache will
> only
> > > be
> > > > > created on the very first job submission that generates the table
> to
> > be
> > > > > cached.
> > > > >
> > > > > If I understand correctly, this is example is about whether
> .cache()
> > > > method
> > > > > should be eagerly evaluated or lazily evaluated. In another word,
> if
> > > > > cache() method actually triggers a job that creates the cache,
> there
> > > will
> > > > > be no such confusion. Is that right?
> > > > >
> > > > > In the example, although d will not consume from the cached Table
> > while
> > > > it
> > > > > looks supposed to, from correctness perspective the code will still
> > > > return
> > > > > correct result, assuming that tables are immutable.
> > > > >
> > > > > Personally I feel it is OK because users probably won't really
> worry
> > > > about
> > > > > whether the table is cached or not. And lazy cache could avoid some
> > > > > unnecessary caching if a cached table is never created in the user
> > > > > application. But I am not opposed to do eager evaluation of cache.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <
> trohrmann@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Another argument for Piotr's point is that lazily changing
> > properties
> > > > of
> > > > > a
> > > > > > node affects all down stream consumers but does not necessarily
> > have
> > > to
> > > > > > happen before these consumers are defined. From a user's
> > perspective
> > > > this
> > > > > > can be quite confusing:
> > > > > >
> > > > > > b = a.map(...)
> > > > > > c = a.map(...)
> > > > > >
> > > > > > a.cache()
> > > > > > d = a.map(...)
> > > > > >
> > > > > > now b, c and d will consume from a cached operator. In this case,
> > the
> > > > > user
> > > > > > would most likely expect that only d reads from a cached result.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> > > > piotr@data-artisans.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Shaoxuan and Becket,
> > > > > > >
> > > > > > > > Can you explain a bit more one what are the side effects? So
> > far
> > > my
> > > > > > > > understanding is that such side effects only exist if a table
> > is
> > > > > > mutable.
> > > > > > > > Is that the case?
> > > > > > >
> > > > > > > Not only that. There are also performance implications and
> those
> > > are
> > > > > > > another implicit side effects of using `void cache()`. As I
> wrote
> > > > > before,
> > > > > > > reading from cache might not always be desirable, thus it can
> > cause
> > > > > > > performance degradation and I’m fine with that - user's or
> > > > optimiser’s
> > > > > > > choice. What I do not like is that this implicit side effect
> can
> > > > > manifest
> > > > > > > in completely different part of code, that wasn’t touched by a
> > user
> > > > > while
> > > > > > > he was adding `void cache()` call somewhere else. And even if
> > > caching
> > > > > > > improves performance, it’s still a side effect of `void
> cache()`.
> > > > > Almost
> > > > > > > from the definition `void` methods have only side effects. As I
> > > wrote
> > > > > > > before, there are couple of scenarios where this might be
> > > undesirable
> > > > > > > and/or unexpected, for example:
> > > > > > >
> > > > > > > 1.
> > > > > > > Table b = …;
> > > > > > > b.cache()
> > > > > > > x = b.join(…)
> > > > > > > y = b.count()
> > > > > > > // ...
> > > > > > > // 100
> > > > > > > // hundred
> > > > > > > // lines
> > > > > > > // of
> > > > > > > // code
> > > > > > > // later
> > > > > > > z = b.filter(…).groupBy(…) // this might be even hidden in a
> > > > different
> > > > > > > method/file/package/dependency
> > > > > > >
> > > > > > > 2.
> > > > > > >
> > > > > > > Table b = ...
> > > > > > > If (some_condition) {
> > > > > > >   foo(b)
> > > > > > > }
> > > > > > > Else {
> > > > > > >   bar(b)
> > > > > > > }
> > > > > > > z = b.filter(…).groupBy(…)
> > > > > > >
> > > > > > >
> > > > > > > Void foo(Table b) {
> > > > > > >   b.cache()
> > > > > > >   // do something with b
> > > > > > > }
> > > > > > >
> > > > > > > In both above examples, `b.cache()` will implicitly affect
> > > (semantic
> > > > > of a
> > > > > > > program in case of sources being mutable and performance) `z =
> > > > > > > b.filter(…).groupBy(…)` which might be far from obvious.
> > > > > > >
> > > > > > > On top of that, there is still this argument of mine that
> having
> > a
> > > > > > > `MaterializedTable` or `CachedTable` handle is more flexible
> for
> > us
> > > > for
> > > > > > the
> > > > > > > future and for the user (as a manual option to bypass cache
> > reads).
> > > > > > >
> > > > > > > >  But Jiangjie is correct,
> > > > > > > > the source table in batching should be immutable. It is the
> > > user’s
> > > > > > > > responsibility to ensure it, otherwise even a regular
> failover
> > > may
> > > > > lead
> > > > > > > > to inconsistent results.
> > > > > > >
> > > > > > > Yes, I agree that’s what perfect world/good deployment should
> be.
> > > But
> > > > > its
> > > > > > > often isn’t and while I’m not trying to fix this (since the
> > proper
> > > > fix
> > > > > is
> > > > > > > to support transactions), I’m just trying to minimise confusion
> > for
> > > > the
> > > > > > > users that are not fully aware what’s going on and operate in
> > less
> > > > then
> > > > > > > perfect setup. And if something bites them after adding
> > `b.cache()`
> > > > > call,
> > > > > > > to make sure that they at least know all of the places that
> > adding
> > > > this
> > > > > > > line can affect.
> > > > > > >
> > > > > > > Thanks, Piotrek
> > > > > > >
> > > > > > > > On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > Hi Piotrek,
> > > > > > > >
> > > > > > > > Thanks again for the clarification. Some more replies are
> > > > following.
> > > > > > > >
> > > > > > > > But keep in mind that `.cache()` will/might not only be used
> in
> > > > > > > interactive
> > > > > > > >> programming and not only in batching.
> > > > > > > >
> > > > > > > > It is true. Actually in stream processing, cache() has the
> same
> > > > > > semantic
> > > > > > > as
> > > > > > > > batch processing. The semantic is following:
> > > > > > > > For a table created via a series of computation, save that
> > table
> > > > for
> > > > > > > later
> > > > > > > > reference to avoid running the computation logic to
> regenerate
> > > the
> > > > > > table.
> > > > > > > > Once the application exits, drop all the cache.
> > > > > > > > This semantic is same for both batch and stream processing.
> The
> > > > > > > difference
> > > > > > > > is that stream applications will only run once as they are
> long
> > > > > > running.
> > > > > > > > And the batch applications may be run multiple times, hence
> the
> > > > cache
> > > > > > may
> > > > > > > > be created and dropped each time the application runs.
> > > > > > > > Admittedly, there will probably be some resource management
> > > > > > requirements
> > > > > > > > for the streaming cached table, such as time based / size
> based
> > > > > > > retention,
> > > > > > > > to address the infinite data issue. But such requirement does
> > not
> > > > > > change
> > > > > > > > the semantic.
> > > > > > > > You are right that interactive programming is just one use
> case
> > > of
> > > > > > > cache().
> > > > > > > > It is not the only use case.
> > > > > > > >
> > > > > > > > For me the more important issue is of not having the `void
> > > cache()`
> > > > > > with
> > > > > > > >> side effects.
> > > > > > > >
> > > > > > > > This is indeed the key point. The argument around whether
> > cache()
> > > > > > should
> > > > > > > > return something already indicates that cache() and
> > materialize()
> > > > > > address
> > > > > > > > different issues.
> > > > > > > > Can you explain a bit more one what are the side effects? So
> > far
> > > my
> > > > > > > > understanding is that such side effects only exist if a table
> > is
> > > > > > mutable.
> > > > > > > > Is that the case?
> > > > > > > >
> > > > > > > > I don’t know, probably initially we should make CachedTable
> > > > > read-only.
> > > > > > I
> > > > > > > >> don’t find it more confusing than the fact that user can not
> > > write
> > > > > to
> > > > > > > views
> > > > > > > >> or materialised views in SQL or that user currently can not
> > > write
> > > > > to a
> > > > > > > >> Table.
> > > > > > > >
> > > > > > > > I don't think anyone should insert something to a cache. By
> > > > > definition
> > > > > > > the
> > > > > > > > cache should only be updated when the corresponding original
> > > table
> > > > is
> > > > > > > > updated. What I am wondering is that given the following two
> > > facts:
> > > > > > > > 1. If and only if a table is mutable (with something like
> > > > insert()),
> > > > > a
> > > > > > > > CachedTable may have implicit behavior.
> > > > > > > > 2. A CachedTable extends a Table.
> > > > > > > > We can come to the conclusion that a CachedTable is mutable
> and
> > > > users
> > > > > > can
> > > > > > > > insert into the CachedTable directly. This is where I thought
> > > > > > confusing.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jiangjie (Becket) Qin
> > > > > > > >
> > > > > > > > On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> > > > > piotr@data-artisans.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi all,
> > > > > > > >>
> > > > > > > >> Regarding naming `cache()` vs `materialize()`. One more
> > > > explanation
> > > > > > why
> > > > > > > I
> > > > > > > >> think `materialize()` is more natural to me is that I think
> of
> > > all
> > > > > > > “Table”s
> > > > > > > >> in Table-API as views. They behave the same way as SQL
> views,
> > > the
> > > > > only
> > > > > > > >> difference for me is that their live scope is short -
> current
> > > > > session
> > > > > > > which
> > > > > > > >> is limited by different execution model. That’s why
> “cashing”
> > a
> > > > view
> > > > > > > for me
> > > > > > > >> is just materialising it.
> > > > > > > >>
> > > > > > > >> However I see and I understand your point of view. Coming
> from
> > > > > > > >> DataSet/DataStream and generally speaking non-SQL world,
> > > `cache()`
> > > > > is
> > > > > > > more
> > > > > > > >> natural. But keep in mind that `.cache()` will/might not
> only
> > be
> > > > > used
> > > > > > in
> > > > > > > >> interactive programming and not only in batching. But naming
> > is
> > > > one
> > > > > > > issue,
> > > > > > > >> and not that critical to me. Especially that once we
> implement
> > > > > proper
> > > > > > > >> materialised views, we can always deprecate/rename `cache()`
> > if
> > > we
> > > > > > deem
> > > > > > > so.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> For me the more important issue is of not having the `void
> > > > cache()`
> > > > > > with
> > > > > > > >> side effects. Exactly for the reasons that you have
> mentioned.
> > > > True:
> > > > > > > >> results might be non deterministic if underlying source
> table
> > > are
> > > > > > > changing.
> > > > > > > >> Problem is that `void cache()` implicitly changes the
> semantic
> > > of
> > > > > > > >> subsequent uses of the cached/materialized Table. It can
> cause
> > > > “wtf”
> > > > > > > moment
> > > > > > > >> for a user if he inserts “b.cache()” call in some place in
> his
> > > > code
> > > > > > and
> > > > > > > >> suddenly some other random places are behaving differently.
> If
> > > > > > > >> `materialize()` or `cache()` returns a Table handle, we
> force
> > > user
> > > > > to
> > > > > > > >> explicitly use the cache which removes the “random” part
> from
> > > the
> > > > > > > "suddenly
> > > > > > > >> some other random places are behaving differently”.
> > > > > > > >>
> > > > > > > >> This argument and others that I’ve raised (greater
> > > > > > flexibility/allowing
> > > > > > > >> user to explicitly bypass the cache) are independent of
> > > `cache()`
> > > > vs
> > > > > > > >> `materialize()` discussion.
> > > > > > > >>
> > > > > > > >>> Does that mean one can also insert into the CachedTable?
> This
> > > > > sounds
> > > > > > > >> pretty confusing.
> > > > > > > >>
> > > > > > > >> I don’t know, probably initially we should make CachedTable
> > > > > > read-only. I
> > > > > > > >> don’t find it more confusing than the fact that user can not
> > > write
> > > > > to
> > > > > > > views
> > > > > > > >> or materialised views in SQL or that user currently can not
> > > write
> > > > > to a
> > > > > > > >> Table.
> > > > > > > >>
> > > > > > > >> Piotrek
> > > > > > > >>
> > > > > > > >>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com>
> > > > wrote:
> > > > > > > >>>
> > > > > > > >>> Hi all,
> > > > > > > >>>
> > > > > > > >>> I agree with @Becket that `cache()` and `materialize()`
> > should
> > > be
> > > > > > > >> considered as two different methods where the later one is
> > more
> > > > > > > >> sophisticated.
> > > > > > > >>>
> > > > > > > >>> According to my understanding, the initial idea is just to
> > > > > introduce
> > > > > > a
> > > > > > > >> simple cache or persist mechanism, but as the TableAPI is a
> > > > > high-level
> > > > > > > API,
> > > > > > > >> it’s naturally for as to think in a SQL way.
> > > > > > > >>>
> > > > > > > >>> Maybe we can add the `cache()` method to the DataSet API
> and
> > > > force
> > > > > > > users
> > > > > > > >> to translate a Table to a Dataset before caching it. Then
> the
> > > > users
> > > > > > > should
> > > > > > > >> manually register the cached dataset to a table again (we
> may
> > > need
> > > > > > some
> > > > > > > >> table replacement mechanisms for datasets with an identical
> > > schema
> > > > > but
> > > > > > > >> different contents here). After all, it’s the dataset rather
> > > than
> > > > > the
> > > > > > > >> dynamic table that need to be cached, right?
> > > > > > > >>>
> > > > > > > >>> Best,
> > > > > > > >>> Xingcan
> > > > > > > >>>
> > > > > > > >>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> > > becket.qin@gmail.com>
> > > > > > > wrote:
> > > > > > > >>>>
> > > > > > > >>>> Hi Piotrek and Jark,
> > > > > > > >>>>
> > > > > > > >>>> Thanks for the feedback and explanation. Those are good
> > > > arguments.
> > > > > > > But I
> > > > > > > >>>> think those arguments are mostly about materialized view.
> > Let
> > > me
> > > > > try
> > > > > > > to
> > > > > > > >>>> explain the reason I believe cache() and materialize() are
> > > > > > different.
> > > > > > > >>>>
> > > > > > > >>>> I think cache() and materialize() have quite different
> > > > > implications.
> > > > > > > An
> > > > > > > >>>> analogy I can think of is save()/publish(). When users
> call
> > > > > cache(),
> > > > > > > it
> > > > > > > >> is
> > > > > > > >>>> just like they are saving an intermediate result as a
> draft
> > of
> > > > > their
> > > > > > > >> work,
> > > > > > > >>>> this intermediate result may not have any realistic
> meaning.
> > > > > Calling
> > > > > > > >>>> cache() does not mean users want to publish the cached
> table
> > > in
> > > > > any
> > > > > > > >> manner.
> > > > > > > >>>> But when users call materialize(), that means "I have
> > > something
> > > > > > > >> meaningful
> > > > > > > >>>> to be reused by others", now users need to think about the
> > > > > > validation,
> > > > > > > >>>> update & versioning, lifecycle of the result, etc.
> > > > > > > >>>>
> > > > > > > >>>> Piotrek's suggestions on variations of the materialize()
> > > methods
> > > > > are
> > > > > > > >> very
> > > > > > > >>>> useful. It would be great if Flink have them. The concept
> of
> > > > > > > >> materialized
> > > > > > > >>>> view is actually a pretty big feature, not to say the
> > related
> > > > > stuff
> > > > > > > like
> > > > > > > >>>> triggers/hooks you mentioned earlier. I think the
> > materialized
> > > > > view
> > > > > > > >> itself
> > > > > > > >>>> should be discussed in a more thorough and systematic
> > manner.
> > > > And
> > > > > I
> > > > > > > >> found
> > > > > > > >>>> that discussion is kind of orthogonal and way beyond
> > > interactive
> > > > > > > >>>> programming experience.
> > > > > > > >>>>
> > > > > > > >>>> The example you gave was interesting. I still have some
> > > > questions,
> > > > > > > >> though.
> > > > > > > >>>>
> > > > > > > >>>> Table source = … // some source that scans files from a
> > > > directory
> > > > > > > >>>>> “/foo/bar/“
> > > > > > > >>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > > > > > >>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > > > > > >>>>
> > > > > > > >>>> t2.count() // initialise cache (if it’s lazily
> initialised)
> > > > > > > >>>>> int a1 = t1.count()
> > > > > > > >>>>> int b1 = t2.count()
> > > > > > > >>>>> // something in the background (or we trigger it) writes
> > new
> > > > > files
> > > > > > to
> > > > > > > >>>>> /foo/bar
> > > > > > > >>>>> int a2 = t1.count()
> > > > > > > >>>>> int b2 = t2.count()
> > > > > > > >>>>> t2.refresh() // possible future extension, not to be
> > > > implemented
> > > > > in
> > > > > > > the
> > > > > > > >>>>> initial version
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>> what if someone else added some more files to /foo/bar at
> > this
> > > > > > point?
> > > > > > > In
> > > > > > > >>>> that case, a3 won't equals to b3, and the result become
> > > > > > > >> non-deterministic,
> > > > > > > >>>> right?
> > > > > > > >>>>
> > > > > > > >>>> int a3 = t1.count()
> > > > > > > >>>>> int b3 = t2.count()
> > > > > > > >>>>> t2.drop() // another possible future extension, manual
> > > “cache”
> > > > > > > dropping
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> When we talk about interactive programming, in most cases,
> > we
> > > > are
> > > > > > > >> talking
> > > > > > > >>>> about batch applications. A fundamental assumption of such
> > > case
> > > > is
> > > > > > > that
> > > > > > > >> the
> > > > > > > >>>> source data is complete before the data processing begins,
> > and
> > > > the
> > > > > > > data
> > > > > > > >>>> will not change during the data processing. IMO, if
> > additional
> > > > > rows
> > > > > > > >> needs
> > > > > > > >>>> to be added to some source during the processing, it
> should
> > be
> > > > > done
> > > > > > in
> > > > > > > >> ways
> > > > > > > >>>> like union the source with another table containing the
> rows
> > > to
> > > > be
> > > > > > > >> added.
> > > > > > > >>>>
> > > > > > > >>>> There are a few cases that computations are executed
> > > repeatedly
> > > > on
> > > > > > the
> > > > > > > >>>> changing data source.
> > > > > > > >>>>
> > > > > > > >>>> For example, people may run a ML training job every hour
> > with
> > > > the
> > > > > > > >> samples
> > > > > > > >>>> newly added in the past hour. In that case, the source
> data
> > > > > between
> > > > > > > will
> > > > > > > >>>> indeed change. But still, the data remain unchanged within
> > one
> > > > > run.
> > > > > > > And
> > > > > > > >>>> usually in that case, the result will need versioning,
> i.e.
> > > for
> > > > a
> > > > > > > given
> > > > > > > >>>> result, it tells that the result is a result from the
> source
> > > > data
> > > > > > by a
> > > > > > > >>>> certain timestamp.
> > > > > > > >>>>
> > > > > > > >>>> Another example is something like data warehouse. In this
> > > case,
> > > > > > there
> > > > > > > >> are a
> > > > > > > >>>> few source of original/raw data. On top of those sources,
> > many
> > > > > > > >> materialized
> > > > > > > >>>> view / queries / reports / dashboards can be created to
> > > generate
> > > > > > > derived
> > > > > > > >>>> data. Those derived data needs to be updated when the
> > > underlying
> > > > > > > >> original
> > > > > > > >>>> data changes. In that case, the processing logic that
> > derives
> > > > the
> > > > > > > >> original
> > > > > > > >>>> data needs to be executed repeatedly to update those
> > > > > reports/views.
> > > > > > > >> Again,
> > > > > > > >>>> all those derived data also need to have version
> management,
> > > > such
> > > > > as
> > > > > > > >>>> timestamp.
> > > > > > > >>>>
> > > > > > > >>>> In any of the above two cases, during a single run of the
> > > > > processing
> > > > > > > >> logic,
> > > > > > > >>>> the data cannot change. Otherwise the behavior of the
> > > processing
> > > > > > logic
> > > > > > > >> may
> > > > > > > >>>> be undefined. In the above two examples, when writing the
> > > > > processing
> > > > > > > >> logic,
> > > > > > > >>>> Users can use .cache() to hint Flink that those results
> > should
> > > > be
> > > > > > > saved
> > > > > > > >> to
> > > > > > > >>>> avoid repeated computation. And then for the result of my
> > > > > > application
> > > > > > > >>>> logic, I'll call materialize(), so that these results
> could
> > be
> > > > > > managed
> > > > > > > >> by
> > > > > > > >>>> the system with versioning, metadata management, lifecycle
> > > > > > management,
> > > > > > > >>>> ACLs, etc.
> > > > > > > >>>>
> > > > > > > >>>> It is true we can use materialize() to do the cache() job,
> > > but I
> > > > > am
> > > > > > > >> really
> > > > > > > >>>> reluctant to shoehorn cache() into materialize() and force
> > > users
> > > > > to
> > > > > > > >> worry
> > > > > > > >>>> about a bunch of implications that they needn't have to. I
> > am
> > > > > > > >> absolutely on
> > > > > > > >>>> your side that redundant API is bad. But it is equally
> > > > > frustrating,
> > > > > > if
> > > > > > > >> not
> > > > > > > >>>> more, that the same API does different things.
> > > > > > > >>>>
> > > > > > > >>>> Thanks,
> > > > > > > >>>>
> > > > > > > >>>> Jiangjie (Becket) Qin
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> > > > > wshaoxuan@gmail.com
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> Thanks Piotrek,
> > > > > > > >>>>> You provided a very good example, it explains all the
> > > > confusions
> > > > > I
> > > > > > > >> have.
> > > > > > > >>>>> It is clear that there is something we have not
> considered
> > in
> > > > the
> > > > > > > >> initial
> > > > > > > >>>>> proposal. We intend to force the user to reuse the
> > > > > > > cached/materialized
> > > > > > > >>>>> table, if its cache() method is executed. We did not
> expect
> > > > that
> > > > > > user
> > > > > > > >> may
> > > > > > > >>>>> want to re-executed the plan from the source table. Let
> me
> > > > > re-think
> > > > > > > >> about
> > > > > > > >>>>> it and get back to you later.
> > > > > > > >>>>>
> > > > > > > >>>>> In the meanwhile, this example/observation also infers
> that
> > > we
> > > > > > cannot
> > > > > > > >> fully
> > > > > > > >>>>> involve the optimizer to decide the plan if a
> > > cache/materialize
> > > > > is
> > > > > > > >>>>> explicitly used, because weather to reuse the cache data
> or
> > > > > > > re-execute
> > > > > > > >> the
> > > > > > > >>>>> query from source data may lead to different results.
> (But
> > I
> > > > > guess
> > > > > > > >>>>> optimizer can still help in some cases ---- as long as it
> > > does
> > > > > not
> > > > > > > >>>>> re-execute from the varied source, we should be safe).
> > > > > > > >>>>>
> > > > > > > >>>>> Regards,
> > > > > > > >>>>> Shaoxuan
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> > > > > > > >> piotr@data-artisans.com>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>
> > > > > > > >>>>>> Hi Shaoxuan,
> > > > > > > >>>>>>
> > > > > > > >>>>>> Re 2:
> > > > > > > >>>>>>
> > > > > > > >>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
> > modified
> > > > > to->
> > > > > > > t1’
> > > > > > > >>>>>>
> > > > > > > >>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> > > > > > > >>>>>> `methodThatAppliesOperators()` method has changed it’s
> > plan?
> > > > > > > >>>>>>
> > > > > > > >>>>>> I was thinking more about something like this:
> > > > > > > >>>>>>
> > > > > > > >>>>>> Table source = … // some source that scans files from a
> > > > > directory
> > > > > > > >>>>>> “/foo/bar/“
> > > > > > > >>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > > > > > >>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > > > > > >>>>>>
> > > > > > > >>>>>> t2.count() // initialise cache (if it’s lazily
> > initialised)
> > > > > > > >>>>>>
> > > > > > > >>>>>> int a1 = t1.count()
> > > > > > > >>>>>> int b1 = t2.count()
> > > > > > > >>>>>>
> > > > > > > >>>>>> // something in the background (or we trigger it) writes
> > new
> > > > > files
> > > > > > > to
> > > > > > > >>>>>> /foo/bar
> > > > > > > >>>>>>
> > > > > > > >>>>>> int a2 = t1.count()
> > > > > > > >>>>>> int b2 = t2.count()
> > > > > > > >>>>>>
> > > > > > > >>>>>> t2.refresh() // possible future extension, not to be
> > > > implemented
> > > > > > in
> > > > > > > >> the
> > > > > > > >>>>>> initial version
> > > > > > > >>>>>>
> > > > > > > >>>>>> int a3 = t1.count()
> > > > > > > >>>>>> int b3 = t2.count()
> > > > > > > >>>>>>
> > > > > > > >>>>>> t2.drop() // another possible future extension, manual
> > > “cache”
> > > > > > > >> dropping
> > > > > > > >>>>>>
> > > > > > > >>>>>> assertTrue(a1 == b1) // same results, but b1 comes from
> > the
> > > > > > “cache"
> > > > > > > >>>>>> assertTrue(b1 == b2) // both values come from the same
> > cache
> > > > > > > >>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2
> re-executed
> > > > full
> > > > > > > table
> > > > > > > >>>>> scan
> > > > > > > >>>>>> and has more data
> > > > > > > >>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> > > > > > > >>>>>> assertTrue(b3 == a2 == a3)
> > > > > > > >>>>>>
> > > > > > > >>>>>> Piotrek
> > > > > > > >>>>>>
> > > > > > > >>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com>
> > > wrote:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Hi,
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> It is an very interesting and useful design!
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Here I want to share some of my thoughts:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> 1. Agree with that cache() method should return some
> > Table
> > > to
> > > > > > avoid
> > > > > > > >>>>> some
> > > > > > > >>>>>>> unexpected problems because of the mutable object.
> > > > > > > >>>>>>> All the existing methods of Table are returning a new
> > Table
> > > > > > > instance.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> 2. I think materialize() would be more consistent with
> > SQL,
> > > > > this
> > > > > > > >> makes
> > > > > > > >>>>> it
> > > > > > > >>>>>>> possible to support the same feature for SQL
> (materialize
> > > > view)
> > > > > > and
> > > > > > > >>>>> keep
> > > > > > > >>>>>>> the same API for users in the future.
> > > > > > > >>>>>>> But I'm also fine if we choose cache().
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> 3. In the proposal, a TableService (or FlinkService?)
> is
> > > used
> > > > > to
> > > > > > > >> cache
> > > > > > > >>>>>> the
> > > > > > > >>>>>>> result of the (intermediate) table.
> > > > > > > >>>>>>> But the name of TableService may be a bit general which
> > is
> > > > not
> > > > > > > quite
> > > > > > > >>>>>>> understanding correctly in the first glance (a
> metastore
> > > for
> > > > > > > >> tables?).
> > > > > > > >>>>>>> Maybe a more specific name would be better, such as
> > > > > > > TableCacheSerive
> > > > > > > >>>>> or
> > > > > > > >>>>>>> TableMaterializeSerivce or something else.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Best,
> > > > > > > >>>>>>> Jark
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
> > > > fhueske@gmail.com
> > > > > >
> > > > > > > >> wrote:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>> Hi,
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Thanks for the clarification Becket!
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> I have a few thoughts to share / questions:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> 1) I'd like to know how you plan to implement the
> > feature
> > > > on a
> > > > > > > plan
> > > > > > > >> /
> > > > > > > >>>>>>>> planner level.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> I would imaging the following to happen when
> > Table.cache()
> > > > is
> > > > > > > >> called:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> 1) immediately optimize the Table and internally
> convert
> > > it
> > > > > > into a
> > > > > > > >>>>>>>> DataSet/DataStream. This is necessary, to avoid that
> > > > operators
> > > > > > of
> > > > > > > >>>>> later
> > > > > > > >>>>>>>> queries on top of the Table are pushed down.
> > > > > > > >>>>>>>> 2) register the DataSet/DataStream as a
> > > > > > DataSet/DataStream-backed
> > > > > > > >>>>> Table
> > > > > > > >>>>>> X
> > > > > > > >>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> > > > > > > materialization
> > > > > > > >>>>> of
> > > > > > > >>>>>> the
> > > > > > > >>>>>>>> Table X
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Based on your proposal the following would happen:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Table t1 = ....
> > > > > > > >>>>>>>> t1.cache(); // cache() returns void. The logical plan
> of
> > > t1
> > > > is
> > > > > > > >>>>> replaced
> > > > > > > >>>>>> by
> > > > > > > >>>>>>>> a scan of X. There is also a reference to the
> > > > materialization
> > > > > of
> > > > > > > X.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> t1.count(); // this executes the program, including
> the
> > > > > > > >>>>>> DataSet/DataStream
> > > > > > > >>>>>>>> that backs X and the sink that writes the
> > materialization
> > > > of X
> > > > > > > >>>>>>>> t1.count(); // this executes the program, but reads X
> > from
> > > > the
> > > > > > > >>>>>>>> materialization.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> My question is, how do you determine when whether the
> > scan
> > > > of
> > > > > t1
> > > > > > > >>>>> should
> > > > > > > >>>>>> go
> > > > > > > >>>>>>>> against the DataSet/DataStream program and when
> against
> > > the
> > > > > > > >>>>>>>> materialization?
> > > > > > > >>>>>>>> AFAIK, there is no hook that will tell you that a part
> > of
> > > > the
> > > > > > > >> program
> > > > > > > >>>>>> was
> > > > > > > >>>>>>>> executed. Flipping a switch during optimization or
> plan
> > > > > > generation
> > > > > > > >> is
> > > > > > > >>>>>> not
> > > > > > > >>>>>>>> sufficient as there is no guarantee that the plan is
> > also
> > > > > > > executed.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Overall, this behavior is somewhat similar to what I
> > > > proposed
> > > > > in
> > > > > > > >>>>>>>> FLINK-8950, which does not include persisting the
> table,
> > > but
> > > > > > just
> > > > > > > >>>>>>>> optimizing and reregistering it as DataSet/DataStream
> > > scan.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> 2) I think Piotr has a point about the implicit
> behavior
> > > and
> > > > > > side
> > > > > > > >>>>>> effects
> > > > > > > >>>>>>>> of the cache() method if it does not return anything.
> > > > > > > >>>>>>>> Consider the following example:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Table t1 = ???
> > > > > > > >>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> > > > > > > >>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> In this case, the behavior/performance of the plan
> that
> > > > > results
> > > > > > > from
> > > > > > > >>>>> the
> > > > > > > >>>>>>>> second method call depends on whether t1 was modified
> by
> > > the
> > > > > > first
> > > > > > > >>>>>> method
> > > > > > > >>>>>>>> or not.
> > > > > > > >>>>>>>> This is the classic issue of mutable vs. immutable
> > > objects.
> > > > > > > >>>>>>>> Also, as Piotr pointed out, it might also be good to
> > have
> > > > the
> > > > > > > >> original
> > > > > > > >>>>>> plan
> > > > > > > >>>>>>>> of t1, because in some cases it is possible to push
> > > filters
> > > > > down
> > > > > > > >> such
> > > > > > > >>>>>> that
> > > > > > > >>>>>>>> evaluating the query from scratch might be more
> > efficient
> > > > than
> > > > > > > >>>>> accessing
> > > > > > > >>>>>>>> the cache.
> > > > > > > >>>>>>>> Moreover, a CachedTable could extend Table() and
> offer a
> > > > > method
> > > > > > > >>>>>> refresh().
> > > > > > > >>>>>>>> This sounds quite useful in an interactive session
> mode.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> > > > > > > materialize()
> > > > > > > >>>>>> seems
> > > > > > > >>>>>>>> to be more future proof.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Best, Fabian
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
> > Wang <
> > > > > > > >>>>>>>> wshaoxuan@gmail.com>:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>> Hi Piotr,
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> Thanks for sharing your ideas on the method naming.
> We
> > > will
> > > > > > think
> > > > > > > >>>>> about
> > > > > > > >>>>>>>>> your suggestions. But I don't understand why we need
> to
> > > > > change
> > > > > > > the
> > > > > > > >>>>>> return
> > > > > > > >>>>>>>>> type of cache().
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> Cache() is a physical operation, it does not change
> the
> > > > logic
> > > > > > of
> > > > > > > >>>>>>>>> the `Table`. On the tableAPI layer, we should not
> > > > introduce a
> > > > > > new
> > > > > > > >>>>> table
> > > > > > > >>>>>>>>> type unless the logic of table has been changed. If
> we
> > > > > > introduce
> > > > > > > a
> > > > > > > >>>>> new
> > > > > > > >>>>>>>>> table type `CachedTable`, we need create the same set
> > of
> > > > > > methods
> > > > > > > of
> > > > > > > >>>>>>>> `Table`
> > > > > > > >>>>>>>>> for it. I don't think it is worth doing this. Or can
> > you
> > > > > please
> > > > > > > >>>>>> elaborate
> > > > > > > >>>>>>>>> more on what could be the "implicit behaviours/side
> > > > effects"
> > > > > > you
> > > > > > > >> are
> > > > > > > >>>>>>>>> thinking about?
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> Regards,
> > > > > > > >>>>>>>>> Shaoxuan
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > > > > > > >>>>>> piotr@data-artisans.com>
> > > > > > > >>>>>>>>> wrote:
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>> Hi Becket,
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Thanks for the response.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> 1. I wasn’t saying that materialised view must be
> > > mutable
> > > > or
> > > > > > > not.
> > > > > > > >>>>> The
> > > > > > > >>>>>>>>> same
> > > > > > > >>>>>>>>>> thing applies to caches as well. To the contrary, I
> > > would
> > > > > > expect
> > > > > > > >>>>> more
> > > > > > > >>>>>>>>>> consistency and updates from something that is
> called
> > > > > “cache”
> > > > > > vs
> > > > > > > >>>>>>>>> something
> > > > > > > >>>>>>>>>> that’s a “materialised view”. In other words, IMO
> most
> > > > > caches
> > > > > > do
> > > > > > > >> not
> > > > > > > >>>>>>>>> serve
> > > > > > > >>>>>>>>>> you invalid/outdated data and they handle updates on
> > > their
> > > > > > own.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> 2. I don’t think that having in the future two very
> > > > similar
> > > > > > > >> concepts
> > > > > > > >>>>>> of
> > > > > > > >>>>>>>>>> `materialized` view and `cache` is a good idea. It
> > would
> > > > be
> > > > > > > >>>>> confusing
> > > > > > > >>>>>>>> for
> > > > > > > >>>>>>>>>> the users. I think it could be handled by
> > > > > > variations/overloading
> > > > > > > >> of
> > > > > > > >>>>>>>>>> materialised view concept. We could start with:
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> `MaterializedTable materialize()` - immutable,
> session
> > > > life
> > > > > > > scope
> > > > > > > >>>>>>>>>> (basically the same semantic as you are proposing
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> And then in the future (if ever) build on top of
> > > > that/expand
> > > > > > it
> > > > > > > >>>>> with:
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> > > > > > > >> `MaterializedTable
> > > > > > > >>>>>>>>>> materialize(refreshHook=…)`
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Or with cross session support:
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> > > > > > > >>>>> `MaterializedTable
> > > > > > > >>>>>>>>>> materializeInto(tableFactory=…)`
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> I’m not saying that we should implement cross
> > > > > > session/refreshing
> > > > > > > >> now
> > > > > > > >>>>>> or
> > > > > > > >>>>>>>>>> even in the near future. I’m just arguing that
> naming
> > > > > current
> > > > > > > >>>>>> immutable
> > > > > > > >>>>>>>>>> session life scope method `materialize()` is more
> > future
> > > > > proof
> > > > > > > and
> > > > > > > >>>>>> more
> > > > > > > >>>>>>>>>> consistent with SQL (on which after all table-api is
> > > > heavily
> > > > > > > >> basing
> > > > > > > >>>>>>>> on).
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would
> > > still
> > > > > > insist
> > > > > > > >> on
> > > > > > > >>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
> > > implicit
> > > > > > > >>>>>>>>> behaviours/side
> > > > > > > >>>>>>>>>> effects and to give both us & users more
> flexibility.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Piotrek
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
> > > > becket.qin@gmail.com
> > > > > >
> > > > > > > >> wrote:
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> Just to add a little bit, the materialized view is
> > > > probably
> > > > > > > more
> > > > > > > >>>>>>>>> similar
> > > > > > > >>>>>>>>>> to
> > > > > > > >>>>>>>>>>> the persistent() brought up earlier in the thread.
> So
> > > it
> > > > is
> > > > > > > >> usually
> > > > > > > >>>>>>>>> cross
> > > > > > > >>>>>>>>>>> session and could be used in a larger scope. For
> > > > example, a
> > > > > > > >>>>>>>>> materialized
> > > > > > > >>>>>>>>>>> view created by user A may be visible to user B. It
> > is
> > > > > > probably
> > > > > > > >>>>>>>>> something
> > > > > > > >>>>>>>>>>> we want to have in the future. I'll put it in the
> > > future
> > > > > work
> > > > > > > >>>>>>>> section.
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> > > > > > > becket.qin@gmail.com
> > > > > > > >>>
> > > > > > > >>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Hi Piotrek,
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Thanks for the explanation.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Right now we are mostly thinking of the cached
> table
> > > as
> > > > > > > >>>>> immutable. I
> > > > > > > >>>>>>>>> can
> > > > > > > >>>>>>>>>>>> see the Materialized view would be useful in the
> > > future.
> > > > > > That
> > > > > > > >>>>> said,
> > > > > > > >>>>>>>> I
> > > > > > > >>>>>>>>>> think
> > > > > > > >>>>>>>>>>>> a simple cache mechanism is probably still needed.
> > So
> > > to
> > > > > me,
> > > > > > > >>>>> cache()
> > > > > > > >>>>>>>>> and
> > > > > > > >>>>>>>>>>>> materialize() should be two separate method as
> they
> > > > > address
> > > > > > > >>>>>>>> different
> > > > > > > >>>>>>>>>>>> needs. Materialize() is a higher level concept
> > usually
> > > > > > > implying
> > > > > > > >>>>>>>>>> periodical
> > > > > > > >>>>>>>>>>>> update, while cache() has much simpler semantic.
> For
> > > > > > example,
> > > > > > > >> one
> > > > > > > >>>>>>>> may
> > > > > > > >>>>>>>>>>>> create a materialized view and use cache() method
> in
> > > the
> > > > > > > >>>>>>>> materialized
> > > > > > > >>>>>>>>>> view
> > > > > > > >>>>>>>>>>>> creation logic. So that during the materialized
> view
> > > > > update,
> > > > > > > >> they
> > > > > > > >>>>> do
> > > > > > > >>>>>>>>> not
> > > > > > > >>>>>>>>>>>> need to worry about the case that the cached table
> > is
> > > > also
> > > > > > > >>>>> changed.
> > > > > > > >>>>>>>>>> Maybe
> > > > > > > >>>>>>>>>>>> under the hood, materialized() and cache() could
> > share
> > > > > some
> > > > > > > >>>>>>>> mechanism,
> > > > > > > >>>>>>>>>> but
> > > > > > > >>>>>>>>>>>> I think a simple cache() method would be handy in
> a
> > > lot
> > > > of
> > > > > > > >> cases.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > > > > > > >>>>>>>>> piotr@data-artisans.com
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Hi Becket,
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > > > > MaterializedTable
> > > > > > > >> that
> > > > > > > >>>>>>>>> they
> > > > > > > >>>>>>>>>>>>> cannot do on a Table?
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Maybe not in the initial implementation, but
> > various
> > > > DBs
> > > > > > > offer
> > > > > > > >>>>>>>>>> different
> > > > > > > >>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
> > > > triggers,
> > > > > > > >> timers,
> > > > > > > >>>>>>>>>> manually
> > > > > > > >>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to
> > > handle
> > > > > > that
> > > > > > > in
> > > > > > > >>>>> the
> > > > > > > >>>>>>>>>> future.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> After users call *table.cache(), *users can just
> > use
> > > > > that
> > > > > > > >> table
> > > > > > > >>>>>>>> and
> > > > > > > >>>>>>>>> do
> > > > > > > >>>>>>>>>>>>> anything that is supported on a Table, including
> > SQL.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> This is some implicit behaviour with side
> effects.
> > > > > Imagine
> > > > > > if
> > > > > > > >>>>> user
> > > > > > > >>>>>>>>> has
> > > > > > > >>>>>>>>>> a
> > > > > > > >>>>>>>>>>>>> long and complicated program, that touches table
> > `b`
> > > > > > multiple
> > > > > > > >>>>>>>> times,
> > > > > > > >>>>>>>>>> maybe
> > > > > > > >>>>>>>>>>>>> scattered around different methods. If he
> modifies
> > > his
> > > > > > > program
> > > > > > > >> by
> > > > > > > >>>>>>>>>> inserting
> > > > > > > >>>>>>>>>>>>> in one place
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> b.cache()
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> This implicitly alters the semantic and behaviour
> > of
> > > > his
> > > > > > code
> > > > > > > >> all
> > > > > > > >>>>>>>>> over
> > > > > > > >>>>>>>>>>>>> the place, maybe in a ways that might cause
> > problems.
> > > > For
> > > > > > > >> example
> > > > > > > >>>>>>>>> what
> > > > > > > >>>>>>>>>> if
> > > > > > > >>>>>>>>>>>>> underlying data is changing?
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Having invisible side effects is also not very
> > clean,
> > > > for
> > > > > > > >> example
> > > > > > > >>>>>>>>> think
> > > > > > > >>>>>>>>>>>>> about something like this (but more complicated):
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Table b = ...;
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> If (some_condition) {
> > > > > > > >>>>>>>>>>>>> processTable1(b)
> > > > > > > >>>>>>>>>>>>> }
> > > > > > > >>>>>>>>>>>>> else {
> > > > > > > >>>>>>>>>>>>> processTable2(b)
> > > > > > > >>>>>>>>>>>>> }
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> // do more stuff with b
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> > > > > > > >> `processTable1`
> > > > > > > >>>>>>>> or
> > > > > > > >>>>>>>>>>>>> `processTable2` methods.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> On the other hand
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Table materialisedB = b.materialize()
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Avoids (at least some of) the side effect issues
> > and
> > > > > forces
> > > > > > > >> user
> > > > > > > >>>>> to
> > > > > > > >>>>>>>>>>>>> explicitly use `materialisedB` where it’s
> > appropriate
> > > > and
> > > > > > > >> forces
> > > > > > > >>>>>>>> user
> > > > > > > >>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>> think what does it actually mean. And if
> something
> > > > > doesn’t
> > > > > > > work
> > > > > > > >>>>> in
> > > > > > > >>>>>>>>> the
> > > > > > > >>>>>>>>>> end
> > > > > > > >>>>>>>>>>>>> for the user, he will know what has he changed
> > > instead
> > > > of
> > > > > > > >> blaming
> > > > > > > >>>>>>>>>> Flink for
> > > > > > > >>>>>>>>>>>>> some “magic” underneath. In the above example,
> > after
> > > > > > > >>>>> materialising
> > > > > > > >>>>>>>> b
> > > > > > > >>>>>>>>> in
> > > > > > > >>>>>>>>>>>>> only one of the methods, he should/would realise
> > > about
> > > > > the
> > > > > > > >> issue
> > > > > > > >>>>>>>> when
> > > > > > > >>>>>>>>>>>>> handling the return value `MaterializedTable` of
> > that
> > > > > > method.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> I guess it comes down to personal preferences if
> > you
> > > > like
> > > > > > > >> things
> > > > > > > >>>>> to
> > > > > > > >>>>>>>>> be
> > > > > > > >>>>>>>>>>>>> implicit or not. The more power is the user,
> > probably
> > > > the
> > > > > > > more
> > > > > > > >>>>>>>> likely
> > > > > > > >>>>>>>>>> he is
> > > > > > > >>>>>>>>>>>>> to like/understand implicit behaviour. And we as
> > > Table
> > > > > API
> > > > > > > >>>>>>>> designers
> > > > > > > >>>>>>>>>> are
> > > > > > > >>>>>>>>>>>>> the most power users out there, so I would
> proceed
> > > with
> > > > > > > caution
> > > > > > > >>>>> (so
> > > > > > > >>>>>>>>>> that we
> > > > > > > >>>>>>>>>>>>> do not end up in the crazy perl realm with it’s
> > > lovely
> > > > > > > implicit
> > > > > > > >>>>>>>>> method
> > > > > > > >>>>>>>>>>>>> arguments ;)  <
> > > > > > https://stackoverflow.com/a/14922656/8149051
> > > > > > > >)
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Table API to also support non-relational
> > processing
> > > > > cases,
> > > > > > > >>>>> cache()
> > > > > > > >>>>>>>>>>>>> might be slightly better.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> I think even such extended Table API could
> benefit
> > > from
> > > > > > > >> sticking
> > > > > > > >>>>>>>>>> to/being
> > > > > > > >>>>>>>>>>>>> consistent with SQL where both SQL and Table API
> > are
> > > > > > > basically
> > > > > > > >>>>> the
> > > > > > > >>>>>>>>>> same.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> One more thing. `MaterializedTable materialize()`
> > > could
> > > > > be
> > > > > > > more
> > > > > > > >>>>>>>>>>>>> powerful/flexible allowing the user to operate
> both
> > > on
> > > > > > > >>>>> materialised
> > > > > > > >>>>>>>>>> and not
> > > > > > > >>>>>>>>>>>>> materialised view at the same time for whatever
> > > reasons
> > > > > > > >>>>> (underlying
> > > > > > > >>>>>>>>>> data
> > > > > > > >>>>>>>>>>>>> changing/better optimisation opportunities after
> > > > pushing
> > > > > > down
> > > > > > > >>>>> more
> > > > > > > >>>>>>>>>> filters
> > > > > > > >>>>>>>>>>>>> etc). For example:
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Table b = …;
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Val min = mb.min();
> > > > > > > >>>>>>>>>>>>> Val max = mb.max();
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Could be more efficient compared to `b.cache()`
> if
> > > > > > > >>>>> `filter(‘userId
> > > > > > > >>>>>>>> =
> > > > > > > >>>>>>>>>>>>> 42);` allows for much more aggressive
> > optimisations.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Piotrek
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> > > > > > fhueske@gmail.com>
> > > > > > > >>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> I'm not suggesting to add support for Ignite.
> This
> > > was
> > > > > > just
> > > > > > > an
> > > > > > > >>>>>>>>>> example.
> > > > > > > >>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> > > > > > > >>>>>>>>>>>>>> For the sake of this proposal, it would be up to
> > the
> > > > > user
> > > > > > to
> > > > > > > >>>>>>>>>> implement a
> > > > > > > >>>>>>>>>>>>>> TableFactory and corresponding TableSource /
> > > TableSink
> > > > > > > classes
> > > > > > > >>>>> to
> > > > > > > >>>>>>>>>>>>> persist
> > > > > > > >>>>>>>>>>>>>> and read the data.
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb
> Flavio
> > > > > > > Pompermaier
> > > > > > > >> <
> > > > > > > >>>>>>>>>>>>>> pompermaier@okkam.it>:
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as
> > an
> > > > > > > >> alternative
> > > > > > > >>>>> to
> > > > > > > >>>>>>>>>>>>> Apache
> > > > > > > >>>>>>>>>>>>>>> Ignite?
> > > > > > > >>>>>>>>>>>>>>> [1]
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>
> > > > > > > >>
> > > > > >
> > > >
> > https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske
> <
> > > > > > > >>>>>>>> fhueske@gmail.com>
> > > > > > > >>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Hi,
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Thanks for the proposal!
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> To summarize, you propose a new method
> > > > Table.cache():
> > > > > > > Table
> > > > > > > >>>>> that
> > > > > > > >>>>>>>>>> will
> > > > > > > >>>>>>>>>>>>>>>> trigger a job and write the result into some
> > > > temporary
> > > > > > > >> storage
> > > > > > > >>>>>>>> as
> > > > > > > >>>>>>>>>>>>> defined
> > > > > > > >>>>>>>>>>>>>>>> by a TableFactory.
> > > > > > > >>>>>>>>>>>>>>>> The cache() call blocks while the job is
> running
> > > and
> > > > > > > >>>>> eventually
> > > > > > > >>>>>>>>>>>>> returns a
> > > > > > > >>>>>>>>>>>>>>>> Table object that represents a scan of the
> > > temporary
> > > > > > > table.
> > > > > > > >>>>>>>>>>>>>>>> When the "session" is closed (closing to be
> > > > defined?),
> > > > > > the
> > > > > > > >>>>>>>>> temporary
> > > > > > > >>>>>>>>>>>>>>> tables
> > > > > > > >>>>>>>>>>>>>>>> are all dropped.
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> I think this behavior makes sense and is a
> good
> > > > first
> > > > > > step
> > > > > > > >>>>>>>> towards
> > > > > > > >>>>>>>>>>>>> more
> > > > > > > >>>>>>>>>>>>>>>> interactive workloads.
> > > > > > > >>>>>>>>>>>>>>>> However, its performance suffers from writing
> to
> > > and
> > > > > > > reading
> > > > > > > >>>>>>>> from
> > > > > > > >>>>>>>>>>>>>>> external
> > > > > > > >>>>>>>>>>>>>>>> systems.
> > > > > > > >>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> > > > > > > significantly
> > > > > > > >>>>>>>>> improve
> > > > > > > >>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across
> > > jobs)
> > > > > > would
> > > > > > > >>>>> have
> > > > > > > >>>>>>>>>> large
> > > > > > > >>>>>>>>>>>>>>>> impacts on many components of Flink.
> > > > > > > >>>>>>>>>>>>>>>> Users could use in-memory filesystems or
> storage
> > > > grids
> > > > > > > >> (Apache
> > > > > > > >>>>>>>>>>>>> Ignite) to
> > > > > > > >>>>>>>>>>>>>>>> mitigate some of the performance effects.
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Best, Fabian
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
> > Becket
> > > > Qin
> > > > > <
> > > > > > > >>>>>>>>>>>>>>>> becket.qin@gmail.com
> > > > > > > >>>>>>>>>>>>>>>>> :
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > > > > > MaterializedTable
> > > > > > > >>>>>>>> that
> > > > > > > >>>>>>>>>> they
> > > > > > > >>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> > > > > *table.cache(),
> > > > > > > >> *users
> > > > > > > >>>>>>>> can
> > > > > > > >>>>>>>>>>>>> just
> > > > > > > >>>>>>>>>>>>>>>> use
> > > > > > > >>>>>>>>>>>>>>>>> that table and do anything that is supported
> > on a
> > > > > > Table,
> > > > > > > >>>>>>>>> including
> > > > > > > >>>>>>>>>>>>> SQL.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize()
> > > sounds
> > > > > > fine
> > > > > > > to
> > > > > > > >>>>> me.
> > > > > > > >>>>>>>>>>>>> cache()
> > > > > > > >>>>>>>>>>>>>>>> is
> > > > > > > >>>>>>>>>>>>>>>>> a bit more general than materialize(). Given
> > that
> > > > we
> > > > > > are
> > > > > > > >>>>>>>>> enhancing
> > > > > > > >>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>> Table API to also support non-relational
> > > processing
> > > > > > > cases,
> > > > > > > >>>>>>>>> cache()
> > > > > > > >>>>>>>>>>>>>>> might
> > > > > > > >>>>>>>>>>>>>>>> be
> > > > > > > >>>>>>>>>>>>>>>>> slightly better.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
> > Nowojski <
> > > > > > > >>>>>>>>>>>>>>> piotr@data-artisans.com
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> Hi Becket,
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend
> to
> > > > reuse
> > > > > > > >> existing
> > > > > > > >>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
> > assumed
> > > > that
> > > > > > you
> > > > > > > >>>>> want
> > > > > > > >>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>> provide
> > > > > > > >>>>>>>>>>>>>>>>> an
> > > > > > > >>>>>>>>>>>>>>>>>> alternate way of writing the data.
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> Now that I hopefully understand the
> proposal,
> > > > maybe
> > > > > we
> > > > > > > >> could
> > > > > > > >>>>>>>>>> rename
> > > > > > > >>>>>>>>>>>>>>>>>> `cache()` to
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> void materialize()
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> or going step further
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> > > > > > > >>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> ?
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> The second option with returning a handle I
> > > think
> > > > is
> > > > > > > more
> > > > > > > >>>>>>>>> flexible
> > > > > > > >>>>>>>>>>>>>>> and
> > > > > > > >>>>>>>>>>>>>>>>>> could provide features such as
> > > “refresh”/“delete”
> > > > or
> > > > > > > >>>>> generally
> > > > > > > >>>>>>>>>>>>>>> speaking
> > > > > > > >>>>>>>>>>>>>>>>>> manage the the view. In the future we could
> > also
> > > > > think
> > > > > > > >> about
> > > > > > > >>>>>>>>>> adding
> > > > > > > >>>>>>>>>>>>>>>> hooks
> > > > > > > >>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is
> also
> > > more
> > > > > > > >> explicit
> > > > > > > >>>>> -
> > > > > > > >>>>>>>>>>>>>>>>>> materialization returning a new table handle
> > > will
> > > > > not
> > > > > > > have
> > > > > > > >>>>> the
> > > > > > > >>>>>>>>>> same
> > > > > > > >>>>>>>>>>>>>>>>>> implicit side effects as adding a simple
> line
> > of
> > > > > code
> > > > > > > like
> > > > > > > >>>>>>>>>>>>>>> `b.cache()`
> > > > > > > >>>>>>>>>>>>>>>>>> would have.
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> It would also be more SQL like, making it
> more
> > > > > > intuitive
> > > > > > > >> for
> > > > > > > >>>>>>>>> users
> > > > > > > >>>>>>>>>>>>>>>>> already
> > > > > > > >>>>>>>>>>>>>>>>>> familiar with the SQL.
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> Piotrek
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> > > > > > > >> becket.qin@gmail.com
> > > > > > > >>>>>>
> > > > > > > >>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
> > > > > equivalent
> > > > > > to
> > > > > > > >>>>>>>>> creating
> > > > > > > >>>>>>>>>> a
> > > > > > > >>>>>>>>>>>>>>>>>> BUILT-IN
> > > > > > > >>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> > > > > > functionality
> > > > > > > is
> > > > > > > >>>>>>>>> missing
> > > > > > > >>>>>>>>>>>>>>>>> today,
> > > > > > > >>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
> > question.
> > > > Do
> > > > > > you
> > > > > > > >> mean
> > > > > > > >>>>>>>> we
> > > > > > > >>>>>>>>>>>>>>>> already
> > > > > > > >>>>>>>>>>>>>>>>>> have
> > > > > > > >>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
> > sugar?
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is
> do
> > > we
> > > > > want
> > > > > > > to
> > > > > > > >>>>> stop
> > > > > > > >>>>>>>>> at
> > > > > > > >>>>>>>>>>>>>>>>> creating
> > > > > > > >>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
> > extend
> > > > that
> > > > > > in
> > > > > > > >> the
> > > > > > > >>>>>>>>> future
> > > > > > > >>>>>>>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>> a
> > > > > > > >>>>>>>>>>>>>>>>>> more
> > > > > > > >>>>>>>>>>>>>>>>>>> useful unified data store distributed with
> > > Flink?
> > > > > And
> > > > > > > do
> > > > > > > >> we
> > > > > > > >>>>>>>>> want
> > > > > > > >>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>>> have
> > > > > > > >>>>>>>>>>>>>>>>>> a
> > > > > > > >>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
> > pattern
> > > > with
> > > > > > > their
> > > > > > > >>>>> own
> > > > > > > >>>>>>>>>> user
> > > > > > > >>>>>>>>>>>>>>>>>> defined
> > > > > > > >>>>>>>>>>>>>>>>>>> services. These considerations are much
> more
> > > > > > > >> architectural.
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
> > Nowojski
> > > <
> > > > > > > >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > > > > > > >>>>>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> Hi,
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand
> > the
> > > > > > > problem.
> > > > > > > >>>>>>>> Isn’t
> > > > > > > >>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing
> data
> > > to
> > > > a
> > > > > > sink
> > > > > > > >> and
> > > > > > > >>>>>>>>> later
> > > > > > > >>>>>>>>>>>>>>>>> reading
> > > > > > > >>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited
> live
> > > > > > scope/live
> > > > > > > >>>>> time?
> > > > > > > >>>>>>>>> And
> > > > > > > >>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>>> sink
> > > > > > > >>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a
> file
> > > > sink?
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> > > > > > materialised
> > > > > > > >>>>> view
> > > > > > > >>>>>>>>>> from a
> > > > > > > >>>>>>>>>>>>>>>>> table
> > > > > > > >>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
> > reusing
> > > > > this
> > > > > > > >>>>>>>>> materialised
> > > > > > > >>>>>>>>>>>>>>>> view
> > > > > > > >>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to
> > > clean
> > > > up
> > > > > > > >>>>>>>>> materialised
> > > > > > > >>>>>>>>>>>>>>>> views
> > > > > > > >>>>>>>>>>>>>>>>>> (for
> > > > > > > >>>>>>>>>>>>>>>>>>>> example when current session finishes)?
> > Maybe
> > > we
> > > > > > need
> > > > > > > >> some
> > > > > > > >>>>>>>>>>>>>>> syntactic
> > > > > > > >>>>>>>>>>>>>>>>>> sugar
> > > > > > > >>>>>>>>>>>>>>>>>>>> on top of it?
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> Piotrek
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> > > > > > > >>>>> becket.qin@gmail.com
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
> > > persist()
> > > > > > with
> > > > > > > >>>>>>>>>>>>>>>>> lifecycle/defined
> > > > > > > >>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the
> future
> > > > work
> > > > > > for
> > > > > > > >>>>> this.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng
> > sun
> > > <
> > > > > > > >>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the
> > name
> > > > of
> > > > > > > >>>>>>>> `cache()`, I
> > > > > > > >>>>>>>>>>>>>>>>>> understand
> > > > > > > >>>>>>>>>>>>>>>>>>>> why
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> you designed this way!
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
> > > > > lifecycle
> > > > > > > for
> > > > > > > >>>>>>>> data
> > > > > > > >>>>>>>>>>>>>>>>>> persistence?
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> For example, persist
> (LifeCycle.SESSION),
> > so
> > > > > that
> > > > > > > the
> > > > > > > >>>>> user
> > > > > > > >>>>>>>>> is
> > > > > > > >>>>>>>>>>>>>>> not
> > > > > > > >>>>>>>>>>>>>>>>>>>> worried
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly
> specify
> > > the
> > > > > time
> > > > > > > >> range
> > > > > > > >>>>>>>> for
> > > > > > > >>>>>>>>>>>>>>>> keeping
> > > > > > > >>>>>>>>>>>>>>>>>>>> time.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand,
> we
> > > can
> > > > > > also
> > > > > > > >>>>> share
> > > > > > > >>>>>>>>> in a
> > > > > > > >>>>>>>>>>>>>>>>> certain
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> > > > > > > >>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> > > > > > > >>>>>>>>>>>>>>> am
> > > > > > > >>>>>>>>>>>>>>>>> not
> > > > > > > >>>>>>>>>>>>>>>>>>>> sure,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for
> reference
> > > > only!
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> Bests,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > > > > 于2018年11月23日周五
> > > > > > > >>>>>>>> 下午1:33写道:
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
> > cache()
> > > > v.s.
> > > > > > > >>>>>>>> persist(),
> > > > > > > >>>>>>>>>>>>>>>>>> personally I
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
> > > describing
> > > > > the
> > > > > > > >>>>>>>> behavior,
> > > > > > > >>>>>>>>>>>>>>> i.e.
> > > > > > > >>>>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> Table
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
> > > > deleted
> > > > > > > after
> > > > > > > >>>>> the
> > > > > > > >>>>>>>>>>>>>>> session
> > > > > > > >>>>>>>>>>>>>>>> is
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> closed.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as
> > > people
> > > > > > might
> > > > > > > >>>>> think
> > > > > > > >>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>> table
> > > > > > > >>>>>>>>>>>>>>>>>>>> will
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> still be there even after the session
> is
> > > > gone.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and
> > > stream
> > > > > > > >>>>> processing
> > > > > > > >>>>>>>> in
> > > > > > > >>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>> same
> > > > > > > >>>>>>>>>>>>>>>>>>>> job.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that
> > > goal.
> > > > I
> > > > > > > >> imagine
> > > > > > > >>>>>>>> that
> > > > > > > >>>>>>>>>>>>>>> would
> > > > > > > >>>>>>>>>>>>>>>>> be
> > > > > > > >>>>>>>>>>>>>>>>>> a
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> huge
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> change across the board, including
> > sources,
> > > > > > > operators
> > > > > > > >>>>> and
> > > > > > > >>>>>>>>>>>>>>>>>>>> optimizations,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
> > > > separate
> > > > > > > >>>>> in-depth
> > > > > > > >>>>>>>>>>>>>>>>> discussions.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan
> > > Cui <
> > > > > > > >>>>>>>>>>>>>>> xingcanc@gmail.com>
> > > > > > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
> > access
> > > > > > domain
> > > > > > > >> are
> > > > > > > >>>>>>>> both
> > > > > > > >>>>>>>>>>>>>>>>>> orthogonal
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this
> may
> > > be
> > > > > the
> > > > > > > >> first
> > > > > > > >>>>>>>> time
> > > > > > > >>>>>>>>>> we
> > > > > > > >>>>>>>>>>>>>>>> plan
> > > > > > > >>>>>>>>>>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
> > other
> > > > than
> > > > > > the
> > > > > > > >>>>>>>> state.
> > > > > > > >>>>>>>>>>>>>>> Maybe
> > > > > > > >>>>>>>>>>>>>>>>> it’s
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> better
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
> > > > > concentrate
> > > > > > > on
> > > > > > > >> a
> > > > > > > >>>>>>>>>> specific
> > > > > > > >>>>>>>>>>>>>>>>> part?
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
> > concerned
> > > > > with
> > > > > > > the
> > > > > > > >>>>>>>>>> underlying
> > > > > > > >>>>>>>>>>>>>>>>>>>> service.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change
> to
> > > the
> > > > > > > >> existing
> > > > > > > >>>>>>>>>>>>>>> codebase.
> > > > > > > >>>>>>>>>>>>>>>> As
> > > > > > > >>>>>>>>>>>>>>>>>> you
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
> > extendible
> > > to
> > > > > > > support
> > > > > > > >>>>>>>> other
> > > > > > > >>>>>>>>>>>>>>>>>> components
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> and
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
> > > thread.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the
> > more
> > > > > > > >> interactive
> > > > > > > >>>>>>>>> Table
> > > > > > > >>>>>>>>>>>>>>>> API,
> > > > > > > >>>>>>>>>>>>>>>>> in
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> case
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough
> service
> > > > > > > mechanism.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei
> > > > Jiang <
> > > > > > > >>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
> > table
> > > > for
> > > > > > > clean
> > > > > > > >> up
> > > > > > > >>>>>>>> is
> > > > > > > >>>>>>>>>> not
> > > > > > > >>>>>>>>>>>>>>>> very
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> reliable.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
> > > > > executed
> > > > > > > >>>>>>>>>> successfully.
> > > > > > > >>>>>>>>>>>>>>> We
> > > > > > > >>>>>>>>>>>>>>>>> may
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> risk
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that
> > > it's
> > > > > > safer
> > > > > > > to
> > > > > > > >>>>>>>> have
> > > > > > > >>>>>>>>> an
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> association
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So
> > we
> > > > can
> > > > > > > always
> > > > > > > >>>>>>>> clean
> > > > > > > >>>>>>>>>> up
> > > > > > > >>>>>>>>>>>>>>>> temp
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> tables
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with
> any
> > > > > active
> > > > > > > >>>>>>>> sessions.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
> > jincheng
> > > > > sun <
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
> > > proposal!
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very
> useful
> > > and
> > > > > > user
> > > > > > > >>>>>>>> friendly
> > > > > > > >>>>>>>>>> in
> > > > > > > >>>>>>>>>>>>>>>> case
> > > > > > > >>>>>>>>>>>>>>>>>> of
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> your
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business
> > has
> > > > to
> > > > > be
> > > > > > > >>>>>>>> executed
> > > > > > > >>>>>>>>> in
> > > > > > > >>>>>>>>>>>>>>>>> several
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> stages
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the
> pipeline
> > > of
> > > > > > Flink
> > > > > > > >> ML,
> > > > > > > >>>>> in
> > > > > > > >>>>>>>>>> order
> > > > > > > >>>>>>>>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> utilize
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we
> > have
> > > > to
> > > > > > > >> submit a
> > > > > > > >>>>>>>> job
> > > > > > > >>>>>>>>>> by
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
> > better
> > > > to
> > > > > > > named
> > > > > > > >>>>>>>>>>>>>>> `persist()`,
> > > > > > > >>>>>>>>>>>>>>>>> And
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> The
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether
> we
> > > > > > internally
> > > > > > > >>>>> cache
> > > > > > > >>>>>>>>> in
> > > > > > > >>>>>>>>>>>>>>>> memory
> > > > > > > >>>>>>>>>>>>>>>>>> or
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> persist
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the
> > > data
> > > > > into
> > > > > > > >> state
> > > > > > > >>>>>>>>>> backend
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
> > > RocksDBStateBackend
> > > > > > etc.)
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in
> the
> > > > > future,
> > > > > > > >>>>> support
> > > > > > > >>>>>>>>> for
> > > > > > > >>>>>>>>>>>>>>>>>> streaming
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> and
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job
> > > will
> > > > > also
> > > > > > > >>>>> benefit
> > > > > > > >>>>>>>>> in
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> "Interactive
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward
> to
> > > > your
> > > > > > > JIRAs
> > > > > > > >>>>> and
> > > > > > > >>>>>>>>>> FLIP!
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > > > > > > 于2018年11月20日周二
> > > > > > > >>>>>>>>>> 下午9:56写道:
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
> > > > pointed
> > > > > > out,
> > > > > > > >> it
> > > > > > > >>>>>>>> is a
> > > > > > > >>>>>>>>>>>>>>>>> promising
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table
> > API
> > > in
> > > > > > > various
> > > > > > > >>>>>>>>>> aspects,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>> including
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among
> > > > others.
> > > > > > One
> > > > > > > >> of
> > > > > > > >>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>> scenarios
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>> where
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>> we
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
> > interactive
> > > > > > > >>>>> programming.
> > > > > > > >>>>>>>> To
> > > > > > > >>>>>>>>>>>>>>>> explain
> > > > > > > >>>>>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on
> the
> > > > > > solution,
> > > > > > > we
> > > > > > > >>>>> put
> > > > > > > >>>>>>>>>>>>>>>> together
> > > > > > > >>>>>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our
> proposal.
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
> > welcome!
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Jark Wu <im...@gmail.com>.
Hi,

All the recent discussions are focused on whether there is a problem if
cache() not return a Table.
It seems that returning a Table explicitly is more clear (and safe?).

So whether there are any problems if cache() returns a Table?  @Becket

Best,
Jark

On Tue, 4 Dec 2018 at 22:27, Till Rohrmann <tr...@apache.org> wrote:

> It's true that b, c, d and e will all read from the original DAG that
> generates a. But all subsequent operators (when running multiple queries)
> which reference cachedTableA should not need to reproduce `a` but directly
> consume the intermediate result.
>
> Conceptually one could think of cache() as introducing a caching operator
> from which you need to consume from if you want to benefit from the caching
> functionality.
>
> I agree, ideally the optimizer makes this kind of decision which
> intermediate result should be cached. But especially when executing ad-hoc
> queries the user might better know which results need to be cached because
> Flink might not see the full DAG. In that sense, I would consider the
> cache() method as a hint for the optimizer. Of course, in the future we
> might add functionality which tries to automatically cache results (e.g.
> caching the latest intermediate results until so and so much space is
> used). But this should hopefully not contradict with `CachedTable cache()`.
>
> Cheers,
> Till
>
> On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <be...@gmail.com> wrote:
>
> > Hi Till,
> >
> > Thanks for the clarification. I am still a little confused.
> >
> > If cache() returns a CachedTable, the example might become:
> >
> > b = a.map(...)
> > c = a.map(...)
> >
> > cachedTableA = a.cache()
> > d = cachedTableA.map(...)
> > e = a.map()
> >
> > In the above case, if cache() is lazily evaluated, b, c, d and e are all
> > going to be reading from the original DAG that generates a. But with a
> > naive expectation, d should be reading from the cache. This seems not
> > solving the potential confusion you raised, right?
> >
> > Just to be clear, my understanding are all based on the assumption that
> the
> > tables are immutable. Therefore, after a.cache(), a the c*achedTableA*
> and
> > original table *a * should be completely interchangeable.
> >
> > That said, I think a valid argument is optimization. There are indeed
> cases
> > that reading from the original DAG could be faster than reading from the
> > cache. For example, in the following example:
> >
> > a.filter(f1' > 100)
> > a.cache()
> > b = a.filter(f1' < 100)
> >
> > Ideally the optimizer should be intelligent enough to decide which way is
> > faster, without user intervention. In this case, it will identify that b
> > would just be an empty table, thus skip reading from the cache
> completely.
> > But I agree that returning a CachedTable would give user the control of
> > when to use cache, even though I still feel that letting the optimizer
> > handle this is a better option in long run.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> >
> > On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <tr...@apache.org>
> wrote:
> >
> > > Yes you are right Becket that it still depends on the actual execution
> of
> > > the job whether a consumer reads from a cached result or not.
> > >
> > > My point was actually about the properties of a (cached vs. non-cached)
> > and
> > > not about the execution. I would not make cache trigger the execution
> of
> > > the job because one loses some flexibility by eagerly triggering the
> > > execution.
> > >
> > > I tried to argue for an explicit CachedTable which is returned by the
> > > cache() method like Piotr did in order to make the API more explicit.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com>
> wrote:
> > >
> > > > Hi Till,
> > > >
> > > > That is a good example. Just a minor correction, in this case, b, c
> > and d
> > > > will all consume from a non-cached a. This is because cache will only
> > be
> > > > created on the very first job submission that generates the table to
> be
> > > > cached.
> > > >
> > > > If I understand correctly, this is example is about whether .cache()
> > > method
> > > > should be eagerly evaluated or lazily evaluated. In another word, if
> > > > cache() method actually triggers a job that creates the cache, there
> > will
> > > > be no such confusion. Is that right?
> > > >
> > > > In the example, although d will not consume from the cached Table
> while
> > > it
> > > > looks supposed to, from correctness perspective the code will still
> > > return
> > > > correct result, assuming that tables are immutable.
> > > >
> > > > Personally I feel it is OK because users probably won't really worry
> > > about
> > > > whether the table is cached or not. And lazy cache could avoid some
> > > > unnecessary caching if a cached table is never created in the user
> > > > application. But I am not opposed to do eager evaluation of cache.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > >
> > > >
> > > > On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <tr...@apache.org>
> > > > wrote:
> > > >
> > > > > Another argument for Piotr's point is that lazily changing
> properties
> > > of
> > > > a
> > > > > node affects all down stream consumers but does not necessarily
> have
> > to
> > > > > happen before these consumers are defined. From a user's
> perspective
> > > this
> > > > > can be quite confusing:
> > > > >
> > > > > b = a.map(...)
> > > > > c = a.map(...)
> > > > >
> > > > > a.cache()
> > > > > d = a.map(...)
> > > > >
> > > > > now b, c and d will consume from a cached operator. In this case,
> the
> > > > user
> > > > > would most likely expect that only d reads from a cached result.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> > > piotr@data-artisans.com>
> > > > > wrote:
> > > > >
> > > > > > Hey Shaoxuan and Becket,
> > > > > >
> > > > > > > Can you explain a bit more one what are the side effects? So
> far
> > my
> > > > > > > understanding is that such side effects only exist if a table
> is
> > > > > mutable.
> > > > > > > Is that the case?
> > > > > >
> > > > > > Not only that. There are also performance implications and those
> > are
> > > > > > another implicit side effects of using `void cache()`. As I wrote
> > > > before,
> > > > > > reading from cache might not always be desirable, thus it can
> cause
> > > > > > performance degradation and I’m fine with that - user's or
> > > optimiser’s
> > > > > > choice. What I do not like is that this implicit side effect can
> > > > manifest
> > > > > > in completely different part of code, that wasn’t touched by a
> user
> > > > while
> > > > > > he was adding `void cache()` call somewhere else. And even if
> > caching
> > > > > > improves performance, it’s still a side effect of `void cache()`.
> > > > Almost
> > > > > > from the definition `void` methods have only side effects. As I
> > wrote
> > > > > > before, there are couple of scenarios where this might be
> > undesirable
> > > > > > and/or unexpected, for example:
> > > > > >
> > > > > > 1.
> > > > > > Table b = …;
> > > > > > b.cache()
> > > > > > x = b.join(…)
> > > > > > y = b.count()
> > > > > > // ...
> > > > > > // 100
> > > > > > // hundred
> > > > > > // lines
> > > > > > // of
> > > > > > // code
> > > > > > // later
> > > > > > z = b.filter(…).groupBy(…) // this might be even hidden in a
> > > different
> > > > > > method/file/package/dependency
> > > > > >
> > > > > > 2.
> > > > > >
> > > > > > Table b = ...
> > > > > > If (some_condition) {
> > > > > >   foo(b)
> > > > > > }
> > > > > > Else {
> > > > > >   bar(b)
> > > > > > }
> > > > > > z = b.filter(…).groupBy(…)
> > > > > >
> > > > > >
> > > > > > Void foo(Table b) {
> > > > > >   b.cache()
> > > > > >   // do something with b
> > > > > > }
> > > > > >
> > > > > > In both above examples, `b.cache()` will implicitly affect
> > (semantic
> > > > of a
> > > > > > program in case of sources being mutable and performance) `z =
> > > > > > b.filter(…).groupBy(…)` which might be far from obvious.
> > > > > >
> > > > > > On top of that, there is still this argument of mine that having
> a
> > > > > > `MaterializedTable` or `CachedTable` handle is more flexible for
> us
> > > for
> > > > > the
> > > > > > future and for the user (as a manual option to bypass cache
> reads).
> > > > > >
> > > > > > >  But Jiangjie is correct,
> > > > > > > the source table in batching should be immutable. It is the
> > user’s
> > > > > > > responsibility to ensure it, otherwise even a regular failover
> > may
> > > > lead
> > > > > > > to inconsistent results.
> > > > > >
> > > > > > Yes, I agree that’s what perfect world/good deployment should be.
> > But
> > > > its
> > > > > > often isn’t and while I’m not trying to fix this (since the
> proper
> > > fix
> > > > is
> > > > > > to support transactions), I’m just trying to minimise confusion
> for
> > > the
> > > > > > users that are not fully aware what’s going on and operate in
> less
> > > then
> > > > > > perfect setup. And if something bites them after adding
> `b.cache()`
> > > > call,
> > > > > > to make sure that they at least know all of the places that
> adding
> > > this
> > > > > > line can affect.
> > > > > >
> > > > > > Thanks, Piotrek
> > > > > >
> > > > > > > On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > Hi Piotrek,
> > > > > > >
> > > > > > > Thanks again for the clarification. Some more replies are
> > > following.
> > > > > > >
> > > > > > > But keep in mind that `.cache()` will/might not only be used in
> > > > > > interactive
> > > > > > >> programming and not only in batching.
> > > > > > >
> > > > > > > It is true. Actually in stream processing, cache() has the same
> > > > > semantic
> > > > > > as
> > > > > > > batch processing. The semantic is following:
> > > > > > > For a table created via a series of computation, save that
> table
> > > for
> > > > > > later
> > > > > > > reference to avoid running the computation logic to regenerate
> > the
> > > > > table.
> > > > > > > Once the application exits, drop all the cache.
> > > > > > > This semantic is same for both batch and stream processing. The
> > > > > > difference
> > > > > > > is that stream applications will only run once as they are long
> > > > > running.
> > > > > > > And the batch applications may be run multiple times, hence the
> > > cache
> > > > > may
> > > > > > > be created and dropped each time the application runs.
> > > > > > > Admittedly, there will probably be some resource management
> > > > > requirements
> > > > > > > for the streaming cached table, such as time based / size based
> > > > > > retention,
> > > > > > > to address the infinite data issue. But such requirement does
> not
> > > > > change
> > > > > > > the semantic.
> > > > > > > You are right that interactive programming is just one use case
> > of
> > > > > > cache().
> > > > > > > It is not the only use case.
> > > > > > >
> > > > > > > For me the more important issue is of not having the `void
> > cache()`
> > > > > with
> > > > > > >> side effects.
> > > > > > >
> > > > > > > This is indeed the key point. The argument around whether
> cache()
> > > > > should
> > > > > > > return something already indicates that cache() and
> materialize()
> > > > > address
> > > > > > > different issues.
> > > > > > > Can you explain a bit more one what are the side effects? So
> far
> > my
> > > > > > > understanding is that such side effects only exist if a table
> is
> > > > > mutable.
> > > > > > > Is that the case?
> > > > > > >
> > > > > > > I don’t know, probably initially we should make CachedTable
> > > > read-only.
> > > > > I
> > > > > > >> don’t find it more confusing than the fact that user can not
> > write
> > > > to
> > > > > > views
> > > > > > >> or materialised views in SQL or that user currently can not
> > write
> > > > to a
> > > > > > >> Table.
> > > > > > >
> > > > > > > I don't think anyone should insert something to a cache. By
> > > > definition
> > > > > > the
> > > > > > > cache should only be updated when the corresponding original
> > table
> > > is
> > > > > > > updated. What I am wondering is that given the following two
> > facts:
> > > > > > > 1. If and only if a table is mutable (with something like
> > > insert()),
> > > > a
> > > > > > > CachedTable may have implicit behavior.
> > > > > > > 2. A CachedTable extends a Table.
> > > > > > > We can come to the conclusion that a CachedTable is mutable and
> > > users
> > > > > can
> > > > > > > insert into the CachedTable directly. This is where I thought
> > > > > confusing.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jiangjie (Becket) Qin
> > > > > > >
> > > > > > > On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> > > > piotr@data-artisans.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> Regarding naming `cache()` vs `materialize()`. One more
> > > explanation
> > > > > why
> > > > > > I
> > > > > > >> think `materialize()` is more natural to me is that I think of
> > all
> > > > > > “Table”s
> > > > > > >> in Table-API as views. They behave the same way as SQL views,
> > the
> > > > only
> > > > > > >> difference for me is that their live scope is short - current
> > > > session
> > > > > > which
> > > > > > >> is limited by different execution model. That’s why “cashing”
> a
> > > view
> > > > > > for me
> > > > > > >> is just materialising it.
> > > > > > >>
> > > > > > >> However I see and I understand your point of view. Coming from
> > > > > > >> DataSet/DataStream and generally speaking non-SQL world,
> > `cache()`
> > > > is
> > > > > > more
> > > > > > >> natural. But keep in mind that `.cache()` will/might not only
> be
> > > > used
> > > > > in
> > > > > > >> interactive programming and not only in batching. But naming
> is
> > > one
> > > > > > issue,
> > > > > > >> and not that critical to me. Especially that once we implement
> > > > proper
> > > > > > >> materialised views, we can always deprecate/rename `cache()`
> if
> > we
> > > > > deem
> > > > > > so.
> > > > > > >>
> > > > > > >>
> > > > > > >> For me the more important issue is of not having the `void
> > > cache()`
> > > > > with
> > > > > > >> side effects. Exactly for the reasons that you have mentioned.
> > > True:
> > > > > > >> results might be non deterministic if underlying source table
> > are
> > > > > > changing.
> > > > > > >> Problem is that `void cache()` implicitly changes the semantic
> > of
> > > > > > >> subsequent uses of the cached/materialized Table. It can cause
> > > “wtf”
> > > > > > moment
> > > > > > >> for a user if he inserts “b.cache()” call in some place in his
> > > code
> > > > > and
> > > > > > >> suddenly some other random places are behaving differently. If
> > > > > > >> `materialize()` or `cache()` returns a Table handle, we force
> > user
> > > > to
> > > > > > >> explicitly use the cache which removes the “random” part from
> > the
> > > > > > "suddenly
> > > > > > >> some other random places are behaving differently”.
> > > > > > >>
> > > > > > >> This argument and others that I’ve raised (greater
> > > > > flexibility/allowing
> > > > > > >> user to explicitly bypass the cache) are independent of
> > `cache()`
> > > vs
> > > > > > >> `materialize()` discussion.
> > > > > > >>
> > > > > > >>> Does that mean one can also insert into the CachedTable? This
> > > > sounds
> > > > > > >> pretty confusing.
> > > > > > >>
> > > > > > >> I don’t know, probably initially we should make CachedTable
> > > > > read-only. I
> > > > > > >> don’t find it more confusing than the fact that user can not
> > write
> > > > to
> > > > > > views
> > > > > > >> or materialised views in SQL or that user currently can not
> > write
> > > > to a
> > > > > > >> Table.
> > > > > > >>
> > > > > > >> Piotrek
> > > > > > >>
> > > > > > >>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com>
> > > wrote:
> > > > > > >>>
> > > > > > >>> Hi all,
> > > > > > >>>
> > > > > > >>> I agree with @Becket that `cache()` and `materialize()`
> should
> > be
> > > > > > >> considered as two different methods where the later one is
> more
> > > > > > >> sophisticated.
> > > > > > >>>
> > > > > > >>> According to my understanding, the initial idea is just to
> > > > introduce
> > > > > a
> > > > > > >> simple cache or persist mechanism, but as the TableAPI is a
> > > > high-level
> > > > > > API,
> > > > > > >> it’s naturally for as to think in a SQL way.
> > > > > > >>>
> > > > > > >>> Maybe we can add the `cache()` method to the DataSet API and
> > > force
> > > > > > users
> > > > > > >> to translate a Table to a Dataset before caching it. Then the
> > > users
> > > > > > should
> > > > > > >> manually register the cached dataset to a table again (we may
> > need
> > > > > some
> > > > > > >> table replacement mechanisms for datasets with an identical
> > schema
> > > > but
> > > > > > >> different contents here). After all, it’s the dataset rather
> > than
> > > > the
> > > > > > >> dynamic table that need to be cached, right?
> > > > > > >>>
> > > > > > >>> Best,
> > > > > > >>> Xingcan
> > > > > > >>>
> > > > > > >>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> > becket.qin@gmail.com>
> > > > > > wrote:
> > > > > > >>>>
> > > > > > >>>> Hi Piotrek and Jark,
> > > > > > >>>>
> > > > > > >>>> Thanks for the feedback and explanation. Those are good
> > > arguments.
> > > > > > But I
> > > > > > >>>> think those arguments are mostly about materialized view.
> Let
> > me
> > > > try
> > > > > > to
> > > > > > >>>> explain the reason I believe cache() and materialize() are
> > > > > different.
> > > > > > >>>>
> > > > > > >>>> I think cache() and materialize() have quite different
> > > > implications.
> > > > > > An
> > > > > > >>>> analogy I can think of is save()/publish(). When users call
> > > > cache(),
> > > > > > it
> > > > > > >> is
> > > > > > >>>> just like they are saving an intermediate result as a draft
> of
> > > > their
> > > > > > >> work,
> > > > > > >>>> this intermediate result may not have any realistic meaning.
> > > > Calling
> > > > > > >>>> cache() does not mean users want to publish the cached table
> > in
> > > > any
> > > > > > >> manner.
> > > > > > >>>> But when users call materialize(), that means "I have
> > something
> > > > > > >> meaningful
> > > > > > >>>> to be reused by others", now users need to think about the
> > > > > validation,
> > > > > > >>>> update & versioning, lifecycle of the result, etc.
> > > > > > >>>>
> > > > > > >>>> Piotrek's suggestions on variations of the materialize()
> > methods
> > > > are
> > > > > > >> very
> > > > > > >>>> useful. It would be great if Flink have them. The concept of
> > > > > > >> materialized
> > > > > > >>>> view is actually a pretty big feature, not to say the
> related
> > > > stuff
> > > > > > like
> > > > > > >>>> triggers/hooks you mentioned earlier. I think the
> materialized
> > > > view
> > > > > > >> itself
> > > > > > >>>> should be discussed in a more thorough and systematic
> manner.
> > > And
> > > > I
> > > > > > >> found
> > > > > > >>>> that discussion is kind of orthogonal and way beyond
> > interactive
> > > > > > >>>> programming experience.
> > > > > > >>>>
> > > > > > >>>> The example you gave was interesting. I still have some
> > > questions,
> > > > > > >> though.
> > > > > > >>>>
> > > > > > >>>> Table source = … // some source that scans files from a
> > > directory
> > > > > > >>>>> “/foo/bar/“
> > > > > > >>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > > > > >>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > > > > >>>>
> > > > > > >>>> t2.count() // initialise cache (if it’s lazily initialised)
> > > > > > >>>>> int a1 = t1.count()
> > > > > > >>>>> int b1 = t2.count()
> > > > > > >>>>> // something in the background (or we trigger it) writes
> new
> > > > files
> > > > > to
> > > > > > >>>>> /foo/bar
> > > > > > >>>>> int a2 = t1.count()
> > > > > > >>>>> int b2 = t2.count()
> > > > > > >>>>> t2.refresh() // possible future extension, not to be
> > > implemented
> > > > in
> > > > > > the
> > > > > > >>>>> initial version
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>> what if someone else added some more files to /foo/bar at
> this
> > > > > point?
> > > > > > In
> > > > > > >>>> that case, a3 won't equals to b3, and the result become
> > > > > > >> non-deterministic,
> > > > > > >>>> right?
> > > > > > >>>>
> > > > > > >>>> int a3 = t1.count()
> > > > > > >>>>> int b3 = t2.count()
> > > > > > >>>>> t2.drop() // another possible future extension, manual
> > “cache”
> > > > > > dropping
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> When we talk about interactive programming, in most cases,
> we
> > > are
> > > > > > >> talking
> > > > > > >>>> about batch applications. A fundamental assumption of such
> > case
> > > is
> > > > > > that
> > > > > > >> the
> > > > > > >>>> source data is complete before the data processing begins,
> and
> > > the
> > > > > > data
> > > > > > >>>> will not change during the data processing. IMO, if
> additional
> > > > rows
> > > > > > >> needs
> > > > > > >>>> to be added to some source during the processing, it should
> be
> > > > done
> > > > > in
> > > > > > >> ways
> > > > > > >>>> like union the source with another table containing the rows
> > to
> > > be
> > > > > > >> added.
> > > > > > >>>>
> > > > > > >>>> There are a few cases that computations are executed
> > repeatedly
> > > on
> > > > > the
> > > > > > >>>> changing data source.
> > > > > > >>>>
> > > > > > >>>> For example, people may run a ML training job every hour
> with
> > > the
> > > > > > >> samples
> > > > > > >>>> newly added in the past hour. In that case, the source data
> > > > between
> > > > > > will
> > > > > > >>>> indeed change. But still, the data remain unchanged within
> one
> > > > run.
> > > > > > And
> > > > > > >>>> usually in that case, the result will need versioning, i.e.
> > for
> > > a
> > > > > > given
> > > > > > >>>> result, it tells that the result is a result from the source
> > > data
> > > > > by a
> > > > > > >>>> certain timestamp.
> > > > > > >>>>
> > > > > > >>>> Another example is something like data warehouse. In this
> > case,
> > > > > there
> > > > > > >> are a
> > > > > > >>>> few source of original/raw data. On top of those sources,
> many
> > > > > > >> materialized
> > > > > > >>>> view / queries / reports / dashboards can be created to
> > generate
> > > > > > derived
> > > > > > >>>> data. Those derived data needs to be updated when the
> > underlying
> > > > > > >> original
> > > > > > >>>> data changes. In that case, the processing logic that
> derives
> > > the
> > > > > > >> original
> > > > > > >>>> data needs to be executed repeatedly to update those
> > > > reports/views.
> > > > > > >> Again,
> > > > > > >>>> all those derived data also need to have version management,
> > > such
> > > > as
> > > > > > >>>> timestamp.
> > > > > > >>>>
> > > > > > >>>> In any of the above two cases, during a single run of the
> > > > processing
> > > > > > >> logic,
> > > > > > >>>> the data cannot change. Otherwise the behavior of the
> > processing
> > > > > logic
> > > > > > >> may
> > > > > > >>>> be undefined. In the above two examples, when writing the
> > > > processing
> > > > > > >> logic,
> > > > > > >>>> Users can use .cache() to hint Flink that those results
> should
> > > be
> > > > > > saved
> > > > > > >> to
> > > > > > >>>> avoid repeated computation. And then for the result of my
> > > > > application
> > > > > > >>>> logic, I'll call materialize(), so that these results could
> be
> > > > > managed
> > > > > > >> by
> > > > > > >>>> the system with versioning, metadata management, lifecycle
> > > > > management,
> > > > > > >>>> ACLs, etc.
> > > > > > >>>>
> > > > > > >>>> It is true we can use materialize() to do the cache() job,
> > but I
> > > > am
> > > > > > >> really
> > > > > > >>>> reluctant to shoehorn cache() into materialize() and force
> > users
> > > > to
> > > > > > >> worry
> > > > > > >>>> about a bunch of implications that they needn't have to. I
> am
> > > > > > >> absolutely on
> > > > > > >>>> your side that redundant API is bad. But it is equally
> > > > frustrating,
> > > > > if
> > > > > > >> not
> > > > > > >>>> more, that the same API does different things.
> > > > > > >>>>
> > > > > > >>>> Thanks,
> > > > > > >>>>
> > > > > > >>>> Jiangjie (Becket) Qin
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> > > > wshaoxuan@gmail.com
> > > > > >
> > > > > > >> wrote:
> > > > > > >>>>
> > > > > > >>>>> Thanks Piotrek,
> > > > > > >>>>> You provided a very good example, it explains all the
> > > confusions
> > > > I
> > > > > > >> have.
> > > > > > >>>>> It is clear that there is something we have not considered
> in
> > > the
> > > > > > >> initial
> > > > > > >>>>> proposal. We intend to force the user to reuse the
> > > > > > cached/materialized
> > > > > > >>>>> table, if its cache() method is executed. We did not expect
> > > that
> > > > > user
> > > > > > >> may
> > > > > > >>>>> want to re-executed the plan from the source table. Let me
> > > > re-think
> > > > > > >> about
> > > > > > >>>>> it and get back to you later.
> > > > > > >>>>>
> > > > > > >>>>> In the meanwhile, this example/observation also infers that
> > we
> > > > > cannot
> > > > > > >> fully
> > > > > > >>>>> involve the optimizer to decide the plan if a
> > cache/materialize
> > > > is
> > > > > > >>>>> explicitly used, because weather to reuse the cache data or
> > > > > > re-execute
> > > > > > >> the
> > > > > > >>>>> query from source data may lead to different results. (But
> I
> > > > guess
> > > > > > >>>>> optimizer can still help in some cases ---- as long as it
> > does
> > > > not
> > > > > > >>>>> re-execute from the varied source, we should be safe).
> > > > > > >>>>>
> > > > > > >>>>> Regards,
> > > > > > >>>>> Shaoxuan
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> > > > > > >> piotr@data-artisans.com>
> > > > > > >>>>> wrote:
> > > > > > >>>>>
> > > > > > >>>>>> Hi Shaoxuan,
> > > > > > >>>>>>
> > > > > > >>>>>> Re 2:
> > > > > > >>>>>>
> > > > > > >>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is
> modified
> > > > to->
> > > > > > t1’
> > > > > > >>>>>>
> > > > > > >>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> > > > > > >>>>>> `methodThatAppliesOperators()` method has changed it’s
> plan?
> > > > > > >>>>>>
> > > > > > >>>>>> I was thinking more about something like this:
> > > > > > >>>>>>
> > > > > > >>>>>> Table source = … // some source that scans files from a
> > > > directory
> > > > > > >>>>>> “/foo/bar/“
> > > > > > >>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > > > > >>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > > > > >>>>>>
> > > > > > >>>>>> t2.count() // initialise cache (if it’s lazily
> initialised)
> > > > > > >>>>>>
> > > > > > >>>>>> int a1 = t1.count()
> > > > > > >>>>>> int b1 = t2.count()
> > > > > > >>>>>>
> > > > > > >>>>>> // something in the background (or we trigger it) writes
> new
> > > > files
> > > > > > to
> > > > > > >>>>>> /foo/bar
> > > > > > >>>>>>
> > > > > > >>>>>> int a2 = t1.count()
> > > > > > >>>>>> int b2 = t2.count()
> > > > > > >>>>>>
> > > > > > >>>>>> t2.refresh() // possible future extension, not to be
> > > implemented
> > > > > in
> > > > > > >> the
> > > > > > >>>>>> initial version
> > > > > > >>>>>>
> > > > > > >>>>>> int a3 = t1.count()
> > > > > > >>>>>> int b3 = t2.count()
> > > > > > >>>>>>
> > > > > > >>>>>> t2.drop() // another possible future extension, manual
> > “cache”
> > > > > > >> dropping
> > > > > > >>>>>>
> > > > > > >>>>>> assertTrue(a1 == b1) // same results, but b1 comes from
> the
> > > > > “cache"
> > > > > > >>>>>> assertTrue(b1 == b2) // both values come from the same
> cache
> > > > > > >>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed
> > > full
> > > > > > table
> > > > > > >>>>> scan
> > > > > > >>>>>> and has more data
> > > > > > >>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> > > > > > >>>>>> assertTrue(b3 == a2 == a3)
> > > > > > >>>>>>
> > > > > > >>>>>> Piotrek
> > > > > > >>>>>>
> > > > > > >>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com>
> > wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>> Hi,
> > > > > > >>>>>>>
> > > > > > >>>>>>> It is an very interesting and useful design!
> > > > > > >>>>>>>
> > > > > > >>>>>>> Here I want to share some of my thoughts:
> > > > > > >>>>>>>
> > > > > > >>>>>>> 1. Agree with that cache() method should return some
> Table
> > to
> > > > > avoid
> > > > > > >>>>> some
> > > > > > >>>>>>> unexpected problems because of the mutable object.
> > > > > > >>>>>>> All the existing methods of Table are returning a new
> Table
> > > > > > instance.
> > > > > > >>>>>>>
> > > > > > >>>>>>> 2. I think materialize() would be more consistent with
> SQL,
> > > > this
> > > > > > >> makes
> > > > > > >>>>> it
> > > > > > >>>>>>> possible to support the same feature for SQL (materialize
> > > view)
> > > > > and
> > > > > > >>>>> keep
> > > > > > >>>>>>> the same API for users in the future.
> > > > > > >>>>>>> But I'm also fine if we choose cache().
> > > > > > >>>>>>>
> > > > > > >>>>>>> 3. In the proposal, a TableService (or FlinkService?) is
> > used
> > > > to
> > > > > > >> cache
> > > > > > >>>>>> the
> > > > > > >>>>>>> result of the (intermediate) table.
> > > > > > >>>>>>> But the name of TableService may be a bit general which
> is
> > > not
> > > > > > quite
> > > > > > >>>>>>> understanding correctly in the first glance (a metastore
> > for
> > > > > > >> tables?).
> > > > > > >>>>>>> Maybe a more specific name would be better, such as
> > > > > > TableCacheSerive
> > > > > > >>>>> or
> > > > > > >>>>>>> TableMaterializeSerivce or something else.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Best,
> > > > > > >>>>>>> Jark
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
> > > fhueske@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>> Hi,
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Thanks for the clarification Becket!
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I have a few thoughts to share / questions:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> 1) I'd like to know how you plan to implement the
> feature
> > > on a
> > > > > > plan
> > > > > > >> /
> > > > > > >>>>>>>> planner level.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I would imaging the following to happen when
> Table.cache()
> > > is
> > > > > > >> called:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> 1) immediately optimize the Table and internally convert
> > it
> > > > > into a
> > > > > > >>>>>>>> DataSet/DataStream. This is necessary, to avoid that
> > > operators
> > > > > of
> > > > > > >>>>> later
> > > > > > >>>>>>>> queries on top of the Table are pushed down.
> > > > > > >>>>>>>> 2) register the DataSet/DataStream as a
> > > > > DataSet/DataStream-backed
> > > > > > >>>>> Table
> > > > > > >>>>>> X
> > > > > > >>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> > > > > > materialization
> > > > > > >>>>> of
> > > > > > >>>>>> the
> > > > > > >>>>>>>> Table X
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Based on your proposal the following would happen:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Table t1 = ....
> > > > > > >>>>>>>> t1.cache(); // cache() returns void. The logical plan of
> > t1
> > > is
> > > > > > >>>>> replaced
> > > > > > >>>>>> by
> > > > > > >>>>>>>> a scan of X. There is also a reference to the
> > > materialization
> > > > of
> > > > > > X.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> t1.count(); // this executes the program, including the
> > > > > > >>>>>> DataSet/DataStream
> > > > > > >>>>>>>> that backs X and the sink that writes the
> materialization
> > > of X
> > > > > > >>>>>>>> t1.count(); // this executes the program, but reads X
> from
> > > the
> > > > > > >>>>>>>> materialization.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> My question is, how do you determine when whether the
> scan
> > > of
> > > > t1
> > > > > > >>>>> should
> > > > > > >>>>>> go
> > > > > > >>>>>>>> against the DataSet/DataStream program and when against
> > the
> > > > > > >>>>>>>> materialization?
> > > > > > >>>>>>>> AFAIK, there is no hook that will tell you that a part
> of
> > > the
> > > > > > >> program
> > > > > > >>>>>> was
> > > > > > >>>>>>>> executed. Flipping a switch during optimization or plan
> > > > > generation
> > > > > > >> is
> > > > > > >>>>>> not
> > > > > > >>>>>>>> sufficient as there is no guarantee that the plan is
> also
> > > > > > executed.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Overall, this behavior is somewhat similar to what I
> > > proposed
> > > > in
> > > > > > >>>>>>>> FLINK-8950, which does not include persisting the table,
> > but
> > > > > just
> > > > > > >>>>>>>> optimizing and reregistering it as DataSet/DataStream
> > scan.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> 2) I think Piotr has a point about the implicit behavior
> > and
> > > > > side
> > > > > > >>>>>> effects
> > > > > > >>>>>>>> of the cache() method if it does not return anything.
> > > > > > >>>>>>>> Consider the following example:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Table t1 = ???
> > > > > > >>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> > > > > > >>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> In this case, the behavior/performance of the plan that
> > > > results
> > > > > > from
> > > > > > >>>>> the
> > > > > > >>>>>>>> second method call depends on whether t1 was modified by
> > the
> > > > > first
> > > > > > >>>>>> method
> > > > > > >>>>>>>> or not.
> > > > > > >>>>>>>> This is the classic issue of mutable vs. immutable
> > objects.
> > > > > > >>>>>>>> Also, as Piotr pointed out, it might also be good to
> have
> > > the
> > > > > > >> original
> > > > > > >>>>>> plan
> > > > > > >>>>>>>> of t1, because in some cases it is possible to push
> > filters
> > > > down
> > > > > > >> such
> > > > > > >>>>>> that
> > > > > > >>>>>>>> evaluating the query from scratch might be more
> efficient
> > > than
> > > > > > >>>>> accessing
> > > > > > >>>>>>>> the cache.
> > > > > > >>>>>>>> Moreover, a CachedTable could extend Table() and offer a
> > > > method
> > > > > > >>>>>> refresh().
> > > > > > >>>>>>>> This sounds quite useful in an interactive session mode.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> > > > > > materialize()
> > > > > > >>>>>> seems
> > > > > > >>>>>>>> to be more future proof.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Best, Fabian
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan
> Wang <
> > > > > > >>>>>>>> wshaoxuan@gmail.com>:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>> Hi Piotr,
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Thanks for sharing your ideas on the method naming. We
> > will
> > > > > think
> > > > > > >>>>> about
> > > > > > >>>>>>>>> your suggestions. But I don't understand why we need to
> > > > change
> > > > > > the
> > > > > > >>>>>> return
> > > > > > >>>>>>>>> type of cache().
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Cache() is a physical operation, it does not change the
> > > logic
> > > > > of
> > > > > > >>>>>>>>> the `Table`. On the tableAPI layer, we should not
> > > introduce a
> > > > > new
> > > > > > >>>>> table
> > > > > > >>>>>>>>> type unless the logic of table has been changed. If we
> > > > > introduce
> > > > > > a
> > > > > > >>>>> new
> > > > > > >>>>>>>>> table type `CachedTable`, we need create the same set
> of
> > > > > methods
> > > > > > of
> > > > > > >>>>>>>> `Table`
> > > > > > >>>>>>>>> for it. I don't think it is worth doing this. Or can
> you
> > > > please
> > > > > > >>>>>> elaborate
> > > > > > >>>>>>>>> more on what could be the "implicit behaviours/side
> > > effects"
> > > > > you
> > > > > > >> are
> > > > > > >>>>>>>>> thinking about?
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Regards,
> > > > > > >>>>>>>>> Shaoxuan
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > > > > > >>>>>> piotr@data-artisans.com>
> > > > > > >>>>>>>>> wrote:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>> Hi Becket,
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Thanks for the response.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> 1. I wasn’t saying that materialised view must be
> > mutable
> > > or
> > > > > > not.
> > > > > > >>>>> The
> > > > > > >>>>>>>>> same
> > > > > > >>>>>>>>>> thing applies to caches as well. To the contrary, I
> > would
> > > > > expect
> > > > > > >>>>> more
> > > > > > >>>>>>>>>> consistency and updates from something that is called
> > > > “cache”
> > > > > vs
> > > > > > >>>>>>>>> something
> > > > > > >>>>>>>>>> that’s a “materialised view”. In other words, IMO most
> > > > caches
> > > > > do
> > > > > > >> not
> > > > > > >>>>>>>>> serve
> > > > > > >>>>>>>>>> you invalid/outdated data and they handle updates on
> > their
> > > > > own.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> 2. I don’t think that having in the future two very
> > > similar
> > > > > > >> concepts
> > > > > > >>>>>> of
> > > > > > >>>>>>>>>> `materialized` view and `cache` is a good idea. It
> would
> > > be
> > > > > > >>>>> confusing
> > > > > > >>>>>>>> for
> > > > > > >>>>>>>>>> the users. I think it could be handled by
> > > > > variations/overloading
> > > > > > >> of
> > > > > > >>>>>>>>>> materialised view concept. We could start with:
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> `MaterializedTable materialize()` - immutable, session
> > > life
> > > > > > scope
> > > > > > >>>>>>>>>> (basically the same semantic as you are proposing
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> And then in the future (if ever) build on top of
> > > that/expand
> > > > > it
> > > > > > >>>>> with:
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> > > > > > >> `MaterializedTable
> > > > > > >>>>>>>>>> materialize(refreshHook=…)`
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Or with cross session support:
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> > > > > > >>>>> `MaterializedTable
> > > > > > >>>>>>>>>> materializeInto(tableFactory=…)`
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I’m not saying that we should implement cross
> > > > > session/refreshing
> > > > > > >> now
> > > > > > >>>>>> or
> > > > > > >>>>>>>>>> even in the near future. I’m just arguing that naming
> > > > current
> > > > > > >>>>>> immutable
> > > > > > >>>>>>>>>> session life scope method `materialize()` is more
> future
> > > > proof
> > > > > > and
> > > > > > >>>>>> more
> > > > > > >>>>>>>>>> consistent with SQL (on which after all table-api is
> > > heavily
> > > > > > >> basing
> > > > > > >>>>>>>> on).
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would
> > still
> > > > > insist
> > > > > > >> on
> > > > > > >>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
> > implicit
> > > > > > >>>>>>>>> behaviours/side
> > > > > > >>>>>>>>>> effects and to give both us & users more flexibility.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Piotrek
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
> > > becket.qin@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> Just to add a little bit, the materialized view is
> > > probably
> > > > > > more
> > > > > > >>>>>>>>> similar
> > > > > > >>>>>>>>>> to
> > > > > > >>>>>>>>>>> the persistent() brought up earlier in the thread. So
> > it
> > > is
> > > > > > >> usually
> > > > > > >>>>>>>>> cross
> > > > > > >>>>>>>>>>> session and could be used in a larger scope. For
> > > example, a
> > > > > > >>>>>>>>> materialized
> > > > > > >>>>>>>>>>> view created by user A may be visible to user B. It
> is
> > > > > probably
> > > > > > >>>>>>>>> something
> > > > > > >>>>>>>>>>> we want to have in the future. I'll put it in the
> > future
> > > > work
> > > > > > >>>>>>>> section.
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> > > > > > becket.qin@gmail.com
> > > > > > >>>
> > > > > > >>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Hi Piotrek,
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Thanks for the explanation.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Right now we are mostly thinking of the cached table
> > as
> > > > > > >>>>> immutable. I
> > > > > > >>>>>>>>> can
> > > > > > >>>>>>>>>>>> see the Materialized view would be useful in the
> > future.
> > > > > That
> > > > > > >>>>> said,
> > > > > > >>>>>>>> I
> > > > > > >>>>>>>>>> think
> > > > > > >>>>>>>>>>>> a simple cache mechanism is probably still needed.
> So
> > to
> > > > me,
> > > > > > >>>>> cache()
> > > > > > >>>>>>>>> and
> > > > > > >>>>>>>>>>>> materialize() should be two separate method as they
> > > > address
> > > > > > >>>>>>>> different
> > > > > > >>>>>>>>>>>> needs. Materialize() is a higher level concept
> usually
> > > > > > implying
> > > > > > >>>>>>>>>> periodical
> > > > > > >>>>>>>>>>>> update, while cache() has much simpler semantic. For
> > > > > example,
> > > > > > >> one
> > > > > > >>>>>>>> may
> > > > > > >>>>>>>>>>>> create a materialized view and use cache() method in
> > the
> > > > > > >>>>>>>> materialized
> > > > > > >>>>>>>>>> view
> > > > > > >>>>>>>>>>>> creation logic. So that during the materialized view
> > > > update,
> > > > > > >> they
> > > > > > >>>>> do
> > > > > > >>>>>>>>> not
> > > > > > >>>>>>>>>>>> need to worry about the case that the cached table
> is
> > > also
> > > > > > >>>>> changed.
> > > > > > >>>>>>>>>> Maybe
> > > > > > >>>>>>>>>>>> under the hood, materialized() and cache() could
> share
> > > > some
> > > > > > >>>>>>>> mechanism,
> > > > > > >>>>>>>>>> but
> > > > > > >>>>>>>>>>>> I think a simple cache() method would be handy in a
> > lot
> > > of
> > > > > > >> cases.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > > > > > >>>>>>>>> piotr@data-artisans.com
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Hi Becket,
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > > > MaterializedTable
> > > > > > >> that
> > > > > > >>>>>>>>> they
> > > > > > >>>>>>>>>>>>> cannot do on a Table?
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Maybe not in the initial implementation, but
> various
> > > DBs
> > > > > > offer
> > > > > > >>>>>>>>>> different
> > > > > > >>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
> > > triggers,
> > > > > > >> timers,
> > > > > > >>>>>>>>>> manually
> > > > > > >>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to
> > handle
> > > > > that
> > > > > > in
> > > > > > >>>>> the
> > > > > > >>>>>>>>>> future.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> After users call *table.cache(), *users can just
> use
> > > > that
> > > > > > >> table
> > > > > > >>>>>>>> and
> > > > > > >>>>>>>>> do
> > > > > > >>>>>>>>>>>>> anything that is supported on a Table, including
> SQL.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> This is some implicit behaviour with side effects.
> > > > Imagine
> > > > > if
> > > > > > >>>>> user
> > > > > > >>>>>>>>> has
> > > > > > >>>>>>>>>> a
> > > > > > >>>>>>>>>>>>> long and complicated program, that touches table
> `b`
> > > > > multiple
> > > > > > >>>>>>>> times,
> > > > > > >>>>>>>>>> maybe
> > > > > > >>>>>>>>>>>>> scattered around different methods. If he modifies
> > his
> > > > > > program
> > > > > > >> by
> > > > > > >>>>>>>>>> inserting
> > > > > > >>>>>>>>>>>>> in one place
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> b.cache()
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> This implicitly alters the semantic and behaviour
> of
> > > his
> > > > > code
> > > > > > >> all
> > > > > > >>>>>>>>> over
> > > > > > >>>>>>>>>>>>> the place, maybe in a ways that might cause
> problems.
> > > For
> > > > > > >> example
> > > > > > >>>>>>>>> what
> > > > > > >>>>>>>>>> if
> > > > > > >>>>>>>>>>>>> underlying data is changing?
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Having invisible side effects is also not very
> clean,
> > > for
> > > > > > >> example
> > > > > > >>>>>>>>> think
> > > > > > >>>>>>>>>>>>> about something like this (but more complicated):
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Table b = ...;
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> If (some_condition) {
> > > > > > >>>>>>>>>>>>> processTable1(b)
> > > > > > >>>>>>>>>>>>> }
> > > > > > >>>>>>>>>>>>> else {
> > > > > > >>>>>>>>>>>>> processTable2(b)
> > > > > > >>>>>>>>>>>>> }
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> // do more stuff with b
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> > > > > > >> `processTable1`
> > > > > > >>>>>>>> or
> > > > > > >>>>>>>>>>>>> `processTable2` methods.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> On the other hand
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Table materialisedB = b.materialize()
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Avoids (at least some of) the side effect issues
> and
> > > > forces
> > > > > > >> user
> > > > > > >>>>> to
> > > > > > >>>>>>>>>>>>> explicitly use `materialisedB` where it’s
> appropriate
> > > and
> > > > > > >> forces
> > > > > > >>>>>>>> user
> > > > > > >>>>>>>>>> to
> > > > > > >>>>>>>>>>>>> think what does it actually mean. And if something
> > > > doesn’t
> > > > > > work
> > > > > > >>>>> in
> > > > > > >>>>>>>>> the
> > > > > > >>>>>>>>>> end
> > > > > > >>>>>>>>>>>>> for the user, he will know what has he changed
> > instead
> > > of
> > > > > > >> blaming
> > > > > > >>>>>>>>>> Flink for
> > > > > > >>>>>>>>>>>>> some “magic” underneath. In the above example,
> after
> > > > > > >>>>> materialising
> > > > > > >>>>>>>> b
> > > > > > >>>>>>>>> in
> > > > > > >>>>>>>>>>>>> only one of the methods, he should/would realise
> > about
> > > > the
> > > > > > >> issue
> > > > > > >>>>>>>> when
> > > > > > >>>>>>>>>>>>> handling the return value `MaterializedTable` of
> that
> > > > > method.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> I guess it comes down to personal preferences if
> you
> > > like
> > > > > > >> things
> > > > > > >>>>> to
> > > > > > >>>>>>>>> be
> > > > > > >>>>>>>>>>>>> implicit or not. The more power is the user,
> probably
> > > the
> > > > > > more
> > > > > > >>>>>>>> likely
> > > > > > >>>>>>>>>> he is
> > > > > > >>>>>>>>>>>>> to like/understand implicit behaviour. And we as
> > Table
> > > > API
> > > > > > >>>>>>>> designers
> > > > > > >>>>>>>>>> are
> > > > > > >>>>>>>>>>>>> the most power users out there, so I would proceed
> > with
> > > > > > caution
> > > > > > >>>>> (so
> > > > > > >>>>>>>>>> that we
> > > > > > >>>>>>>>>>>>> do not end up in the crazy perl realm with it’s
> > lovely
> > > > > > implicit
> > > > > > >>>>>>>>> method
> > > > > > >>>>>>>>>>>>> arguments ;)  <
> > > > > https://stackoverflow.com/a/14922656/8149051
> > > > > > >)
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Table API to also support non-relational
> processing
> > > > cases,
> > > > > > >>>>> cache()
> > > > > > >>>>>>>>>>>>> might be slightly better.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> I think even such extended Table API could benefit
> > from
> > > > > > >> sticking
> > > > > > >>>>>>>>>> to/being
> > > > > > >>>>>>>>>>>>> consistent with SQL where both SQL and Table API
> are
> > > > > > basically
> > > > > > >>>>> the
> > > > > > >>>>>>>>>> same.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> One more thing. `MaterializedTable materialize()`
> > could
> > > > be
> > > > > > more
> > > > > > >>>>>>>>>>>>> powerful/flexible allowing the user to operate both
> > on
> > > > > > >>>>> materialised
> > > > > > >>>>>>>>>> and not
> > > > > > >>>>>>>>>>>>> materialised view at the same time for whatever
> > reasons
> > > > > > >>>>> (underlying
> > > > > > >>>>>>>>>> data
> > > > > > >>>>>>>>>>>>> changing/better optimisation opportunities after
> > > pushing
> > > > > down
> > > > > > >>>>> more
> > > > > > >>>>>>>>>> filters
> > > > > > >>>>>>>>>>>>> etc). For example:
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Table b = …;
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Val min = mb.min();
> > > > > > >>>>>>>>>>>>> Val max = mb.max();
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Could be more efficient compared to `b.cache()` if
> > > > > > >>>>> `filter(‘userId
> > > > > > >>>>>>>> =
> > > > > > >>>>>>>>>>>>> 42);` allows for much more aggressive
> optimisations.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Piotrek
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> > > > > fhueske@gmail.com>
> > > > > > >>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> I'm not suggesting to add support for Ignite. This
> > was
> > > > > just
> > > > > > an
> > > > > > >>>>>>>>>> example.
> > > > > > >>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> > > > > > >>>>>>>>>>>>>> For the sake of this proposal, it would be up to
> the
> > > > user
> > > > > to
> > > > > > >>>>>>>>>> implement a
> > > > > > >>>>>>>>>>>>>> TableFactory and corresponding TableSource /
> > TableSink
> > > > > > classes
> > > > > > >>>>> to
> > > > > > >>>>>>>>>>>>> persist
> > > > > > >>>>>>>>>>>>>> and read the data.
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio
> > > > > > Pompermaier
> > > > > > >> <
> > > > > > >>>>>>>>>>>>>> pompermaier@okkam.it>:
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as
> an
> > > > > > >> alternative
> > > > > > >>>>> to
> > > > > > >>>>>>>>>>>>> Apache
> > > > > > >>>>>>>>>>>>>>> Ignite?
> > > > > > >>>>>>>>>>>>>>> [1]
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>
> > > > > > >>
> > > > >
> > >
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> > > > > > >>>>>>>> fhueske@gmail.com>
> > > > > > >>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> Hi,
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> Thanks for the proposal!
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> To summarize, you propose a new method
> > > Table.cache():
> > > > > > Table
> > > > > > >>>>> that
> > > > > > >>>>>>>>>> will
> > > > > > >>>>>>>>>>>>>>>> trigger a job and write the result into some
> > > temporary
> > > > > > >> storage
> > > > > > >>>>>>>> as
> > > > > > >>>>>>>>>>>>> defined
> > > > > > >>>>>>>>>>>>>>>> by a TableFactory.
> > > > > > >>>>>>>>>>>>>>>> The cache() call blocks while the job is running
> > and
> > > > > > >>>>> eventually
> > > > > > >>>>>>>>>>>>> returns a
> > > > > > >>>>>>>>>>>>>>>> Table object that represents a scan of the
> > temporary
> > > > > > table.
> > > > > > >>>>>>>>>>>>>>>> When the "session" is closed (closing to be
> > > defined?),
> > > > > the
> > > > > > >>>>>>>>> temporary
> > > > > > >>>>>>>>>>>>>>> tables
> > > > > > >>>>>>>>>>>>>>>> are all dropped.
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> I think this behavior makes sense and is a good
> > > first
> > > > > step
> > > > > > >>>>>>>> towards
> > > > > > >>>>>>>>>>>>> more
> > > > > > >>>>>>>>>>>>>>>> interactive workloads.
> > > > > > >>>>>>>>>>>>>>>> However, its performance suffers from writing to
> > and
> > > > > > reading
> > > > > > >>>>>>>> from
> > > > > > >>>>>>>>>>>>>>> external
> > > > > > >>>>>>>>>>>>>>>> systems.
> > > > > > >>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> > > > > > significantly
> > > > > > >>>>>>>>> improve
> > > > > > >>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across
> > jobs)
> > > > > would
> > > > > > >>>>> have
> > > > > > >>>>>>>>>> large
> > > > > > >>>>>>>>>>>>>>>> impacts on many components of Flink.
> > > > > > >>>>>>>>>>>>>>>> Users could use in-memory filesystems or storage
> > > grids
> > > > > > >> (Apache
> > > > > > >>>>>>>>>>>>> Ignite) to
> > > > > > >>>>>>>>>>>>>>>> mitigate some of the performance effects.
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> Best, Fabian
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb
> Becket
> > > Qin
> > > > <
> > > > > > >>>>>>>>>>>>>>>> becket.qin@gmail.com
> > > > > > >>>>>>>>>>>>>>>>> :
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > > > > MaterializedTable
> > > > > > >>>>>>>> that
> > > > > > >>>>>>>>>> they
> > > > > > >>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> > > > *table.cache(),
> > > > > > >> *users
> > > > > > >>>>>>>> can
> > > > > > >>>>>>>>>>>>> just
> > > > > > >>>>>>>>>>>>>>>> use
> > > > > > >>>>>>>>>>>>>>>>> that table and do anything that is supported
> on a
> > > > > Table,
> > > > > > >>>>>>>>> including
> > > > > > >>>>>>>>>>>>> SQL.
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize()
> > sounds
> > > > > fine
> > > > > > to
> > > > > > >>>>> me.
> > > > > > >>>>>>>>>>>>> cache()
> > > > > > >>>>>>>>>>>>>>>> is
> > > > > > >>>>>>>>>>>>>>>>> a bit more general than materialize(). Given
> that
> > > we
> > > > > are
> > > > > > >>>>>>>>> enhancing
> > > > > > >>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>> Table API to also support non-relational
> > processing
> > > > > > cases,
> > > > > > >>>>>>>>> cache()
> > > > > > >>>>>>>>>>>>>>> might
> > > > > > >>>>>>>>>>>>>>>> be
> > > > > > >>>>>>>>>>>>>>>>> slightly better.
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr
> Nowojski <
> > > > > > >>>>>>>>>>>>>>> piotr@data-artisans.com
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> Hi Becket,
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to
> > > reuse
> > > > > > >> existing
> > > > > > >>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I
> assumed
> > > that
> > > > > you
> > > > > > >>>>> want
> > > > > > >>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>> provide
> > > > > > >>>>>>>>>>>>>>>>> an
> > > > > > >>>>>>>>>>>>>>>>>> alternate way of writing the data.
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> Now that I hopefully understand the proposal,
> > > maybe
> > > > we
> > > > > > >> could
> > > > > > >>>>>>>>>> rename
> > > > > > >>>>>>>>>>>>>>>>>> `cache()` to
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> void materialize()
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> or going step further
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> > > > > > >>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> ?
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> The second option with returning a handle I
> > think
> > > is
> > > > > > more
> > > > > > >>>>>>>>> flexible
> > > > > > >>>>>>>>>>>>>>> and
> > > > > > >>>>>>>>>>>>>>>>>> could provide features such as
> > “refresh”/“delete”
> > > or
> > > > > > >>>>> generally
> > > > > > >>>>>>>>>>>>>>> speaking
> > > > > > >>>>>>>>>>>>>>>>>> manage the the view. In the future we could
> also
> > > > think
> > > > > > >> about
> > > > > > >>>>>>>>>> adding
> > > > > > >>>>>>>>>>>>>>>> hooks
> > > > > > >>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is also
> > more
> > > > > > >> explicit
> > > > > > >>>>> -
> > > > > > >>>>>>>>>>>>>>>>>> materialization returning a new table handle
> > will
> > > > not
> > > > > > have
> > > > > > >>>>> the
> > > > > > >>>>>>>>>> same
> > > > > > >>>>>>>>>>>>>>>>>> implicit side effects as adding a simple line
> of
> > > > code
> > > > > > like
> > > > > > >>>>>>>>>>>>>>> `b.cache()`
> > > > > > >>>>>>>>>>>>>>>>>> would have.
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> It would also be more SQL like, making it more
> > > > > intuitive
> > > > > > >> for
> > > > > > >>>>>>>>> users
> > > > > > >>>>>>>>>>>>>>>>> already
> > > > > > >>>>>>>>>>>>>>>>>> familiar with the SQL.
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> Piotrek
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> > > > > > >> becket.qin@gmail.com
> > > > > > >>>>>>
> > > > > > >>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
> > > > equivalent
> > > > > to
> > > > > > >>>>>>>>> creating
> > > > > > >>>>>>>>>> a
> > > > > > >>>>>>>>>>>>>>>>>> BUILT-IN
> > > > > > >>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> > > > > functionality
> > > > > > is
> > > > > > >>>>>>>>> missing
> > > > > > >>>>>>>>>>>>>>>>> today,
> > > > > > >>>>>>>>>>>>>>>>>>> though. Not sure if I understand your
> question.
> > > Do
> > > > > you
> > > > > > >> mean
> > > > > > >>>>>>>> we
> > > > > > >>>>>>>>>>>>>>>> already
> > > > > > >>>>>>>>>>>>>>>>>> have
> > > > > > >>>>>>>>>>>>>>>>>>> the functionality and just need a syntax
> sugar?
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is do
> > we
> > > > want
> > > > > > to
> > > > > > >>>>> stop
> > > > > > >>>>>>>>> at
> > > > > > >>>>>>>>>>>>>>>>> creating
> > > > > > >>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to
> extend
> > > that
> > > > > in
> > > > > > >> the
> > > > > > >>>>>>>>> future
> > > > > > >>>>>>>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>> a
> > > > > > >>>>>>>>>>>>>>>>>> more
> > > > > > >>>>>>>>>>>>>>>>>>> useful unified data store distributed with
> > Flink?
> > > > And
> > > > > > do
> > > > > > >> we
> > > > > > >>>>>>>>> want
> > > > > > >>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>>> have
> > > > > > >>>>>>>>>>>>>>>>>> a
> > > > > > >>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job
> pattern
> > > with
> > > > > > their
> > > > > > >>>>> own
> > > > > > >>>>>>>>>> user
> > > > > > >>>>>>>>>>>>>>>>>> defined
> > > > > > >>>>>>>>>>>>>>>>>>> services. These considerations are much more
> > > > > > >> architectural.
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr
> Nowojski
> > <
> > > > > > >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > > > > > >>>>>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> Hi,
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand
> the
> > > > > > problem.
> > > > > > >>>>>>>> Isn’t
> > > > > > >>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data
> > to
> > > a
> > > > > sink
> > > > > > >> and
> > > > > > >>>>>>>>> later
> > > > > > >>>>>>>>>>>>>>>>> reading
> > > > > > >>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live
> > > > > scope/live
> > > > > > >>>>> time?
> > > > > > >>>>>>>>> And
> > > > > > >>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>>> sink
> > > > > > >>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a file
> > > sink?
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> > > > > materialised
> > > > > > >>>>> view
> > > > > > >>>>>>>>>> from a
> > > > > > >>>>>>>>>>>>>>>>> table
> > > > > > >>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and
> reusing
> > > > this
> > > > > > >>>>>>>>> materialised
> > > > > > >>>>>>>>>>>>>>>> view
> > > > > > >>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to
> > clean
> > > up
> > > > > > >>>>>>>>> materialised
> > > > > > >>>>>>>>>>>>>>>> views
> > > > > > >>>>>>>>>>>>>>>>>> (for
> > > > > > >>>>>>>>>>>>>>>>>>>> example when current session finishes)?
> Maybe
> > we
> > > > > need
> > > > > > >> some
> > > > > > >>>>>>>>>>>>>>> syntactic
> > > > > > >>>>>>>>>>>>>>>>>> sugar
> > > > > > >>>>>>>>>>>>>>>>>>>> on top of it?
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> Piotrek
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> > > > > > >>>>> becket.qin@gmail.com
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
> > persist()
> > > > > with
> > > > > > >>>>>>>>>>>>>>>>> lifecycle/defined
> > > > > > >>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the future
> > > work
> > > > > for
> > > > > > >>>>> this.
> > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng
> sun
> > <
> > > > > > >>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the
> name
> > > of
> > > > > > >>>>>>>> `cache()`, I
> > > > > > >>>>>>>>>>>>>>>>>> understand
> > > > > > >>>>>>>>>>>>>>>>>>>> why
> > > > > > >>>>>>>>>>>>>>>>>>>>>> you designed this way!
> > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
> > > > lifecycle
> > > > > > for
> > > > > > >>>>>>>> data
> > > > > > >>>>>>>>>>>>>>>>>> persistence?
> > > > > > >>>>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION),
> so
> > > > that
> > > > > > the
> > > > > > >>>>> user
> > > > > > >>>>>>>>> is
> > > > > > >>>>>>>>>>>>>>> not
> > > > > > >>>>>>>>>>>>>>>>>>>> worried
> > > > > > >>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify
> > the
> > > > time
> > > > > > >> range
> > > > > > >>>>>>>> for
> > > > > > >>>>>>>>>>>>>>>> keeping
> > > > > > >>>>>>>>>>>>>>>>>>>> time.
> > > > > > >>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we
> > can
> > > > > also
> > > > > > >>>>> share
> > > > > > >>>>>>>>> in a
> > > > > > >>>>>>>>>>>>>>>>> certain
> > > > > > >>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> > > > > > >>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> > > > > > >>>>>>>>>>>>>>> am
> > > > > > >>>>>>>>>>>>>>>>> not
> > > > > > >>>>>>>>>>>>>>>>>>>> sure,
> > > > > > >>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference
> > > only!
> > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>> Bests,
> > > > > > >>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > > > 于2018年11月23日周五
> > > > > > >>>>>>>> 下午1:33写道:
> > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding
> cache()
> > > v.s.
> > > > > > >>>>>>>> persist(),
> > > > > > >>>>>>>>>>>>>>>>>> personally I
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
> > describing
> > > > the
> > > > > > >>>>>>>> behavior,
> > > > > > >>>>>>>>>>>>>>> i.e.
> > > > > > >>>>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>>>>>>> Table
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
> > > deleted
> > > > > > after
> > > > > > >>>>> the
> > > > > > >>>>>>>>>>>>>>> session
> > > > > > >>>>>>>>>>>>>>>> is
> > > > > > >>>>>>>>>>>>>>>>>>>>>> closed.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as
> > people
> > > > > might
> > > > > > >>>>> think
> > > > > > >>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>> table
> > > > > > >>>>>>>>>>>>>>>>>>>> will
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> still be there even after the session is
> > > gone.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and
> > stream
> > > > > > >>>>> processing
> > > > > > >>>>>>>> in
> > > > > > >>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>> same
> > > > > > >>>>>>>>>>>>>>>>>>>> job.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that
> > goal.
> > > I
> > > > > > >> imagine
> > > > > > >>>>>>>> that
> > > > > > >>>>>>>>>>>>>>> would
> > > > > > >>>>>>>>>>>>>>>>> be
> > > > > > >>>>>>>>>>>>>>>>>> a
> > > > > > >>>>>>>>>>>>>>>>>>>>>> huge
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> change across the board, including
> sources,
> > > > > > operators
> > > > > > >>>>> and
> > > > > > >>>>>>>>>>>>>>>>>>>> optimizations,
> > > > > > >>>>>>>>>>>>>>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
> > > separate
> > > > > > >>>>> in-depth
> > > > > > >>>>>>>>>>>>>>>>> discussions.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan
> > Cui <
> > > > > > >>>>>>>>>>>>>>> xingcanc@gmail.com>
> > > > > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or
> access
> > > > > domain
> > > > > > >> are
> > > > > > >>>>>>>> both
> > > > > > >>>>>>>>>>>>>>>>>> orthogonal
> > > > > > >>>>>>>>>>>>>>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may
> > be
> > > > the
> > > > > > >> first
> > > > > > >>>>>>>> time
> > > > > > >>>>>>>>>> we
> > > > > > >>>>>>>>>>>>>>>> plan
> > > > > > >>>>>>>>>>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism
> other
> > > than
> > > > > the
> > > > > > >>>>>>>> state.
> > > > > > >>>>>>>>>>>>>>> Maybe
> > > > > > >>>>>>>>>>>>>>>>> it’s
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> better
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
> > > > concentrate
> > > > > > on
> > > > > > >> a
> > > > > > >>>>>>>>>> specific
> > > > > > >>>>>>>>>>>>>>>>> part?
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more
> concerned
> > > > with
> > > > > > the
> > > > > > >>>>>>>>>> underlying
> > > > > > >>>>>>>>>>>>>>>>>>>> service.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to
> > the
> > > > > > >> existing
> > > > > > >>>>>>>>>>>>>>> codebase.
> > > > > > >>>>>>>>>>>>>>>> As
> > > > > > >>>>>>>>>>>>>>>>>> you
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be
> extendible
> > to
> > > > > > support
> > > > > > >>>>>>>> other
> > > > > > >>>>>>>>>>>>>>>>>> components
> > > > > > >>>>>>>>>>>>>>>>>>>>>> and
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
> > thread.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the
> more
> > > > > > >> interactive
> > > > > > >>>>>>>>> Table
> > > > > > >>>>>>>>>>>>>>>> API,
> > > > > > >>>>>>>>>>>>>>>>> in
> > > > > > >>>>>>>>>>>>>>>>>>>>>> case
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service
> > > > > > mechanism.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei
> > > Jiang <
> > > > > > >>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp
> table
> > > for
> > > > > > clean
> > > > > > >> up
> > > > > > >>>>>>>> is
> > > > > > >>>>>>>>>> not
> > > > > > >>>>>>>>>>>>>>>> very
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> reliable.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
> > > > executed
> > > > > > >>>>>>>>>> successfully.
> > > > > > >>>>>>>>>>>>>>> We
> > > > > > >>>>>>>>>>>>>>>>> may
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> risk
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that
> > it's
> > > > > safer
> > > > > > to
> > > > > > >>>>>>>> have
> > > > > > >>>>>>>>> an
> > > > > > >>>>>>>>>>>>>>>>>>>>>> association
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So
> we
> > > can
> > > > > > always
> > > > > > >>>>>>>> clean
> > > > > > >>>>>>>>>> up
> > > > > > >>>>>>>>>>>>>>>> temp
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> tables
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any
> > > > active
> > > > > > >>>>>>>> sessions.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM
> jincheng
> > > > sun <
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
> > proposal!
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful
> > and
> > > > > user
> > > > > > >>>>>>>> friendly
> > > > > > >>>>>>>>>> in
> > > > > > >>>>>>>>>>>>>>>> case
> > > > > > >>>>>>>>>>>>>>>>>> of
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> your
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business
> has
> > > to
> > > > be
> > > > > > >>>>>>>> executed
> > > > > > >>>>>>>>> in
> > > > > > >>>>>>>>>>>>>>>>> several
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> stages
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline
> > of
> > > > > Flink
> > > > > > >> ML,
> > > > > > >>>>> in
> > > > > > >>>>>>>>>> order
> > > > > > >>>>>>>>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> utilize
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we
> have
> > > to
> > > > > > >> submit a
> > > > > > >>>>>>>> job
> > > > > > >>>>>>>>>> by
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is
> better
> > > to
> > > > > > named
> > > > > > >>>>>>>>>>>>>>> `persist()`,
> > > > > > >>>>>>>>>>>>>>>>> And
> > > > > > >>>>>>>>>>>>>>>>>>>>>> The
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we
> > > > > internally
> > > > > > >>>>> cache
> > > > > > >>>>>>>>> in
> > > > > > >>>>>>>>>>>>>>>> memory
> > > > > > >>>>>>>>>>>>>>>>>> or
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> persist
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the
> > data
> > > > into
> > > > > > >> state
> > > > > > >>>>>>>>>> backend
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
> > RocksDBStateBackend
> > > > > etc.)
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the
> > > > future,
> > > > > > >>>>> support
> > > > > > >>>>>>>>> for
> > > > > > >>>>>>>>>>>>>>>>>> streaming
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> and
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job
> > will
> > > > also
> > > > > > >>>>> benefit
> > > > > > >>>>>>>>> in
> > > > > > >>>>>>>>>>>>>>>>>>>>>> "Interactive
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to
> > > your
> > > > > > JIRAs
> > > > > > >>>>> and
> > > > > > >>>>>>>>>> FLIP!
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > > > > > 于2018年11月20日周二
> > > > > > >>>>>>>>>> 下午9:56写道:
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
> > > pointed
> > > > > out,
> > > > > > >> it
> > > > > > >>>>>>>> is a
> > > > > > >>>>>>>>>>>>>>>>> promising
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table
> API
> > in
> > > > > > various
> > > > > > >>>>>>>>>> aspects,
> > > > > > >>>>>>>>>>>>>>>>>>>>>> including
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among
> > > others.
> > > > > One
> > > > > > >> of
> > > > > > >>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>> scenarios
> > > > > > >>>>>>>>>>>>>>>>>>>>>>> where
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>> we
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is
> interactive
> > > > > > >>>>> programming.
> > > > > > >>>>>>>> To
> > > > > > >>>>>>>>>>>>>>>> explain
> > > > > > >>>>>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the
> > > > > solution,
> > > > > > we
> > > > > > >>>>> put
> > > > > > >>>>>>>>>>>>>>>> together
> > > > > > >>>>>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very
> welcome!
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Till Rohrmann <tr...@apache.org>.
It's true that b, c, d and e will all read from the original DAG that
generates a. But all subsequent operators (when running multiple queries)
which reference cachedTableA should not need to reproduce `a` but directly
consume the intermediate result.

Conceptually one could think of cache() as introducing a caching operator
from which you need to consume from if you want to benefit from the caching
functionality.

I agree, ideally the optimizer makes this kind of decision which
intermediate result should be cached. But especially when executing ad-hoc
queries the user might better know which results need to be cached because
Flink might not see the full DAG. In that sense, I would consider the
cache() method as a hint for the optimizer. Of course, in the future we
might add functionality which tries to automatically cache results (e.g.
caching the latest intermediate results until so and so much space is
used). But this should hopefully not contradict with `CachedTable cache()`.

Cheers,
Till

On Tue, Dec 4, 2018 at 2:33 PM Becket Qin <be...@gmail.com> wrote:

> Hi Till,
>
> Thanks for the clarification. I am still a little confused.
>
> If cache() returns a CachedTable, the example might become:
>
> b = a.map(...)
> c = a.map(...)
>
> cachedTableA = a.cache()
> d = cachedTableA.map(...)
> e = a.map()
>
> In the above case, if cache() is lazily evaluated, b, c, d and e are all
> going to be reading from the original DAG that generates a. But with a
> naive expectation, d should be reading from the cache. This seems not
> solving the potential confusion you raised, right?
>
> Just to be clear, my understanding are all based on the assumption that the
> tables are immutable. Therefore, after a.cache(), a the c*achedTableA* and
> original table *a * should be completely interchangeable.
>
> That said, I think a valid argument is optimization. There are indeed cases
> that reading from the original DAG could be faster than reading from the
> cache. For example, in the following example:
>
> a.filter(f1' > 100)
> a.cache()
> b = a.filter(f1' < 100)
>
> Ideally the optimizer should be intelligent enough to decide which way is
> faster, without user intervention. In this case, it will identify that b
> would just be an empty table, thus skip reading from the cache completely.
> But I agree that returning a CachedTable would give user the control of
> when to use cache, even though I still feel that letting the optimizer
> handle this is a better option in long run.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
>
> On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <tr...@apache.org> wrote:
>
> > Yes you are right Becket that it still depends on the actual execution of
> > the job whether a consumer reads from a cached result or not.
> >
> > My point was actually about the properties of a (cached vs. non-cached)
> and
> > not about the execution. I would not make cache trigger the execution of
> > the job because one loses some flexibility by eagerly triggering the
> > execution.
> >
> > I tried to argue for an explicit CachedTable which is returned by the
> > cache() method like Piotr did in order to make the API more explicit.
> >
> > Cheers,
> > Till
> >
> > On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com> wrote:
> >
> > > Hi Till,
> > >
> > > That is a good example. Just a minor correction, in this case, b, c
> and d
> > > will all consume from a non-cached a. This is because cache will only
> be
> > > created on the very first job submission that generates the table to be
> > > cached.
> > >
> > > If I understand correctly, this is example is about whether .cache()
> > method
> > > should be eagerly evaluated or lazily evaluated. In another word, if
> > > cache() method actually triggers a job that creates the cache, there
> will
> > > be no such confusion. Is that right?
> > >
> > > In the example, although d will not consume from the cached Table while
> > it
> > > looks supposed to, from correctness perspective the code will still
> > return
> > > correct result, assuming that tables are immutable.
> > >
> > > Personally I feel it is OK because users probably won't really worry
> > about
> > > whether the table is cached or not. And lazy cache could avoid some
> > > unnecessary caching if a cached table is never created in the user
> > > application. But I am not opposed to do eager evaluation of cache.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > > On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <tr...@apache.org>
> > > wrote:
> > >
> > > > Another argument for Piotr's point is that lazily changing properties
> > of
> > > a
> > > > node affects all down stream consumers but does not necessarily have
> to
> > > > happen before these consumers are defined. From a user's perspective
> > this
> > > > can be quite confusing:
> > > >
> > > > b = a.map(...)
> > > > c = a.map(...)
> > > >
> > > > a.cache()
> > > > d = a.map(...)
> > > >
> > > > now b, c and d will consume from a cached operator. In this case, the
> > > user
> > > > would most likely expect that only d reads from a cached result.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> > piotr@data-artisans.com>
> > > > wrote:
> > > >
> > > > > Hey Shaoxuan and Becket,
> > > > >
> > > > > > Can you explain a bit more one what are the side effects? So far
> my
> > > > > > understanding is that such side effects only exist if a table is
> > > > mutable.
> > > > > > Is that the case?
> > > > >
> > > > > Not only that. There are also performance implications and those
> are
> > > > > another implicit side effects of using `void cache()`. As I wrote
> > > before,
> > > > > reading from cache might not always be desirable, thus it can cause
> > > > > performance degradation and I’m fine with that - user's or
> > optimiser’s
> > > > > choice. What I do not like is that this implicit side effect can
> > > manifest
> > > > > in completely different part of code, that wasn’t touched by a user
> > > while
> > > > > he was adding `void cache()` call somewhere else. And even if
> caching
> > > > > improves performance, it’s still a side effect of `void cache()`.
> > > Almost
> > > > > from the definition `void` methods have only side effects. As I
> wrote
> > > > > before, there are couple of scenarios where this might be
> undesirable
> > > > > and/or unexpected, for example:
> > > > >
> > > > > 1.
> > > > > Table b = …;
> > > > > b.cache()
> > > > > x = b.join(…)
> > > > > y = b.count()
> > > > > // ...
> > > > > // 100
> > > > > // hundred
> > > > > // lines
> > > > > // of
> > > > > // code
> > > > > // later
> > > > > z = b.filter(…).groupBy(…) // this might be even hidden in a
> > different
> > > > > method/file/package/dependency
> > > > >
> > > > > 2.
> > > > >
> > > > > Table b = ...
> > > > > If (some_condition) {
> > > > >   foo(b)
> > > > > }
> > > > > Else {
> > > > >   bar(b)
> > > > > }
> > > > > z = b.filter(…).groupBy(…)
> > > > >
> > > > >
> > > > > Void foo(Table b) {
> > > > >   b.cache()
> > > > >   // do something with b
> > > > > }
> > > > >
> > > > > In both above examples, `b.cache()` will implicitly affect
> (semantic
> > > of a
> > > > > program in case of sources being mutable and performance) `z =
> > > > > b.filter(…).groupBy(…)` which might be far from obvious.
> > > > >
> > > > > On top of that, there is still this argument of mine that having a
> > > > > `MaterializedTable` or `CachedTable` handle is more flexible for us
> > for
> > > > the
> > > > > future and for the user (as a manual option to bypass cache reads).
> > > > >
> > > > > >  But Jiangjie is correct,
> > > > > > the source table in batching should be immutable. It is the
> user’s
> > > > > > responsibility to ensure it, otherwise even a regular failover
> may
> > > lead
> > > > > > to inconsistent results.
> > > > >
> > > > > Yes, I agree that’s what perfect world/good deployment should be.
> But
> > > its
> > > > > often isn’t and while I’m not trying to fix this (since the proper
> > fix
> > > is
> > > > > to support transactions), I’m just trying to minimise confusion for
> > the
> > > > > users that are not fully aware what’s going on and operate in less
> > then
> > > > > perfect setup. And if something bites them after adding `b.cache()`
> > > call,
> > > > > to make sure that they at least know all of the places that adding
> > this
> > > > > line can affect.
> > > > >
> > > > > Thanks, Piotrek
> > > > >
> > > > > > On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com>
> wrote:
> > > > > >
> > > > > > Hi Piotrek,
> > > > > >
> > > > > > Thanks again for the clarification. Some more replies are
> > following.
> > > > > >
> > > > > > But keep in mind that `.cache()` will/might not only be used in
> > > > > interactive
> > > > > >> programming and not only in batching.
> > > > > >
> > > > > > It is true. Actually in stream processing, cache() has the same
> > > > semantic
> > > > > as
> > > > > > batch processing. The semantic is following:
> > > > > > For a table created via a series of computation, save that table
> > for
> > > > > later
> > > > > > reference to avoid running the computation logic to regenerate
> the
> > > > table.
> > > > > > Once the application exits, drop all the cache.
> > > > > > This semantic is same for both batch and stream processing. The
> > > > > difference
> > > > > > is that stream applications will only run once as they are long
> > > > running.
> > > > > > And the batch applications may be run multiple times, hence the
> > cache
> > > > may
> > > > > > be created and dropped each time the application runs.
> > > > > > Admittedly, there will probably be some resource management
> > > > requirements
> > > > > > for the streaming cached table, such as time based / size based
> > > > > retention,
> > > > > > to address the infinite data issue. But such requirement does not
> > > > change
> > > > > > the semantic.
> > > > > > You are right that interactive programming is just one use case
> of
> > > > > cache().
> > > > > > It is not the only use case.
> > > > > >
> > > > > > For me the more important issue is of not having the `void
> cache()`
> > > > with
> > > > > >> side effects.
> > > > > >
> > > > > > This is indeed the key point. The argument around whether cache()
> > > > should
> > > > > > return something already indicates that cache() and materialize()
> > > > address
> > > > > > different issues.
> > > > > > Can you explain a bit more one what are the side effects? So far
> my
> > > > > > understanding is that such side effects only exist if a table is
> > > > mutable.
> > > > > > Is that the case?
> > > > > >
> > > > > > I don’t know, probably initially we should make CachedTable
> > > read-only.
> > > > I
> > > > > >> don’t find it more confusing than the fact that user can not
> write
> > > to
> > > > > views
> > > > > >> or materialised views in SQL or that user currently can not
> write
> > > to a
> > > > > >> Table.
> > > > > >
> > > > > > I don't think anyone should insert something to a cache. By
> > > definition
> > > > > the
> > > > > > cache should only be updated when the corresponding original
> table
> > is
> > > > > > updated. What I am wondering is that given the following two
> facts:
> > > > > > 1. If and only if a table is mutable (with something like
> > insert()),
> > > a
> > > > > > CachedTable may have implicit behavior.
> > > > > > 2. A CachedTable extends a Table.
> > > > > > We can come to the conclusion that a CachedTable is mutable and
> > users
> > > > can
> > > > > > insert into the CachedTable directly. This is where I thought
> > > > confusing.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jiangjie (Becket) Qin
> > > > > >
> > > > > > On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> > > piotr@data-artisans.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> Regarding naming `cache()` vs `materialize()`. One more
> > explanation
> > > > why
> > > > > I
> > > > > >> think `materialize()` is more natural to me is that I think of
> all
> > > > > “Table”s
> > > > > >> in Table-API as views. They behave the same way as SQL views,
> the
> > > only
> > > > > >> difference for me is that their live scope is short - current
> > > session
> > > > > which
> > > > > >> is limited by different execution model. That’s why “cashing” a
> > view
> > > > > for me
> > > > > >> is just materialising it.
> > > > > >>
> > > > > >> However I see and I understand your point of view. Coming from
> > > > > >> DataSet/DataStream and generally speaking non-SQL world,
> `cache()`
> > > is
> > > > > more
> > > > > >> natural. But keep in mind that `.cache()` will/might not only be
> > > used
> > > > in
> > > > > >> interactive programming and not only in batching. But naming is
> > one
> > > > > issue,
> > > > > >> and not that critical to me. Especially that once we implement
> > > proper
> > > > > >> materialised views, we can always deprecate/rename `cache()` if
> we
> > > > deem
> > > > > so.
> > > > > >>
> > > > > >>
> > > > > >> For me the more important issue is of not having the `void
> > cache()`
> > > > with
> > > > > >> side effects. Exactly for the reasons that you have mentioned.
> > True:
> > > > > >> results might be non deterministic if underlying source table
> are
> > > > > changing.
> > > > > >> Problem is that `void cache()` implicitly changes the semantic
> of
> > > > > >> subsequent uses of the cached/materialized Table. It can cause
> > “wtf”
> > > > > moment
> > > > > >> for a user if he inserts “b.cache()” call in some place in his
> > code
> > > > and
> > > > > >> suddenly some other random places are behaving differently. If
> > > > > >> `materialize()` or `cache()` returns a Table handle, we force
> user
> > > to
> > > > > >> explicitly use the cache which removes the “random” part from
> the
> > > > > "suddenly
> > > > > >> some other random places are behaving differently”.
> > > > > >>
> > > > > >> This argument and others that I’ve raised (greater
> > > > flexibility/allowing
> > > > > >> user to explicitly bypass the cache) are independent of
> `cache()`
> > vs
> > > > > >> `materialize()` discussion.
> > > > > >>
> > > > > >>> Does that mean one can also insert into the CachedTable? This
> > > sounds
> > > > > >> pretty confusing.
> > > > > >>
> > > > > >> I don’t know, probably initially we should make CachedTable
> > > > read-only. I
> > > > > >> don’t find it more confusing than the fact that user can not
> write
> > > to
> > > > > views
> > > > > >> or materialised views in SQL or that user currently can not
> write
> > > to a
> > > > > >> Table.
> > > > > >>
> > > > > >> Piotrek
> > > > > >>
> > > > > >>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com>
> > wrote:
> > > > > >>>
> > > > > >>> Hi all,
> > > > > >>>
> > > > > >>> I agree with @Becket that `cache()` and `materialize()` should
> be
> > > > > >> considered as two different methods where the later one is more
> > > > > >> sophisticated.
> > > > > >>>
> > > > > >>> According to my understanding, the initial idea is just to
> > > introduce
> > > > a
> > > > > >> simple cache or persist mechanism, but as the TableAPI is a
> > > high-level
> > > > > API,
> > > > > >> it’s naturally for as to think in a SQL way.
> > > > > >>>
> > > > > >>> Maybe we can add the `cache()` method to the DataSet API and
> > force
> > > > > users
> > > > > >> to translate a Table to a Dataset before caching it. Then the
> > users
> > > > > should
> > > > > >> manually register the cached dataset to a table again (we may
> need
> > > > some
> > > > > >> table replacement mechanisms for datasets with an identical
> schema
> > > but
> > > > > >> different contents here). After all, it’s the dataset rather
> than
> > > the
> > > > > >> dynamic table that need to be cached, right?
> > > > > >>>
> > > > > >>> Best,
> > > > > >>> Xingcan
> > > > > >>>
> > > > > >>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <
> becket.qin@gmail.com>
> > > > > wrote:
> > > > > >>>>
> > > > > >>>> Hi Piotrek and Jark,
> > > > > >>>>
> > > > > >>>> Thanks for the feedback and explanation. Those are good
> > arguments.
> > > > > But I
> > > > > >>>> think those arguments are mostly about materialized view. Let
> me
> > > try
> > > > > to
> > > > > >>>> explain the reason I believe cache() and materialize() are
> > > > different.
> > > > > >>>>
> > > > > >>>> I think cache() and materialize() have quite different
> > > implications.
> > > > > An
> > > > > >>>> analogy I can think of is save()/publish(). When users call
> > > cache(),
> > > > > it
> > > > > >> is
> > > > > >>>> just like they are saving an intermediate result as a draft of
> > > their
> > > > > >> work,
> > > > > >>>> this intermediate result may not have any realistic meaning.
> > > Calling
> > > > > >>>> cache() does not mean users want to publish the cached table
> in
> > > any
> > > > > >> manner.
> > > > > >>>> But when users call materialize(), that means "I have
> something
> > > > > >> meaningful
> > > > > >>>> to be reused by others", now users need to think about the
> > > > validation,
> > > > > >>>> update & versioning, lifecycle of the result, etc.
> > > > > >>>>
> > > > > >>>> Piotrek's suggestions on variations of the materialize()
> methods
> > > are
> > > > > >> very
> > > > > >>>> useful. It would be great if Flink have them. The concept of
> > > > > >> materialized
> > > > > >>>> view is actually a pretty big feature, not to say the related
> > > stuff
> > > > > like
> > > > > >>>> triggers/hooks you mentioned earlier. I think the materialized
> > > view
> > > > > >> itself
> > > > > >>>> should be discussed in a more thorough and systematic manner.
> > And
> > > I
> > > > > >> found
> > > > > >>>> that discussion is kind of orthogonal and way beyond
> interactive
> > > > > >>>> programming experience.
> > > > > >>>>
> > > > > >>>> The example you gave was interesting. I still have some
> > questions,
> > > > > >> though.
> > > > > >>>>
> > > > > >>>> Table source = … // some source that scans files from a
> > directory
> > > > > >>>>> “/foo/bar/“
> > > > > >>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > > > >>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > > > >>>>
> > > > > >>>> t2.count() // initialise cache (if it’s lazily initialised)
> > > > > >>>>> int a1 = t1.count()
> > > > > >>>>> int b1 = t2.count()
> > > > > >>>>> // something in the background (or we trigger it) writes new
> > > files
> > > > to
> > > > > >>>>> /foo/bar
> > > > > >>>>> int a2 = t1.count()
> > > > > >>>>> int b2 = t2.count()
> > > > > >>>>> t2.refresh() // possible future extension, not to be
> > implemented
> > > in
> > > > > the
> > > > > >>>>> initial version
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>> what if someone else added some more files to /foo/bar at this
> > > > point?
> > > > > In
> > > > > >>>> that case, a3 won't equals to b3, and the result become
> > > > > >> non-deterministic,
> > > > > >>>> right?
> > > > > >>>>
> > > > > >>>> int a3 = t1.count()
> > > > > >>>>> int b3 = t2.count()
> > > > > >>>>> t2.drop() // another possible future extension, manual
> “cache”
> > > > > dropping
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> When we talk about interactive programming, in most cases, we
> > are
> > > > > >> talking
> > > > > >>>> about batch applications. A fundamental assumption of such
> case
> > is
> > > > > that
> > > > > >> the
> > > > > >>>> source data is complete before the data processing begins, and
> > the
> > > > > data
> > > > > >>>> will not change during the data processing. IMO, if additional
> > > rows
> > > > > >> needs
> > > > > >>>> to be added to some source during the processing, it should be
> > > done
> > > > in
> > > > > >> ways
> > > > > >>>> like union the source with another table containing the rows
> to
> > be
> > > > > >> added.
> > > > > >>>>
> > > > > >>>> There are a few cases that computations are executed
> repeatedly
> > on
> > > > the
> > > > > >>>> changing data source.
> > > > > >>>>
> > > > > >>>> For example, people may run a ML training job every hour with
> > the
> > > > > >> samples
> > > > > >>>> newly added in the past hour. In that case, the source data
> > > between
> > > > > will
> > > > > >>>> indeed change. But still, the data remain unchanged within one
> > > run.
> > > > > And
> > > > > >>>> usually in that case, the result will need versioning, i.e.
> for
> > a
> > > > > given
> > > > > >>>> result, it tells that the result is a result from the source
> > data
> > > > by a
> > > > > >>>> certain timestamp.
> > > > > >>>>
> > > > > >>>> Another example is something like data warehouse. In this
> case,
> > > > there
> > > > > >> are a
> > > > > >>>> few source of original/raw data. On top of those sources, many
> > > > > >> materialized
> > > > > >>>> view / queries / reports / dashboards can be created to
> generate
> > > > > derived
> > > > > >>>> data. Those derived data needs to be updated when the
> underlying
> > > > > >> original
> > > > > >>>> data changes. In that case, the processing logic that derives
> > the
> > > > > >> original
> > > > > >>>> data needs to be executed repeatedly to update those
> > > reports/views.
> > > > > >> Again,
> > > > > >>>> all those derived data also need to have version management,
> > such
> > > as
> > > > > >>>> timestamp.
> > > > > >>>>
> > > > > >>>> In any of the above two cases, during a single run of the
> > > processing
> > > > > >> logic,
> > > > > >>>> the data cannot change. Otherwise the behavior of the
> processing
> > > > logic
> > > > > >> may
> > > > > >>>> be undefined. In the above two examples, when writing the
> > > processing
> > > > > >> logic,
> > > > > >>>> Users can use .cache() to hint Flink that those results should
> > be
> > > > > saved
> > > > > >> to
> > > > > >>>> avoid repeated computation. And then for the result of my
> > > > application
> > > > > >>>> logic, I'll call materialize(), so that these results could be
> > > > managed
> > > > > >> by
> > > > > >>>> the system with versioning, metadata management, lifecycle
> > > > management,
> > > > > >>>> ACLs, etc.
> > > > > >>>>
> > > > > >>>> It is true we can use materialize() to do the cache() job,
> but I
> > > am
> > > > > >> really
> > > > > >>>> reluctant to shoehorn cache() into materialize() and force
> users
> > > to
> > > > > >> worry
> > > > > >>>> about a bunch of implications that they needn't have to. I am
> > > > > >> absolutely on
> > > > > >>>> your side that redundant API is bad. But it is equally
> > > frustrating,
> > > > if
> > > > > >> not
> > > > > >>>> more, that the same API does different things.
> > > > > >>>>
> > > > > >>>> Thanks,
> > > > > >>>>
> > > > > >>>> Jiangjie (Becket) Qin
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> > > wshaoxuan@gmail.com
> > > > >
> > > > > >> wrote:
> > > > > >>>>
> > > > > >>>>> Thanks Piotrek,
> > > > > >>>>> You provided a very good example, it explains all the
> > confusions
> > > I
> > > > > >> have.
> > > > > >>>>> It is clear that there is something we have not considered in
> > the
> > > > > >> initial
> > > > > >>>>> proposal. We intend to force the user to reuse the
> > > > > cached/materialized
> > > > > >>>>> table, if its cache() method is executed. We did not expect
> > that
> > > > user
> > > > > >> may
> > > > > >>>>> want to re-executed the plan from the source table. Let me
> > > re-think
> > > > > >> about
> > > > > >>>>> it and get back to you later.
> > > > > >>>>>
> > > > > >>>>> In the meanwhile, this example/observation also infers that
> we
> > > > cannot
> > > > > >> fully
> > > > > >>>>> involve the optimizer to decide the plan if a
> cache/materialize
> > > is
> > > > > >>>>> explicitly used, because weather to reuse the cache data or
> > > > > re-execute
> > > > > >> the
> > > > > >>>>> query from source data may lead to different results. (But I
> > > guess
> > > > > >>>>> optimizer can still help in some cases ---- as long as it
> does
> > > not
> > > > > >>>>> re-execute from the varied source, we should be safe).
> > > > > >>>>>
> > > > > >>>>> Regards,
> > > > > >>>>> Shaoxuan
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> > > > > >> piotr@data-artisans.com>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Hi Shaoxuan,
> > > > > >>>>>>
> > > > > >>>>>> Re 2:
> > > > > >>>>>>
> > > > > >>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified
> > > to->
> > > > > t1’
> > > > > >>>>>>
> > > > > >>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> > > > > >>>>>> `methodThatAppliesOperators()` method has changed it’s plan?
> > > > > >>>>>>
> > > > > >>>>>> I was thinking more about something like this:
> > > > > >>>>>>
> > > > > >>>>>> Table source = … // some source that scans files from a
> > > directory
> > > > > >>>>>> “/foo/bar/“
> > > > > >>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > > > >>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > > > >>>>>>
> > > > > >>>>>> t2.count() // initialise cache (if it’s lazily initialised)
> > > > > >>>>>>
> > > > > >>>>>> int a1 = t1.count()
> > > > > >>>>>> int b1 = t2.count()
> > > > > >>>>>>
> > > > > >>>>>> // something in the background (or we trigger it) writes new
> > > files
> > > > > to
> > > > > >>>>>> /foo/bar
> > > > > >>>>>>
> > > > > >>>>>> int a2 = t1.count()
> > > > > >>>>>> int b2 = t2.count()
> > > > > >>>>>>
> > > > > >>>>>> t2.refresh() // possible future extension, not to be
> > implemented
> > > > in
> > > > > >> the
> > > > > >>>>>> initial version
> > > > > >>>>>>
> > > > > >>>>>> int a3 = t1.count()
> > > > > >>>>>> int b3 = t2.count()
> > > > > >>>>>>
> > > > > >>>>>> t2.drop() // another possible future extension, manual
> “cache”
> > > > > >> dropping
> > > > > >>>>>>
> > > > > >>>>>> assertTrue(a1 == b1) // same results, but b1 comes from the
> > > > “cache"
> > > > > >>>>>> assertTrue(b1 == b2) // both values come from the same cache
> > > > > >>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed
> > full
> > > > > table
> > > > > >>>>> scan
> > > > > >>>>>> and has more data
> > > > > >>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> > > > > >>>>>> assertTrue(b3 == a2 == a3)
> > > > > >>>>>>
> > > > > >>>>>> Piotrek
> > > > > >>>>>>
> > > > > >>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com>
> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>> Hi,
> > > > > >>>>>>>
> > > > > >>>>>>> It is an very interesting and useful design!
> > > > > >>>>>>>
> > > > > >>>>>>> Here I want to share some of my thoughts:
> > > > > >>>>>>>
> > > > > >>>>>>> 1. Agree with that cache() method should return some Table
> to
> > > > avoid
> > > > > >>>>> some
> > > > > >>>>>>> unexpected problems because of the mutable object.
> > > > > >>>>>>> All the existing methods of Table are returning a new Table
> > > > > instance.
> > > > > >>>>>>>
> > > > > >>>>>>> 2. I think materialize() would be more consistent with SQL,
> > > this
> > > > > >> makes
> > > > > >>>>> it
> > > > > >>>>>>> possible to support the same feature for SQL (materialize
> > view)
> > > > and
> > > > > >>>>> keep
> > > > > >>>>>>> the same API for users in the future.
> > > > > >>>>>>> But I'm also fine if we choose cache().
> > > > > >>>>>>>
> > > > > >>>>>>> 3. In the proposal, a TableService (or FlinkService?) is
> used
> > > to
> > > > > >> cache
> > > > > >>>>>> the
> > > > > >>>>>>> result of the (intermediate) table.
> > > > > >>>>>>> But the name of TableService may be a bit general which is
> > not
> > > > > quite
> > > > > >>>>>>> understanding correctly in the first glance (a metastore
> for
> > > > > >> tables?).
> > > > > >>>>>>> Maybe a more specific name would be better, such as
> > > > > TableCacheSerive
> > > > > >>>>> or
> > > > > >>>>>>> TableMaterializeSerivce or something else.
> > > > > >>>>>>>
> > > > > >>>>>>> Best,
> > > > > >>>>>>> Jark
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
> > fhueske@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> Hi,
> > > > > >>>>>>>>
> > > > > >>>>>>>> Thanks for the clarification Becket!
> > > > > >>>>>>>>
> > > > > >>>>>>>> I have a few thoughts to share / questions:
> > > > > >>>>>>>>
> > > > > >>>>>>>> 1) I'd like to know how you plan to implement the feature
> > on a
> > > > > plan
> > > > > >> /
> > > > > >>>>>>>> planner level.
> > > > > >>>>>>>>
> > > > > >>>>>>>> I would imaging the following to happen when Table.cache()
> > is
> > > > > >> called:
> > > > > >>>>>>>>
> > > > > >>>>>>>> 1) immediately optimize the Table and internally convert
> it
> > > > into a
> > > > > >>>>>>>> DataSet/DataStream. This is necessary, to avoid that
> > operators
> > > > of
> > > > > >>>>> later
> > > > > >>>>>>>> queries on top of the Table are pushed down.
> > > > > >>>>>>>> 2) register the DataSet/DataStream as a
> > > > DataSet/DataStream-backed
> > > > > >>>>> Table
> > > > > >>>>>> X
> > > > > >>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> > > > > materialization
> > > > > >>>>> of
> > > > > >>>>>> the
> > > > > >>>>>>>> Table X
> > > > > >>>>>>>>
> > > > > >>>>>>>> Based on your proposal the following would happen:
> > > > > >>>>>>>>
> > > > > >>>>>>>> Table t1 = ....
> > > > > >>>>>>>> t1.cache(); // cache() returns void. The logical plan of
> t1
> > is
> > > > > >>>>> replaced
> > > > > >>>>>> by
> > > > > >>>>>>>> a scan of X. There is also a reference to the
> > materialization
> > > of
> > > > > X.
> > > > > >>>>>>>>
> > > > > >>>>>>>> t1.count(); // this executes the program, including the
> > > > > >>>>>> DataSet/DataStream
> > > > > >>>>>>>> that backs X and the sink that writes the materialization
> > of X
> > > > > >>>>>>>> t1.count(); // this executes the program, but reads X from
> > the
> > > > > >>>>>>>> materialization.
> > > > > >>>>>>>>
> > > > > >>>>>>>> My question is, how do you determine when whether the scan
> > of
> > > t1
> > > > > >>>>> should
> > > > > >>>>>> go
> > > > > >>>>>>>> against the DataSet/DataStream program and when against
> the
> > > > > >>>>>>>> materialization?
> > > > > >>>>>>>> AFAIK, there is no hook that will tell you that a part of
> > the
> > > > > >> program
> > > > > >>>>>> was
> > > > > >>>>>>>> executed. Flipping a switch during optimization or plan
> > > > generation
> > > > > >> is
> > > > > >>>>>> not
> > > > > >>>>>>>> sufficient as there is no guarantee that the plan is also
> > > > > executed.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Overall, this behavior is somewhat similar to what I
> > proposed
> > > in
> > > > > >>>>>>>> FLINK-8950, which does not include persisting the table,
> but
> > > > just
> > > > > >>>>>>>> optimizing and reregistering it as DataSet/DataStream
> scan.
> > > > > >>>>>>>>
> > > > > >>>>>>>> 2) I think Piotr has a point about the implicit behavior
> and
> > > > side
> > > > > >>>>>> effects
> > > > > >>>>>>>> of the cache() method if it does not return anything.
> > > > > >>>>>>>> Consider the following example:
> > > > > >>>>>>>>
> > > > > >>>>>>>> Table t1 = ???
> > > > > >>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> > > > > >>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> > > > > >>>>>>>>
> > > > > >>>>>>>> In this case, the behavior/performance of the plan that
> > > results
> > > > > from
> > > > > >>>>> the
> > > > > >>>>>>>> second method call depends on whether t1 was modified by
> the
> > > > first
> > > > > >>>>>> method
> > > > > >>>>>>>> or not.
> > > > > >>>>>>>> This is the classic issue of mutable vs. immutable
> objects.
> > > > > >>>>>>>> Also, as Piotr pointed out, it might also be good to have
> > the
> > > > > >> original
> > > > > >>>>>> plan
> > > > > >>>>>>>> of t1, because in some cases it is possible to push
> filters
> > > down
> > > > > >> such
> > > > > >>>>>> that
> > > > > >>>>>>>> evaluating the query from scratch might be more efficient
> > than
> > > > > >>>>> accessing
> > > > > >>>>>>>> the cache.
> > > > > >>>>>>>> Moreover, a CachedTable could extend Table() and offer a
> > > method
> > > > > >>>>>> refresh().
> > > > > >>>>>>>> This sounds quite useful in an interactive session mode.
> > > > > >>>>>>>>
> > > > > >>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> > > > > materialize()
> > > > > >>>>>> seems
> > > > > >>>>>>>> to be more future proof.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Best, Fabian
> > > > > >>>>>>>>
> > > > > >>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> > > > > >>>>>>>> wshaoxuan@gmail.com>:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Hi Piotr,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Thanks for sharing your ideas on the method naming. We
> will
> > > > think
> > > > > >>>>> about
> > > > > >>>>>>>>> your suggestions. But I don't understand why we need to
> > > change
> > > > > the
> > > > > >>>>>> return
> > > > > >>>>>>>>> type of cache().
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Cache() is a physical operation, it does not change the
> > logic
> > > > of
> > > > > >>>>>>>>> the `Table`. On the tableAPI layer, we should not
> > introduce a
> > > > new
> > > > > >>>>> table
> > > > > >>>>>>>>> type unless the logic of table has been changed. If we
> > > > introduce
> > > > > a
> > > > > >>>>> new
> > > > > >>>>>>>>> table type `CachedTable`, we need create the same set of
> > > > methods
> > > > > of
> > > > > >>>>>>>> `Table`
> > > > > >>>>>>>>> for it. I don't think it is worth doing this. Or can you
> > > please
> > > > > >>>>>> elaborate
> > > > > >>>>>>>>> more on what could be the "implicit behaviours/side
> > effects"
> > > > you
> > > > > >> are
> > > > > >>>>>>>>> thinking about?
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Regards,
> > > > > >>>>>>>>> Shaoxuan
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > > > > >>>>>> piotr@data-artisans.com>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Hi Becket,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Thanks for the response.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> 1. I wasn’t saying that materialised view must be
> mutable
> > or
> > > > > not.
> > > > > >>>>> The
> > > > > >>>>>>>>> same
> > > > > >>>>>>>>>> thing applies to caches as well. To the contrary, I
> would
> > > > expect
> > > > > >>>>> more
> > > > > >>>>>>>>>> consistency and updates from something that is called
> > > “cache”
> > > > vs
> > > > > >>>>>>>>> something
> > > > > >>>>>>>>>> that’s a “materialised view”. In other words, IMO most
> > > caches
> > > > do
> > > > > >> not
> > > > > >>>>>>>>> serve
> > > > > >>>>>>>>>> you invalid/outdated data and they handle updates on
> their
> > > > own.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> 2. I don’t think that having in the future two very
> > similar
> > > > > >> concepts
> > > > > >>>>>> of
> > > > > >>>>>>>>>> `materialized` view and `cache` is a good idea. It would
> > be
> > > > > >>>>> confusing
> > > > > >>>>>>>> for
> > > > > >>>>>>>>>> the users. I think it could be handled by
> > > > variations/overloading
> > > > > >> of
> > > > > >>>>>>>>>> materialised view concept. We could start with:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> `MaterializedTable materialize()` - immutable, session
> > life
> > > > > scope
> > > > > >>>>>>>>>> (basically the same semantic as you are proposing
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> And then in the future (if ever) build on top of
> > that/expand
> > > > it
> > > > > >>>>> with:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> > > > > >> `MaterializedTable
> > > > > >>>>>>>>>> materialize(refreshHook=…)`
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Or with cross session support:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> > > > > >>>>> `MaterializedTable
> > > > > >>>>>>>>>> materializeInto(tableFactory=…)`
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I’m not saying that we should implement cross
> > > > session/refreshing
> > > > > >> now
> > > > > >>>>>> or
> > > > > >>>>>>>>>> even in the near future. I’m just arguing that naming
> > > current
> > > > > >>>>>> immutable
> > > > > >>>>>>>>>> session life scope method `materialize()` is more future
> > > proof
> > > > > and
> > > > > >>>>>> more
> > > > > >>>>>>>>>> consistent with SQL (on which after all table-api is
> > heavily
> > > > > >> basing
> > > > > >>>>>>>> on).
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would
> still
> > > > insist
> > > > > >> on
> > > > > >>>>>>>>>> `cache()` returning `CachedTable` handle to avoid
> implicit
> > > > > >>>>>>>>> behaviours/side
> > > > > >>>>>>>>>> effects and to give both us & users more flexibility.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Piotrek
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
> > becket.qin@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Just to add a little bit, the materialized view is
> > probably
> > > > > more
> > > > > >>>>>>>>> similar
> > > > > >>>>>>>>>> to
> > > > > >>>>>>>>>>> the persistent() brought up earlier in the thread. So
> it
> > is
> > > > > >> usually
> > > > > >>>>>>>>> cross
> > > > > >>>>>>>>>>> session and could be used in a larger scope. For
> > example, a
> > > > > >>>>>>>>> materialized
> > > > > >>>>>>>>>>> view created by user A may be visible to user B. It is
> > > > probably
> > > > > >>>>>>>>> something
> > > > > >>>>>>>>>>> we want to have in the future. I'll put it in the
> future
> > > work
> > > > > >>>>>>>> section.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> > > > > becket.qin@gmail.com
> > > > > >>>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> Hi Piotrek,
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Thanks for the explanation.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Right now we are mostly thinking of the cached table
> as
> > > > > >>>>> immutable. I
> > > > > >>>>>>>>> can
> > > > > >>>>>>>>>>>> see the Materialized view would be useful in the
> future.
> > > > That
> > > > > >>>>> said,
> > > > > >>>>>>>> I
> > > > > >>>>>>>>>> think
> > > > > >>>>>>>>>>>> a simple cache mechanism is probably still needed. So
> to
> > > me,
> > > > > >>>>> cache()
> > > > > >>>>>>>>> and
> > > > > >>>>>>>>>>>> materialize() should be two separate method as they
> > > address
> > > > > >>>>>>>> different
> > > > > >>>>>>>>>>>> needs. Materialize() is a higher level concept usually
> > > > > implying
> > > > > >>>>>>>>>> periodical
> > > > > >>>>>>>>>>>> update, while cache() has much simpler semantic. For
> > > > example,
> > > > > >> one
> > > > > >>>>>>>> may
> > > > > >>>>>>>>>>>> create a materialized view and use cache() method in
> the
> > > > > >>>>>>>> materialized
> > > > > >>>>>>>>>> view
> > > > > >>>>>>>>>>>> creation logic. So that during the materialized view
> > > update,
> > > > > >> they
> > > > > >>>>> do
> > > > > >>>>>>>>> not
> > > > > >>>>>>>>>>>> need to worry about the case that the cached table is
> > also
> > > > > >>>>> changed.
> > > > > >>>>>>>>>> Maybe
> > > > > >>>>>>>>>>>> under the hood, materialized() and cache() could share
> > > some
> > > > > >>>>>>>> mechanism,
> > > > > >>>>>>>>>> but
> > > > > >>>>>>>>>>>> I think a simple cache() method would be handy in a
> lot
> > of
> > > > > >> cases.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > > > > >>>>>>>>> piotr@data-artisans.com
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Hi Becket,
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > > MaterializedTable
> > > > > >> that
> > > > > >>>>>>>>> they
> > > > > >>>>>>>>>>>>> cannot do on a Table?
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Maybe not in the initial implementation, but various
> > DBs
> > > > > offer
> > > > > >>>>>>>>>> different
> > > > > >>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
> > triggers,
> > > > > >> timers,
> > > > > >>>>>>>>>> manually
> > > > > >>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to
> handle
> > > > that
> > > > > in
> > > > > >>>>> the
> > > > > >>>>>>>>>> future.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> After users call *table.cache(), *users can just use
> > > that
> > > > > >> table
> > > > > >>>>>>>> and
> > > > > >>>>>>>>> do
> > > > > >>>>>>>>>>>>> anything that is supported on a Table, including SQL.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> This is some implicit behaviour with side effects.
> > > Imagine
> > > > if
> > > > > >>>>> user
> > > > > >>>>>>>>> has
> > > > > >>>>>>>>>> a
> > > > > >>>>>>>>>>>>> long and complicated program, that touches table `b`
> > > > multiple
> > > > > >>>>>>>> times,
> > > > > >>>>>>>>>> maybe
> > > > > >>>>>>>>>>>>> scattered around different methods. If he modifies
> his
> > > > > program
> > > > > >> by
> > > > > >>>>>>>>>> inserting
> > > > > >>>>>>>>>>>>> in one place
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> b.cache()
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> This implicitly alters the semantic and behaviour of
> > his
> > > > code
> > > > > >> all
> > > > > >>>>>>>>> over
> > > > > >>>>>>>>>>>>> the place, maybe in a ways that might cause problems.
> > For
> > > > > >> example
> > > > > >>>>>>>>> what
> > > > > >>>>>>>>>> if
> > > > > >>>>>>>>>>>>> underlying data is changing?
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Having invisible side effects is also not very clean,
> > for
> > > > > >> example
> > > > > >>>>>>>>> think
> > > > > >>>>>>>>>>>>> about something like this (but more complicated):
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Table b = ...;
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> If (some_condition) {
> > > > > >>>>>>>>>>>>> processTable1(b)
> > > > > >>>>>>>>>>>>> }
> > > > > >>>>>>>>>>>>> else {
> > > > > >>>>>>>>>>>>> processTable2(b)
> > > > > >>>>>>>>>>>>> }
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> // do more stuff with b
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> > > > > >> `processTable1`
> > > > > >>>>>>>> or
> > > > > >>>>>>>>>>>>> `processTable2` methods.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On the other hand
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Table materialisedB = b.materialize()
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Avoids (at least some of) the side effect issues and
> > > forces
> > > > > >> user
> > > > > >>>>> to
> > > > > >>>>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate
> > and
> > > > > >> forces
> > > > > >>>>>>>> user
> > > > > >>>>>>>>>> to
> > > > > >>>>>>>>>>>>> think what does it actually mean. And if something
> > > doesn’t
> > > > > work
> > > > > >>>>> in
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>> end
> > > > > >>>>>>>>>>>>> for the user, he will know what has he changed
> instead
> > of
> > > > > >> blaming
> > > > > >>>>>>>>>> Flink for
> > > > > >>>>>>>>>>>>> some “magic” underneath. In the above example, after
> > > > > >>>>> materialising
> > > > > >>>>>>>> b
> > > > > >>>>>>>>> in
> > > > > >>>>>>>>>>>>> only one of the methods, he should/would realise
> about
> > > the
> > > > > >> issue
> > > > > >>>>>>>> when
> > > > > >>>>>>>>>>>>> handling the return value `MaterializedTable` of that
> > > > method.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> I guess it comes down to personal preferences if you
> > like
> > > > > >> things
> > > > > >>>>> to
> > > > > >>>>>>>>> be
> > > > > >>>>>>>>>>>>> implicit or not. The more power is the user, probably
> > the
> > > > > more
> > > > > >>>>>>>> likely
> > > > > >>>>>>>>>> he is
> > > > > >>>>>>>>>>>>> to like/understand implicit behaviour. And we as
> Table
> > > API
> > > > > >>>>>>>> designers
> > > > > >>>>>>>>>> are
> > > > > >>>>>>>>>>>>> the most power users out there, so I would proceed
> with
> > > > > caution
> > > > > >>>>> (so
> > > > > >>>>>>>>>> that we
> > > > > >>>>>>>>>>>>> do not end up in the crazy perl realm with it’s
> lovely
> > > > > implicit
> > > > > >>>>>>>>> method
> > > > > >>>>>>>>>>>>> arguments ;)  <
> > > > https://stackoverflow.com/a/14922656/8149051
> > > > > >)
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Table API to also support non-relational processing
> > > cases,
> > > > > >>>>> cache()
> > > > > >>>>>>>>>>>>> might be slightly better.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> I think even such extended Table API could benefit
> from
> > > > > >> sticking
> > > > > >>>>>>>>>> to/being
> > > > > >>>>>>>>>>>>> consistent with SQL where both SQL and Table API are
> > > > > basically
> > > > > >>>>> the
> > > > > >>>>>>>>>> same.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> One more thing. `MaterializedTable materialize()`
> could
> > > be
> > > > > more
> > > > > >>>>>>>>>>>>> powerful/flexible allowing the user to operate both
> on
> > > > > >>>>> materialised
> > > > > >>>>>>>>>> and not
> > > > > >>>>>>>>>>>>> materialised view at the same time for whatever
> reasons
> > > > > >>>>> (underlying
> > > > > >>>>>>>>>> data
> > > > > >>>>>>>>>>>>> changing/better optimisation opportunities after
> > pushing
> > > > down
> > > > > >>>>> more
> > > > > >>>>>>>>>> filters
> > > > > >>>>>>>>>>>>> etc). For example:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Table b = …;
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Val min = mb.min();
> > > > > >>>>>>>>>>>>> Val max = mb.max();
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Could be more efficient compared to `b.cache()` if
> > > > > >>>>> `filter(‘userId
> > > > > >>>>>>>> =
> > > > > >>>>>>>>>>>>> 42);` allows for much more aggressive optimisations.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Piotrek
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> > > > fhueske@gmail.com>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> I'm not suggesting to add support for Ignite. This
> was
> > > > just
> > > > > an
> > > > > >>>>>>>>>> example.
> > > > > >>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> > > > > >>>>>>>>>>>>>> For the sake of this proposal, it would be up to the
> > > user
> > > > to
> > > > > >>>>>>>>>> implement a
> > > > > >>>>>>>>>>>>>> TableFactory and corresponding TableSource /
> TableSink
> > > > > classes
> > > > > >>>>> to
> > > > > >>>>>>>>>>>>> persist
> > > > > >>>>>>>>>>>>>> and read the data.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio
> > > > > Pompermaier
> > > > > >> <
> > > > > >>>>>>>>>>>>>> pompermaier@okkam.it>:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an
> > > > > >> alternative
> > > > > >>>>> to
> > > > > >>>>>>>>>>>>> Apache
> > > > > >>>>>>>>>>>>>>> Ignite?
> > > > > >>>>>>>>>>>>>>> [1]
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>
> > > > > >>
> > > >
> > https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> > > > > >>>>>>>> fhueske@gmail.com>
> > > > > >>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Thanks for the proposal!
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> To summarize, you propose a new method
> > Table.cache():
> > > > > Table
> > > > > >>>>> that
> > > > > >>>>>>>>>> will
> > > > > >>>>>>>>>>>>>>>> trigger a job and write the result into some
> > temporary
> > > > > >> storage
> > > > > >>>>>>>> as
> > > > > >>>>>>>>>>>>> defined
> > > > > >>>>>>>>>>>>>>>> by a TableFactory.
> > > > > >>>>>>>>>>>>>>>> The cache() call blocks while the job is running
> and
> > > > > >>>>> eventually
> > > > > >>>>>>>>>>>>> returns a
> > > > > >>>>>>>>>>>>>>>> Table object that represents a scan of the
> temporary
> > > > > table.
> > > > > >>>>>>>>>>>>>>>> When the "session" is closed (closing to be
> > defined?),
> > > > the
> > > > > >>>>>>>>> temporary
> > > > > >>>>>>>>>>>>>>> tables
> > > > > >>>>>>>>>>>>>>>> are all dropped.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> I think this behavior makes sense and is a good
> > first
> > > > step
> > > > > >>>>>>>> towards
> > > > > >>>>>>>>>>>>> more
> > > > > >>>>>>>>>>>>>>>> interactive workloads.
> > > > > >>>>>>>>>>>>>>>> However, its performance suffers from writing to
> and
> > > > > reading
> > > > > >>>>>>>> from
> > > > > >>>>>>>>>>>>>>> external
> > > > > >>>>>>>>>>>>>>>> systems.
> > > > > >>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> > > > > significantly
> > > > > >>>>>>>>> improve
> > > > > >>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across
> jobs)
> > > > would
> > > > > >>>>> have
> > > > > >>>>>>>>>> large
> > > > > >>>>>>>>>>>>>>>> impacts on many components of Flink.
> > > > > >>>>>>>>>>>>>>>> Users could use in-memory filesystems or storage
> > grids
> > > > > >> (Apache
> > > > > >>>>>>>>>>>>> Ignite) to
> > > > > >>>>>>>>>>>>>>>> mitigate some of the performance effects.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Best, Fabian
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket
> > Qin
> > > <
> > > > > >>>>>>>>>>>>>>>> becket.qin@gmail.com
> > > > > >>>>>>>>>>>>>>>>> :
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > > > MaterializedTable
> > > > > >>>>>>>> that
> > > > > >>>>>>>>>> they
> > > > > >>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> > > *table.cache(),
> > > > > >> *users
> > > > > >>>>>>>> can
> > > > > >>>>>>>>>>>>> just
> > > > > >>>>>>>>>>>>>>>> use
> > > > > >>>>>>>>>>>>>>>>> that table and do anything that is supported on a
> > > > Table,
> > > > > >>>>>>>>> including
> > > > > >>>>>>>>>>>>> SQL.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize()
> sounds
> > > > fine
> > > > > to
> > > > > >>>>> me.
> > > > > >>>>>>>>>>>>> cache()
> > > > > >>>>>>>>>>>>>>>> is
> > > > > >>>>>>>>>>>>>>>>> a bit more general than materialize(). Given that
> > we
> > > > are
> > > > > >>>>>>>>> enhancing
> > > > > >>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>> Table API to also support non-relational
> processing
> > > > > cases,
> > > > > >>>>>>>>> cache()
> > > > > >>>>>>>>>>>>>>> might
> > > > > >>>>>>>>>>>>>>>> be
> > > > > >>>>>>>>>>>>>>>>> slightly better.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > > > > >>>>>>>>>>>>>>> piotr@data-artisans.com
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Hi Becket,
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to
> > reuse
> > > > > >> existing
> > > > > >>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed
> > that
> > > > you
> > > > > >>>>> want
> > > > > >>>>>>>> to
> > > > > >>>>>>>>>>>>>>>> provide
> > > > > >>>>>>>>>>>>>>>>> an
> > > > > >>>>>>>>>>>>>>>>>> alternate way of writing the data.
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Now that I hopefully understand the proposal,
> > maybe
> > > we
> > > > > >> could
> > > > > >>>>>>>>>> rename
> > > > > >>>>>>>>>>>>>>>>>> `cache()` to
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> void materialize()
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> or going step further
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> > > > > >>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> ?
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> The second option with returning a handle I
> think
> > is
> > > > > more
> > > > > >>>>>>>>> flexible
> > > > > >>>>>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>>>> could provide features such as
> “refresh”/“delete”
> > or
> > > > > >>>>> generally
> > > > > >>>>>>>>>>>>>>> speaking
> > > > > >>>>>>>>>>>>>>>>>> manage the the view. In the future we could also
> > > think
> > > > > >> about
> > > > > >>>>>>>>>> adding
> > > > > >>>>>>>>>>>>>>>> hooks
> > > > > >>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is also
> more
> > > > > >> explicit
> > > > > >>>>> -
> > > > > >>>>>>>>>>>>>>>>>> materialization returning a new table handle
> will
> > > not
> > > > > have
> > > > > >>>>> the
> > > > > >>>>>>>>>> same
> > > > > >>>>>>>>>>>>>>>>>> implicit side effects as adding a simple line of
> > > code
> > > > > like
> > > > > >>>>>>>>>>>>>>> `b.cache()`
> > > > > >>>>>>>>>>>>>>>>>> would have.
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> It would also be more SQL like, making it more
> > > > intuitive
> > > > > >> for
> > > > > >>>>>>>>> users
> > > > > >>>>>>>>>>>>>>>>> already
> > > > > >>>>>>>>>>>>>>>>>> familiar with the SQL.
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Piotrek
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> > > > > >> becket.qin@gmail.com
> > > > > >>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
> > > equivalent
> > > > to
> > > > > >>>>>>>>> creating
> > > > > >>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>>>> BUILT-IN
> > > > > >>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> > > > functionality
> > > > > is
> > > > > >>>>>>>>> missing
> > > > > >>>>>>>>>>>>>>>>> today,
> > > > > >>>>>>>>>>>>>>>>>>> though. Not sure if I understand your question.
> > Do
> > > > you
> > > > > >> mean
> > > > > >>>>>>>> we
> > > > > >>>>>>>>>>>>>>>> already
> > > > > >>>>>>>>>>>>>>>>>> have
> > > > > >>>>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is do
> we
> > > want
> > > > > to
> > > > > >>>>> stop
> > > > > >>>>>>>>> at
> > > > > >>>>>>>>>>>>>>>>> creating
> > > > > >>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend
> > that
> > > > in
> > > > > >> the
> > > > > >>>>>>>>> future
> > > > > >>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>>>> more
> > > > > >>>>>>>>>>>>>>>>>>> useful unified data store distributed with
> Flink?
> > > And
> > > > > do
> > > > > >> we
> > > > > >>>>>>>>> want
> > > > > >>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>> have
> > > > > >>>>>>>>>>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern
> > with
> > > > > their
> > > > > >>>>> own
> > > > > >>>>>>>>>> user
> > > > > >>>>>>>>>>>>>>>>>> defined
> > > > > >>>>>>>>>>>>>>>>>>> services. These considerations are much more
> > > > > >> architectural.
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski
> <
> > > > > >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > > > > >>>>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the
> > > > > problem.
> > > > > >>>>>>>> Isn’t
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data
> to
> > a
> > > > sink
> > > > > >> and
> > > > > >>>>>>>>> later
> > > > > >>>>>>>>>>>>>>>>> reading
> > > > > >>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live
> > > > scope/live
> > > > > >>>>> time?
> > > > > >>>>>>>>> And
> > > > > >>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>> sink
> > > > > >>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a file
> > sink?
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> > > > materialised
> > > > > >>>>> view
> > > > > >>>>>>>>>> from a
> > > > > >>>>>>>>>>>>>>>>> table
> > > > > >>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing
> > > this
> > > > > >>>>>>>>> materialised
> > > > > >>>>>>>>>>>>>>>> view
> > > > > >>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to
> clean
> > up
> > > > > >>>>>>>>> materialised
> > > > > >>>>>>>>>>>>>>>> views
> > > > > >>>>>>>>>>>>>>>>>> (for
> > > > > >>>>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe
> we
> > > > need
> > > > > >> some
> > > > > >>>>>>>>>>>>>>> syntactic
> > > > > >>>>>>>>>>>>>>>>>> sugar
> > > > > >>>>>>>>>>>>>>>>>>>> on top of it?
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Piotrek
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> > > > > >>>>> becket.qin@gmail.com
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a
> persist()
> > > > with
> > > > > >>>>>>>>>>>>>>>>> lifecycle/defined
> > > > > >>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the future
> > work
> > > > for
> > > > > >>>>> this.
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun
> <
> > > > > >>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name
> > of
> > > > > >>>>>>>> `cache()`, I
> > > > > >>>>>>>>>>>>>>>>>> understand
> > > > > >>>>>>>>>>>>>>>>>>>> why
> > > > > >>>>>>>>>>>>>>>>>>>>>> you designed this way!
> > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
> > > lifecycle
> > > > > for
> > > > > >>>>>>>> data
> > > > > >>>>>>>>>>>>>>>>>> persistence?
> > > > > >>>>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so
> > > that
> > > > > the
> > > > > >>>>> user
> > > > > >>>>>>>>> is
> > > > > >>>>>>>>>>>>>>> not
> > > > > >>>>>>>>>>>>>>>>>>>> worried
> > > > > >>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify
> the
> > > time
> > > > > >> range
> > > > > >>>>>>>> for
> > > > > >>>>>>>>>>>>>>>> keeping
> > > > > >>>>>>>>>>>>>>>>>>>> time.
> > > > > >>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we
> can
> > > > also
> > > > > >>>>> share
> > > > > >>>>>>>>> in a
> > > > > >>>>>>>>>>>>>>>>> certain
> > > > > >>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> > > > > >>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> > > > > >>>>>>>>>>>>>>> am
> > > > > >>>>>>>>>>>>>>>>> not
> > > > > >>>>>>>>>>>>>>>>>>>> sure,
> > > > > >>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference
> > only!
> > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>> Bests,
> > > > > >>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > > 于2018年11月23日周五
> > > > > >>>>>>>> 下午1:33写道:
> > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache()
> > v.s.
> > > > > >>>>>>>> persist(),
> > > > > >>>>>>>>>>>>>>>>>> personally I
> > > > > >>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately
> describing
> > > the
> > > > > >>>>>>>> behavior,
> > > > > >>>>>>>>>>>>>>> i.e.
> > > > > >>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>>>>> Table
> > > > > >>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
> > deleted
> > > > > after
> > > > > >>>>> the
> > > > > >>>>>>>>>>>>>>> session
> > > > > >>>>>>>>>>>>>>>> is
> > > > > >>>>>>>>>>>>>>>>>>>>>> closed.
> > > > > >>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as
> people
> > > > might
> > > > > >>>>> think
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>> table
> > > > > >>>>>>>>>>>>>>>>>>>> will
> > > > > >>>>>>>>>>>>>>>>>>>>>>> still be there even after the session is
> > gone.
> > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and
> stream
> > > > > >>>>> processing
> > > > > >>>>>>>> in
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>> same
> > > > > >>>>>>>>>>>>>>>>>>>> job.
> > > > > >>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that
> goal.
> > I
> > > > > >> imagine
> > > > > >>>>>>>> that
> > > > > >>>>>>>>>>>>>>> would
> > > > > >>>>>>>>>>>>>>>>> be
> > > > > >>>>>>>>>>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>>>>>>>> huge
> > > > > >>>>>>>>>>>>>>>>>>>>>>> change across the board, including sources,
> > > > > operators
> > > > > >>>>> and
> > > > > >>>>>>>>>>>>>>>>>>>> optimizations,
> > > > > >>>>>>>>>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
> > separate
> > > > > >>>>> in-depth
> > > > > >>>>>>>>>>>>>>>>> discussions.
> > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan
> Cui <
> > > > > >>>>>>>>>>>>>>> xingcanc@gmail.com>
> > > > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access
> > > > domain
> > > > > >> are
> > > > > >>>>>>>> both
> > > > > >>>>>>>>>>>>>>>>>> orthogonal
> > > > > >>>>>>>>>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may
> be
> > > the
> > > > > >> first
> > > > > >>>>>>>> time
> > > > > >>>>>>>>>> we
> > > > > >>>>>>>>>>>>>>>> plan
> > > > > >>>>>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other
> > than
> > > > the
> > > > > >>>>>>>> state.
> > > > > >>>>>>>>>>>>>>> Maybe
> > > > > >>>>>>>>>>>>>>>>> it’s
> > > > > >>>>>>>>>>>>>>>>>>>>>>> better
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
> > > concentrate
> > > > > on
> > > > > >> a
> > > > > >>>>>>>>>> specific
> > > > > >>>>>>>>>>>>>>>>> part?
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned
> > > with
> > > > > the
> > > > > >>>>>>>>>> underlying
> > > > > >>>>>>>>>>>>>>>>>>>> service.
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to
> the
> > > > > >> existing
> > > > > >>>>>>>>>>>>>>> codebase.
> > > > > >>>>>>>>>>>>>>>> As
> > > > > >>>>>>>>>>>>>>>>>> you
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible
> to
> > > > > support
> > > > > >>>>>>>> other
> > > > > >>>>>>>>>>>>>>>>>> components
> > > > > >>>>>>>>>>>>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another
> thread.
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more
> > > > > >> interactive
> > > > > >>>>>>>>> Table
> > > > > >>>>>>>>>>>>>>>> API,
> > > > > >>>>>>>>>>>>>>>>> in
> > > > > >>>>>>>>>>>>>>>>>>>>>> case
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service
> > > > > mechanism.
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei
> > Jiang <
> > > > > >>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> > > > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table
> > for
> > > > > clean
> > > > > >> up
> > > > > >>>>>>>> is
> > > > > >>>>>>>>>> not
> > > > > >>>>>>>>>>>>>>>> very
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> reliable.
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
> > > executed
> > > > > >>>>>>>>>> successfully.
> > > > > >>>>>>>>>>>>>>> We
> > > > > >>>>>>>>>>>>>>>>> may
> > > > > >>>>>>>>>>>>>>>>>>>>>>> risk
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that
> it's
> > > > safer
> > > > > to
> > > > > >>>>>>>> have
> > > > > >>>>>>>>> an
> > > > > >>>>>>>>>>>>>>>>>>>>>> association
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we
> > can
> > > > > always
> > > > > >>>>>>>> clean
> > > > > >>>>>>>>>> up
> > > > > >>>>>>>>>>>>>>>> temp
> > > > > >>>>>>>>>>>>>>>>>>>>>>> tables
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any
> > > active
> > > > > >>>>>>>> sessions.
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng
> > > sun <
> > > > > >>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great
> proposal!
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful
> and
> > > > user
> > > > > >>>>>>>> friendly
> > > > > >>>>>>>>>> in
> > > > > >>>>>>>>>>>>>>>> case
> > > > > >>>>>>>>>>>>>>>>>> of
> > > > > >>>>>>>>>>>>>>>>>>>>>>> your
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has
> > to
> > > be
> > > > > >>>>>>>> executed
> > > > > >>>>>>>>> in
> > > > > >>>>>>>>>>>>>>>>> several
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> stages
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline
> of
> > > > Flink
> > > > > >> ML,
> > > > > >>>>> in
> > > > > >>>>>>>>>> order
> > > > > >>>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>>>>>>>> utilize
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have
> > to
> > > > > >> submit a
> > > > > >>>>>>>> job
> > > > > >>>>>>>>>> by
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better
> > to
> > > > > named
> > > > > >>>>>>>>>>>>>>> `persist()`,
> > > > > >>>>>>>>>>>>>>>>> And
> > > > > >>>>>>>>>>>>>>>>>>>>>> The
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we
> > > > internally
> > > > > >>>>> cache
> > > > > >>>>>>>>> in
> > > > > >>>>>>>>>>>>>>>> memory
> > > > > >>>>>>>>>>>>>>>>>> or
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> persist
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the
> data
> > > into
> > > > > >> state
> > > > > >>>>>>>>>> backend
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or
> RocksDBStateBackend
> > > > etc.)
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the
> > > future,
> > > > > >>>>> support
> > > > > >>>>>>>>> for
> > > > > >>>>>>>>>>>>>>>>>> streaming
> > > > > >>>>>>>>>>>>>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job
> will
> > > also
> > > > > >>>>> benefit
> > > > > >>>>>>>>> in
> > > > > >>>>>>>>>>>>>>>>>>>>>> "Interactive
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to
> > your
> > > > > JIRAs
> > > > > >>>>> and
> > > > > >>>>>>>>>> FLIP!
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > > > > 于2018年11月20日周二
> > > > > >>>>>>>>>> 下午9:56写道:
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
> > pointed
> > > > out,
> > > > > >> it
> > > > > >>>>>>>> is a
> > > > > >>>>>>>>>>>>>>>>> promising
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API
> in
> > > > > various
> > > > > >>>>>>>>>> aspects,
> > > > > >>>>>>>>>>>>>>>>>>>>>> including
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among
> > others.
> > > > One
> > > > > >> of
> > > > > >>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>> scenarios
> > > > > >>>>>>>>>>>>>>>>>>>>>>> where
> > > > > >>>>>>>>>>>>>>>>>>>>>>>> we
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
> > > > > >>>>> programming.
> > > > > >>>>>>>> To
> > > > > >>>>>>>>>>>>>>>> explain
> > > > > >>>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the
> > > > solution,
> > > > > we
> > > > > >>>>> put
> > > > > >>>>>>>>>>>>>>>> together
> > > > > >>>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Till,

Thanks for the clarification. I am still a little confused.

If cache() returns a CachedTable, the example might become:

b = a.map(...)
c = a.map(...)

cachedTableA = a.cache()
d = cachedTableA.map(...)
e = a.map()

In the above case, if cache() is lazily evaluated, b, c, d and e are all
going to be reading from the original DAG that generates a. But with a
naive expectation, d should be reading from the cache. This seems not
solving the potential confusion you raised, right?

Just to be clear, my understanding are all based on the assumption that the
tables are immutable. Therefore, after a.cache(), a the c*achedTableA* and
original table *a * should be completely interchangeable.

That said, I think a valid argument is optimization. There are indeed cases
that reading from the original DAG could be faster than reading from the
cache. For example, in the following example:

a.filter(f1' > 100)
a.cache()
b = a.filter(f1' < 100)

Ideally the optimizer should be intelligent enough to decide which way is
faster, without user intervention. In this case, it will identify that b
would just be an empty table, thus skip reading from the cache completely.
But I agree that returning a CachedTable would give user the control of
when to use cache, even though I still feel that letting the optimizer
handle this is a better option in long run.

Thanks,

Jiangjie (Becket) Qin




On Tue, Dec 4, 2018 at 6:51 PM Till Rohrmann <tr...@apache.org> wrote:

> Yes you are right Becket that it still depends on the actual execution of
> the job whether a consumer reads from a cached result or not.
>
> My point was actually about the properties of a (cached vs. non-cached) and
> not about the execution. I would not make cache trigger the execution of
> the job because one loses some flexibility by eagerly triggering the
> execution.
>
> I tried to argue for an explicit CachedTable which is returned by the
> cache() method like Piotr did in order to make the API more explicit.
>
> Cheers,
> Till
>
> On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com> wrote:
>
> > Hi Till,
> >
> > That is a good example. Just a minor correction, in this case, b, c and d
> > will all consume from a non-cached a. This is because cache will only be
> > created on the very first job submission that generates the table to be
> > cached.
> >
> > If I understand correctly, this is example is about whether .cache()
> method
> > should be eagerly evaluated or lazily evaluated. In another word, if
> > cache() method actually triggers a job that creates the cache, there will
> > be no such confusion. Is that right?
> >
> > In the example, although d will not consume from the cached Table while
> it
> > looks supposed to, from correctness perspective the code will still
> return
> > correct result, assuming that tables are immutable.
> >
> > Personally I feel it is OK because users probably won't really worry
> about
> > whether the table is cached or not. And lazy cache could avoid some
> > unnecessary caching if a cached table is never created in the user
> > application. But I am not opposed to do eager evaluation of cache.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <tr...@apache.org>
> > wrote:
> >
> > > Another argument for Piotr's point is that lazily changing properties
> of
> > a
> > > node affects all down stream consumers but does not necessarily have to
> > > happen before these consumers are defined. From a user's perspective
> this
> > > can be quite confusing:
> > >
> > > b = a.map(...)
> > > c = a.map(...)
> > >
> > > a.cache()
> > > d = a.map(...)
> > >
> > > now b, c and d will consume from a cached operator. In this case, the
> > user
> > > would most likely expect that only d reads from a cached result.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <
> piotr@data-artisans.com>
> > > wrote:
> > >
> > > > Hey Shaoxuan and Becket,
> > > >
> > > > > Can you explain a bit more one what are the side effects? So far my
> > > > > understanding is that such side effects only exist if a table is
> > > mutable.
> > > > > Is that the case?
> > > >
> > > > Not only that. There are also performance implications and those are
> > > > another implicit side effects of using `void cache()`. As I wrote
> > before,
> > > > reading from cache might not always be desirable, thus it can cause
> > > > performance degradation and I’m fine with that - user's or
> optimiser’s
> > > > choice. What I do not like is that this implicit side effect can
> > manifest
> > > > in completely different part of code, that wasn’t touched by a user
> > while
> > > > he was adding `void cache()` call somewhere else. And even if caching
> > > > improves performance, it’s still a side effect of `void cache()`.
> > Almost
> > > > from the definition `void` methods have only side effects. As I wrote
> > > > before, there are couple of scenarios where this might be undesirable
> > > > and/or unexpected, for example:
> > > >
> > > > 1.
> > > > Table b = …;
> > > > b.cache()
> > > > x = b.join(…)
> > > > y = b.count()
> > > > // ...
> > > > // 100
> > > > // hundred
> > > > // lines
> > > > // of
> > > > // code
> > > > // later
> > > > z = b.filter(…).groupBy(…) // this might be even hidden in a
> different
> > > > method/file/package/dependency
> > > >
> > > > 2.
> > > >
> > > > Table b = ...
> > > > If (some_condition) {
> > > >   foo(b)
> > > > }
> > > > Else {
> > > >   bar(b)
> > > > }
> > > > z = b.filter(…).groupBy(…)
> > > >
> > > >
> > > > Void foo(Table b) {
> > > >   b.cache()
> > > >   // do something with b
> > > > }
> > > >
> > > > In both above examples, `b.cache()` will implicitly affect (semantic
> > of a
> > > > program in case of sources being mutable and performance) `z =
> > > > b.filter(…).groupBy(…)` which might be far from obvious.
> > > >
> > > > On top of that, there is still this argument of mine that having a
> > > > `MaterializedTable` or `CachedTable` handle is more flexible for us
> for
> > > the
> > > > future and for the user (as a manual option to bypass cache reads).
> > > >
> > > > >  But Jiangjie is correct,
> > > > > the source table in batching should be immutable. It is the user’s
> > > > > responsibility to ensure it, otherwise even a regular failover may
> > lead
> > > > > to inconsistent results.
> > > >
> > > > Yes, I agree that’s what perfect world/good deployment should be. But
> > its
> > > > often isn’t and while I’m not trying to fix this (since the proper
> fix
> > is
> > > > to support transactions), I’m just trying to minimise confusion for
> the
> > > > users that are not fully aware what’s going on and operate in less
> then
> > > > perfect setup. And if something bites them after adding `b.cache()`
> > call,
> > > > to make sure that they at least know all of the places that adding
> this
> > > > line can affect.
> > > >
> > > > Thanks, Piotrek
> > > >
> > > > > On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com> wrote:
> > > > >
> > > > > Hi Piotrek,
> > > > >
> > > > > Thanks again for the clarification. Some more replies are
> following.
> > > > >
> > > > > But keep in mind that `.cache()` will/might not only be used in
> > > > interactive
> > > > >> programming and not only in batching.
> > > > >
> > > > > It is true. Actually in stream processing, cache() has the same
> > > semantic
> > > > as
> > > > > batch processing. The semantic is following:
> > > > > For a table created via a series of computation, save that table
> for
> > > > later
> > > > > reference to avoid running the computation logic to regenerate the
> > > table.
> > > > > Once the application exits, drop all the cache.
> > > > > This semantic is same for both batch and stream processing. The
> > > > difference
> > > > > is that stream applications will only run once as they are long
> > > running.
> > > > > And the batch applications may be run multiple times, hence the
> cache
> > > may
> > > > > be created and dropped each time the application runs.
> > > > > Admittedly, there will probably be some resource management
> > > requirements
> > > > > for the streaming cached table, such as time based / size based
> > > > retention,
> > > > > to address the infinite data issue. But such requirement does not
> > > change
> > > > > the semantic.
> > > > > You are right that interactive programming is just one use case of
> > > > cache().
> > > > > It is not the only use case.
> > > > >
> > > > > For me the more important issue is of not having the `void cache()`
> > > with
> > > > >> side effects.
> > > > >
> > > > > This is indeed the key point. The argument around whether cache()
> > > should
> > > > > return something already indicates that cache() and materialize()
> > > address
> > > > > different issues.
> > > > > Can you explain a bit more one what are the side effects? So far my
> > > > > understanding is that such side effects only exist if a table is
> > > mutable.
> > > > > Is that the case?
> > > > >
> > > > > I don’t know, probably initially we should make CachedTable
> > read-only.
> > > I
> > > > >> don’t find it more confusing than the fact that user can not write
> > to
> > > > views
> > > > >> or materialised views in SQL or that user currently can not write
> > to a
> > > > >> Table.
> > > > >
> > > > > I don't think anyone should insert something to a cache. By
> > definition
> > > > the
> > > > > cache should only be updated when the corresponding original table
> is
> > > > > updated. What I am wondering is that given the following two facts:
> > > > > 1. If and only if a table is mutable (with something like
> insert()),
> > a
> > > > > CachedTable may have implicit behavior.
> > > > > 2. A CachedTable extends a Table.
> > > > > We can come to the conclusion that a CachedTable is mutable and
> users
> > > can
> > > > > insert into the CachedTable directly. This is where I thought
> > > confusing.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> > piotr@data-artisans.com
> > > >
> > > > > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> Regarding naming `cache()` vs `materialize()`. One more
> explanation
> > > why
> > > > I
> > > > >> think `materialize()` is more natural to me is that I think of all
> > > > “Table”s
> > > > >> in Table-API as views. They behave the same way as SQL views, the
> > only
> > > > >> difference for me is that their live scope is short - current
> > session
> > > > which
> > > > >> is limited by different execution model. That’s why “cashing” a
> view
> > > > for me
> > > > >> is just materialising it.
> > > > >>
> > > > >> However I see and I understand your point of view. Coming from
> > > > >> DataSet/DataStream and generally speaking non-SQL world, `cache()`
> > is
> > > > more
> > > > >> natural. But keep in mind that `.cache()` will/might not only be
> > used
> > > in
> > > > >> interactive programming and not only in batching. But naming is
> one
> > > > issue,
> > > > >> and not that critical to me. Especially that once we implement
> > proper
> > > > >> materialised views, we can always deprecate/rename `cache()` if we
> > > deem
> > > > so.
> > > > >>
> > > > >>
> > > > >> For me the more important issue is of not having the `void
> cache()`
> > > with
> > > > >> side effects. Exactly for the reasons that you have mentioned.
> True:
> > > > >> results might be non deterministic if underlying source table are
> > > > changing.
> > > > >> Problem is that `void cache()` implicitly changes the semantic of
> > > > >> subsequent uses of the cached/materialized Table. It can cause
> “wtf”
> > > > moment
> > > > >> for a user if he inserts “b.cache()” call in some place in his
> code
> > > and
> > > > >> suddenly some other random places are behaving differently. If
> > > > >> `materialize()` or `cache()` returns a Table handle, we force user
> > to
> > > > >> explicitly use the cache which removes the “random” part from the
> > > > "suddenly
> > > > >> some other random places are behaving differently”.
> > > > >>
> > > > >> This argument and others that I’ve raised (greater
> > > flexibility/allowing
> > > > >> user to explicitly bypass the cache) are independent of `cache()`
> vs
> > > > >> `materialize()` discussion.
> > > > >>
> > > > >>> Does that mean one can also insert into the CachedTable? This
> > sounds
> > > > >> pretty confusing.
> > > > >>
> > > > >> I don’t know, probably initially we should make CachedTable
> > > read-only. I
> > > > >> don’t find it more confusing than the fact that user can not write
> > to
> > > > views
> > > > >> or materialised views in SQL or that user currently can not write
> > to a
> > > > >> Table.
> > > > >>
> > > > >> Piotrek
> > > > >>
> > > > >>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com>
> wrote:
> > > > >>>
> > > > >>> Hi all,
> > > > >>>
> > > > >>> I agree with @Becket that `cache()` and `materialize()` should be
> > > > >> considered as two different methods where the later one is more
> > > > >> sophisticated.
> > > > >>>
> > > > >>> According to my understanding, the initial idea is just to
> > introduce
> > > a
> > > > >> simple cache or persist mechanism, but as the TableAPI is a
> > high-level
> > > > API,
> > > > >> it’s naturally for as to think in a SQL way.
> > > > >>>
> > > > >>> Maybe we can add the `cache()` method to the DataSet API and
> force
> > > > users
> > > > >> to translate a Table to a Dataset before caching it. Then the
> users
> > > > should
> > > > >> manually register the cached dataset to a table again (we may need
> > > some
> > > > >> table replacement mechanisms for datasets with an identical schema
> > but
> > > > >> different contents here). After all, it’s the dataset rather than
> > the
> > > > >> dynamic table that need to be cached, right?
> > > > >>>
> > > > >>> Best,
> > > > >>> Xingcan
> > > > >>>
> > > > >>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <be...@gmail.com>
> > > > wrote:
> > > > >>>>
> > > > >>>> Hi Piotrek and Jark,
> > > > >>>>
> > > > >>>> Thanks for the feedback and explanation. Those are good
> arguments.
> > > > But I
> > > > >>>> think those arguments are mostly about materialized view. Let me
> > try
> > > > to
> > > > >>>> explain the reason I believe cache() and materialize() are
> > > different.
> > > > >>>>
> > > > >>>> I think cache() and materialize() have quite different
> > implications.
> > > > An
> > > > >>>> analogy I can think of is save()/publish(). When users call
> > cache(),
> > > > it
> > > > >> is
> > > > >>>> just like they are saving an intermediate result as a draft of
> > their
> > > > >> work,
> > > > >>>> this intermediate result may not have any realistic meaning.
> > Calling
> > > > >>>> cache() does not mean users want to publish the cached table in
> > any
> > > > >> manner.
> > > > >>>> But when users call materialize(), that means "I have something
> > > > >> meaningful
> > > > >>>> to be reused by others", now users need to think about the
> > > validation,
> > > > >>>> update & versioning, lifecycle of the result, etc.
> > > > >>>>
> > > > >>>> Piotrek's suggestions on variations of the materialize() methods
> > are
> > > > >> very
> > > > >>>> useful. It would be great if Flink have them. The concept of
> > > > >> materialized
> > > > >>>> view is actually a pretty big feature, not to say the related
> > stuff
> > > > like
> > > > >>>> triggers/hooks you mentioned earlier. I think the materialized
> > view
> > > > >> itself
> > > > >>>> should be discussed in a more thorough and systematic manner.
> And
> > I
> > > > >> found
> > > > >>>> that discussion is kind of orthogonal and way beyond interactive
> > > > >>>> programming experience.
> > > > >>>>
> > > > >>>> The example you gave was interesting. I still have some
> questions,
> > > > >> though.
> > > > >>>>
> > > > >>>> Table source = … // some source that scans files from a
> directory
> > > > >>>>> “/foo/bar/“
> > > > >>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > > >>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > > >>>>
> > > > >>>> t2.count() // initialise cache (if it’s lazily initialised)
> > > > >>>>> int a1 = t1.count()
> > > > >>>>> int b1 = t2.count()
> > > > >>>>> // something in the background (or we trigger it) writes new
> > files
> > > to
> > > > >>>>> /foo/bar
> > > > >>>>> int a2 = t1.count()
> > > > >>>>> int b2 = t2.count()
> > > > >>>>> t2.refresh() // possible future extension, not to be
> implemented
> > in
> > > > the
> > > > >>>>> initial version
> > > > >>>>>
> > > > >>>>
> > > > >>>> what if someone else added some more files to /foo/bar at this
> > > point?
> > > > In
> > > > >>>> that case, a3 won't equals to b3, and the result become
> > > > >> non-deterministic,
> > > > >>>> right?
> > > > >>>>
> > > > >>>> int a3 = t1.count()
> > > > >>>>> int b3 = t2.count()
> > > > >>>>> t2.drop() // another possible future extension, manual “cache”
> > > > dropping
> > > > >>>>
> > > > >>>>
> > > > >>>> When we talk about interactive programming, in most cases, we
> are
> > > > >> talking
> > > > >>>> about batch applications. A fundamental assumption of such case
> is
> > > > that
> > > > >> the
> > > > >>>> source data is complete before the data processing begins, and
> the
> > > > data
> > > > >>>> will not change during the data processing. IMO, if additional
> > rows
> > > > >> needs
> > > > >>>> to be added to some source during the processing, it should be
> > done
> > > in
> > > > >> ways
> > > > >>>> like union the source with another table containing the rows to
> be
> > > > >> added.
> > > > >>>>
> > > > >>>> There are a few cases that computations are executed repeatedly
> on
> > > the
> > > > >>>> changing data source.
> > > > >>>>
> > > > >>>> For example, people may run a ML training job every hour with
> the
> > > > >> samples
> > > > >>>> newly added in the past hour. In that case, the source data
> > between
> > > > will
> > > > >>>> indeed change. But still, the data remain unchanged within one
> > run.
> > > > And
> > > > >>>> usually in that case, the result will need versioning, i.e. for
> a
> > > > given
> > > > >>>> result, it tells that the result is a result from the source
> data
> > > by a
> > > > >>>> certain timestamp.
> > > > >>>>
> > > > >>>> Another example is something like data warehouse. In this case,
> > > there
> > > > >> are a
> > > > >>>> few source of original/raw data. On top of those sources, many
> > > > >> materialized
> > > > >>>> view / queries / reports / dashboards can be created to generate
> > > > derived
> > > > >>>> data. Those derived data needs to be updated when the underlying
> > > > >> original
> > > > >>>> data changes. In that case, the processing logic that derives
> the
> > > > >> original
> > > > >>>> data needs to be executed repeatedly to update those
> > reports/views.
> > > > >> Again,
> > > > >>>> all those derived data also need to have version management,
> such
> > as
> > > > >>>> timestamp.
> > > > >>>>
> > > > >>>> In any of the above two cases, during a single run of the
> > processing
> > > > >> logic,
> > > > >>>> the data cannot change. Otherwise the behavior of the processing
> > > logic
> > > > >> may
> > > > >>>> be undefined. In the above two examples, when writing the
> > processing
> > > > >> logic,
> > > > >>>> Users can use .cache() to hint Flink that those results should
> be
> > > > saved
> > > > >> to
> > > > >>>> avoid repeated computation. And then for the result of my
> > > application
> > > > >>>> logic, I'll call materialize(), so that these results could be
> > > managed
> > > > >> by
> > > > >>>> the system with versioning, metadata management, lifecycle
> > > management,
> > > > >>>> ACLs, etc.
> > > > >>>>
> > > > >>>> It is true we can use materialize() to do the cache() job, but I
> > am
> > > > >> really
> > > > >>>> reluctant to shoehorn cache() into materialize() and force users
> > to
> > > > >> worry
> > > > >>>> about a bunch of implications that they needn't have to. I am
> > > > >> absolutely on
> > > > >>>> your side that redundant API is bad. But it is equally
> > frustrating,
> > > if
> > > > >> not
> > > > >>>> more, that the same API does different things.
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>>
> > > > >>>> Jiangjie (Becket) Qin
> > > > >>>>
> > > > >>>>
> > > > >>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> > wshaoxuan@gmail.com
> > > >
> > > > >> wrote:
> > > > >>>>
> > > > >>>>> Thanks Piotrek,
> > > > >>>>> You provided a very good example, it explains all the
> confusions
> > I
> > > > >> have.
> > > > >>>>> It is clear that there is something we have not considered in
> the
> > > > >> initial
> > > > >>>>> proposal. We intend to force the user to reuse the
> > > > cached/materialized
> > > > >>>>> table, if its cache() method is executed. We did not expect
> that
> > > user
> > > > >> may
> > > > >>>>> want to re-executed the plan from the source table. Let me
> > re-think
> > > > >> about
> > > > >>>>> it and get back to you later.
> > > > >>>>>
> > > > >>>>> In the meanwhile, this example/observation also infers that we
> > > cannot
> > > > >> fully
> > > > >>>>> involve the optimizer to decide the plan if a cache/materialize
> > is
> > > > >>>>> explicitly used, because weather to reuse the cache data or
> > > > re-execute
> > > > >> the
> > > > >>>>> query from source data may lead to different results. (But I
> > guess
> > > > >>>>> optimizer can still help in some cases ---- as long as it does
> > not
> > > > >>>>> re-execute from the varied source, we should be safe).
> > > > >>>>>
> > > > >>>>> Regards,
> > > > >>>>> Shaoxuan
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> > > > >> piotr@data-artisans.com>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi Shaoxuan,
> > > > >>>>>>
> > > > >>>>>> Re 2:
> > > > >>>>>>
> > > > >>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified
> > to->
> > > > t1’
> > > > >>>>>>
> > > > >>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> > > > >>>>>> `methodThatAppliesOperators()` method has changed it’s plan?
> > > > >>>>>>
> > > > >>>>>> I was thinking more about something like this:
> > > > >>>>>>
> > > > >>>>>> Table source = … // some source that scans files from a
> > directory
> > > > >>>>>> “/foo/bar/“
> > > > >>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > > >>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > > >>>>>>
> > > > >>>>>> t2.count() // initialise cache (if it’s lazily initialised)
> > > > >>>>>>
> > > > >>>>>> int a1 = t1.count()
> > > > >>>>>> int b1 = t2.count()
> > > > >>>>>>
> > > > >>>>>> // something in the background (or we trigger it) writes new
> > files
> > > > to
> > > > >>>>>> /foo/bar
> > > > >>>>>>
> > > > >>>>>> int a2 = t1.count()
> > > > >>>>>> int b2 = t2.count()
> > > > >>>>>>
> > > > >>>>>> t2.refresh() // possible future extension, not to be
> implemented
> > > in
> > > > >> the
> > > > >>>>>> initial version
> > > > >>>>>>
> > > > >>>>>> int a3 = t1.count()
> > > > >>>>>> int b3 = t2.count()
> > > > >>>>>>
> > > > >>>>>> t2.drop() // another possible future extension, manual “cache”
> > > > >> dropping
> > > > >>>>>>
> > > > >>>>>> assertTrue(a1 == b1) // same results, but b1 comes from the
> > > “cache"
> > > > >>>>>> assertTrue(b1 == b2) // both values come from the same cache
> > > > >>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed
> full
> > > > table
> > > > >>>>> scan
> > > > >>>>>> and has more data
> > > > >>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> > > > >>>>>> assertTrue(b3 == a2 == a3)
> > > > >>>>>>
> > > > >>>>>> Piotrek
> > > > >>>>>>
> > > > >>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
> > > > >>>>>>>
> > > > >>>>>>> Hi,
> > > > >>>>>>>
> > > > >>>>>>> It is an very interesting and useful design!
> > > > >>>>>>>
> > > > >>>>>>> Here I want to share some of my thoughts:
> > > > >>>>>>>
> > > > >>>>>>> 1. Agree with that cache() method should return some Table to
> > > avoid
> > > > >>>>> some
> > > > >>>>>>> unexpected problems because of the mutable object.
> > > > >>>>>>> All the existing methods of Table are returning a new Table
> > > > instance.
> > > > >>>>>>>
> > > > >>>>>>> 2. I think materialize() would be more consistent with SQL,
> > this
> > > > >> makes
> > > > >>>>> it
> > > > >>>>>>> possible to support the same feature for SQL (materialize
> view)
> > > and
> > > > >>>>> keep
> > > > >>>>>>> the same API for users in the future.
> > > > >>>>>>> But I'm also fine if we choose cache().
> > > > >>>>>>>
> > > > >>>>>>> 3. In the proposal, a TableService (or FlinkService?) is used
> > to
> > > > >> cache
> > > > >>>>>> the
> > > > >>>>>>> result of the (intermediate) table.
> > > > >>>>>>> But the name of TableService may be a bit general which is
> not
> > > > quite
> > > > >>>>>>> understanding correctly in the first glance (a metastore for
> > > > >> tables?).
> > > > >>>>>>> Maybe a more specific name would be better, such as
> > > > TableCacheSerive
> > > > >>>>> or
> > > > >>>>>>> TableMaterializeSerivce or something else.
> > > > >>>>>>>
> > > > >>>>>>> Best,
> > > > >>>>>>> Jark
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <
> fhueske@gmail.com
> > >
> > > > >> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi,
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks for the clarification Becket!
> > > > >>>>>>>>
> > > > >>>>>>>> I have a few thoughts to share / questions:
> > > > >>>>>>>>
> > > > >>>>>>>> 1) I'd like to know how you plan to implement the feature
> on a
> > > > plan
> > > > >> /
> > > > >>>>>>>> planner level.
> > > > >>>>>>>>
> > > > >>>>>>>> I would imaging the following to happen when Table.cache()
> is
> > > > >> called:
> > > > >>>>>>>>
> > > > >>>>>>>> 1) immediately optimize the Table and internally convert it
> > > into a
> > > > >>>>>>>> DataSet/DataStream. This is necessary, to avoid that
> operators
> > > of
> > > > >>>>> later
> > > > >>>>>>>> queries on top of the Table are pushed down.
> > > > >>>>>>>> 2) register the DataSet/DataStream as a
> > > DataSet/DataStream-backed
> > > > >>>>> Table
> > > > >>>>>> X
> > > > >>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> > > > materialization
> > > > >>>>> of
> > > > >>>>>> the
> > > > >>>>>>>> Table X
> > > > >>>>>>>>
> > > > >>>>>>>> Based on your proposal the following would happen:
> > > > >>>>>>>>
> > > > >>>>>>>> Table t1 = ....
> > > > >>>>>>>> t1.cache(); // cache() returns void. The logical plan of t1
> is
> > > > >>>>> replaced
> > > > >>>>>> by
> > > > >>>>>>>> a scan of X. There is also a reference to the
> materialization
> > of
> > > > X.
> > > > >>>>>>>>
> > > > >>>>>>>> t1.count(); // this executes the program, including the
> > > > >>>>>> DataSet/DataStream
> > > > >>>>>>>> that backs X and the sink that writes the materialization
> of X
> > > > >>>>>>>> t1.count(); // this executes the program, but reads X from
> the
> > > > >>>>>>>> materialization.
> > > > >>>>>>>>
> > > > >>>>>>>> My question is, how do you determine when whether the scan
> of
> > t1
> > > > >>>>> should
> > > > >>>>>> go
> > > > >>>>>>>> against the DataSet/DataStream program and when against the
> > > > >>>>>>>> materialization?
> > > > >>>>>>>> AFAIK, there is no hook that will tell you that a part of
> the
> > > > >> program
> > > > >>>>>> was
> > > > >>>>>>>> executed. Flipping a switch during optimization or plan
> > > generation
> > > > >> is
> > > > >>>>>> not
> > > > >>>>>>>> sufficient as there is no guarantee that the plan is also
> > > > executed.
> > > > >>>>>>>>
> > > > >>>>>>>> Overall, this behavior is somewhat similar to what I
> proposed
> > in
> > > > >>>>>>>> FLINK-8950, which does not include persisting the table, but
> > > just
> > > > >>>>>>>> optimizing and reregistering it as DataSet/DataStream scan.
> > > > >>>>>>>>
> > > > >>>>>>>> 2) I think Piotr has a point about the implicit behavior and
> > > side
> > > > >>>>>> effects
> > > > >>>>>>>> of the cache() method if it does not return anything.
> > > > >>>>>>>> Consider the following example:
> > > > >>>>>>>>
> > > > >>>>>>>> Table t1 = ???
> > > > >>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> > > > >>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> > > > >>>>>>>>
> > > > >>>>>>>> In this case, the behavior/performance of the plan that
> > results
> > > > from
> > > > >>>>> the
> > > > >>>>>>>> second method call depends on whether t1 was modified by the
> > > first
> > > > >>>>>> method
> > > > >>>>>>>> or not.
> > > > >>>>>>>> This is the classic issue of mutable vs. immutable objects.
> > > > >>>>>>>> Also, as Piotr pointed out, it might also be good to have
> the
> > > > >> original
> > > > >>>>>> plan
> > > > >>>>>>>> of t1, because in some cases it is possible to push filters
> > down
> > > > >> such
> > > > >>>>>> that
> > > > >>>>>>>> evaluating the query from scratch might be more efficient
> than
> > > > >>>>> accessing
> > > > >>>>>>>> the cache.
> > > > >>>>>>>> Moreover, a CachedTable could extend Table() and offer a
> > method
> > > > >>>>>> refresh().
> > > > >>>>>>>> This sounds quite useful in an interactive session mode.
> > > > >>>>>>>>
> > > > >>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> > > > materialize()
> > > > >>>>>> seems
> > > > >>>>>>>> to be more future proof.
> > > > >>>>>>>>
> > > > >>>>>>>> Best, Fabian
> > > > >>>>>>>>
> > > > >>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> > > > >>>>>>>> wshaoxuan@gmail.com>:
> > > > >>>>>>>>
> > > > >>>>>>>>> Hi Piotr,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks for sharing your ideas on the method naming. We will
> > > think
> > > > >>>>> about
> > > > >>>>>>>>> your suggestions. But I don't understand why we need to
> > change
> > > > the
> > > > >>>>>> return
> > > > >>>>>>>>> type of cache().
> > > > >>>>>>>>>
> > > > >>>>>>>>> Cache() is a physical operation, it does not change the
> logic
> > > of
> > > > >>>>>>>>> the `Table`. On the tableAPI layer, we should not
> introduce a
> > > new
> > > > >>>>> table
> > > > >>>>>>>>> type unless the logic of table has been changed. If we
> > > introduce
> > > > a
> > > > >>>>> new
> > > > >>>>>>>>> table type `CachedTable`, we need create the same set of
> > > methods
> > > > of
> > > > >>>>>>>> `Table`
> > > > >>>>>>>>> for it. I don't think it is worth doing this. Or can you
> > please
> > > > >>>>>> elaborate
> > > > >>>>>>>>> more on what could be the "implicit behaviours/side
> effects"
> > > you
> > > > >> are
> > > > >>>>>>>>> thinking about?
> > > > >>>>>>>>>
> > > > >>>>>>>>> Regards,
> > > > >>>>>>>>> Shaoxuan
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > > > >>>>>> piotr@data-artisans.com>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Hi Becket,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Thanks for the response.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> 1. I wasn’t saying that materialised view must be mutable
> or
> > > > not.
> > > > >>>>> The
> > > > >>>>>>>>> same
> > > > >>>>>>>>>> thing applies to caches as well. To the contrary, I would
> > > expect
> > > > >>>>> more
> > > > >>>>>>>>>> consistency and updates from something that is called
> > “cache”
> > > vs
> > > > >>>>>>>>> something
> > > > >>>>>>>>>> that’s a “materialised view”. In other words, IMO most
> > caches
> > > do
> > > > >> not
> > > > >>>>>>>>> serve
> > > > >>>>>>>>>> you invalid/outdated data and they handle updates on their
> > > own.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> 2. I don’t think that having in the future two very
> similar
> > > > >> concepts
> > > > >>>>>> of
> > > > >>>>>>>>>> `materialized` view and `cache` is a good idea. It would
> be
> > > > >>>>> confusing
> > > > >>>>>>>> for
> > > > >>>>>>>>>> the users. I think it could be handled by
> > > variations/overloading
> > > > >> of
> > > > >>>>>>>>>> materialised view concept. We could start with:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> `MaterializedTable materialize()` - immutable, session
> life
> > > > scope
> > > > >>>>>>>>>> (basically the same semantic as you are proposing
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> And then in the future (if ever) build on top of
> that/expand
> > > it
> > > > >>>>> with:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> > > > >> `MaterializedTable
> > > > >>>>>>>>>> materialize(refreshHook=…)`
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Or with cross session support:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> > > > >>>>> `MaterializedTable
> > > > >>>>>>>>>> materializeInto(tableFactory=…)`
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I’m not saying that we should implement cross
> > > session/refreshing
> > > > >> now
> > > > >>>>>> or
> > > > >>>>>>>>>> even in the near future. I’m just arguing that naming
> > current
> > > > >>>>>> immutable
> > > > >>>>>>>>>> session life scope method `materialize()` is more future
> > proof
> > > > and
> > > > >>>>>> more
> > > > >>>>>>>>>> consistent with SQL (on which after all table-api is
> heavily
> > > > >> basing
> > > > >>>>>>>> on).
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would still
> > > insist
> > > > >> on
> > > > >>>>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
> > > > >>>>>>>>> behaviours/side
> > > > >>>>>>>>>> effects and to give both us & users more flexibility.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Piotrek
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <
> becket.qin@gmail.com
> > >
> > > > >> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Just to add a little bit, the materialized view is
> probably
> > > > more
> > > > >>>>>>>>> similar
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>> the persistent() brought up earlier in the thread. So it
> is
> > > > >> usually
> > > > >>>>>>>>> cross
> > > > >>>>>>>>>>> session and could be used in a larger scope. For
> example, a
> > > > >>>>>>>>> materialized
> > > > >>>>>>>>>>> view created by user A may be visible to user B. It is
> > > probably
> > > > >>>>>>>>> something
> > > > >>>>>>>>>>> we want to have in the future. I'll put it in the future
> > work
> > > > >>>>>>>> section.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> > > > becket.qin@gmail.com
> > > > >>>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> Hi Piotrek,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thanks for the explanation.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Right now we are mostly thinking of the cached table as
> > > > >>>>> immutable. I
> > > > >>>>>>>>> can
> > > > >>>>>>>>>>>> see the Materialized view would be useful in the future.
> > > That
> > > > >>>>> said,
> > > > >>>>>>>> I
> > > > >>>>>>>>>> think
> > > > >>>>>>>>>>>> a simple cache mechanism is probably still needed. So to
> > me,
> > > > >>>>> cache()
> > > > >>>>>>>>> and
> > > > >>>>>>>>>>>> materialize() should be two separate method as they
> > address
> > > > >>>>>>>> different
> > > > >>>>>>>>>>>> needs. Materialize() is a higher level concept usually
> > > > implying
> > > > >>>>>>>>>> periodical
> > > > >>>>>>>>>>>> update, while cache() has much simpler semantic. For
> > > example,
> > > > >> one
> > > > >>>>>>>> may
> > > > >>>>>>>>>>>> create a materialized view and use cache() method in the
> > > > >>>>>>>> materialized
> > > > >>>>>>>>>> view
> > > > >>>>>>>>>>>> creation logic. So that during the materialized view
> > update,
> > > > >> they
> > > > >>>>> do
> > > > >>>>>>>>> not
> > > > >>>>>>>>>>>> need to worry about the case that the cached table is
> also
> > > > >>>>> changed.
> > > > >>>>>>>>>> Maybe
> > > > >>>>>>>>>>>> under the hood, materialized() and cache() could share
> > some
> > > > >>>>>>>> mechanism,
> > > > >>>>>>>>>> but
> > > > >>>>>>>>>>>> I think a simple cache() method would be handy in a lot
> of
> > > > >> cases.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > > > >>>>>>>>> piotr@data-artisans.com
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Hi Becket,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > MaterializedTable
> > > > >> that
> > > > >>>>>>>>> they
> > > > >>>>>>>>>>>>> cannot do on a Table?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Maybe not in the initial implementation, but various
> DBs
> > > > offer
> > > > >>>>>>>>>> different
> > > > >>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks,
> triggers,
> > > > >> timers,
> > > > >>>>>>>>>> manually
> > > > >>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle
> > > that
> > > > in
> > > > >>>>> the
> > > > >>>>>>>>>> future.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> After users call *table.cache(), *users can just use
> > that
> > > > >> table
> > > > >>>>>>>> and
> > > > >>>>>>>>> do
> > > > >>>>>>>>>>>>> anything that is supported on a Table, including SQL.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> This is some implicit behaviour with side effects.
> > Imagine
> > > if
> > > > >>>>> user
> > > > >>>>>>>>> has
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>>> long and complicated program, that touches table `b`
> > > multiple
> > > > >>>>>>>> times,
> > > > >>>>>>>>>> maybe
> > > > >>>>>>>>>>>>> scattered around different methods. If he modifies his
> > > > program
> > > > >> by
> > > > >>>>>>>>>> inserting
> > > > >>>>>>>>>>>>> in one place
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> b.cache()
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> This implicitly alters the semantic and behaviour of
> his
> > > code
> > > > >> all
> > > > >>>>>>>>> over
> > > > >>>>>>>>>>>>> the place, maybe in a ways that might cause problems.
> For
> > > > >> example
> > > > >>>>>>>>> what
> > > > >>>>>>>>>> if
> > > > >>>>>>>>>>>>> underlying data is changing?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Having invisible side effects is also not very clean,
> for
> > > > >> example
> > > > >>>>>>>>> think
> > > > >>>>>>>>>>>>> about something like this (but more complicated):
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Table b = ...;
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> If (some_condition) {
> > > > >>>>>>>>>>>>> processTable1(b)
> > > > >>>>>>>>>>>>> }
> > > > >>>>>>>>>>>>> else {
> > > > >>>>>>>>>>>>> processTable2(b)
> > > > >>>>>>>>>>>>> }
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> // do more stuff with b
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> > > > >> `processTable1`
> > > > >>>>>>>> or
> > > > >>>>>>>>>>>>> `processTable2` methods.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On the other hand
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Table materialisedB = b.materialize()
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Avoids (at least some of) the side effect issues and
> > forces
> > > > >> user
> > > > >>>>> to
> > > > >>>>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate
> and
> > > > >> forces
> > > > >>>>>>>> user
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>>>> think what does it actually mean. And if something
> > doesn’t
> > > > work
> > > > >>>>> in
> > > > >>>>>>>>> the
> > > > >>>>>>>>>> end
> > > > >>>>>>>>>>>>> for the user, he will know what has he changed instead
> of
> > > > >> blaming
> > > > >>>>>>>>>> Flink for
> > > > >>>>>>>>>>>>> some “magic” underneath. In the above example, after
> > > > >>>>> materialising
> > > > >>>>>>>> b
> > > > >>>>>>>>> in
> > > > >>>>>>>>>>>>> only one of the methods, he should/would realise about
> > the
> > > > >> issue
> > > > >>>>>>>> when
> > > > >>>>>>>>>>>>> handling the return value `MaterializedTable` of that
> > > method.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I guess it comes down to personal preferences if you
> like
> > > > >> things
> > > > >>>>> to
> > > > >>>>>>>>> be
> > > > >>>>>>>>>>>>> implicit or not. The more power is the user, probably
> the
> > > > more
> > > > >>>>>>>> likely
> > > > >>>>>>>>>> he is
> > > > >>>>>>>>>>>>> to like/understand implicit behaviour. And we as Table
> > API
> > > > >>>>>>>> designers
> > > > >>>>>>>>>> are
> > > > >>>>>>>>>>>>> the most power users out there, so I would proceed with
> > > > caution
> > > > >>>>> (so
> > > > >>>>>>>>>> that we
> > > > >>>>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely
> > > > implicit
> > > > >>>>>>>>> method
> > > > >>>>>>>>>>>>> arguments ;)  <
> > > https://stackoverflow.com/a/14922656/8149051
> > > > >)
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Table API to also support non-relational processing
> > cases,
> > > > >>>>> cache()
> > > > >>>>>>>>>>>>> might be slightly better.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I think even such extended Table API could benefit from
> > > > >> sticking
> > > > >>>>>>>>>> to/being
> > > > >>>>>>>>>>>>> consistent with SQL where both SQL and Table API are
> > > > basically
> > > > >>>>> the
> > > > >>>>>>>>>> same.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> One more thing. `MaterializedTable materialize()` could
> > be
> > > > more
> > > > >>>>>>>>>>>>> powerful/flexible allowing the user to operate both on
> > > > >>>>> materialised
> > > > >>>>>>>>>> and not
> > > > >>>>>>>>>>>>> materialised view at the same time for whatever reasons
> > > > >>>>> (underlying
> > > > >>>>>>>>>> data
> > > > >>>>>>>>>>>>> changing/better optimisation opportunities after
> pushing
> > > down
> > > > >>>>> more
> > > > >>>>>>>>>> filters
> > > > >>>>>>>>>>>>> etc). For example:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Table b = …;
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Val min = mb.min();
> > > > >>>>>>>>>>>>> Val max = mb.max();
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Could be more efficient compared to `b.cache()` if
> > > > >>>>> `filter(‘userId
> > > > >>>>>>>> =
> > > > >>>>>>>>>>>>> 42);` allows for much more aggressive optimisations.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Piotrek
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> > > fhueske@gmail.com>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was
> > > just
> > > > an
> > > > >>>>>>>>>> example.
> > > > >>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> > > > >>>>>>>>>>>>>> For the sake of this proposal, it would be up to the
> > user
> > > to
> > > > >>>>>>>>>> implement a
> > > > >>>>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink
> > > > classes
> > > > >>>>> to
> > > > >>>>>>>>>>>>> persist
> > > > >>>>>>>>>>>>>> and read the data.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio
> > > > Pompermaier
> > > > >> <
> > > > >>>>>>>>>>>>>> pompermaier@okkam.it>:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an
> > > > >> alternative
> > > > >>>>> to
> > > > >>>>>>>>>>>>> Apache
> > > > >>>>>>>>>>>>>>> Ignite?
> > > > >>>>>>>>>>>>>>> [1]
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>
> > > > >>
> > >
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> > > > >>>>>>>> fhueske@gmail.com>
> > > > >>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Hi,
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Thanks for the proposal!
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> To summarize, you propose a new method
> Table.cache():
> > > > Table
> > > > >>>>> that
> > > > >>>>>>>>>> will
> > > > >>>>>>>>>>>>>>>> trigger a job and write the result into some
> temporary
> > > > >> storage
> > > > >>>>>>>> as
> > > > >>>>>>>>>>>>> defined
> > > > >>>>>>>>>>>>>>>> by a TableFactory.
> > > > >>>>>>>>>>>>>>>> The cache() call blocks while the job is running and
> > > > >>>>> eventually
> > > > >>>>>>>>>>>>> returns a
> > > > >>>>>>>>>>>>>>>> Table object that represents a scan of the temporary
> > > > table.
> > > > >>>>>>>>>>>>>>>> When the "session" is closed (closing to be
> defined?),
> > > the
> > > > >>>>>>>>> temporary
> > > > >>>>>>>>>>>>>>> tables
> > > > >>>>>>>>>>>>>>>> are all dropped.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> I think this behavior makes sense and is a good
> first
> > > step
> > > > >>>>>>>> towards
> > > > >>>>>>>>>>>>> more
> > > > >>>>>>>>>>>>>>>> interactive workloads.
> > > > >>>>>>>>>>>>>>>> However, its performance suffers from writing to and
> > > > reading
> > > > >>>>>>>> from
> > > > >>>>>>>>>>>>>>> external
> > > > >>>>>>>>>>>>>>>> systems.
> > > > >>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> > > > significantly
> > > > >>>>>>>>> improve
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs)
> > > would
> > > > >>>>> have
> > > > >>>>>>>>>> large
> > > > >>>>>>>>>>>>>>>> impacts on many components of Flink.
> > > > >>>>>>>>>>>>>>>> Users could use in-memory filesystems or storage
> grids
> > > > >> (Apache
> > > > >>>>>>>>>>>>> Ignite) to
> > > > >>>>>>>>>>>>>>>> mitigate some of the performance effects.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Best, Fabian
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket
> Qin
> > <
> > > > >>>>>>>>>>>>>>>> becket.qin@gmail.com
> > > > >>>>>>>>>>>>>>>>> :
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > > MaterializedTable
> > > > >>>>>>>> that
> > > > >>>>>>>>>> they
> > > > >>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> > *table.cache(),
> > > > >> *users
> > > > >>>>>>>> can
> > > > >>>>>>>>>>>>> just
> > > > >>>>>>>>>>>>>>>> use
> > > > >>>>>>>>>>>>>>>>> that table and do anything that is supported on a
> > > Table,
> > > > >>>>>>>>> including
> > > > >>>>>>>>>>>>> SQL.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds
> > > fine
> > > > to
> > > > >>>>> me.
> > > > >>>>>>>>>>>>> cache()
> > > > >>>>>>>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>>>> a bit more general than materialize(). Given that
> we
> > > are
> > > > >>>>>>>>> enhancing
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>> Table API to also support non-relational processing
> > > > cases,
> > > > >>>>>>>>> cache()
> > > > >>>>>>>>>>>>>>> might
> > > > >>>>>>>>>>>>>>>> be
> > > > >>>>>>>>>>>>>>>>> slightly better.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > > > >>>>>>>>>>>>>>> piotr@data-artisans.com
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Hi Becket,
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to
> reuse
> > > > >> existing
> > > > >>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed
> that
> > > you
> > > > >>>>> want
> > > > >>>>>>>> to
> > > > >>>>>>>>>>>>>>>> provide
> > > > >>>>>>>>>>>>>>>>> an
> > > > >>>>>>>>>>>>>>>>>> alternate way of writing the data.
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Now that I hopefully understand the proposal,
> maybe
> > we
> > > > >> could
> > > > >>>>>>>>>> rename
> > > > >>>>>>>>>>>>>>>>>> `cache()` to
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> void materialize()
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> or going step further
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> > > > >>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> ?
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> The second option with returning a handle I think
> is
> > > > more
> > > > >>>>>>>>> flexible
> > > > >>>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete”
> or
> > > > >>>>> generally
> > > > >>>>>>>>>>>>>>> speaking
> > > > >>>>>>>>>>>>>>>>>> manage the the view. In the future we could also
> > think
> > > > >> about
> > > > >>>>>>>>>> adding
> > > > >>>>>>>>>>>>>>>> hooks
> > > > >>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more
> > > > >> explicit
> > > > >>>>> -
> > > > >>>>>>>>>>>>>>>>>> materialization returning a new table handle will
> > not
> > > > have
> > > > >>>>> the
> > > > >>>>>>>>>> same
> > > > >>>>>>>>>>>>>>>>>> implicit side effects as adding a simple line of
> > code
> > > > like
> > > > >>>>>>>>>>>>>>> `b.cache()`
> > > > >>>>>>>>>>>>>>>>>> would have.
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> It would also be more SQL like, making it more
> > > intuitive
> > > > >> for
> > > > >>>>>>>>> users
> > > > >>>>>>>>>>>>>>>>> already
> > > > >>>>>>>>>>>>>>>>>> familiar with the SQL.
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Piotrek
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> > > > >> becket.qin@gmail.com
> > > > >>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
> > equivalent
> > > to
> > > > >>>>>>>>> creating
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>> BUILT-IN
> > > > >>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> > > functionality
> > > > is
> > > > >>>>>>>>> missing
> > > > >>>>>>>>>>>>>>>>> today,
> > > > >>>>>>>>>>>>>>>>>>> though. Not sure if I understand your question.
> Do
> > > you
> > > > >> mean
> > > > >>>>>>>> we
> > > > >>>>>>>>>>>>>>>> already
> > > > >>>>>>>>>>>>>>>>>> have
> > > > >>>>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we
> > want
> > > > to
> > > > >>>>> stop
> > > > >>>>>>>>> at
> > > > >>>>>>>>>>>>>>>>> creating
> > > > >>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend
> that
> > > in
> > > > >> the
> > > > >>>>>>>>> future
> > > > >>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>> more
> > > > >>>>>>>>>>>>>>>>>>> useful unified data store distributed with Flink?
> > And
> > > > do
> > > > >> we
> > > > >>>>>>>>> want
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>> have
> > > > >>>>>>>>>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern
> with
> > > > their
> > > > >>>>> own
> > > > >>>>>>>>>> user
> > > > >>>>>>>>>>>>>>>>>> defined
> > > > >>>>>>>>>>>>>>>>>>> services. These considerations are much more
> > > > >> architectural.
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > > > >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > > > >>>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Hi,
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the
> > > > problem.
> > > > >>>>>>>> Isn’t
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to
> a
> > > sink
> > > > >> and
> > > > >>>>>>>>> later
> > > > >>>>>>>>>>>>>>>>> reading
> > > > >>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live
> > > scope/live
> > > > >>>>> time?
> > > > >>>>>>>>> And
> > > > >>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>> sink
> > > > >>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a file
> sink?
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> > > materialised
> > > > >>>>> view
> > > > >>>>>>>>>> from a
> > > > >>>>>>>>>>>>>>>>> table
> > > > >>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing
> > this
> > > > >>>>>>>>> materialised
> > > > >>>>>>>>>>>>>>>> view
> > > > >>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean
> up
> > > > >>>>>>>>> materialised
> > > > >>>>>>>>>>>>>>>> views
> > > > >>>>>>>>>>>>>>>>>> (for
> > > > >>>>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we
> > > need
> > > > >> some
> > > > >>>>>>>>>>>>>>> syntactic
> > > > >>>>>>>>>>>>>>>>>> sugar
> > > > >>>>>>>>>>>>>>>>>>>> on top of it?
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Piotrek
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> > > > >>>>> becket.qin@gmail.com
> > > > >>>>>>>>>
> > > > >>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist()
> > > with
> > > > >>>>>>>>>>>>>>>>> lifecycle/defined
> > > > >>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the future
> work
> > > for
> > > > >>>>> this.
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > > > >>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name
> of
> > > > >>>>>>>> `cache()`, I
> > > > >>>>>>>>>>>>>>>>>> understand
> > > > >>>>>>>>>>>>>>>>>>>> why
> > > > >>>>>>>>>>>>>>>>>>>>>> you designed this way!
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
> > lifecycle
> > > > for
> > > > >>>>>>>> data
> > > > >>>>>>>>>>>>>>>>>> persistence?
> > > > >>>>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so
> > that
> > > > the
> > > > >>>>> user
> > > > >>>>>>>>> is
> > > > >>>>>>>>>>>>>>> not
> > > > >>>>>>>>>>>>>>>>>>>> worried
> > > > >>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the
> > time
> > > > >> range
> > > > >>>>>>>> for
> > > > >>>>>>>>>>>>>>>> keeping
> > > > >>>>>>>>>>>>>>>>>>>> time.
> > > > >>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can
> > > also
> > > > >>>>> share
> > > > >>>>>>>>> in a
> > > > >>>>>>>>>>>>>>>>> certain
> > > > >>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> > > > >>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> > > > >>>>>>>>>>>>>>> am
> > > > >>>>>>>>>>>>>>>>> not
> > > > >>>>>>>>>>>>>>>>>>>> sure,
> > > > >>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference
> only!
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> Bests,
> > > > >>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > 于2018年11月23日周五
> > > > >>>>>>>> 下午1:33写道:
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache()
> v.s.
> > > > >>>>>>>> persist(),
> > > > >>>>>>>>>>>>>>>>>> personally I
> > > > >>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing
> > the
> > > > >>>>>>>> behavior,
> > > > >>>>>>>>>>>>>>> i.e.
> > > > >>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>>>> Table
> > > > >>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be
> deleted
> > > > after
> > > > >>>>> the
> > > > >>>>>>>>>>>>>>> session
> > > > >>>>>>>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>>>>>>>>> closed.
> > > > >>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people
> > > might
> > > > >>>>> think
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>>>>> table
> > > > >>>>>>>>>>>>>>>>>>>> will
> > > > >>>>>>>>>>>>>>>>>>>>>>> still be there even after the session is
> gone.
> > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
> > > > >>>>> processing
> > > > >>>>>>>> in
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>> same
> > > > >>>>>>>>>>>>>>>>>>>> job.
> > > > >>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal.
> I
> > > > >> imagine
> > > > >>>>>>>> that
> > > > >>>>>>>>>>>>>>> would
> > > > >>>>>>>>>>>>>>>>> be
> > > > >>>>>>>>>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>>>>>> huge
> > > > >>>>>>>>>>>>>>>>>>>>>>> change across the board, including sources,
> > > > operators
> > > > >>>>> and
> > > > >>>>>>>>>>>>>>>>>>>> optimizations,
> > > > >>>>>>>>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several
> separate
> > > > >>>>> in-depth
> > > > >>>>>>>>>>>>>>>>> discussions.
> > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > > > >>>>>>>>>>>>>>> xingcanc@gmail.com>
> > > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access
> > > domain
> > > > >> are
> > > > >>>>>>>> both
> > > > >>>>>>>>>>>>>>>>>> orthogonal
> > > > >>>>>>>>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be
> > the
> > > > >> first
> > > > >>>>>>>> time
> > > > >>>>>>>>>> we
> > > > >>>>>>>>>>>>>>>> plan
> > > > >>>>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other
> than
> > > the
> > > > >>>>>>>> state.
> > > > >>>>>>>>>>>>>>> Maybe
> > > > >>>>>>>>>>>>>>>>> it’s
> > > > >>>>>>>>>>>>>>>>>>>>>>> better
> > > > >>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
> > concentrate
> > > > on
> > > > >> a
> > > > >>>>>>>>>> specific
> > > > >>>>>>>>>>>>>>>>> part?
> > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned
> > with
> > > > the
> > > > >>>>>>>>>> underlying
> > > > >>>>>>>>>>>>>>>>>>>> service.
> > > > >>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the
> > > > >> existing
> > > > >>>>>>>>>>>>>>> codebase.
> > > > >>>>>>>>>>>>>>>> As
> > > > >>>>>>>>>>>>>>>>>> you
> > > > >>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to
> > > > support
> > > > >>>>>>>> other
> > > > >>>>>>>>>>>>>>>>>> components
> > > > >>>>>>>>>>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
> > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more
> > > > >> interactive
> > > > >>>>>>>>> Table
> > > > >>>>>>>>>>>>>>>> API,
> > > > >>>>>>>>>>>>>>>>> in
> > > > >>>>>>>>>>>>>>>>>>>>>> case
> > > > >>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service
> > > > mechanism.
> > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei
> Jiang <
> > > > >>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> > > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table
> for
> > > > clean
> > > > >> up
> > > > >>>>>>>> is
> > > > >>>>>>>>>> not
> > > > >>>>>>>>>>>>>>>> very
> > > > >>>>>>>>>>>>>>>>>>>>>>>> reliable.
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
> > executed
> > > > >>>>>>>>>> successfully.
> > > > >>>>>>>>>>>>>>> We
> > > > >>>>>>>>>>>>>>>>> may
> > > > >>>>>>>>>>>>>>>>>>>>>>> risk
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's
> > > safer
> > > > to
> > > > >>>>>>>> have
> > > > >>>>>>>>> an
> > > > >>>>>>>>>>>>>>>>>>>>>> association
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we
> can
> > > > always
> > > > >>>>>>>> clean
> > > > >>>>>>>>>> up
> > > > >>>>>>>>>>>>>>>> temp
> > > > >>>>>>>>>>>>>>>>>>>>>>> tables
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any
> > active
> > > > >>>>>>>> sessions.
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng
> > sun <
> > > > >>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and
> > > user
> > > > >>>>>>>> friendly
> > > > >>>>>>>>>> in
> > > > >>>>>>>>>>>>>>>> case
> > > > >>>>>>>>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>>>>>>>>>>> your
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has
> to
> > be
> > > > >>>>>>>> executed
> > > > >>>>>>>>> in
> > > > >>>>>>>>>>>>>>>>> several
> > > > >>>>>>>>>>>>>>>>>>>>>>>> stages
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of
> > > Flink
> > > > >> ML,
> > > > >>>>> in
> > > > >>>>>>>>>> order
> > > > >>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>>>>>>> utilize
> > > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have
> to
> > > > >> submit a
> > > > >>>>>>>> job
> > > > >>>>>>>>>> by
> > > > >>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better
> to
> > > > named
> > > > >>>>>>>>>>>>>>> `persist()`,
> > > > >>>>>>>>>>>>>>>>> And
> > > > >>>>>>>>>>>>>>>>>>>>>> The
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we
> > > internally
> > > > >>>>> cache
> > > > >>>>>>>>> in
> > > > >>>>>>>>>>>>>>>> memory
> > > > >>>>>>>>>>>>>>>>>> or
> > > > >>>>>>>>>>>>>>>>>>>>>>>> persist
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data
> > into
> > > > >> state
> > > > >>>>>>>>>> backend
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend
> > > etc.)
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the
> > future,
> > > > >>>>> support
> > > > >>>>>>>>> for
> > > > >>>>>>>>>>>>>>>>>> streaming
> > > > >>>>>>>>>>>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will
> > also
> > > > >>>>> benefit
> > > > >>>>>>>>> in
> > > > >>>>>>>>>>>>>>>>>>>>>> "Interactive
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to
> your
> > > > JIRAs
> > > > >>>>> and
> > > > >>>>>>>>>> FLIP!
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > > > 于2018年11月20日周二
> > > > >>>>>>>>>> 下午9:56写道:
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have
> pointed
> > > out,
> > > > >> it
> > > > >>>>>>>> is a
> > > > >>>>>>>>>>>>>>>>> promising
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in
> > > > various
> > > > >>>>>>>>>> aspects,
> > > > >>>>>>>>>>>>>>>>>>>>>> including
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among
> others.
> > > One
> > > > >> of
> > > > >>>>>>>> the
> > > > >>>>>>>>>>>>>>>>> scenarios
> > > > >>>>>>>>>>>>>>>>>>>>>>> where
> > > > >>>>>>>>>>>>>>>>>>>>>>>> we
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
> > > > >>>>> programming.
> > > > >>>>>>>> To
> > > > >>>>>>>>>>>>>>>> explain
> > > > >>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the
> > > solution,
> > > > we
> > > > >>>>> put
> > > > >>>>>>>>>>>>>>>> together
> > > > >>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Till Rohrmann <tr...@apache.org>.
Yes you are right Becket that it still depends on the actual execution of
the job whether a consumer reads from a cached result or not.

My point was actually about the properties of a (cached vs. non-cached) and
not about the execution. I would not make cache trigger the execution of
the job because one loses some flexibility by eagerly triggering the
execution.

I tried to argue for an explicit CachedTable which is returned by the
cache() method like Piotr did in order to make the API more explicit.

Cheers,
Till

On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <be...@gmail.com> wrote:

> Hi Till,
>
> That is a good example. Just a minor correction, in this case, b, c and d
> will all consume from a non-cached a. This is because cache will only be
> created on the very first job submission that generates the table to be
> cached.
>
> If I understand correctly, this is example is about whether .cache() method
> should be eagerly evaluated or lazily evaluated. In another word, if
> cache() method actually triggers a job that creates the cache, there will
> be no such confusion. Is that right?
>
> In the example, although d will not consume from the cached Table while it
> looks supposed to, from correctness perspective the code will still return
> correct result, assuming that tables are immutable.
>
> Personally I feel it is OK because users probably won't really worry about
> whether the table is cached or not. And lazy cache could avoid some
> unnecessary caching if a cached table is never created in the user
> application. But I am not opposed to do eager evaluation of cache.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <tr...@apache.org>
> wrote:
>
> > Another argument for Piotr's point is that lazily changing properties of
> a
> > node affects all down stream consumers but does not necessarily have to
> > happen before these consumers are defined. From a user's perspective this
> > can be quite confusing:
> >
> > b = a.map(...)
> > c = a.map(...)
> >
> > a.cache()
> > d = a.map(...)
> >
> > now b, c and d will consume from a cached operator. In this case, the
> user
> > would most likely expect that only d reads from a cached result.
> >
> > Cheers,
> > Till
> >
> > On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <pi...@data-artisans.com>
> > wrote:
> >
> > > Hey Shaoxuan and Becket,
> > >
> > > > Can you explain a bit more one what are the side effects? So far my
> > > > understanding is that such side effects only exist if a table is
> > mutable.
> > > > Is that the case?
> > >
> > > Not only that. There are also performance implications and those are
> > > another implicit side effects of using `void cache()`. As I wrote
> before,
> > > reading from cache might not always be desirable, thus it can cause
> > > performance degradation and I’m fine with that - user's or optimiser’s
> > > choice. What I do not like is that this implicit side effect can
> manifest
> > > in completely different part of code, that wasn’t touched by a user
> while
> > > he was adding `void cache()` call somewhere else. And even if caching
> > > improves performance, it’s still a side effect of `void cache()`.
> Almost
> > > from the definition `void` methods have only side effects. As I wrote
> > > before, there are couple of scenarios where this might be undesirable
> > > and/or unexpected, for example:
> > >
> > > 1.
> > > Table b = …;
> > > b.cache()
> > > x = b.join(…)
> > > y = b.count()
> > > // ...
> > > // 100
> > > // hundred
> > > // lines
> > > // of
> > > // code
> > > // later
> > > z = b.filter(…).groupBy(…) // this might be even hidden in a different
> > > method/file/package/dependency
> > >
> > > 2.
> > >
> > > Table b = ...
> > > If (some_condition) {
> > >   foo(b)
> > > }
> > > Else {
> > >   bar(b)
> > > }
> > > z = b.filter(…).groupBy(…)
> > >
> > >
> > > Void foo(Table b) {
> > >   b.cache()
> > >   // do something with b
> > > }
> > >
> > > In both above examples, `b.cache()` will implicitly affect (semantic
> of a
> > > program in case of sources being mutable and performance) `z =
> > > b.filter(…).groupBy(…)` which might be far from obvious.
> > >
> > > On top of that, there is still this argument of mine that having a
> > > `MaterializedTable` or `CachedTable` handle is more flexible for us for
> > the
> > > future and for the user (as a manual option to bypass cache reads).
> > >
> > > >  But Jiangjie is correct,
> > > > the source table in batching should be immutable. It is the user’s
> > > > responsibility to ensure it, otherwise even a regular failover may
> lead
> > > > to inconsistent results.
> > >
> > > Yes, I agree that’s what perfect world/good deployment should be. But
> its
> > > often isn’t and while I’m not trying to fix this (since the proper fix
> is
> > > to support transactions), I’m just trying to minimise confusion for the
> > > users that are not fully aware what’s going on and operate in less then
> > > perfect setup. And if something bites them after adding `b.cache()`
> call,
> > > to make sure that they at least know all of the places that adding this
> > > line can affect.
> > >
> > > Thanks, Piotrek
> > >
> > > > On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com> wrote:
> > > >
> > > > Hi Piotrek,
> > > >
> > > > Thanks again for the clarification. Some more replies are following.
> > > >
> > > > But keep in mind that `.cache()` will/might not only be used in
> > > interactive
> > > >> programming and not only in batching.
> > > >
> > > > It is true. Actually in stream processing, cache() has the same
> > semantic
> > > as
> > > > batch processing. The semantic is following:
> > > > For a table created via a series of computation, save that table for
> > > later
> > > > reference to avoid running the computation logic to regenerate the
> > table.
> > > > Once the application exits, drop all the cache.
> > > > This semantic is same for both batch and stream processing. The
> > > difference
> > > > is that stream applications will only run once as they are long
> > running.
> > > > And the batch applications may be run multiple times, hence the cache
> > may
> > > > be created and dropped each time the application runs.
> > > > Admittedly, there will probably be some resource management
> > requirements
> > > > for the streaming cached table, such as time based / size based
> > > retention,
> > > > to address the infinite data issue. But such requirement does not
> > change
> > > > the semantic.
> > > > You are right that interactive programming is just one use case of
> > > cache().
> > > > It is not the only use case.
> > > >
> > > > For me the more important issue is of not having the `void cache()`
> > with
> > > >> side effects.
> > > >
> > > > This is indeed the key point. The argument around whether cache()
> > should
> > > > return something already indicates that cache() and materialize()
> > address
> > > > different issues.
> > > > Can you explain a bit more one what are the side effects? So far my
> > > > understanding is that such side effects only exist if a table is
> > mutable.
> > > > Is that the case?
> > > >
> > > > I don’t know, probably initially we should make CachedTable
> read-only.
> > I
> > > >> don’t find it more confusing than the fact that user can not write
> to
> > > views
> > > >> or materialised views in SQL or that user currently can not write
> to a
> > > >> Table.
> > > >
> > > > I don't think anyone should insert something to a cache. By
> definition
> > > the
> > > > cache should only be updated when the corresponding original table is
> > > > updated. What I am wondering is that given the following two facts:
> > > > 1. If and only if a table is mutable (with something like insert()),
> a
> > > > CachedTable may have implicit behavior.
> > > > 2. A CachedTable extends a Table.
> > > > We can come to the conclusion that a CachedTable is mutable and users
> > can
> > > > insert into the CachedTable directly. This is where I thought
> > confusing.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> piotr@data-artisans.com
> > >
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> Regarding naming `cache()` vs `materialize()`. One more explanation
> > why
> > > I
> > > >> think `materialize()` is more natural to me is that I think of all
> > > “Table”s
> > > >> in Table-API as views. They behave the same way as SQL views, the
> only
> > > >> difference for me is that their live scope is short - current
> session
> > > which
> > > >> is limited by different execution model. That’s why “cashing” a view
> > > for me
> > > >> is just materialising it.
> > > >>
> > > >> However I see and I understand your point of view. Coming from
> > > >> DataSet/DataStream and generally speaking non-SQL world, `cache()`
> is
> > > more
> > > >> natural. But keep in mind that `.cache()` will/might not only be
> used
> > in
> > > >> interactive programming and not only in batching. But naming is one
> > > issue,
> > > >> and not that critical to me. Especially that once we implement
> proper
> > > >> materialised views, we can always deprecate/rename `cache()` if we
> > deem
> > > so.
> > > >>
> > > >>
> > > >> For me the more important issue is of not having the `void cache()`
> > with
> > > >> side effects. Exactly for the reasons that you have mentioned. True:
> > > >> results might be non deterministic if underlying source table are
> > > changing.
> > > >> Problem is that `void cache()` implicitly changes the semantic of
> > > >> subsequent uses of the cached/materialized Table. It can cause “wtf”
> > > moment
> > > >> for a user if he inserts “b.cache()” call in some place in his code
> > and
> > > >> suddenly some other random places are behaving differently. If
> > > >> `materialize()` or `cache()` returns a Table handle, we force user
> to
> > > >> explicitly use the cache which removes the “random” part from the
> > > "suddenly
> > > >> some other random places are behaving differently”.
> > > >>
> > > >> This argument and others that I’ve raised (greater
> > flexibility/allowing
> > > >> user to explicitly bypass the cache) are independent of `cache()` vs
> > > >> `materialize()` discussion.
> > > >>
> > > >>> Does that mean one can also insert into the CachedTable? This
> sounds
> > > >> pretty confusing.
> > > >>
> > > >> I don’t know, probably initially we should make CachedTable
> > read-only. I
> > > >> don’t find it more confusing than the fact that user can not write
> to
> > > views
> > > >> or materialised views in SQL or that user currently can not write
> to a
> > > >> Table.
> > > >>
> > > >> Piotrek
> > > >>
> > > >>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com> wrote:
> > > >>>
> > > >>> Hi all,
> > > >>>
> > > >>> I agree with @Becket that `cache()` and `materialize()` should be
> > > >> considered as two different methods where the later one is more
> > > >> sophisticated.
> > > >>>
> > > >>> According to my understanding, the initial idea is just to
> introduce
> > a
> > > >> simple cache or persist mechanism, but as the TableAPI is a
> high-level
> > > API,
> > > >> it’s naturally for as to think in a SQL way.
> > > >>>
> > > >>> Maybe we can add the `cache()` method to the DataSet API and force
> > > users
> > > >> to translate a Table to a Dataset before caching it. Then the users
> > > should
> > > >> manually register the cached dataset to a table again (we may need
> > some
> > > >> table replacement mechanisms for datasets with an identical schema
> but
> > > >> different contents here). After all, it’s the dataset rather than
> the
> > > >> dynamic table that need to be cached, right?
> > > >>>
> > > >>> Best,
> > > >>> Xingcan
> > > >>>
> > > >>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >>>>
> > > >>>> Hi Piotrek and Jark,
> > > >>>>
> > > >>>> Thanks for the feedback and explanation. Those are good arguments.
> > > But I
> > > >>>> think those arguments are mostly about materialized view. Let me
> try
> > > to
> > > >>>> explain the reason I believe cache() and materialize() are
> > different.
> > > >>>>
> > > >>>> I think cache() and materialize() have quite different
> implications.
> > > An
> > > >>>> analogy I can think of is save()/publish(). When users call
> cache(),
> > > it
> > > >> is
> > > >>>> just like they are saving an intermediate result as a draft of
> their
> > > >> work,
> > > >>>> this intermediate result may not have any realistic meaning.
> Calling
> > > >>>> cache() does not mean users want to publish the cached table in
> any
> > > >> manner.
> > > >>>> But when users call materialize(), that means "I have something
> > > >> meaningful
> > > >>>> to be reused by others", now users need to think about the
> > validation,
> > > >>>> update & versioning, lifecycle of the result, etc.
> > > >>>>
> > > >>>> Piotrek's suggestions on variations of the materialize() methods
> are
> > > >> very
> > > >>>> useful. It would be great if Flink have them. The concept of
> > > >> materialized
> > > >>>> view is actually a pretty big feature, not to say the related
> stuff
> > > like
> > > >>>> triggers/hooks you mentioned earlier. I think the materialized
> view
> > > >> itself
> > > >>>> should be discussed in a more thorough and systematic manner. And
> I
> > > >> found
> > > >>>> that discussion is kind of orthogonal and way beyond interactive
> > > >>>> programming experience.
> > > >>>>
> > > >>>> The example you gave was interesting. I still have some questions,
> > > >> though.
> > > >>>>
> > > >>>> Table source = … // some source that scans files from a directory
> > > >>>>> “/foo/bar/“
> > > >>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > >>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > >>>>
> > > >>>> t2.count() // initialise cache (if it’s lazily initialised)
> > > >>>>> int a1 = t1.count()
> > > >>>>> int b1 = t2.count()
> > > >>>>> // something in the background (or we trigger it) writes new
> files
> > to
> > > >>>>> /foo/bar
> > > >>>>> int a2 = t1.count()
> > > >>>>> int b2 = t2.count()
> > > >>>>> t2.refresh() // possible future extension, not to be implemented
> in
> > > the
> > > >>>>> initial version
> > > >>>>>
> > > >>>>
> > > >>>> what if someone else added some more files to /foo/bar at this
> > point?
> > > In
> > > >>>> that case, a3 won't equals to b3, and the result become
> > > >> non-deterministic,
> > > >>>> right?
> > > >>>>
> > > >>>> int a3 = t1.count()
> > > >>>>> int b3 = t2.count()
> > > >>>>> t2.drop() // another possible future extension, manual “cache”
> > > dropping
> > > >>>>
> > > >>>>
> > > >>>> When we talk about interactive programming, in most cases, we are
> > > >> talking
> > > >>>> about batch applications. A fundamental assumption of such case is
> > > that
> > > >> the
> > > >>>> source data is complete before the data processing begins, and the
> > > data
> > > >>>> will not change during the data processing. IMO, if additional
> rows
> > > >> needs
> > > >>>> to be added to some source during the processing, it should be
> done
> > in
> > > >> ways
> > > >>>> like union the source with another table containing the rows to be
> > > >> added.
> > > >>>>
> > > >>>> There are a few cases that computations are executed repeatedly on
> > the
> > > >>>> changing data source.
> > > >>>>
> > > >>>> For example, people may run a ML training job every hour with the
> > > >> samples
> > > >>>> newly added in the past hour. In that case, the source data
> between
> > > will
> > > >>>> indeed change. But still, the data remain unchanged within one
> run.
> > > And
> > > >>>> usually in that case, the result will need versioning, i.e. for a
> > > given
> > > >>>> result, it tells that the result is a result from the source data
> > by a
> > > >>>> certain timestamp.
> > > >>>>
> > > >>>> Another example is something like data warehouse. In this case,
> > there
> > > >> are a
> > > >>>> few source of original/raw data. On top of those sources, many
> > > >> materialized
> > > >>>> view / queries / reports / dashboards can be created to generate
> > > derived
> > > >>>> data. Those derived data needs to be updated when the underlying
> > > >> original
> > > >>>> data changes. In that case, the processing logic that derives the
> > > >> original
> > > >>>> data needs to be executed repeatedly to update those
> reports/views.
> > > >> Again,
> > > >>>> all those derived data also need to have version management, such
> as
> > > >>>> timestamp.
> > > >>>>
> > > >>>> In any of the above two cases, during a single run of the
> processing
> > > >> logic,
> > > >>>> the data cannot change. Otherwise the behavior of the processing
> > logic
> > > >> may
> > > >>>> be undefined. In the above two examples, when writing the
> processing
> > > >> logic,
> > > >>>> Users can use .cache() to hint Flink that those results should be
> > > saved
> > > >> to
> > > >>>> avoid repeated computation. And then for the result of my
> > application
> > > >>>> logic, I'll call materialize(), so that these results could be
> > managed
> > > >> by
> > > >>>> the system with versioning, metadata management, lifecycle
> > management,
> > > >>>> ACLs, etc.
> > > >>>>
> > > >>>> It is true we can use materialize() to do the cache() job, but I
> am
> > > >> really
> > > >>>> reluctant to shoehorn cache() into materialize() and force users
> to
> > > >> worry
> > > >>>> about a bunch of implications that they needn't have to. I am
> > > >> absolutely on
> > > >>>> your side that redundant API is bad. But it is equally
> frustrating,
> > if
> > > >> not
> > > >>>> more, that the same API does different things.
> > > >>>>
> > > >>>> Thanks,
> > > >>>>
> > > >>>> Jiangjie (Becket) Qin
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> wshaoxuan@gmail.com
> > >
> > > >> wrote:
> > > >>>>
> > > >>>>> Thanks Piotrek,
> > > >>>>> You provided a very good example, it explains all the confusions
> I
> > > >> have.
> > > >>>>> It is clear that there is something we have not considered in the
> > > >> initial
> > > >>>>> proposal. We intend to force the user to reuse the
> > > cached/materialized
> > > >>>>> table, if its cache() method is executed. We did not expect that
> > user
> > > >> may
> > > >>>>> want to re-executed the plan from the source table. Let me
> re-think
> > > >> about
> > > >>>>> it and get back to you later.
> > > >>>>>
> > > >>>>> In the meanwhile, this example/observation also infers that we
> > cannot
> > > >> fully
> > > >>>>> involve the optimizer to decide the plan if a cache/materialize
> is
> > > >>>>> explicitly used, because weather to reuse the cache data or
> > > re-execute
> > > >> the
> > > >>>>> query from source data may lead to different results. (But I
> guess
> > > >>>>> optimizer can still help in some cases ---- as long as it does
> not
> > > >>>>> re-execute from the varied source, we should be safe).
> > > >>>>>
> > > >>>>> Regards,
> > > >>>>> Shaoxuan
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> > > >> piotr@data-artisans.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi Shaoxuan,
> > > >>>>>>
> > > >>>>>> Re 2:
> > > >>>>>>
> > > >>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified
> to->
> > > t1’
> > > >>>>>>
> > > >>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> > > >>>>>> `methodThatAppliesOperators()` method has changed it’s plan?
> > > >>>>>>
> > > >>>>>> I was thinking more about something like this:
> > > >>>>>>
> > > >>>>>> Table source = … // some source that scans files from a
> directory
> > > >>>>>> “/foo/bar/“
> > > >>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > >>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > >>>>>>
> > > >>>>>> t2.count() // initialise cache (if it’s lazily initialised)
> > > >>>>>>
> > > >>>>>> int a1 = t1.count()
> > > >>>>>> int b1 = t2.count()
> > > >>>>>>
> > > >>>>>> // something in the background (or we trigger it) writes new
> files
> > > to
> > > >>>>>> /foo/bar
> > > >>>>>>
> > > >>>>>> int a2 = t1.count()
> > > >>>>>> int b2 = t2.count()
> > > >>>>>>
> > > >>>>>> t2.refresh() // possible future extension, not to be implemented
> > in
> > > >> the
> > > >>>>>> initial version
> > > >>>>>>
> > > >>>>>> int a3 = t1.count()
> > > >>>>>> int b3 = t2.count()
> > > >>>>>>
> > > >>>>>> t2.drop() // another possible future extension, manual “cache”
> > > >> dropping
> > > >>>>>>
> > > >>>>>> assertTrue(a1 == b1) // same results, but b1 comes from the
> > “cache"
> > > >>>>>> assertTrue(b1 == b2) // both values come from the same cache
> > > >>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full
> > > table
> > > >>>>> scan
> > > >>>>>> and has more data
> > > >>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> > > >>>>>> assertTrue(b3 == a2 == a3)
> > > >>>>>>
> > > >>>>>> Piotrek
> > > >>>>>>
> > > >>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
> > > >>>>>>>
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>> It is an very interesting and useful design!
> > > >>>>>>>
> > > >>>>>>> Here I want to share some of my thoughts:
> > > >>>>>>>
> > > >>>>>>> 1. Agree with that cache() method should return some Table to
> > avoid
> > > >>>>> some
> > > >>>>>>> unexpected problems because of the mutable object.
> > > >>>>>>> All the existing methods of Table are returning a new Table
> > > instance.
> > > >>>>>>>
> > > >>>>>>> 2. I think materialize() would be more consistent with SQL,
> this
> > > >> makes
> > > >>>>> it
> > > >>>>>>> possible to support the same feature for SQL (materialize view)
> > and
> > > >>>>> keep
> > > >>>>>>> the same API for users in the future.
> > > >>>>>>> But I'm also fine if we choose cache().
> > > >>>>>>>
> > > >>>>>>> 3. In the proposal, a TableService (or FlinkService?) is used
> to
> > > >> cache
> > > >>>>>> the
> > > >>>>>>> result of the (intermediate) table.
> > > >>>>>>> But the name of TableService may be a bit general which is not
> > > quite
> > > >>>>>>> understanding correctly in the first glance (a metastore for
> > > >> tables?).
> > > >>>>>>> Maybe a more specific name would be better, such as
> > > TableCacheSerive
> > > >>>>> or
> > > >>>>>>> TableMaterializeSerivce or something else.
> > > >>>>>>>
> > > >>>>>>> Best,
> > > >>>>>>> Jark
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fhueske@gmail.com
> >
> > > >> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi,
> > > >>>>>>>>
> > > >>>>>>>> Thanks for the clarification Becket!
> > > >>>>>>>>
> > > >>>>>>>> I have a few thoughts to share / questions:
> > > >>>>>>>>
> > > >>>>>>>> 1) I'd like to know how you plan to implement the feature on a
> > > plan
> > > >> /
> > > >>>>>>>> planner level.
> > > >>>>>>>>
> > > >>>>>>>> I would imaging the following to happen when Table.cache() is
> > > >> called:
> > > >>>>>>>>
> > > >>>>>>>> 1) immediately optimize the Table and internally convert it
> > into a
> > > >>>>>>>> DataSet/DataStream. This is necessary, to avoid that operators
> > of
> > > >>>>> later
> > > >>>>>>>> queries on top of the Table are pushed down.
> > > >>>>>>>> 2) register the DataSet/DataStream as a
> > DataSet/DataStream-backed
> > > >>>>> Table
> > > >>>>>> X
> > > >>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> > > materialization
> > > >>>>> of
> > > >>>>>> the
> > > >>>>>>>> Table X
> > > >>>>>>>>
> > > >>>>>>>> Based on your proposal the following would happen:
> > > >>>>>>>>
> > > >>>>>>>> Table t1 = ....
> > > >>>>>>>> t1.cache(); // cache() returns void. The logical plan of t1 is
> > > >>>>> replaced
> > > >>>>>> by
> > > >>>>>>>> a scan of X. There is also a reference to the materialization
> of
> > > X.
> > > >>>>>>>>
> > > >>>>>>>> t1.count(); // this executes the program, including the
> > > >>>>>> DataSet/DataStream
> > > >>>>>>>> that backs X and the sink that writes the materialization of X
> > > >>>>>>>> t1.count(); // this executes the program, but reads X from the
> > > >>>>>>>> materialization.
> > > >>>>>>>>
> > > >>>>>>>> My question is, how do you determine when whether the scan of
> t1
> > > >>>>> should
> > > >>>>>> go
> > > >>>>>>>> against the DataSet/DataStream program and when against the
> > > >>>>>>>> materialization?
> > > >>>>>>>> AFAIK, there is no hook that will tell you that a part of the
> > > >> program
> > > >>>>>> was
> > > >>>>>>>> executed. Flipping a switch during optimization or plan
> > generation
> > > >> is
> > > >>>>>> not
> > > >>>>>>>> sufficient as there is no guarantee that the plan is also
> > > executed.
> > > >>>>>>>>
> > > >>>>>>>> Overall, this behavior is somewhat similar to what I proposed
> in
> > > >>>>>>>> FLINK-8950, which does not include persisting the table, but
> > just
> > > >>>>>>>> optimizing and reregistering it as DataSet/DataStream scan.
> > > >>>>>>>>
> > > >>>>>>>> 2) I think Piotr has a point about the implicit behavior and
> > side
> > > >>>>>> effects
> > > >>>>>>>> of the cache() method if it does not return anything.
> > > >>>>>>>> Consider the following example:
> > > >>>>>>>>
> > > >>>>>>>> Table t1 = ???
> > > >>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> > > >>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> > > >>>>>>>>
> > > >>>>>>>> In this case, the behavior/performance of the plan that
> results
> > > from
> > > >>>>> the
> > > >>>>>>>> second method call depends on whether t1 was modified by the
> > first
> > > >>>>>> method
> > > >>>>>>>> or not.
> > > >>>>>>>> This is the classic issue of mutable vs. immutable objects.
> > > >>>>>>>> Also, as Piotr pointed out, it might also be good to have the
> > > >> original
> > > >>>>>> plan
> > > >>>>>>>> of t1, because in some cases it is possible to push filters
> down
> > > >> such
> > > >>>>>> that
> > > >>>>>>>> evaluating the query from scratch might be more efficient than
> > > >>>>> accessing
> > > >>>>>>>> the cache.
> > > >>>>>>>> Moreover, a CachedTable could extend Table() and offer a
> method
> > > >>>>>> refresh().
> > > >>>>>>>> This sounds quite useful in an interactive session mode.
> > > >>>>>>>>
> > > >>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> > > materialize()
> > > >>>>>> seems
> > > >>>>>>>> to be more future proof.
> > > >>>>>>>>
> > > >>>>>>>> Best, Fabian
> > > >>>>>>>>
> > > >>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> > > >>>>>>>> wshaoxuan@gmail.com>:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Piotr,
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks for sharing your ideas on the method naming. We will
> > think
> > > >>>>> about
> > > >>>>>>>>> your suggestions. But I don't understand why we need to
> change
> > > the
> > > >>>>>> return
> > > >>>>>>>>> type of cache().
> > > >>>>>>>>>
> > > >>>>>>>>> Cache() is a physical operation, it does not change the logic
> > of
> > > >>>>>>>>> the `Table`. On the tableAPI layer, we should not introduce a
> > new
> > > >>>>> table
> > > >>>>>>>>> type unless the logic of table has been changed. If we
> > introduce
> > > a
> > > >>>>> new
> > > >>>>>>>>> table type `CachedTable`, we need create the same set of
> > methods
> > > of
> > > >>>>>>>> `Table`
> > > >>>>>>>>> for it. I don't think it is worth doing this. Or can you
> please
> > > >>>>>> elaborate
> > > >>>>>>>>> more on what could be the "implicit behaviours/side effects"
> > you
> > > >> are
> > > >>>>>>>>> thinking about?
> > > >>>>>>>>>
> > > >>>>>>>>> Regards,
> > > >>>>>>>>> Shaoxuan
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > > >>>>>> piotr@data-artisans.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi Becket,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks for the response.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1. I wasn’t saying that materialised view must be mutable or
> > > not.
> > > >>>>> The
> > > >>>>>>>>> same
> > > >>>>>>>>>> thing applies to caches as well. To the contrary, I would
> > expect
> > > >>>>> more
> > > >>>>>>>>>> consistency and updates from something that is called
> “cache”
> > vs
> > > >>>>>>>>> something
> > > >>>>>>>>>> that’s a “materialised view”. In other words, IMO most
> caches
> > do
> > > >> not
> > > >>>>>>>>> serve
> > > >>>>>>>>>> you invalid/outdated data and they handle updates on their
> > own.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2. I don’t think that having in the future two very similar
> > > >> concepts
> > > >>>>>> of
> > > >>>>>>>>>> `materialized` view and `cache` is a good idea. It would be
> > > >>>>> confusing
> > > >>>>>>>> for
> > > >>>>>>>>>> the users. I think it could be handled by
> > variations/overloading
> > > >> of
> > > >>>>>>>>>> materialised view concept. We could start with:
> > > >>>>>>>>>>
> > > >>>>>>>>>> `MaterializedTable materialize()` - immutable, session life
> > > scope
> > > >>>>>>>>>> (basically the same semantic as you are proposing
> > > >>>>>>>>>>
> > > >>>>>>>>>> And then in the future (if ever) build on top of that/expand
> > it
> > > >>>>> with:
> > > >>>>>>>>>>
> > > >>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> > > >> `MaterializedTable
> > > >>>>>>>>>> materialize(refreshHook=…)`
> > > >>>>>>>>>>
> > > >>>>>>>>>> Or with cross session support:
> > > >>>>>>>>>>
> > > >>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> > > >>>>> `MaterializedTable
> > > >>>>>>>>>> materializeInto(tableFactory=…)`
> > > >>>>>>>>>>
> > > >>>>>>>>>> I’m not saying that we should implement cross
> > session/refreshing
> > > >> now
> > > >>>>>> or
> > > >>>>>>>>>> even in the near future. I’m just arguing that naming
> current
> > > >>>>>> immutable
> > > >>>>>>>>>> session life scope method `materialize()` is more future
> proof
> > > and
> > > >>>>>> more
> > > >>>>>>>>>> consistent with SQL (on which after all table-api is heavily
> > > >> basing
> > > >>>>>>>> on).
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would still
> > insist
> > > >> on
> > > >>>>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
> > > >>>>>>>>> behaviours/side
> > > >>>>>>>>>> effects and to give both us & users more flexibility.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Piotrek
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <becket.qin@gmail.com
> >
> > > >> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Just to add a little bit, the materialized view is probably
> > > more
> > > >>>>>>>>> similar
> > > >>>>>>>>>> to
> > > >>>>>>>>>>> the persistent() brought up earlier in the thread. So it is
> > > >> usually
> > > >>>>>>>>> cross
> > > >>>>>>>>>>> session and could be used in a larger scope. For example, a
> > > >>>>>>>>> materialized
> > > >>>>>>>>>>> view created by user A may be visible to user B. It is
> > probably
> > > >>>>>>>>> something
> > > >>>>>>>>>>> we want to have in the future. I'll put it in the future
> work
> > > >>>>>>>> section.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> > > becket.qin@gmail.com
> > > >>>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks for the explanation.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Right now we are mostly thinking of the cached table as
> > > >>>>> immutable. I
> > > >>>>>>>>> can
> > > >>>>>>>>>>>> see the Materialized view would be useful in the future.
> > That
> > > >>>>> said,
> > > >>>>>>>> I
> > > >>>>>>>>>> think
> > > >>>>>>>>>>>> a simple cache mechanism is probably still needed. So to
> me,
> > > >>>>> cache()
> > > >>>>>>>>> and
> > > >>>>>>>>>>>> materialize() should be two separate method as they
> address
> > > >>>>>>>> different
> > > >>>>>>>>>>>> needs. Materialize() is a higher level concept usually
> > > implying
> > > >>>>>>>>>> periodical
> > > >>>>>>>>>>>> update, while cache() has much simpler semantic. For
> > example,
> > > >> one
> > > >>>>>>>> may
> > > >>>>>>>>>>>> create a materialized view and use cache() method in the
> > > >>>>>>>> materialized
> > > >>>>>>>>>> view
> > > >>>>>>>>>>>> creation logic. So that during the materialized view
> update,
> > > >> they
> > > >>>>> do
> > > >>>>>>>>> not
> > > >>>>>>>>>>>> need to worry about the case that the cached table is also
> > > >>>>> changed.
> > > >>>>>>>>>> Maybe
> > > >>>>>>>>>>>> under the hood, materialized() and cache() could share
> some
> > > >>>>>>>> mechanism,
> > > >>>>>>>>>> but
> > > >>>>>>>>>>>> I think a simple cache() method would be handy in a lot of
> > > >> cases.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > > >>>>>>>>> piotr@data-artisans.com
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi Becket,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Is there any extra thing user can do on a
> > MaterializedTable
> > > >> that
> > > >>>>>>>>> they
> > > >>>>>>>>>>>>> cannot do on a Table?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Maybe not in the initial implementation, but various DBs
> > > offer
> > > >>>>>>>>>> different
> > > >>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks, triggers,
> > > >> timers,
> > > >>>>>>>>>> manually
> > > >>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle
> > that
> > > in
> > > >>>>> the
> > > >>>>>>>>>> future.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> After users call *table.cache(), *users can just use
> that
> > > >> table
> > > >>>>>>>> and
> > > >>>>>>>>> do
> > > >>>>>>>>>>>>> anything that is supported on a Table, including SQL.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> This is some implicit behaviour with side effects.
> Imagine
> > if
> > > >>>>> user
> > > >>>>>>>>> has
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>> long and complicated program, that touches table `b`
> > multiple
> > > >>>>>>>> times,
> > > >>>>>>>>>> maybe
> > > >>>>>>>>>>>>> scattered around different methods. If he modifies his
> > > program
> > > >> by
> > > >>>>>>>>>> inserting
> > > >>>>>>>>>>>>> in one place
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> b.cache()
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> This implicitly alters the semantic and behaviour of his
> > code
> > > >> all
> > > >>>>>>>>> over
> > > >>>>>>>>>>>>> the place, maybe in a ways that might cause problems. For
> > > >> example
> > > >>>>>>>>> what
> > > >>>>>>>>>> if
> > > >>>>>>>>>>>>> underlying data is changing?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Having invisible side effects is also not very clean, for
> > > >> example
> > > >>>>>>>>> think
> > > >>>>>>>>>>>>> about something like this (but more complicated):
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Table b = ...;
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> If (some_condition) {
> > > >>>>>>>>>>>>> processTable1(b)
> > > >>>>>>>>>>>>> }
> > > >>>>>>>>>>>>> else {
> > > >>>>>>>>>>>>> processTable2(b)
> > > >>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> // do more stuff with b
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> > > >> `processTable1`
> > > >>>>>>>> or
> > > >>>>>>>>>>>>> `processTable2` methods.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On the other hand
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Table materialisedB = b.materialize()
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Avoids (at least some of) the side effect issues and
> forces
> > > >> user
> > > >>>>> to
> > > >>>>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate and
> > > >> forces
> > > >>>>>>>> user
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>> think what does it actually mean. And if something
> doesn’t
> > > work
> > > >>>>> in
> > > >>>>>>>>> the
> > > >>>>>>>>>> end
> > > >>>>>>>>>>>>> for the user, he will know what has he changed instead of
> > > >> blaming
> > > >>>>>>>>>> Flink for
> > > >>>>>>>>>>>>> some “magic” underneath. In the above example, after
> > > >>>>> materialising
> > > >>>>>>>> b
> > > >>>>>>>>> in
> > > >>>>>>>>>>>>> only one of the methods, he should/would realise about
> the
> > > >> issue
> > > >>>>>>>> when
> > > >>>>>>>>>>>>> handling the return value `MaterializedTable` of that
> > method.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I guess it comes down to personal preferences if you like
> > > >> things
> > > >>>>> to
> > > >>>>>>>>> be
> > > >>>>>>>>>>>>> implicit or not. The more power is the user, probably the
> > > more
> > > >>>>>>>> likely
> > > >>>>>>>>>> he is
> > > >>>>>>>>>>>>> to like/understand implicit behaviour. And we as Table
> API
> > > >>>>>>>> designers
> > > >>>>>>>>>> are
> > > >>>>>>>>>>>>> the most power users out there, so I would proceed with
> > > caution
> > > >>>>> (so
> > > >>>>>>>>>> that we
> > > >>>>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely
> > > implicit
> > > >>>>>>>>> method
> > > >>>>>>>>>>>>> arguments ;)  <
> > https://stackoverflow.com/a/14922656/8149051
> > > >)
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Table API to also support non-relational processing
> cases,
> > > >>>>> cache()
> > > >>>>>>>>>>>>> might be slightly better.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I think even such extended Table API could benefit from
> > > >> sticking
> > > >>>>>>>>>> to/being
> > > >>>>>>>>>>>>> consistent with SQL where both SQL and Table API are
> > > basically
> > > >>>>> the
> > > >>>>>>>>>> same.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> One more thing. `MaterializedTable materialize()` could
> be
> > > more
> > > >>>>>>>>>>>>> powerful/flexible allowing the user to operate both on
> > > >>>>> materialised
> > > >>>>>>>>>> and not
> > > >>>>>>>>>>>>> materialised view at the same time for whatever reasons
> > > >>>>> (underlying
> > > >>>>>>>>>> data
> > > >>>>>>>>>>>>> changing/better optimisation opportunities after pushing
> > down
> > > >>>>> more
> > > >>>>>>>>>> filters
> > > >>>>>>>>>>>>> etc). For example:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Table b = …;
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Val min = mb.min();
> > > >>>>>>>>>>>>> Val max = mb.max();
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Could be more efficient compared to `b.cache()` if
> > > >>>>> `filter(‘userId
> > > >>>>>>>> =
> > > >>>>>>>>>>>>> 42);` allows for much more aggressive optimisations.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> > fhueske@gmail.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was
> > just
> > > an
> > > >>>>>>>>>> example.
> > > >>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> > > >>>>>>>>>>>>>> For the sake of this proposal, it would be up to the
> user
> > to
> > > >>>>>>>>>> implement a
> > > >>>>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink
> > > classes
> > > >>>>> to
> > > >>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>> and read the data.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio
> > > Pompermaier
> > > >> <
> > > >>>>>>>>>>>>>> pompermaier@okkam.it>:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an
> > > >> alternative
> > > >>>>> to
> > > >>>>>>>>>>>>> Apache
> > > >>>>>>>>>>>>>>> Ignite?
> > > >>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>
> > > >>
> > https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> > > >>>>>>>> fhueske@gmail.com>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks for the proposal!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> To summarize, you propose a new method Table.cache():
> > > Table
> > > >>>>> that
> > > >>>>>>>>>> will
> > > >>>>>>>>>>>>>>>> trigger a job and write the result into some temporary
> > > >> storage
> > > >>>>>>>> as
> > > >>>>>>>>>>>>> defined
> > > >>>>>>>>>>>>>>>> by a TableFactory.
> > > >>>>>>>>>>>>>>>> The cache() call blocks while the job is running and
> > > >>>>> eventually
> > > >>>>>>>>>>>>> returns a
> > > >>>>>>>>>>>>>>>> Table object that represents a scan of the temporary
> > > table.
> > > >>>>>>>>>>>>>>>> When the "session" is closed (closing to be defined?),
> > the
> > > >>>>>>>>> temporary
> > > >>>>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>>> are all dropped.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> I think this behavior makes sense and is a good first
> > step
> > > >>>>>>>> towards
> > > >>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>> interactive workloads.
> > > >>>>>>>>>>>>>>>> However, its performance suffers from writing to and
> > > reading
> > > >>>>>>>> from
> > > >>>>>>>>>>>>>>> external
> > > >>>>>>>>>>>>>>>> systems.
> > > >>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> > > significantly
> > > >>>>>>>>> improve
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs)
> > would
> > > >>>>> have
> > > >>>>>>>>>> large
> > > >>>>>>>>>>>>>>>> impacts on many components of Flink.
> > > >>>>>>>>>>>>>>>> Users could use in-memory filesystems or storage grids
> > > >> (Apache
> > > >>>>>>>>>>>>> Ignite) to
> > > >>>>>>>>>>>>>>>> mitigate some of the performance effects.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Best, Fabian
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin
> <
> > > >>>>>>>>>>>>>>>> becket.qin@gmail.com
> > > >>>>>>>>>>>>>>>>> :
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > MaterializedTable
> > > >>>>>>>> that
> > > >>>>>>>>>> they
> > > >>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> *table.cache(),
> > > >> *users
> > > >>>>>>>> can
> > > >>>>>>>>>>>>> just
> > > >>>>>>>>>>>>>>>> use
> > > >>>>>>>>>>>>>>>>> that table and do anything that is supported on a
> > Table,
> > > >>>>>>>>> including
> > > >>>>>>>>>>>>> SQL.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds
> > fine
> > > to
> > > >>>>> me.
> > > >>>>>>>>>>>>> cache()
> > > >>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>> a bit more general than materialize(). Given that we
> > are
> > > >>>>>>>>> enhancing
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> Table API to also support non-relational processing
> > > cases,
> > > >>>>>>>>> cache()
> > > >>>>>>>>>>>>>>> might
> > > >>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>> slightly better.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > > >>>>>>>>>>>>>>> piotr@data-artisans.com
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Hi Becket,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse
> > > >> existing
> > > >>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that
> > you
> > > >>>>> want
> > > >>>>>>>> to
> > > >>>>>>>>>>>>>>>> provide
> > > >>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>> alternate way of writing the data.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe
> we
> > > >> could
> > > >>>>>>>>>> rename
> > > >>>>>>>>>>>>>>>>>> `cache()` to
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> void materialize()
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> or going step further
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> > > >>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> ?
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> The second option with returning a handle I think is
> > > more
> > > >>>>>>>>> flexible
> > > >>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete” or
> > > >>>>> generally
> > > >>>>>>>>>>>>>>> speaking
> > > >>>>>>>>>>>>>>>>>> manage the the view. In the future we could also
> think
> > > >> about
> > > >>>>>>>>>> adding
> > > >>>>>>>>>>>>>>>> hooks
> > > >>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more
> > > >> explicit
> > > >>>>> -
> > > >>>>>>>>>>>>>>>>>> materialization returning a new table handle will
> not
> > > have
> > > >>>>> the
> > > >>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>> implicit side effects as adding a simple line of
> code
> > > like
> > > >>>>>>>>>>>>>>> `b.cache()`
> > > >>>>>>>>>>>>>>>>>> would have.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> It would also be more SQL like, making it more
> > intuitive
> > > >> for
> > > >>>>>>>>> users
> > > >>>>>>>>>>>>>>>>> already
> > > >>>>>>>>>>>>>>>>>> familiar with the SQL.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> > > >> becket.qin@gmail.com
> > > >>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
> equivalent
> > to
> > > >>>>>>>>> creating
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>> BUILT-IN
> > > >>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> > functionality
> > > is
> > > >>>>>>>>> missing
> > > >>>>>>>>>>>>>>>>> today,
> > > >>>>>>>>>>>>>>>>>>> though. Not sure if I understand your question. Do
> > you
> > > >> mean
> > > >>>>>>>> we
> > > >>>>>>>>>>>>>>>> already
> > > >>>>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we
> want
> > > to
> > > >>>>> stop
> > > >>>>>>>>> at
> > > >>>>>>>>>>>>>>>>> creating
> > > >>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend that
> > in
> > > >> the
> > > >>>>>>>>> future
> > > >>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>> useful unified data store distributed with Flink?
> And
> > > do
> > > >> we
> > > >>>>>>>>> want
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern with
> > > their
> > > >>>>> own
> > > >>>>>>>>>> user
> > > >>>>>>>>>>>>>>>>>> defined
> > > >>>>>>>>>>>>>>>>>>> services. These considerations are much more
> > > >> architectural.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > > >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > > >>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the
> > > problem.
> > > >>>>>>>> Isn’t
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a
> > sink
> > > >> and
> > > >>>>>>>>> later
> > > >>>>>>>>>>>>>>>>> reading
> > > >>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live
> > scope/live
> > > >>>>> time?
> > > >>>>>>>>> And
> > > >>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>> sink
> > > >>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> > materialised
> > > >>>>> view
> > > >>>>>>>>>> from a
> > > >>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing
> this
> > > >>>>>>>>> materialised
> > > >>>>>>>>>>>>>>>> view
> > > >>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> > > >>>>>>>>> materialised
> > > >>>>>>>>>>>>>>>> views
> > > >>>>>>>>>>>>>>>>>> (for
> > > >>>>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we
> > need
> > > >> some
> > > >>>>>>>>>>>>>>> syntactic
> > > >>>>>>>>>>>>>>>>>> sugar
> > > >>>>>>>>>>>>>>>>>>>> on top of it?
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> > > >>>>> becket.qin@gmail.com
> > > >>>>>>>>>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist()
> > with
> > > >>>>>>>>>>>>>>>>> lifecycle/defined
> > > >>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the future work
> > for
> > > >>>>> this.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > > >>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
> > > >>>>>>>> `cache()`, I
> > > >>>>>>>>>>>>>>>>>> understand
> > > >>>>>>>>>>>>>>>>>>>> why
> > > >>>>>>>>>>>>>>>>>>>>>> you designed this way!
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
> lifecycle
> > > for
> > > >>>>>>>> data
> > > >>>>>>>>>>>>>>>>>> persistence?
> > > >>>>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so
> that
> > > the
> > > >>>>> user
> > > >>>>>>>>> is
> > > >>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>> worried
> > > >>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the
> time
> > > >> range
> > > >>>>>>>> for
> > > >>>>>>>>>>>>>>>> keeping
> > > >>>>>>>>>>>>>>>>>>>> time.
> > > >>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can
> > also
> > > >>>>> share
> > > >>>>>>>>> in a
> > > >>>>>>>>>>>>>>>>> certain
> > > >>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> > > >>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> > > >>>>>>>>>>>>>>> am
> > > >>>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>> sure,
> > > >>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Bests,
> > > >>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> 于2018年11月23日周五
> > > >>>>>>>> 下午1:33写道:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> > > >>>>>>>> persist(),
> > > >>>>>>>>>>>>>>>>>> personally I
> > > >>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing
> the
> > > >>>>>>>> behavior,
> > > >>>>>>>>>>>>>>> i.e.
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>> Table
> > > >>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted
> > > after
> > > >>>>> the
> > > >>>>>>>>>>>>>>> session
> > > >>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>> closed.
> > > >>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people
> > might
> > > >>>>> think
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>>>>>>>>>> still be there even after the session is gone.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
> > > >>>>> processing
> > > >>>>>>>> in
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>>>> job.
> > > >>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I
> > > >> imagine
> > > >>>>>>>> that
> > > >>>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>>> huge
> > > >>>>>>>>>>>>>>>>>>>>>>> change across the board, including sources,
> > > operators
> > > >>>>> and
> > > >>>>>>>>>>>>>>>>>>>> optimizations,
> > > >>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several separate
> > > >>>>> in-depth
> > > >>>>>>>>>>>>>>>>> discussions.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > > >>>>>>>>>>>>>>> xingcanc@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access
> > domain
> > > >> are
> > > >>>>>>>> both
> > > >>>>>>>>>>>>>>>>>> orthogonal
> > > >>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be
> the
> > > >> first
> > > >>>>>>>> time
> > > >>>>>>>>>> we
> > > >>>>>>>>>>>>>>>> plan
> > > >>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than
> > the
> > > >>>>>>>> state.
> > > >>>>>>>>>>>>>>> Maybe
> > > >>>>>>>>>>>>>>>>> it’s
> > > >>>>>>>>>>>>>>>>>>>>>>> better
> > > >>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
> concentrate
> > > on
> > > >> a
> > > >>>>>>>>>> specific
> > > >>>>>>>>>>>>>>>>> part?
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned
> with
> > > the
> > > >>>>>>>>>> underlying
> > > >>>>>>>>>>>>>>>>>>>> service.
> > > >>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the
> > > >> existing
> > > >>>>>>>>>>>>>>> codebase.
> > > >>>>>>>>>>>>>>>> As
> > > >>>>>>>>>>>>>>>>>> you
> > > >>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to
> > > support
> > > >>>>>>>> other
> > > >>>>>>>>>>>>>>>>>> components
> > > >>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more
> > > >> interactive
> > > >>>>>>>>> Table
> > > >>>>>>>>>>>>>>>> API,
> > > >>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>> case
> > > >>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service
> > > mechanism.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > > >>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for
> > > clean
> > > >> up
> > > >>>>>>>> is
> > > >>>>>>>>>> not
> > > >>>>>>>>>>>>>>>> very
> > > >>>>>>>>>>>>>>>>>>>>>>>> reliable.
> > > >>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
> executed
> > > >>>>>>>>>> successfully.
> > > >>>>>>>>>>>>>>> We
> > > >>>>>>>>>>>>>>>>> may
> > > >>>>>>>>>>>>>>>>>>>>>>> risk
> > > >>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's
> > safer
> > > to
> > > >>>>>>>> have
> > > >>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>> association
> > > >>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can
> > > always
> > > >>>>>>>> clean
> > > >>>>>>>>>> up
> > > >>>>>>>>>>>>>>>> temp
> > > >>>>>>>>>>>>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any
> active
> > > >>>>>>>> sessions.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng
> sun <
> > > >>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and
> > user
> > > >>>>>>>> friendly
> > > >>>>>>>>>> in
> > > >>>>>>>>>>>>>>>> case
> > > >>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>> your
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to
> be
> > > >>>>>>>> executed
> > > >>>>>>>>> in
> > > >>>>>>>>>>>>>>>>> several
> > > >>>>>>>>>>>>>>>>>>>>>>>> stages
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of
> > Flink
> > > >> ML,
> > > >>>>> in
> > > >>>>>>>>>> order
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>> utilize
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to
> > > >> submit a
> > > >>>>>>>> job
> > > >>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to
> > > named
> > > >>>>>>>>>>>>>>> `persist()`,
> > > >>>>>>>>>>>>>>>>> And
> > > >>>>>>>>>>>>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we
> > internally
> > > >>>>> cache
> > > >>>>>>>>> in
> > > >>>>>>>>>>>>>>>> memory
> > > >>>>>>>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data
> into
> > > >> state
> > > >>>>>>>>>> backend
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend
> > etc.)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the
> future,
> > > >>>>> support
> > > >>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>> streaming
> > > >>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will
> also
> > > >>>>> benefit
> > > >>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>> "Interactive
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your
> > > JIRAs
> > > >>>>> and
> > > >>>>>>>>>> FLIP!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > > 于2018年11月20日周二
> > > >>>>>>>>>> 下午9:56写道:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed
> > out,
> > > >> it
> > > >>>>>>>> is a
> > > >>>>>>>>>>>>>>>>> promising
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in
> > > various
> > > >>>>>>>>>> aspects,
> > > >>>>>>>>>>>>>>>>>>>>>> including
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others.
> > One
> > > >> of
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>>> scenarios
> > > >>>>>>>>>>>>>>>>>>>>>>> where
> > > >>>>>>>>>>>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
> > > >>>>> programming.
> > > >>>>>>>> To
> > > >>>>>>>>>>>>>>>> explain
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the
> > solution,
> > > we
> > > >>>>> put
> > > >>>>>>>>>>>>>>>> together
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Till,

That is a good example. Just a minor correction, in this case, b, c and d
will all consume from a non-cached a. This is because cache will only be
created on the very first job submission that generates the table to be
cached.

If I understand correctly, this is example is about whether .cache() method
should be eagerly evaluated or lazily evaluated. In another word, if
cache() method actually triggers a job that creates the cache, there will
be no such confusion. Is that right?

In the example, although d will not consume from the cached Table while it
looks supposed to, from correctness perspective the code will still return
correct result, assuming that tables are immutable.

Personally I feel it is OK because users probably won't really worry about
whether the table is cached or not. And lazy cache could avoid some
unnecessary caching if a cached table is never created in the user
application. But I am not opposed to do eager evaluation of cache.

Thanks,

Jiangjie (Becket) Qin



On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <tr...@apache.org> wrote:

> Another argument for Piotr's point is that lazily changing properties of a
> node affects all down stream consumers but does not necessarily have to
> happen before these consumers are defined. From a user's perspective this
> can be quite confusing:
>
> b = a.map(...)
> c = a.map(...)
>
> a.cache()
> d = a.map(...)
>
> now b, c and d will consume from a cached operator. In this case, the user
> would most likely expect that only d reads from a cached result.
>
> Cheers,
> Till
>
> On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
> > Hey Shaoxuan and Becket,
> >
> > > Can you explain a bit more one what are the side effects? So far my
> > > understanding is that such side effects only exist if a table is
> mutable.
> > > Is that the case?
> >
> > Not only that. There are also performance implications and those are
> > another implicit side effects of using `void cache()`. As I wrote before,
> > reading from cache might not always be desirable, thus it can cause
> > performance degradation and I’m fine with that - user's or optimiser’s
> > choice. What I do not like is that this implicit side effect can manifest
> > in completely different part of code, that wasn’t touched by a user while
> > he was adding `void cache()` call somewhere else. And even if caching
> > improves performance, it’s still a side effect of `void cache()`. Almost
> > from the definition `void` methods have only side effects. As I wrote
> > before, there are couple of scenarios where this might be undesirable
> > and/or unexpected, for example:
> >
> > 1.
> > Table b = …;
> > b.cache()
> > x = b.join(…)
> > y = b.count()
> > // ...
> > // 100
> > // hundred
> > // lines
> > // of
> > // code
> > // later
> > z = b.filter(…).groupBy(…) // this might be even hidden in a different
> > method/file/package/dependency
> >
> > 2.
> >
> > Table b = ...
> > If (some_condition) {
> >   foo(b)
> > }
> > Else {
> >   bar(b)
> > }
> > z = b.filter(…).groupBy(…)
> >
> >
> > Void foo(Table b) {
> >   b.cache()
> >   // do something with b
> > }
> >
> > In both above examples, `b.cache()` will implicitly affect (semantic of a
> > program in case of sources being mutable and performance) `z =
> > b.filter(…).groupBy(…)` which might be far from obvious.
> >
> > On top of that, there is still this argument of mine that having a
> > `MaterializedTable` or `CachedTable` handle is more flexible for us for
> the
> > future and for the user (as a manual option to bypass cache reads).
> >
> > >  But Jiangjie is correct,
> > > the source table in batching should be immutable. It is the user’s
> > > responsibility to ensure it, otherwise even a regular failover may lead
> > > to inconsistent results.
> >
> > Yes, I agree that’s what perfect world/good deployment should be. But its
> > often isn’t and while I’m not trying to fix this (since the proper fix is
> > to support transactions), I’m just trying to minimise confusion for the
> > users that are not fully aware what’s going on and operate in less then
> > perfect setup. And if something bites them after adding `b.cache()` call,
> > to make sure that they at least know all of the places that adding this
> > line can affect.
> >
> > Thanks, Piotrek
> >
> > > On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com> wrote:
> > >
> > > Hi Piotrek,
> > >
> > > Thanks again for the clarification. Some more replies are following.
> > >
> > > But keep in mind that `.cache()` will/might not only be used in
> > interactive
> > >> programming and not only in batching.
> > >
> > > It is true. Actually in stream processing, cache() has the same
> semantic
> > as
> > > batch processing. The semantic is following:
> > > For a table created via a series of computation, save that table for
> > later
> > > reference to avoid running the computation logic to regenerate the
> table.
> > > Once the application exits, drop all the cache.
> > > This semantic is same for both batch and stream processing. The
> > difference
> > > is that stream applications will only run once as they are long
> running.
> > > And the batch applications may be run multiple times, hence the cache
> may
> > > be created and dropped each time the application runs.
> > > Admittedly, there will probably be some resource management
> requirements
> > > for the streaming cached table, such as time based / size based
> > retention,
> > > to address the infinite data issue. But such requirement does not
> change
> > > the semantic.
> > > You are right that interactive programming is just one use case of
> > cache().
> > > It is not the only use case.
> > >
> > > For me the more important issue is of not having the `void cache()`
> with
> > >> side effects.
> > >
> > > This is indeed the key point. The argument around whether cache()
> should
> > > return something already indicates that cache() and materialize()
> address
> > > different issues.
> > > Can you explain a bit more one what are the side effects? So far my
> > > understanding is that such side effects only exist if a table is
> mutable.
> > > Is that the case?
> > >
> > > I don’t know, probably initially we should make CachedTable read-only.
> I
> > >> don’t find it more confusing than the fact that user can not write to
> > views
> > >> or materialised views in SQL or that user currently can not write to a
> > >> Table.
> > >
> > > I don't think anyone should insert something to a cache. By definition
> > the
> > > cache should only be updated when the corresponding original table is
> > > updated. What I am wondering is that given the following two facts:
> > > 1. If and only if a table is mutable (with something like insert()), a
> > > CachedTable may have implicit behavior.
> > > 2. A CachedTable extends a Table.
> > > We can come to the conclusion that a CachedTable is mutable and users
> can
> > > insert into the CachedTable directly. This is where I thought
> confusing.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <piotr@data-artisans.com
> >
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> Regarding naming `cache()` vs `materialize()`. One more explanation
> why
> > I
> > >> think `materialize()` is more natural to me is that I think of all
> > “Table”s
> > >> in Table-API as views. They behave the same way as SQL views, the only
> > >> difference for me is that their live scope is short - current session
> > which
> > >> is limited by different execution model. That’s why “cashing” a view
> > for me
> > >> is just materialising it.
> > >>
> > >> However I see and I understand your point of view. Coming from
> > >> DataSet/DataStream and generally speaking non-SQL world, `cache()` is
> > more
> > >> natural. But keep in mind that `.cache()` will/might not only be used
> in
> > >> interactive programming and not only in batching. But naming is one
> > issue,
> > >> and not that critical to me. Especially that once we implement proper
> > >> materialised views, we can always deprecate/rename `cache()` if we
> deem
> > so.
> > >>
> > >>
> > >> For me the more important issue is of not having the `void cache()`
> with
> > >> side effects. Exactly for the reasons that you have mentioned. True:
> > >> results might be non deterministic if underlying source table are
> > changing.
> > >> Problem is that `void cache()` implicitly changes the semantic of
> > >> subsequent uses of the cached/materialized Table. It can cause “wtf”
> > moment
> > >> for a user if he inserts “b.cache()” call in some place in his code
> and
> > >> suddenly some other random places are behaving differently. If
> > >> `materialize()` or `cache()` returns a Table handle, we force user to
> > >> explicitly use the cache which removes the “random” part from the
> > "suddenly
> > >> some other random places are behaving differently”.
> > >>
> > >> This argument and others that I’ve raised (greater
> flexibility/allowing
> > >> user to explicitly bypass the cache) are independent of `cache()` vs
> > >> `materialize()` discussion.
> > >>
> > >>> Does that mean one can also insert into the CachedTable? This sounds
> > >> pretty confusing.
> > >>
> > >> I don’t know, probably initially we should make CachedTable
> read-only. I
> > >> don’t find it more confusing than the fact that user can not write to
> > views
> > >> or materialised views in SQL or that user currently can not write to a
> > >> Table.
> > >>
> > >> Piotrek
> > >>
> > >>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com> wrote:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> I agree with @Becket that `cache()` and `materialize()` should be
> > >> considered as two different methods where the later one is more
> > >> sophisticated.
> > >>>
> > >>> According to my understanding, the initial idea is just to introduce
> a
> > >> simple cache or persist mechanism, but as the TableAPI is a high-level
> > API,
> > >> it’s naturally for as to think in a SQL way.
> > >>>
> > >>> Maybe we can add the `cache()` method to the DataSet API and force
> > users
> > >> to translate a Table to a Dataset before caching it. Then the users
> > should
> > >> manually register the cached dataset to a table again (we may need
> some
> > >> table replacement mechanisms for datasets with an identical schema but
> > >> different contents here). After all, it’s the dataset rather than the
> > >> dynamic table that need to be cached, right?
> > >>>
> > >>> Best,
> > >>> Xingcan
> > >>>
> > >>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <be...@gmail.com>
> > wrote:
> > >>>>
> > >>>> Hi Piotrek and Jark,
> > >>>>
> > >>>> Thanks for the feedback and explanation. Those are good arguments.
> > But I
> > >>>> think those arguments are mostly about materialized view. Let me try
> > to
> > >>>> explain the reason I believe cache() and materialize() are
> different.
> > >>>>
> > >>>> I think cache() and materialize() have quite different implications.
> > An
> > >>>> analogy I can think of is save()/publish(). When users call cache(),
> > it
> > >> is
> > >>>> just like they are saving an intermediate result as a draft of their
> > >> work,
> > >>>> this intermediate result may not have any realistic meaning. Calling
> > >>>> cache() does not mean users want to publish the cached table in any
> > >> manner.
> > >>>> But when users call materialize(), that means "I have something
> > >> meaningful
> > >>>> to be reused by others", now users need to think about the
> validation,
> > >>>> update & versioning, lifecycle of the result, etc.
> > >>>>
> > >>>> Piotrek's suggestions on variations of the materialize() methods are
> > >> very
> > >>>> useful. It would be great if Flink have them. The concept of
> > >> materialized
> > >>>> view is actually a pretty big feature, not to say the related stuff
> > like
> > >>>> triggers/hooks you mentioned earlier. I think the materialized view
> > >> itself
> > >>>> should be discussed in a more thorough and systematic manner. And I
> > >> found
> > >>>> that discussion is kind of orthogonal and way beyond interactive
> > >>>> programming experience.
> > >>>>
> > >>>> The example you gave was interesting. I still have some questions,
> > >> though.
> > >>>>
> > >>>> Table source = … // some source that scans files from a directory
> > >>>>> “/foo/bar/“
> > >>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > >>>>> Table t2 = t1.materialize() // (or `cache()`)
> > >>>>
> > >>>> t2.count() // initialise cache (if it’s lazily initialised)
> > >>>>> int a1 = t1.count()
> > >>>>> int b1 = t2.count()
> > >>>>> // something in the background (or we trigger it) writes new files
> to
> > >>>>> /foo/bar
> > >>>>> int a2 = t1.count()
> > >>>>> int b2 = t2.count()
> > >>>>> t2.refresh() // possible future extension, not to be implemented in
> > the
> > >>>>> initial version
> > >>>>>
> > >>>>
> > >>>> what if someone else added some more files to /foo/bar at this
> point?
> > In
> > >>>> that case, a3 won't equals to b3, and the result become
> > >> non-deterministic,
> > >>>> right?
> > >>>>
> > >>>> int a3 = t1.count()
> > >>>>> int b3 = t2.count()
> > >>>>> t2.drop() // another possible future extension, manual “cache”
> > dropping
> > >>>>
> > >>>>
> > >>>> When we talk about interactive programming, in most cases, we are
> > >> talking
> > >>>> about batch applications. A fundamental assumption of such case is
> > that
> > >> the
> > >>>> source data is complete before the data processing begins, and the
> > data
> > >>>> will not change during the data processing. IMO, if additional rows
> > >> needs
> > >>>> to be added to some source during the processing, it should be done
> in
> > >> ways
> > >>>> like union the source with another table containing the rows to be
> > >> added.
> > >>>>
> > >>>> There are a few cases that computations are executed repeatedly on
> the
> > >>>> changing data source.
> > >>>>
> > >>>> For example, people may run a ML training job every hour with the
> > >> samples
> > >>>> newly added in the past hour. In that case, the source data between
> > will
> > >>>> indeed change. But still, the data remain unchanged within one run.
> > And
> > >>>> usually in that case, the result will need versioning, i.e. for a
> > given
> > >>>> result, it tells that the result is a result from the source data
> by a
> > >>>> certain timestamp.
> > >>>>
> > >>>> Another example is something like data warehouse. In this case,
> there
> > >> are a
> > >>>> few source of original/raw data. On top of those sources, many
> > >> materialized
> > >>>> view / queries / reports / dashboards can be created to generate
> > derived
> > >>>> data. Those derived data needs to be updated when the underlying
> > >> original
> > >>>> data changes. In that case, the processing logic that derives the
> > >> original
> > >>>> data needs to be executed repeatedly to update those reports/views.
> > >> Again,
> > >>>> all those derived data also need to have version management, such as
> > >>>> timestamp.
> > >>>>
> > >>>> In any of the above two cases, during a single run of the processing
> > >> logic,
> > >>>> the data cannot change. Otherwise the behavior of the processing
> logic
> > >> may
> > >>>> be undefined. In the above two examples, when writing the processing
> > >> logic,
> > >>>> Users can use .cache() to hint Flink that those results should be
> > saved
> > >> to
> > >>>> avoid repeated computation. And then for the result of my
> application
> > >>>> logic, I'll call materialize(), so that these results could be
> managed
> > >> by
> > >>>> the system with versioning, metadata management, lifecycle
> management,
> > >>>> ACLs, etc.
> > >>>>
> > >>>> It is true we can use materialize() to do the cache() job, but I am
> > >> really
> > >>>> reluctant to shoehorn cache() into materialize() and force users to
> > >> worry
> > >>>> about a bunch of implications that they needn't have to. I am
> > >> absolutely on
> > >>>> your side that redundant API is bad. But it is equally frustrating,
> if
> > >> not
> > >>>> more, that the same API does different things.
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>> Jiangjie (Becket) Qin
> > >>>>
> > >>>>
> > >>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <wshaoxuan@gmail.com
> >
> > >> wrote:
> > >>>>
> > >>>>> Thanks Piotrek,
> > >>>>> You provided a very good example, it explains all the confusions I
> > >> have.
> > >>>>> It is clear that there is something we have not considered in the
> > >> initial
> > >>>>> proposal. We intend to force the user to reuse the
> > cached/materialized
> > >>>>> table, if its cache() method is executed. We did not expect that
> user
> > >> may
> > >>>>> want to re-executed the plan from the source table. Let me re-think
> > >> about
> > >>>>> it and get back to you later.
> > >>>>>
> > >>>>> In the meanwhile, this example/observation also infers that we
> cannot
> > >> fully
> > >>>>> involve the optimizer to decide the plan if a cache/materialize is
> > >>>>> explicitly used, because weather to reuse the cache data or
> > re-execute
> > >> the
> > >>>>> query from source data may lead to different results. (But I guess
> > >>>>> optimizer can still help in some cases ---- as long as it does not
> > >>>>> re-execute from the varied source, we should be safe).
> > >>>>>
> > >>>>> Regards,
> > >>>>> Shaoxuan
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> > >> piotr@data-artisans.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Shaoxuan,
> > >>>>>>
> > >>>>>> Re 2:
> > >>>>>>
> > >>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified to->
> > t1’
> > >>>>>>
> > >>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> > >>>>>> `methodThatAppliesOperators()` method has changed it’s plan?
> > >>>>>>
> > >>>>>> I was thinking more about something like this:
> > >>>>>>
> > >>>>>> Table source = … // some source that scans files from a directory
> > >>>>>> “/foo/bar/“
> > >>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > >>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > >>>>>>
> > >>>>>> t2.count() // initialise cache (if it’s lazily initialised)
> > >>>>>>
> > >>>>>> int a1 = t1.count()
> > >>>>>> int b1 = t2.count()
> > >>>>>>
> > >>>>>> // something in the background (or we trigger it) writes new files
> > to
> > >>>>>> /foo/bar
> > >>>>>>
> > >>>>>> int a2 = t1.count()
> > >>>>>> int b2 = t2.count()
> > >>>>>>
> > >>>>>> t2.refresh() // possible future extension, not to be implemented
> in
> > >> the
> > >>>>>> initial version
> > >>>>>>
> > >>>>>> int a3 = t1.count()
> > >>>>>> int b3 = t2.count()
> > >>>>>>
> > >>>>>> t2.drop() // another possible future extension, manual “cache”
> > >> dropping
> > >>>>>>
> > >>>>>> assertTrue(a1 == b1) // same results, but b1 comes from the
> “cache"
> > >>>>>> assertTrue(b1 == b2) // both values come from the same cache
> > >>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full
> > table
> > >>>>> scan
> > >>>>>> and has more data
> > >>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> > >>>>>> assertTrue(b3 == a2 == a3)
> > >>>>>>
> > >>>>>> Piotrek
> > >>>>>>
> > >>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> It is an very interesting and useful design!
> > >>>>>>>
> > >>>>>>> Here I want to share some of my thoughts:
> > >>>>>>>
> > >>>>>>> 1. Agree with that cache() method should return some Table to
> avoid
> > >>>>> some
> > >>>>>>> unexpected problems because of the mutable object.
> > >>>>>>> All the existing methods of Table are returning a new Table
> > instance.
> > >>>>>>>
> > >>>>>>> 2. I think materialize() would be more consistent with SQL, this
> > >> makes
> > >>>>> it
> > >>>>>>> possible to support the same feature for SQL (materialize view)
> and
> > >>>>> keep
> > >>>>>>> the same API for users in the future.
> > >>>>>>> But I'm also fine if we choose cache().
> > >>>>>>>
> > >>>>>>> 3. In the proposal, a TableService (or FlinkService?) is used to
> > >> cache
> > >>>>>> the
> > >>>>>>> result of the (intermediate) table.
> > >>>>>>> But the name of TableService may be a bit general which is not
> > quite
> > >>>>>>> understanding correctly in the first glance (a metastore for
> > >> tables?).
> > >>>>>>> Maybe a more specific name would be better, such as
> > TableCacheSerive
> > >>>>> or
> > >>>>>>> TableMaterializeSerivce or something else.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Jark
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com>
> > >> wrote:
> > >>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> Thanks for the clarification Becket!
> > >>>>>>>>
> > >>>>>>>> I have a few thoughts to share / questions:
> > >>>>>>>>
> > >>>>>>>> 1) I'd like to know how you plan to implement the feature on a
> > plan
> > >> /
> > >>>>>>>> planner level.
> > >>>>>>>>
> > >>>>>>>> I would imaging the following to happen when Table.cache() is
> > >> called:
> > >>>>>>>>
> > >>>>>>>> 1) immediately optimize the Table and internally convert it
> into a
> > >>>>>>>> DataSet/DataStream. This is necessary, to avoid that operators
> of
> > >>>>> later
> > >>>>>>>> queries on top of the Table are pushed down.
> > >>>>>>>> 2) register the DataSet/DataStream as a
> DataSet/DataStream-backed
> > >>>>> Table
> > >>>>>> X
> > >>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> > materialization
> > >>>>> of
> > >>>>>> the
> > >>>>>>>> Table X
> > >>>>>>>>
> > >>>>>>>> Based on your proposal the following would happen:
> > >>>>>>>>
> > >>>>>>>> Table t1 = ....
> > >>>>>>>> t1.cache(); // cache() returns void. The logical plan of t1 is
> > >>>>> replaced
> > >>>>>> by
> > >>>>>>>> a scan of X. There is also a reference to the materialization of
> > X.
> > >>>>>>>>
> > >>>>>>>> t1.count(); // this executes the program, including the
> > >>>>>> DataSet/DataStream
> > >>>>>>>> that backs X and the sink that writes the materialization of X
> > >>>>>>>> t1.count(); // this executes the program, but reads X from the
> > >>>>>>>> materialization.
> > >>>>>>>>
> > >>>>>>>> My question is, how do you determine when whether the scan of t1
> > >>>>> should
> > >>>>>> go
> > >>>>>>>> against the DataSet/DataStream program and when against the
> > >>>>>>>> materialization?
> > >>>>>>>> AFAIK, there is no hook that will tell you that a part of the
> > >> program
> > >>>>>> was
> > >>>>>>>> executed. Flipping a switch during optimization or plan
> generation
> > >> is
> > >>>>>> not
> > >>>>>>>> sufficient as there is no guarantee that the plan is also
> > executed.
> > >>>>>>>>
> > >>>>>>>> Overall, this behavior is somewhat similar to what I proposed in
> > >>>>>>>> FLINK-8950, which does not include persisting the table, but
> just
> > >>>>>>>> optimizing and reregistering it as DataSet/DataStream scan.
> > >>>>>>>>
> > >>>>>>>> 2) I think Piotr has a point about the implicit behavior and
> side
> > >>>>>> effects
> > >>>>>>>> of the cache() method if it does not return anything.
> > >>>>>>>> Consider the following example:
> > >>>>>>>>
> > >>>>>>>> Table t1 = ???
> > >>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> > >>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> > >>>>>>>>
> > >>>>>>>> In this case, the behavior/performance of the plan that results
> > from
> > >>>>> the
> > >>>>>>>> second method call depends on whether t1 was modified by the
> first
> > >>>>>> method
> > >>>>>>>> or not.
> > >>>>>>>> This is the classic issue of mutable vs. immutable objects.
> > >>>>>>>> Also, as Piotr pointed out, it might also be good to have the
> > >> original
> > >>>>>> plan
> > >>>>>>>> of t1, because in some cases it is possible to push filters down
> > >> such
> > >>>>>> that
> > >>>>>>>> evaluating the query from scratch might be more efficient than
> > >>>>> accessing
> > >>>>>>>> the cache.
> > >>>>>>>> Moreover, a CachedTable could extend Table() and offer a method
> > >>>>>> refresh().
> > >>>>>>>> This sounds quite useful in an interactive session mode.
> > >>>>>>>>
> > >>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> > materialize()
> > >>>>>> seems
> > >>>>>>>> to be more future proof.
> > >>>>>>>>
> > >>>>>>>> Best, Fabian
> > >>>>>>>>
> > >>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> > >>>>>>>> wshaoxuan@gmail.com>:
> > >>>>>>>>
> > >>>>>>>>> Hi Piotr,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for sharing your ideas on the method naming. We will
> think
> > >>>>> about
> > >>>>>>>>> your suggestions. But I don't understand why we need to change
> > the
> > >>>>>> return
> > >>>>>>>>> type of cache().
> > >>>>>>>>>
> > >>>>>>>>> Cache() is a physical operation, it does not change the logic
> of
> > >>>>>>>>> the `Table`. On the tableAPI layer, we should not introduce a
> new
> > >>>>> table
> > >>>>>>>>> type unless the logic of table has been changed. If we
> introduce
> > a
> > >>>>> new
> > >>>>>>>>> table type `CachedTable`, we need create the same set of
> methods
> > of
> > >>>>>>>> `Table`
> > >>>>>>>>> for it. I don't think it is worth doing this. Or can you please
> > >>>>>> elaborate
> > >>>>>>>>> more on what could be the "implicit behaviours/side effects"
> you
> > >> are
> > >>>>>>>>> thinking about?
> > >>>>>>>>>
> > >>>>>>>>> Regards,
> > >>>>>>>>> Shaoxuan
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > >>>>>> piotr@data-artisans.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi Becket,
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for the response.
> > >>>>>>>>>>
> > >>>>>>>>>> 1. I wasn’t saying that materialised view must be mutable or
> > not.
> > >>>>> The
> > >>>>>>>>> same
> > >>>>>>>>>> thing applies to caches as well. To the contrary, I would
> expect
> > >>>>> more
> > >>>>>>>>>> consistency and updates from something that is called “cache”
> vs
> > >>>>>>>>> something
> > >>>>>>>>>> that’s a “materialised view”. In other words, IMO most caches
> do
> > >> not
> > >>>>>>>>> serve
> > >>>>>>>>>> you invalid/outdated data and they handle updates on their
> own.
> > >>>>>>>>>>
> > >>>>>>>>>> 2. I don’t think that having in the future two very similar
> > >> concepts
> > >>>>>> of
> > >>>>>>>>>> `materialized` view and `cache` is a good idea. It would be
> > >>>>> confusing
> > >>>>>>>> for
> > >>>>>>>>>> the users. I think it could be handled by
> variations/overloading
> > >> of
> > >>>>>>>>>> materialised view concept. We could start with:
> > >>>>>>>>>>
> > >>>>>>>>>> `MaterializedTable materialize()` - immutable, session life
> > scope
> > >>>>>>>>>> (basically the same semantic as you are proposing
> > >>>>>>>>>>
> > >>>>>>>>>> And then in the future (if ever) build on top of that/expand
> it
> > >>>>> with:
> > >>>>>>>>>>
> > >>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> > >> `MaterializedTable
> > >>>>>>>>>> materialize(refreshHook=…)`
> > >>>>>>>>>>
> > >>>>>>>>>> Or with cross session support:
> > >>>>>>>>>>
> > >>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> > >>>>> `MaterializedTable
> > >>>>>>>>>> materializeInto(tableFactory=…)`
> > >>>>>>>>>>
> > >>>>>>>>>> I’m not saying that we should implement cross
> session/refreshing
> > >> now
> > >>>>>> or
> > >>>>>>>>>> even in the near future. I’m just arguing that naming current
> > >>>>>> immutable
> > >>>>>>>>>> session life scope method `materialize()` is more future proof
> > and
> > >>>>>> more
> > >>>>>>>>>> consistent with SQL (on which after all table-api is heavily
> > >> basing
> > >>>>>>>> on).
> > >>>>>>>>>>
> > >>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would still
> insist
> > >> on
> > >>>>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
> > >>>>>>>>> behaviours/side
> > >>>>>>>>>> effects and to give both us & users more flexibility.
> > >>>>>>>>>>
> > >>>>>>>>>> Piotrek
> > >>>>>>>>>>
> > >>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com>
> > >> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Just to add a little bit, the materialized view is probably
> > more
> > >>>>>>>>> similar
> > >>>>>>>>>> to
> > >>>>>>>>>>> the persistent() brought up earlier in the thread. So it is
> > >> usually
> > >>>>>>>>> cross
> > >>>>>>>>>>> session and could be used in a larger scope. For example, a
> > >>>>>>>>> materialized
> > >>>>>>>>>>> view created by user A may be visible to user B. It is
> probably
> > >>>>>>>>> something
> > >>>>>>>>>>> we want to have in the future. I'll put it in the future work
> > >>>>>>>> section.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> > becket.qin@gmail.com
> > >>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Piotrek,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for the explanation.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Right now we are mostly thinking of the cached table as
> > >>>>> immutable. I
> > >>>>>>>>> can
> > >>>>>>>>>>>> see the Materialized view would be useful in the future.
> That
> > >>>>> said,
> > >>>>>>>> I
> > >>>>>>>>>> think
> > >>>>>>>>>>>> a simple cache mechanism is probably still needed. So to me,
> > >>>>> cache()
> > >>>>>>>>> and
> > >>>>>>>>>>>> materialize() should be two separate method as they address
> > >>>>>>>> different
> > >>>>>>>>>>>> needs. Materialize() is a higher level concept usually
> > implying
> > >>>>>>>>>> periodical
> > >>>>>>>>>>>> update, while cache() has much simpler semantic. For
> example,
> > >> one
> > >>>>>>>> may
> > >>>>>>>>>>>> create a materialized view and use cache() method in the
> > >>>>>>>> materialized
> > >>>>>>>>>> view
> > >>>>>>>>>>>> creation logic. So that during the materialized view update,
> > >> they
> > >>>>> do
> > >>>>>>>>> not
> > >>>>>>>>>>>> need to worry about the case that the cached table is also
> > >>>>> changed.
> > >>>>>>>>>> Maybe
> > >>>>>>>>>>>> under the hood, materialized() and cache() could share some
> > >>>>>>>> mechanism,
> > >>>>>>>>>> but
> > >>>>>>>>>>>> I think a simple cache() method would be handy in a lot of
> > >> cases.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > >>>>>>>>> piotr@data-artisans.com
> > >>>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Becket,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Is there any extra thing user can do on a
> MaterializedTable
> > >> that
> > >>>>>>>>> they
> > >>>>>>>>>>>>> cannot do on a Table?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Maybe not in the initial implementation, but various DBs
> > offer
> > >>>>>>>>>> different
> > >>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks, triggers,
> > >> timers,
> > >>>>>>>>>> manually
> > >>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle
> that
> > in
> > >>>>> the
> > >>>>>>>>>> future.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> After users call *table.cache(), *users can just use that
> > >> table
> > >>>>>>>> and
> > >>>>>>>>> do
> > >>>>>>>>>>>>> anything that is supported on a Table, including SQL.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> This is some implicit behaviour with side effects. Imagine
> if
> > >>>>> user
> > >>>>>>>>> has
> > >>>>>>>>>> a
> > >>>>>>>>>>>>> long and complicated program, that touches table `b`
> multiple
> > >>>>>>>> times,
> > >>>>>>>>>> maybe
> > >>>>>>>>>>>>> scattered around different methods. If he modifies his
> > program
> > >> by
> > >>>>>>>>>> inserting
> > >>>>>>>>>>>>> in one place
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> b.cache()
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> This implicitly alters the semantic and behaviour of his
> code
> > >> all
> > >>>>>>>>> over
> > >>>>>>>>>>>>> the place, maybe in a ways that might cause problems. For
> > >> example
> > >>>>>>>>> what
> > >>>>>>>>>> if
> > >>>>>>>>>>>>> underlying data is changing?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Having invisible side effects is also not very clean, for
> > >> example
> > >>>>>>>>> think
> > >>>>>>>>>>>>> about something like this (but more complicated):
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Table b = ...;
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> If (some_condition) {
> > >>>>>>>>>>>>> processTable1(b)
> > >>>>>>>>>>>>> }
> > >>>>>>>>>>>>> else {
> > >>>>>>>>>>>>> processTable2(b)
> > >>>>>>>>>>>>> }
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> // do more stuff with b
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> > >> `processTable1`
> > >>>>>>>> or
> > >>>>>>>>>>>>> `processTable2` methods.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On the other hand
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Table materialisedB = b.materialize()
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Avoids (at least some of) the side effect issues and forces
> > >> user
> > >>>>> to
> > >>>>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate and
> > >> forces
> > >>>>>>>> user
> > >>>>>>>>>> to
> > >>>>>>>>>>>>> think what does it actually mean. And if something doesn’t
> > work
> > >>>>> in
> > >>>>>>>>> the
> > >>>>>>>>>> end
> > >>>>>>>>>>>>> for the user, he will know what has he changed instead of
> > >> blaming
> > >>>>>>>>>> Flink for
> > >>>>>>>>>>>>> some “magic” underneath. In the above example, after
> > >>>>> materialising
> > >>>>>>>> b
> > >>>>>>>>> in
> > >>>>>>>>>>>>> only one of the methods, he should/would realise about the
> > >> issue
> > >>>>>>>> when
> > >>>>>>>>>>>>> handling the return value `MaterializedTable` of that
> method.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I guess it comes down to personal preferences if you like
> > >> things
> > >>>>> to
> > >>>>>>>>> be
> > >>>>>>>>>>>>> implicit or not. The more power is the user, probably the
> > more
> > >>>>>>>> likely
> > >>>>>>>>>> he is
> > >>>>>>>>>>>>> to like/understand implicit behaviour. And we as Table API
> > >>>>>>>> designers
> > >>>>>>>>>> are
> > >>>>>>>>>>>>> the most power users out there, so I would proceed with
> > caution
> > >>>>> (so
> > >>>>>>>>>> that we
> > >>>>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely
> > implicit
> > >>>>>>>>> method
> > >>>>>>>>>>>>> arguments ;)  <
> https://stackoverflow.com/a/14922656/8149051
> > >)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Table API to also support non-relational processing cases,
> > >>>>> cache()
> > >>>>>>>>>>>>> might be slightly better.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I think even such extended Table API could benefit from
> > >> sticking
> > >>>>>>>>>> to/being
> > >>>>>>>>>>>>> consistent with SQL where both SQL and Table API are
> > basically
> > >>>>> the
> > >>>>>>>>>> same.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> One more thing. `MaterializedTable materialize()` could be
> > more
> > >>>>>>>>>>>>> powerful/flexible allowing the user to operate both on
> > >>>>> materialised
> > >>>>>>>>>> and not
> > >>>>>>>>>>>>> materialised view at the same time for whatever reasons
> > >>>>> (underlying
> > >>>>>>>>>> data
> > >>>>>>>>>>>>> changing/better optimisation opportunities after pushing
> down
> > >>>>> more
> > >>>>>>>>>> filters
> > >>>>>>>>>>>>> etc). For example:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Table b = …;
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Val min = mb.min();
> > >>>>>>>>>>>>> Val max = mb.max();
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Could be more efficient compared to `b.cache()` if
> > >>>>> `filter(‘userId
> > >>>>>>>> =
> > >>>>>>>>>>>>> 42);` allows for much more aggressive optimisations.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> fhueske@gmail.com>
> > >>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was
> just
> > an
> > >>>>>>>>>> example.
> > >>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> > >>>>>>>>>>>>>> For the sake of this proposal, it would be up to the user
> to
> > >>>>>>>>>> implement a
> > >>>>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink
> > classes
> > >>>>> to
> > >>>>>>>>>>>>> persist
> > >>>>>>>>>>>>>> and read the data.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio
> > Pompermaier
> > >> <
> > >>>>>>>>>>>>>> pompermaier@okkam.it>:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an
> > >> alternative
> > >>>>> to
> > >>>>>>>>>>>>> Apache
> > >>>>>>>>>>>>>>> Ignite?
> > >>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>
> > >>
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> > >>>>>>>> fhueske@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for the proposal!
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> To summarize, you propose a new method Table.cache():
> > Table
> > >>>>> that
> > >>>>>>>>>> will
> > >>>>>>>>>>>>>>>> trigger a job and write the result into some temporary
> > >> storage
> > >>>>>>>> as
> > >>>>>>>>>>>>> defined
> > >>>>>>>>>>>>>>>> by a TableFactory.
> > >>>>>>>>>>>>>>>> The cache() call blocks while the job is running and
> > >>>>> eventually
> > >>>>>>>>>>>>> returns a
> > >>>>>>>>>>>>>>>> Table object that represents a scan of the temporary
> > table.
> > >>>>>>>>>>>>>>>> When the "session" is closed (closing to be defined?),
> the
> > >>>>>>>>> temporary
> > >>>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>>> are all dropped.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I think this behavior makes sense and is a good first
> step
> > >>>>>>>> towards
> > >>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>> interactive workloads.
> > >>>>>>>>>>>>>>>> However, its performance suffers from writing to and
> > reading
> > >>>>>>>> from
> > >>>>>>>>>>>>>>> external
> > >>>>>>>>>>>>>>>> systems.
> > >>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> > significantly
> > >>>>>>>>> improve
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs)
> would
> > >>>>> have
> > >>>>>>>>>> large
> > >>>>>>>>>>>>>>>> impacts on many components of Flink.
> > >>>>>>>>>>>>>>>> Users could use in-memory filesystems or storage grids
> > >> (Apache
> > >>>>>>>>>>>>> Ignite) to
> > >>>>>>>>>>>>>>>> mitigate some of the performance effects.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Best, Fabian
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> > >>>>>>>>>>>>>>>> becket.qin@gmail.com
> > >>>>>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > MaterializedTable
> > >>>>>>>> that
> > >>>>>>>>>> they
> > >>>>>>>>>>>>>>>>> cannot do on a Table? After users call *table.cache(),
> > >> *users
> > >>>>>>>> can
> > >>>>>>>>>>>>> just
> > >>>>>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>> that table and do anything that is supported on a
> Table,
> > >>>>>>>>> including
> > >>>>>>>>>>>>> SQL.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds
> fine
> > to
> > >>>>> me.
> > >>>>>>>>>>>>> cache()
> > >>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>> a bit more general than materialize(). Given that we
> are
> > >>>>>>>>> enhancing
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> Table API to also support non-relational processing
> > cases,
> > >>>>>>>>> cache()
> > >>>>>>>>>>>>>>> might
> > >>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>> slightly better.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > >>>>>>>>>>>>>>> piotr@data-artisans.com
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi Becket,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse
> > >> existing
> > >>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that
> you
> > >>>>> want
> > >>>>>>>> to
> > >>>>>>>>>>>>>>>> provide
> > >>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>> alternate way of writing the data.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we
> > >> could
> > >>>>>>>>>> rename
> > >>>>>>>>>>>>>>>>>> `cache()` to
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> void materialize()
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> or going step further
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> > >>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> ?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The second option with returning a handle I think is
> > more
> > >>>>>>>>> flexible
> > >>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete” or
> > >>>>> generally
> > >>>>>>>>>>>>>>> speaking
> > >>>>>>>>>>>>>>>>>> manage the the view. In the future we could also think
> > >> about
> > >>>>>>>>>> adding
> > >>>>>>>>>>>>>>>> hooks
> > >>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more
> > >> explicit
> > >>>>> -
> > >>>>>>>>>>>>>>>>>> materialization returning a new table handle will not
> > have
> > >>>>> the
> > >>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>> implicit side effects as adding a simple line of code
> > like
> > >>>>>>>>>>>>>>> `b.cache()`
> > >>>>>>>>>>>>>>>>>> would have.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> It would also be more SQL like, making it more
> intuitive
> > >> for
> > >>>>>>>>> users
> > >>>>>>>>>>>>>>>>> already
> > >>>>>>>>>>>>>>>>>> familiar with the SQL.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> > >> becket.qin@gmail.com
> > >>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent
> to
> > >>>>>>>>> creating
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>> BUILT-IN
> > >>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> functionality
> > is
> > >>>>>>>>> missing
> > >>>>>>>>>>>>>>>>> today,
> > >>>>>>>>>>>>>>>>>>> though. Not sure if I understand your question. Do
> you
> > >> mean
> > >>>>>>>> we
> > >>>>>>>>>>>>>>>> already
> > >>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we want
> > to
> > >>>>> stop
> > >>>>>>>>> at
> > >>>>>>>>>>>>>>>>> creating
> > >>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend that
> in
> > >> the
> > >>>>>>>>> future
> > >>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>> useful unified data store distributed with Flink? And
> > do
> > >> we
> > >>>>>>>>> want
> > >>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern with
> > their
> > >>>>> own
> > >>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>> defined
> > >>>>>>>>>>>>>>>>>>> services. These considerations are much more
> > >> architectural.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> > >>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the
> > problem.
> > >>>>>>>> Isn’t
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a
> sink
> > >> and
> > >>>>>>>>> later
> > >>>>>>>>>>>>>>>>> reading
> > >>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live
> scope/live
> > >>>>> time?
> > >>>>>>>>> And
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> sink
> > >>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> materialised
> > >>>>> view
> > >>>>>>>>>> from a
> > >>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
> > >>>>>>>>> materialised
> > >>>>>>>>>>>>>>>> view
> > >>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> > >>>>>>>>> materialised
> > >>>>>>>>>>>>>>>> views
> > >>>>>>>>>>>>>>>>>> (for
> > >>>>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we
> need
> > >> some
> > >>>>>>>>>>>>>>> syntactic
> > >>>>>>>>>>>>>>>>>> sugar
> > >>>>>>>>>>>>>>>>>>>> on top of it?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> > >>>>> becket.qin@gmail.com
> > >>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist()
> with
> > >>>>>>>>>>>>>>>>> lifecycle/defined
> > >>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the future work
> for
> > >>>>> this.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > >>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
> > >>>>>>>> `cache()`, I
> > >>>>>>>>>>>>>>>>>> understand
> > >>>>>>>>>>>>>>>>>>>> why
> > >>>>>>>>>>>>>>>>>>>>>> you designed this way!
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle
> > for
> > >>>>>>>> data
> > >>>>>>>>>>>>>>>>>> persistence?
> > >>>>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that
> > the
> > >>>>> user
> > >>>>>>>>> is
> > >>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>> worried
> > >>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the time
> > >> range
> > >>>>>>>> for
> > >>>>>>>>>>>>>>>> keeping
> > >>>>>>>>>>>>>>>>>>>> time.
> > >>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can
> also
> > >>>>> share
> > >>>>>>>>> in a
> > >>>>>>>>>>>>>>>>> certain
> > >>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> > >>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> > >>>>>>>>>>>>>>> am
> > >>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>> sure,
> > >>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Bests,
> > >>>>>>>>>>>>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
> > >>>>>>>> 下午1:33写道:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> > >>>>>>>> persist(),
> > >>>>>>>>>>>>>>>>>> personally I
> > >>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
> > >>>>>>>> behavior,
> > >>>>>>>>>>>>>>> i.e.
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>> Table
> > >>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted
> > after
> > >>>>> the
> > >>>>>>>>>>>>>>> session
> > >>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>> closed.
> > >>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people
> might
> > >>>>> think
> > >>>>>>>>> the
> > >>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>> still be there even after the session is gone.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
> > >>>>> processing
> > >>>>>>>> in
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>> job.
> > >>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I
> > >> imagine
> > >>>>>>>> that
> > >>>>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>>> huge
> > >>>>>>>>>>>>>>>>>>>>>>> change across the board, including sources,
> > operators
> > >>>>> and
> > >>>>>>>>>>>>>>>>>>>> optimizations,
> > >>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several separate
> > >>>>> in-depth
> > >>>>>>>>>>>>>>>>> discussions.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > >>>>>>>>>>>>>>> xingcanc@gmail.com>
> > >>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access
> domain
> > >> are
> > >>>>>>>> both
> > >>>>>>>>>>>>>>>>>> orthogonal
> > >>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the
> > >> first
> > >>>>>>>> time
> > >>>>>>>>>> we
> > >>>>>>>>>>>>>>>> plan
> > >>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than
> the
> > >>>>>>>> state.
> > >>>>>>>>>>>>>>> Maybe
> > >>>>>>>>>>>>>>>>> it’s
> > >>>>>>>>>>>>>>>>>>>>>>> better
> > >>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate
> > on
> > >> a
> > >>>>>>>>>> specific
> > >>>>>>>>>>>>>>>>> part?
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with
> > the
> > >>>>>>>>>> underlying
> > >>>>>>>>>>>>>>>>>>>> service.
> > >>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the
> > >> existing
> > >>>>>>>>>>>>>>> codebase.
> > >>>>>>>>>>>>>>>> As
> > >>>>>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to
> > support
> > >>>>>>>> other
> > >>>>>>>>>>>>>>>>>> components
> > >>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more
> > >> interactive
> > >>>>>>>>> Table
> > >>>>>>>>>>>>>>>> API,
> > >>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service
> > mechanism.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > >>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for
> > clean
> > >> up
> > >>>>>>>> is
> > >>>>>>>>>> not
> > >>>>>>>>>>>>>>>> very
> > >>>>>>>>>>>>>>>>>>>>>>>> reliable.
> > >>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
> > >>>>>>>>>> successfully.
> > >>>>>>>>>>>>>>> We
> > >>>>>>>>>>>>>>>>> may
> > >>>>>>>>>>>>>>>>>>>>>>> risk
> > >>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's
> safer
> > to
> > >>>>>>>> have
> > >>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>> association
> > >>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can
> > always
> > >>>>>>>> clean
> > >>>>>>>>>> up
> > >>>>>>>>>>>>>>>> temp
> > >>>>>>>>>>>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any active
> > >>>>>>>> sessions.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > >>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and
> user
> > >>>>>>>> friendly
> > >>>>>>>>>> in
> > >>>>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>> your
> > >>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> > >>>>>>>> executed
> > >>>>>>>>> in
> > >>>>>>>>>>>>>>>>> several
> > >>>>>>>>>>>>>>>>>>>>>>>> stages
> > >>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of
> Flink
> > >> ML,
> > >>>>> in
> > >>>>>>>>>> order
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>> utilize
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to
> > >> submit a
> > >>>>>>>> job
> > >>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to
> > named
> > >>>>>>>>>>>>>>> `persist()`,
> > >>>>>>>>>>>>>>>>> And
> > >>>>>>>>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we
> internally
> > >>>>> cache
> > >>>>>>>>> in
> > >>>>>>>>>>>>>>>> memory
> > >>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>>>>> persist
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into
> > >> state
> > >>>>>>>>>> backend
> > >>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend
> etc.)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future,
> > >>>>> support
> > >>>>>>>>> for
> > >>>>>>>>>>>>>>>>>> streaming
> > >>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also
> > >>>>> benefit
> > >>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>> "Interactive
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your
> > JIRAs
> > >>>>> and
> > >>>>>>>>>> FLIP!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> > 于2018年11月20日周二
> > >>>>>>>>>> 下午9:56写道:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed
> out,
> > >> it
> > >>>>>>>> is a
> > >>>>>>>>>>>>>>>>> promising
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in
> > various
> > >>>>>>>>>> aspects,
> > >>>>>>>>>>>>>>>>>>>>>> including
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others.
> One
> > >> of
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>>> scenarios
> > >>>>>>>>>>>>>>>>>>>>>>> where
> > >>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
> > >>>>> programming.
> > >>>>>>>> To
> > >>>>>>>>>>>>>>>> explain
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the
> solution,
> > we
> > >>>>> put
> > >>>>>>>>>>>>>>>> together
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Till Rohrmann <tr...@apache.org>.
Another argument for Piotr's point is that lazily changing properties of a
node affects all down stream consumers but does not necessarily have to
happen before these consumers are defined. From a user's perspective this
can be quite confusing:

b = a.map(...)
c = a.map(...)

a.cache()
d = a.map(...)

now b, c and d will consume from a cached operator. In this case, the user
would most likely expect that only d reads from a cached result.

Cheers,
Till

On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hey Shaoxuan and Becket,
>
> > Can you explain a bit more one what are the side effects? So far my
> > understanding is that such side effects only exist if a table is mutable.
> > Is that the case?
>
> Not only that. There are also performance implications and those are
> another implicit side effects of using `void cache()`. As I wrote before,
> reading from cache might not always be desirable, thus it can cause
> performance degradation and I’m fine with that - user's or optimiser’s
> choice. What I do not like is that this implicit side effect can manifest
> in completely different part of code, that wasn’t touched by a user while
> he was adding `void cache()` call somewhere else. And even if caching
> improves performance, it’s still a side effect of `void cache()`. Almost
> from the definition `void` methods have only side effects. As I wrote
> before, there are couple of scenarios where this might be undesirable
> and/or unexpected, for example:
>
> 1.
> Table b = …;
> b.cache()
> x = b.join(…)
> y = b.count()
> // ...
> // 100
> // hundred
> // lines
> // of
> // code
> // later
> z = b.filter(…).groupBy(…) // this might be even hidden in a different
> method/file/package/dependency
>
> 2.
>
> Table b = ...
> If (some_condition) {
>   foo(b)
> }
> Else {
>   bar(b)
> }
> z = b.filter(…).groupBy(…)
>
>
> Void foo(Table b) {
>   b.cache()
>   // do something with b
> }
>
> In both above examples, `b.cache()` will implicitly affect (semantic of a
> program in case of sources being mutable and performance) `z =
> b.filter(…).groupBy(…)` which might be far from obvious.
>
> On top of that, there is still this argument of mine that having a
> `MaterializedTable` or `CachedTable` handle is more flexible for us for the
> future and for the user (as a manual option to bypass cache reads).
>
> >  But Jiangjie is correct,
> > the source table in batching should be immutable. It is the user’s
> > responsibility to ensure it, otherwise even a regular failover may lead
> > to inconsistent results.
>
> Yes, I agree that’s what perfect world/good deployment should be. But its
> often isn’t and while I’m not trying to fix this (since the proper fix is
> to support transactions), I’m just trying to minimise confusion for the
> users that are not fully aware what’s going on and operate in less then
> perfect setup. And if something bites them after adding `b.cache()` call,
> to make sure that they at least know all of the places that adding this
> line can affect.
>
> Thanks, Piotrek
>
> > On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com> wrote:
> >
> > Hi Piotrek,
> >
> > Thanks again for the clarification. Some more replies are following.
> >
> > But keep in mind that `.cache()` will/might not only be used in
> interactive
> >> programming and not only in batching.
> >
> > It is true. Actually in stream processing, cache() has the same semantic
> as
> > batch processing. The semantic is following:
> > For a table created via a series of computation, save that table for
> later
> > reference to avoid running the computation logic to regenerate the table.
> > Once the application exits, drop all the cache.
> > This semantic is same for both batch and stream processing. The
> difference
> > is that stream applications will only run once as they are long running.
> > And the batch applications may be run multiple times, hence the cache may
> > be created and dropped each time the application runs.
> > Admittedly, there will probably be some resource management requirements
> > for the streaming cached table, such as time based / size based
> retention,
> > to address the infinite data issue. But such requirement does not change
> > the semantic.
> > You are right that interactive programming is just one use case of
> cache().
> > It is not the only use case.
> >
> > For me the more important issue is of not having the `void cache()` with
> >> side effects.
> >
> > This is indeed the key point. The argument around whether cache() should
> > return something already indicates that cache() and materialize() address
> > different issues.
> > Can you explain a bit more one what are the side effects? So far my
> > understanding is that such side effects only exist if a table is mutable.
> > Is that the case?
> >
> > I don’t know, probably initially we should make CachedTable read-only. I
> >> don’t find it more confusing than the fact that user can not write to
> views
> >> or materialised views in SQL or that user currently can not write to a
> >> Table.
> >
> > I don't think anyone should insert something to a cache. By definition
> the
> > cache should only be updated when the corresponding original table is
> > updated. What I am wondering is that given the following two facts:
> > 1. If and only if a table is mutable (with something like insert()), a
> > CachedTable may have implicit behavior.
> > 2. A CachedTable extends a Table.
> > We can come to the conclusion that a CachedTable is mutable and users can
> > insert into the CachedTable directly. This is where I thought confusing.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <pi...@data-artisans.com>
> > wrote:
> >
> >> Hi all,
> >>
> >> Regarding naming `cache()` vs `materialize()`. One more explanation why
> I
> >> think `materialize()` is more natural to me is that I think of all
> “Table”s
> >> in Table-API as views. They behave the same way as SQL views, the only
> >> difference for me is that their live scope is short - current session
> which
> >> is limited by different execution model. That’s why “cashing” a view
> for me
> >> is just materialising it.
> >>
> >> However I see and I understand your point of view. Coming from
> >> DataSet/DataStream and generally speaking non-SQL world, `cache()` is
> more
> >> natural. But keep in mind that `.cache()` will/might not only be used in
> >> interactive programming and not only in batching. But naming is one
> issue,
> >> and not that critical to me. Especially that once we implement proper
> >> materialised views, we can always deprecate/rename `cache()` if we deem
> so.
> >>
> >>
> >> For me the more important issue is of not having the `void cache()` with
> >> side effects. Exactly for the reasons that you have mentioned. True:
> >> results might be non deterministic if underlying source table are
> changing.
> >> Problem is that `void cache()` implicitly changes the semantic of
> >> subsequent uses of the cached/materialized Table. It can cause “wtf”
> moment
> >> for a user if he inserts “b.cache()” call in some place in his code and
> >> suddenly some other random places are behaving differently. If
> >> `materialize()` or `cache()` returns a Table handle, we force user to
> >> explicitly use the cache which removes the “random” part from the
> "suddenly
> >> some other random places are behaving differently”.
> >>
> >> This argument and others that I’ve raised (greater flexibility/allowing
> >> user to explicitly bypass the cache) are independent of `cache()` vs
> >> `materialize()` discussion.
> >>
> >>> Does that mean one can also insert into the CachedTable? This sounds
> >> pretty confusing.
> >>
> >> I don’t know, probably initially we should make CachedTable read-only. I
> >> don’t find it more confusing than the fact that user can not write to
> views
> >> or materialised views in SQL or that user currently can not write to a
> >> Table.
> >>
> >> Piotrek
> >>
> >>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I agree with @Becket that `cache()` and `materialize()` should be
> >> considered as two different methods where the later one is more
> >> sophisticated.
> >>>
> >>> According to my understanding, the initial idea is just to introduce a
> >> simple cache or persist mechanism, but as the TableAPI is a high-level
> API,
> >> it’s naturally for as to think in a SQL way.
> >>>
> >>> Maybe we can add the `cache()` method to the DataSet API and force
> users
> >> to translate a Table to a Dataset before caching it. Then the users
> should
> >> manually register the cached dataset to a table again (we may need some
> >> table replacement mechanisms for datasets with an identical schema but
> >> different contents here). After all, it’s the dataset rather than the
> >> dynamic table that need to be cached, right?
> >>>
> >>> Best,
> >>> Xingcan
> >>>
> >>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <be...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Piotrek and Jark,
> >>>>
> >>>> Thanks for the feedback and explanation. Those are good arguments.
> But I
> >>>> think those arguments are mostly about materialized view. Let me try
> to
> >>>> explain the reason I believe cache() and materialize() are different.
> >>>>
> >>>> I think cache() and materialize() have quite different implications.
> An
> >>>> analogy I can think of is save()/publish(). When users call cache(),
> it
> >> is
> >>>> just like they are saving an intermediate result as a draft of their
> >> work,
> >>>> this intermediate result may not have any realistic meaning. Calling
> >>>> cache() does not mean users want to publish the cached table in any
> >> manner.
> >>>> But when users call materialize(), that means "I have something
> >> meaningful
> >>>> to be reused by others", now users need to think about the validation,
> >>>> update & versioning, lifecycle of the result, etc.
> >>>>
> >>>> Piotrek's suggestions on variations of the materialize() methods are
> >> very
> >>>> useful. It would be great if Flink have them. The concept of
> >> materialized
> >>>> view is actually a pretty big feature, not to say the related stuff
> like
> >>>> triggers/hooks you mentioned earlier. I think the materialized view
> >> itself
> >>>> should be discussed in a more thorough and systematic manner. And I
> >> found
> >>>> that discussion is kind of orthogonal and way beyond interactive
> >>>> programming experience.
> >>>>
> >>>> The example you gave was interesting. I still have some questions,
> >> though.
> >>>>
> >>>> Table source = … // some source that scans files from a directory
> >>>>> “/foo/bar/“
> >>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>
> >>>> t2.count() // initialise cache (if it’s lazily initialised)
> >>>>> int a1 = t1.count()
> >>>>> int b1 = t2.count()
> >>>>> // something in the background (or we trigger it) writes new files to
> >>>>> /foo/bar
> >>>>> int a2 = t1.count()
> >>>>> int b2 = t2.count()
> >>>>> t2.refresh() // possible future extension, not to be implemented in
> the
> >>>>> initial version
> >>>>>
> >>>>
> >>>> what if someone else added some more files to /foo/bar at this point?
> In
> >>>> that case, a3 won't equals to b3, and the result become
> >> non-deterministic,
> >>>> right?
> >>>>
> >>>> int a3 = t1.count()
> >>>>> int b3 = t2.count()
> >>>>> t2.drop() // another possible future extension, manual “cache”
> dropping
> >>>>
> >>>>
> >>>> When we talk about interactive programming, in most cases, we are
> >> talking
> >>>> about batch applications. A fundamental assumption of such case is
> that
> >> the
> >>>> source data is complete before the data processing begins, and the
> data
> >>>> will not change during the data processing. IMO, if additional rows
> >> needs
> >>>> to be added to some source during the processing, it should be done in
> >> ways
> >>>> like union the source with another table containing the rows to be
> >> added.
> >>>>
> >>>> There are a few cases that computations are executed repeatedly on the
> >>>> changing data source.
> >>>>
> >>>> For example, people may run a ML training job every hour with the
> >> samples
> >>>> newly added in the past hour. In that case, the source data between
> will
> >>>> indeed change. But still, the data remain unchanged within one run.
> And
> >>>> usually in that case, the result will need versioning, i.e. for a
> given
> >>>> result, it tells that the result is a result from the source data by a
> >>>> certain timestamp.
> >>>>
> >>>> Another example is something like data warehouse. In this case, there
> >> are a
> >>>> few source of original/raw data. On top of those sources, many
> >> materialized
> >>>> view / queries / reports / dashboards can be created to generate
> derived
> >>>> data. Those derived data needs to be updated when the underlying
> >> original
> >>>> data changes. In that case, the processing logic that derives the
> >> original
> >>>> data needs to be executed repeatedly to update those reports/views.
> >> Again,
> >>>> all those derived data also need to have version management, such as
> >>>> timestamp.
> >>>>
> >>>> In any of the above two cases, during a single run of the processing
> >> logic,
> >>>> the data cannot change. Otherwise the behavior of the processing logic
> >> may
> >>>> be undefined. In the above two examples, when writing the processing
> >> logic,
> >>>> Users can use .cache() to hint Flink that those results should be
> saved
> >> to
> >>>> avoid repeated computation. And then for the result of my application
> >>>> logic, I'll call materialize(), so that these results could be managed
> >> by
> >>>> the system with versioning, metadata management, lifecycle management,
> >>>> ACLs, etc.
> >>>>
> >>>> It is true we can use materialize() to do the cache() job, but I am
> >> really
> >>>> reluctant to shoehorn cache() into materialize() and force users to
> >> worry
> >>>> about a bunch of implications that they needn't have to. I am
> >> absolutely on
> >>>> your side that redundant API is bad. But it is equally frustrating, if
> >> not
> >>>> more, that the same API does different things.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <ws...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Thanks Piotrek,
> >>>>> You provided a very good example, it explains all the confusions I
> >> have.
> >>>>> It is clear that there is something we have not considered in the
> >> initial
> >>>>> proposal. We intend to force the user to reuse the
> cached/materialized
> >>>>> table, if its cache() method is executed. We did not expect that user
> >> may
> >>>>> want to re-executed the plan from the source table. Let me re-think
> >> about
> >>>>> it and get back to you later.
> >>>>>
> >>>>> In the meanwhile, this example/observation also infers that we cannot
> >> fully
> >>>>> involve the optimizer to decide the plan if a cache/materialize is
> >>>>> explicitly used, because weather to reuse the cache data or
> re-execute
> >> the
> >>>>> query from source data may lead to different results. (But I guess
> >>>>> optimizer can still help in some cases ---- as long as it does not
> >>>>> re-execute from the varied source, we should be safe).
> >>>>>
> >>>>> Regards,
> >>>>> Shaoxuan
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> >> piotr@data-artisans.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Shaoxuan,
> >>>>>>
> >>>>>> Re 2:
> >>>>>>
> >>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified to->
> t1’
> >>>>>>
> >>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> >>>>>> `methodThatAppliesOperators()` method has changed it’s plan?
> >>>>>>
> >>>>>> I was thinking more about something like this:
> >>>>>>
> >>>>>> Table source = … // some source that scans files from a directory
> >>>>>> “/foo/bar/“
> >>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>>>
> >>>>>> t2.count() // initialise cache (if it’s lazily initialised)
> >>>>>>
> >>>>>> int a1 = t1.count()
> >>>>>> int b1 = t2.count()
> >>>>>>
> >>>>>> // something in the background (or we trigger it) writes new files
> to
> >>>>>> /foo/bar
> >>>>>>
> >>>>>> int a2 = t1.count()
> >>>>>> int b2 = t2.count()
> >>>>>>
> >>>>>> t2.refresh() // possible future extension, not to be implemented in
> >> the
> >>>>>> initial version
> >>>>>>
> >>>>>> int a3 = t1.count()
> >>>>>> int b3 = t2.count()
> >>>>>>
> >>>>>> t2.drop() // another possible future extension, manual “cache”
> >> dropping
> >>>>>>
> >>>>>> assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
> >>>>>> assertTrue(b1 == b2) // both values come from the same cache
> >>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full
> table
> >>>>> scan
> >>>>>> and has more data
> >>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> >>>>>> assertTrue(b3 == a2 == a3)
> >>>>>>
> >>>>>> Piotrek
> >>>>>>
> >>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> It is an very interesting and useful design!
> >>>>>>>
> >>>>>>> Here I want to share some of my thoughts:
> >>>>>>>
> >>>>>>> 1. Agree with that cache() method should return some Table to avoid
> >>>>> some
> >>>>>>> unexpected problems because of the mutable object.
> >>>>>>> All the existing methods of Table are returning a new Table
> instance.
> >>>>>>>
> >>>>>>> 2. I think materialize() would be more consistent with SQL, this
> >> makes
> >>>>> it
> >>>>>>> possible to support the same feature for SQL (materialize view) and
> >>>>> keep
> >>>>>>> the same API for users in the future.
> >>>>>>> But I'm also fine if we choose cache().
> >>>>>>>
> >>>>>>> 3. In the proposal, a TableService (or FlinkService?) is used to
> >> cache
> >>>>>> the
> >>>>>>> result of the (intermediate) table.
> >>>>>>> But the name of TableService may be a bit general which is not
> quite
> >>>>>>> understanding correctly in the first glance (a metastore for
> >> tables?).
> >>>>>>> Maybe a more specific name would be better, such as
> TableCacheSerive
> >>>>> or
> >>>>>>> TableMaterializeSerivce or something else.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jark
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Thanks for the clarification Becket!
> >>>>>>>>
> >>>>>>>> I have a few thoughts to share / questions:
> >>>>>>>>
> >>>>>>>> 1) I'd like to know how you plan to implement the feature on a
> plan
> >> /
> >>>>>>>> planner level.
> >>>>>>>>
> >>>>>>>> I would imaging the following to happen when Table.cache() is
> >> called:
> >>>>>>>>
> >>>>>>>> 1) immediately optimize the Table and internally convert it into a
> >>>>>>>> DataSet/DataStream. This is necessary, to avoid that operators of
> >>>>> later
> >>>>>>>> queries on top of the Table are pushed down.
> >>>>>>>> 2) register the DataSet/DataStream as a DataSet/DataStream-backed
> >>>>> Table
> >>>>>> X
> >>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> materialization
> >>>>> of
> >>>>>> the
> >>>>>>>> Table X
> >>>>>>>>
> >>>>>>>> Based on your proposal the following would happen:
> >>>>>>>>
> >>>>>>>> Table t1 = ....
> >>>>>>>> t1.cache(); // cache() returns void. The logical plan of t1 is
> >>>>> replaced
> >>>>>> by
> >>>>>>>> a scan of X. There is also a reference to the materialization of
> X.
> >>>>>>>>
> >>>>>>>> t1.count(); // this executes the program, including the
> >>>>>> DataSet/DataStream
> >>>>>>>> that backs X and the sink that writes the materialization of X
> >>>>>>>> t1.count(); // this executes the program, but reads X from the
> >>>>>>>> materialization.
> >>>>>>>>
> >>>>>>>> My question is, how do you determine when whether the scan of t1
> >>>>> should
> >>>>>> go
> >>>>>>>> against the DataSet/DataStream program and when against the
> >>>>>>>> materialization?
> >>>>>>>> AFAIK, there is no hook that will tell you that a part of the
> >> program
> >>>>>> was
> >>>>>>>> executed. Flipping a switch during optimization or plan generation
> >> is
> >>>>>> not
> >>>>>>>> sufficient as there is no guarantee that the plan is also
> executed.
> >>>>>>>>
> >>>>>>>> Overall, this behavior is somewhat similar to what I proposed in
> >>>>>>>> FLINK-8950, which does not include persisting the table, but just
> >>>>>>>> optimizing and reregistering it as DataSet/DataStream scan.
> >>>>>>>>
> >>>>>>>> 2) I think Piotr has a point about the implicit behavior and side
> >>>>>> effects
> >>>>>>>> of the cache() method if it does not return anything.
> >>>>>>>> Consider the following example:
> >>>>>>>>
> >>>>>>>> Table t1 = ???
> >>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> >>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> >>>>>>>>
> >>>>>>>> In this case, the behavior/performance of the plan that results
> from
> >>>>> the
> >>>>>>>> second method call depends on whether t1 was modified by the first
> >>>>>> method
> >>>>>>>> or not.
> >>>>>>>> This is the classic issue of mutable vs. immutable objects.
> >>>>>>>> Also, as Piotr pointed out, it might also be good to have the
> >> original
> >>>>>> plan
> >>>>>>>> of t1, because in some cases it is possible to push filters down
> >> such
> >>>>>> that
> >>>>>>>> evaluating the query from scratch might be more efficient than
> >>>>> accessing
> >>>>>>>> the cache.
> >>>>>>>> Moreover, a CachedTable could extend Table() and offer a method
> >>>>>> refresh().
> >>>>>>>> This sounds quite useful in an interactive session mode.
> >>>>>>>>
> >>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> materialize()
> >>>>>> seems
> >>>>>>>> to be more future proof.
> >>>>>>>>
> >>>>>>>> Best, Fabian
> >>>>>>>>
> >>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> >>>>>>>> wshaoxuan@gmail.com>:
> >>>>>>>>
> >>>>>>>>> Hi Piotr,
> >>>>>>>>>
> >>>>>>>>> Thanks for sharing your ideas on the method naming. We will think
> >>>>> about
> >>>>>>>>> your suggestions. But I don't understand why we need to change
> the
> >>>>>> return
> >>>>>>>>> type of cache().
> >>>>>>>>>
> >>>>>>>>> Cache() is a physical operation, it does not change the logic of
> >>>>>>>>> the `Table`. On the tableAPI layer, we should not introduce a new
> >>>>> table
> >>>>>>>>> type unless the logic of table has been changed. If we introduce
> a
> >>>>> new
> >>>>>>>>> table type `CachedTable`, we need create the same set of methods
> of
> >>>>>>>> `Table`
> >>>>>>>>> for it. I don't think it is worth doing this. Or can you please
> >>>>>> elaborate
> >>>>>>>>> more on what could be the "implicit behaviours/side effects" you
> >> are
> >>>>>>>>> thinking about?
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Shaoxuan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> >>>>>> piotr@data-artisans.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Becket,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the response.
> >>>>>>>>>>
> >>>>>>>>>> 1. I wasn’t saying that materialised view must be mutable or
> not.
> >>>>> The
> >>>>>>>>> same
> >>>>>>>>>> thing applies to caches as well. To the contrary, I would expect
> >>>>> more
> >>>>>>>>>> consistency and updates from something that is called “cache” vs
> >>>>>>>>> something
> >>>>>>>>>> that’s a “materialised view”. In other words, IMO most caches do
> >> not
> >>>>>>>>> serve
> >>>>>>>>>> you invalid/outdated data and they handle updates on their own.
> >>>>>>>>>>
> >>>>>>>>>> 2. I don’t think that having in the future two very similar
> >> concepts
> >>>>>> of
> >>>>>>>>>> `materialized` view and `cache` is a good idea. It would be
> >>>>> confusing
> >>>>>>>> for
> >>>>>>>>>> the users. I think it could be handled by variations/overloading
> >> of
> >>>>>>>>>> materialised view concept. We could start with:
> >>>>>>>>>>
> >>>>>>>>>> `MaterializedTable materialize()` - immutable, session life
> scope
> >>>>>>>>>> (basically the same semantic as you are proposing
> >>>>>>>>>>
> >>>>>>>>>> And then in the future (if ever) build on top of that/expand it
> >>>>> with:
> >>>>>>>>>>
> >>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> >> `MaterializedTable
> >>>>>>>>>> materialize(refreshHook=…)`
> >>>>>>>>>>
> >>>>>>>>>> Or with cross session support:
> >>>>>>>>>>
> >>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> >>>>> `MaterializedTable
> >>>>>>>>>> materializeInto(tableFactory=…)`
> >>>>>>>>>>
> >>>>>>>>>> I’m not saying that we should implement cross session/refreshing
> >> now
> >>>>>> or
> >>>>>>>>>> even in the near future. I’m just arguing that naming current
> >>>>>> immutable
> >>>>>>>>>> session life scope method `materialize()` is more future proof
> and
> >>>>>> more
> >>>>>>>>>> consistent with SQL (on which after all table-api is heavily
> >> basing
> >>>>>>>> on).
> >>>>>>>>>>
> >>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would still insist
> >> on
> >>>>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
> >>>>>>>>> behaviours/side
> >>>>>>>>>> effects and to give both us & users more flexibility.
> >>>>>>>>>>
> >>>>>>>>>> Piotrek
> >>>>>>>>>>
> >>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com>
> >> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Just to add a little bit, the materialized view is probably
> more
> >>>>>>>>> similar
> >>>>>>>>>> to
> >>>>>>>>>>> the persistent() brought up earlier in the thread. So it is
> >> usually
> >>>>>>>>> cross
> >>>>>>>>>>> session and could be used in a larger scope. For example, a
> >>>>>>>>> materialized
> >>>>>>>>>>> view created by user A may be visible to user B. It is probably
> >>>>>>>>> something
> >>>>>>>>>>> we want to have in the future. I'll put it in the future work
> >>>>>>>> section.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> becket.qin@gmail.com
> >>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the explanation.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Right now we are mostly thinking of the cached table as
> >>>>> immutable. I
> >>>>>>>>> can
> >>>>>>>>>>>> see the Materialized view would be useful in the future. That
> >>>>> said,
> >>>>>>>> I
> >>>>>>>>>> think
> >>>>>>>>>>>> a simple cache mechanism is probably still needed. So to me,
> >>>>> cache()
> >>>>>>>>> and
> >>>>>>>>>>>> materialize() should be two separate method as they address
> >>>>>>>> different
> >>>>>>>>>>>> needs. Materialize() is a higher level concept usually
> implying
> >>>>>>>>>> periodical
> >>>>>>>>>>>> update, while cache() has much simpler semantic. For example,
> >> one
> >>>>>>>> may
> >>>>>>>>>>>> create a materialized view and use cache() method in the
> >>>>>>>> materialized
> >>>>>>>>>> view
> >>>>>>>>>>>> creation logic. So that during the materialized view update,
> >> they
> >>>>> do
> >>>>>>>>> not
> >>>>>>>>>>>> need to worry about the case that the cached table is also
> >>>>> changed.
> >>>>>>>>>> Maybe
> >>>>>>>>>>>> under the hood, materialized() and cache() could share some
> >>>>>>>> mechanism,
> >>>>>>>>>> but
> >>>>>>>>>>>> I think a simple cache() method would be handy in a lot of
> >> cases.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> >>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
> >> that
> >>>>>>>>> they
> >>>>>>>>>>>>> cannot do on a Table?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Maybe not in the initial implementation, but various DBs
> offer
> >>>>>>>>>> different
> >>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks, triggers,
> >> timers,
> >>>>>>>>>> manually
> >>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle that
> in
> >>>>> the
> >>>>>>>>>> future.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> After users call *table.cache(), *users can just use that
> >> table
> >>>>>>>> and
> >>>>>>>>> do
> >>>>>>>>>>>>> anything that is supported on a Table, including SQL.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is some implicit behaviour with side effects. Imagine if
> >>>>> user
> >>>>>>>>> has
> >>>>>>>>>> a
> >>>>>>>>>>>>> long and complicated program, that touches table `b` multiple
> >>>>>>>> times,
> >>>>>>>>>> maybe
> >>>>>>>>>>>>> scattered around different methods. If he modifies his
> program
> >> by
> >>>>>>>>>> inserting
> >>>>>>>>>>>>> in one place
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> b.cache()
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This implicitly alters the semantic and behaviour of his code
> >> all
> >>>>>>>>> over
> >>>>>>>>>>>>> the place, maybe in a ways that might cause problems. For
> >> example
> >>>>>>>>> what
> >>>>>>>>>> if
> >>>>>>>>>>>>> underlying data is changing?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Having invisible side effects is also not very clean, for
> >> example
> >>>>>>>>> think
> >>>>>>>>>>>>> about something like this (but more complicated):
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Table b = ...;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>>>> processTable1(b)
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>> else {
> >>>>>>>>>>>>> processTable2(b)
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> // do more stuff with b
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> >> `processTable1`
> >>>>>>>> or
> >>>>>>>>>>>>> `processTable2` methods.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On the other hand
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Table materialisedB = b.materialize()
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Avoids (at least some of) the side effect issues and forces
> >> user
> >>>>> to
> >>>>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate and
> >> forces
> >>>>>>>> user
> >>>>>>>>>> to
> >>>>>>>>>>>>> think what does it actually mean. And if something doesn’t
> work
> >>>>> in
> >>>>>>>>> the
> >>>>>>>>>> end
> >>>>>>>>>>>>> for the user, he will know what has he changed instead of
> >> blaming
> >>>>>>>>>> Flink for
> >>>>>>>>>>>>> some “magic” underneath. In the above example, after
> >>>>> materialising
> >>>>>>>> b
> >>>>>>>>> in
> >>>>>>>>>>>>> only one of the methods, he should/would realise about the
> >> issue
> >>>>>>>> when
> >>>>>>>>>>>>> handling the return value `MaterializedTable` of that method.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I guess it comes down to personal preferences if you like
> >> things
> >>>>> to
> >>>>>>>>> be
> >>>>>>>>>>>>> implicit or not. The more power is the user, probably the
> more
> >>>>>>>> likely
> >>>>>>>>>> he is
> >>>>>>>>>>>>> to like/understand implicit behaviour. And we as Table API
> >>>>>>>> designers
> >>>>>>>>>> are
> >>>>>>>>>>>>> the most power users out there, so I would proceed with
> caution
> >>>>> (so
> >>>>>>>>>> that we
> >>>>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely
> implicit
> >>>>>>>>> method
> >>>>>>>>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051
> >)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Table API to also support non-relational processing cases,
> >>>>> cache()
> >>>>>>>>>>>>> might be slightly better.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think even such extended Table API could benefit from
> >> sticking
> >>>>>>>>>> to/being
> >>>>>>>>>>>>> consistent with SQL where both SQL and Table API are
> basically
> >>>>> the
> >>>>>>>>>> same.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> One more thing. `MaterializedTable materialize()` could be
> more
> >>>>>>>>>>>>> powerful/flexible allowing the user to operate both on
> >>>>> materialised
> >>>>>>>>>> and not
> >>>>>>>>>>>>> materialised view at the same time for whatever reasons
> >>>>> (underlying
> >>>>>>>>>> data
> >>>>>>>>>>>>> changing/better optimisation opportunities after pushing down
> >>>>> more
> >>>>>>>>>> filters
> >>>>>>>>>>>>> etc). For example:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Table b = …;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Val min = mb.min();
> >>>>>>>>>>>>> Val max = mb.max();
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Could be more efficient compared to `b.cache()` if
> >>>>> `filter(‘userId
> >>>>>>>> =
> >>>>>>>>>>>>> 42);` allows for much more aggressive optimisations.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was just
> an
> >>>>>>>>>> example.
> >>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> >>>>>>>>>>>>>> For the sake of this proposal, it would be up to the user to
> >>>>>>>>>> implement a
> >>>>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink
> classes
> >>>>> to
> >>>>>>>>>>>>> persist
> >>>>>>>>>>>>>> and read the data.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio
> Pompermaier
> >> <
> >>>>>>>>>>>>>> pompermaier@okkam.it>:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an
> >> alternative
> >>>>> to
> >>>>>>>>>>>>> Apache
> >>>>>>>>>>>>>>> Ignite?
> >>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> >>>>>>>> fhueske@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the proposal!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> To summarize, you propose a new method Table.cache():
> Table
> >>>>> that
> >>>>>>>>>> will
> >>>>>>>>>>>>>>>> trigger a job and write the result into some temporary
> >> storage
> >>>>>>>> as
> >>>>>>>>>>>>> defined
> >>>>>>>>>>>>>>>> by a TableFactory.
> >>>>>>>>>>>>>>>> The cache() call blocks while the job is running and
> >>>>> eventually
> >>>>>>>>>>>>> returns a
> >>>>>>>>>>>>>>>> Table object that represents a scan of the temporary
> table.
> >>>>>>>>>>>>>>>> When the "session" is closed (closing to be defined?), the
> >>>>>>>>> temporary
> >>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>> are all dropped.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I think this behavior makes sense and is a good first step
> >>>>>>>> towards
> >>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>> interactive workloads.
> >>>>>>>>>>>>>>>> However, its performance suffers from writing to and
> reading
> >>>>>>>> from
> >>>>>>>>>>>>>>> external
> >>>>>>>>>>>>>>>> systems.
> >>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> significantly
> >>>>>>>>> improve
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would
> >>>>> have
> >>>>>>>>>> large
> >>>>>>>>>>>>>>>> impacts on many components of Flink.
> >>>>>>>>>>>>>>>> Users could use in-memory filesystems or storage grids
> >> (Apache
> >>>>>>>>>>>>> Ignite) to
> >>>>>>>>>>>>>>>> mitigate some of the performance effects.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best, Fabian
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> >>>>>>>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> MaterializedTable
> >>>>>>>> that
> >>>>>>>>>> they
> >>>>>>>>>>>>>>>>> cannot do on a Table? After users call *table.cache(),
> >> *users
> >>>>>>>> can
> >>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>> that table and do anything that is supported on a Table,
> >>>>>>>>> including
> >>>>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine
> to
> >>>>> me.
> >>>>>>>>>>>>> cache()
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> a bit more general than materialize(). Given that we are
> >>>>>>>>> enhancing
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> Table API to also support non-relational processing
> cases,
> >>>>>>>>> cache()
> >>>>>>>>>>>>>>> might
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> slightly better.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> >>>>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse
> >> existing
> >>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you
> >>>>> want
> >>>>>>>> to
> >>>>>>>>>>>>>>>> provide
> >>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>> alternate way of writing the data.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we
> >> could
> >>>>>>>>>> rename
> >>>>>>>>>>>>>>>>>> `cache()` to
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> void materialize()
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> or going step further
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> >>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> ?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The second option with returning a handle I think is
> more
> >>>>>>>>> flexible
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete” or
> >>>>> generally
> >>>>>>>>>>>>>>> speaking
> >>>>>>>>>>>>>>>>>> manage the the view. In the future we could also think
> >> about
> >>>>>>>>>> adding
> >>>>>>>>>>>>>>>> hooks
> >>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more
> >> explicit
> >>>>> -
> >>>>>>>>>>>>>>>>>> materialization returning a new table handle will not
> have
> >>>>> the
> >>>>>>>>>> same
> >>>>>>>>>>>>>>>>>> implicit side effects as adding a simple line of code
> like
> >>>>>>>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>>>>>>> would have.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> It would also be more SQL like, making it more intuitive
> >> for
> >>>>>>>>> users
> >>>>>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>> familiar with the SQL.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> >> becket.qin@gmail.com
> >>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
> >>>>>>>>> creating
> >>>>>>>>>> a
> >>>>>>>>>>>>>>>>>> BUILT-IN
> >>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That functionality
> is
> >>>>>>>>> missing
> >>>>>>>>>>>>>>>>> today,
> >>>>>>>>>>>>>>>>>>> though. Not sure if I understand your question. Do you
> >> mean
> >>>>>>>> we
> >>>>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we want
> to
> >>>>> stop
> >>>>>>>>> at
> >>>>>>>>>>>>>>>>> creating
> >>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend that in
> >> the
> >>>>>>>>> future
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>> useful unified data store distributed with Flink? And
> do
> >> we
> >>>>>>>>> want
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern with
> their
> >>>>> own
> >>>>>>>>>> user
> >>>>>>>>>>>>>>>>>> defined
> >>>>>>>>>>>>>>>>>>> services. These considerations are much more
> >> architectural.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> >>>>>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the
> problem.
> >>>>>>>> Isn’t
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink
> >> and
> >>>>>>>>> later
> >>>>>>>>>>>>>>>>> reading
> >>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live
> >>>>> time?
> >>>>>>>>> And
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> sink
> >>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a materialised
> >>>>> view
> >>>>>>>>>> from a
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
> >>>>>>>>> materialised
> >>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> >>>>>>>>> materialised
> >>>>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>>>> (for
> >>>>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we need
> >> some
> >>>>>>>>>>>>>>> syntactic
> >>>>>>>>>>>>>>>>>> sugar
> >>>>>>>>>>>>>>>>>>>> on top of it?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> >>>>> becket.qin@gmail.com
> >>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> >>>>>>>>>>>>>>>>> lifecycle/defined
> >>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the future work for
> >>>>> this.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> >>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
> >>>>>>>> `cache()`, I
> >>>>>>>>>>>>>>>>>> understand
> >>>>>>>>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>>>>>>>> you designed this way!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle
> for
> >>>>>>>> data
> >>>>>>>>>>>>>>>>>> persistence?
> >>>>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that
> the
> >>>>> user
> >>>>>>>>> is
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> worried
> >>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the time
> >> range
> >>>>>>>> for
> >>>>>>>>>>>>>>>> keeping
> >>>>>>>>>>>>>>>>>>>> time.
> >>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also
> >>>>> share
> >>>>>>>>> in a
> >>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> >>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> >>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> sure,
> >>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
> >>>>>>>> 下午1:33写道:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> >>>>>>>> persist(),
> >>>>>>>>>>>>>>>>>> personally I
> >>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
> >>>>>>>> behavior,
> >>>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted
> after
> >>>>> the
> >>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>> closed.
> >>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people might
> >>>>> think
> >>>>>>>>> the
> >>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>> still be there even after the session is gone.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
> >>>>> processing
> >>>>>>>> in
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>> job.
> >>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I
> >> imagine
> >>>>>>>> that
> >>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>>>> change across the board, including sources,
> operators
> >>>>> and
> >>>>>>>>>>>>>>>>>>>> optimizations,
> >>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several separate
> >>>>> in-depth
> >>>>>>>>>>>>>>>>> discussions.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> >>>>>>>>>>>>>>> xingcanc@gmail.com>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain
> >> are
> >>>>>>>> both
> >>>>>>>>>>>>>>>>>> orthogonal
> >>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the
> >> first
> >>>>>>>> time
> >>>>>>>>>> we
> >>>>>>>>>>>>>>>> plan
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
> >>>>>>>> state.
> >>>>>>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>>>> it’s
> >>>>>>>>>>>>>>>>>>>>>>> better
> >>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate
> on
> >> a
> >>>>>>>>>> specific
> >>>>>>>>>>>>>>>>> part?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with
> the
> >>>>>>>>>> underlying
> >>>>>>>>>>>>>>>>>>>> service.
> >>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the
> >> existing
> >>>>>>>>>>>>>>> codebase.
> >>>>>>>>>>>>>>>> As
> >>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to
> support
> >>>>>>>> other
> >>>>>>>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more
> >> interactive
> >>>>>>>>> Table
> >>>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service
> mechanism.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> >>>>>>>>>>>>>>>> xiaoweij@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for
> clean
> >> up
> >>>>>>>> is
> >>>>>>>>>> not
> >>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>>>> reliable.
> >>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
> >>>>>>>>>> successfully.
> >>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>>>> risk
> >>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer
> to
> >>>>>>>> have
> >>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>> association
> >>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can
> always
> >>>>>>>> clean
> >>>>>>>>>> up
> >>>>>>>>>>>>>>>> temp
> >>>>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any active
> >>>>>>>> sessions.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> >>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
> >>>>>>>> friendly
> >>>>>>>>>> in
> >>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> >>>>>>>> executed
> >>>>>>>>> in
> >>>>>>>>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>>>>>>>> stages
> >>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink
> >> ML,
> >>>>> in
> >>>>>>>>>> order
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> utilize
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to
> >> submit a
> >>>>>>>> job
> >>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to
> named
> >>>>>>>>>>>>>>> `persist()`,
> >>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally
> >>>>> cache
> >>>>>>>>> in
> >>>>>>>>>>>>>>>> memory
> >>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>> persist
> >>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into
> >> state
> >>>>>>>>>> backend
> >>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future,
> >>>>> support
> >>>>>>>>> for
> >>>>>>>>>>>>>>>>>> streaming
> >>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also
> >>>>> benefit
> >>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> "Interactive
> >>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your
> JIRAs
> >>>>> and
> >>>>>>>>>> FLIP!
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com>
> 于2018年11月20日周二
> >>>>>>>>>> 下午9:56写道:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out,
> >> it
> >>>>>>>> is a
> >>>>>>>>>>>>>>>>> promising
> >>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in
> various
> >>>>>>>>>> aspects,
> >>>>>>>>>>>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One
> >> of
> >>>>>>>> the
> >>>>>>>>>>>>>>>>> scenarios
> >>>>>>>>>>>>>>>>>>>>>>> where
> >>>>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
> >>>>> programming.
> >>>>>>>> To
> >>>>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution,
> we
> >>>>> put
> >>>>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
> >>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hey Shaoxuan and Becket,

> Can you explain a bit more one what are the side effects? So far my
> understanding is that such side effects only exist if a table is mutable.
> Is that the case?

Not only that. There are also performance implications and those are another implicit side effects of using `void cache()`. As I wrote before, reading from cache might not always be desirable, thus it can cause performance degradation and I’m fine with that - user's or optimiser’s choice. What I do not like is that this implicit side effect can manifest in completely different part of code, that wasn’t touched by a user while he was adding `void cache()` call somewhere else. And even if caching improves performance, it’s still a side effect of `void cache()`. Almost from the definition `void` methods have only side effects. As I wrote before, there are couple of scenarios where this might be undesirable and/or unexpected, for example:

1.
Table b = …;
b.cache()
x = b.join(…)
y = b.count()
// ...
// 100
// hundred
// lines 
// of  
// code
// later
z = b.filter(…).groupBy(…) // this might be even hidden in a different method/file/package/dependency

2.

Table b = ...
If (some_condition) {
  foo(b)
}
Else {
  bar(b)
}
z = b.filter(…).groupBy(…)


Void foo(Table b) {
  b.cache()
  // do something with b
}

In both above examples, `b.cache()` will implicitly affect (semantic of a program in case of sources being mutable and performance) `z = b.filter(…).groupBy(…)` which might be far from obvious. 

On top of that, there is still this argument of mine that having a `MaterializedTable` or `CachedTable` handle is more flexible for us for the future and for the user (as a manual option to bypass cache reads).

>  But Jiangjie is correct,
> the source table in batching should be immutable. It is the user’s
> responsibility to ensure it, otherwise even a regular failover may lead
> to inconsistent results.

Yes, I agree that’s what perfect world/good deployment should be. But its often isn’t and while I’m not trying to fix this (since the proper fix is to support transactions), I’m just trying to minimise confusion for the users that are not fully aware what’s going on and operate in less then perfect setup. And if something bites them after adding `b.cache()` call, to make sure that they at least know all of the places that adding this line can affect.

Thanks, Piotrek

> On 1 Dec 2018, at 15:39, Becket Qin <be...@gmail.com> wrote:
> 
> Hi Piotrek,
> 
> Thanks again for the clarification. Some more replies are following.
> 
> But keep in mind that `.cache()` will/might not only be used in interactive
>> programming and not only in batching.
> 
> It is true. Actually in stream processing, cache() has the same semantic as
> batch processing. The semantic is following:
> For a table created via a series of computation, save that table for later
> reference to avoid running the computation logic to regenerate the table.
> Once the application exits, drop all the cache.
> This semantic is same for both batch and stream processing. The difference
> is that stream applications will only run once as they are long running.
> And the batch applications may be run multiple times, hence the cache may
> be created and dropped each time the application runs.
> Admittedly, there will probably be some resource management requirements
> for the streaming cached table, such as time based / size based retention,
> to address the infinite data issue. But such requirement does not change
> the semantic.
> You are right that interactive programming is just one use case of cache().
> It is not the only use case.
> 
> For me the more important issue is of not having the `void cache()` with
>> side effects.
> 
> This is indeed the key point. The argument around whether cache() should
> return something already indicates that cache() and materialize() address
> different issues.
> Can you explain a bit more one what are the side effects? So far my
> understanding is that such side effects only exist if a table is mutable.
> Is that the case?
> 
> I don’t know, probably initially we should make CachedTable read-only. I
>> don’t find it more confusing than the fact that user can not write to views
>> or materialised views in SQL or that user currently can not write to a
>> Table.
> 
> I don't think anyone should insert something to a cache. By definition the
> cache should only be updated when the corresponding original table is
> updated. What I am wondering is that given the following two facts:
> 1. If and only if a table is mutable (with something like insert()), a
> CachedTable may have implicit behavior.
> 2. A CachedTable extends a Table.
> We can come to the conclusion that a CachedTable is mutable and users can
> insert into the CachedTable directly. This is where I thought confusing.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
> 
>> Hi all,
>> 
>> Regarding naming `cache()` vs `materialize()`. One more explanation why I
>> think `materialize()` is more natural to me is that I think of all “Table”s
>> in Table-API as views. They behave the same way as SQL views, the only
>> difference for me is that their live scope is short - current session which
>> is limited by different execution model. That’s why “cashing” a view for me
>> is just materialising it.
>> 
>> However I see and I understand your point of view. Coming from
>> DataSet/DataStream and generally speaking non-SQL world, `cache()` is more
>> natural. But keep in mind that `.cache()` will/might not only be used in
>> interactive programming and not only in batching. But naming is one issue,
>> and not that critical to me. Especially that once we implement proper
>> materialised views, we can always deprecate/rename `cache()` if we deem so.
>> 
>> 
>> For me the more important issue is of not having the `void cache()` with
>> side effects. Exactly for the reasons that you have mentioned. True:
>> results might be non deterministic if underlying source table are changing.
>> Problem is that `void cache()` implicitly changes the semantic of
>> subsequent uses of the cached/materialized Table. It can cause “wtf” moment
>> for a user if he inserts “b.cache()” call in some place in his code and
>> suddenly some other random places are behaving differently. If
>> `materialize()` or `cache()` returns a Table handle, we force user to
>> explicitly use the cache which removes the “random” part from the "suddenly
>> some other random places are behaving differently”.
>> 
>> This argument and others that I’ve raised (greater flexibility/allowing
>> user to explicitly bypass the cache) are independent of `cache()` vs
>> `materialize()` discussion.
>> 
>>> Does that mean one can also insert into the CachedTable? This sounds
>> pretty confusing.
>> 
>> I don’t know, probably initially we should make CachedTable read-only. I
>> don’t find it more confusing than the fact that user can not write to views
>> or materialised views in SQL or that user currently can not write to a
>> Table.
>> 
>> Piotrek
>> 
>>> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> I agree with @Becket that `cache()` and `materialize()` should be
>> considered as two different methods where the later one is more
>> sophisticated.
>>> 
>>> According to my understanding, the initial idea is just to introduce a
>> simple cache or persist mechanism, but as the TableAPI is a high-level API,
>> it’s naturally for as to think in a SQL way.
>>> 
>>> Maybe we can add the `cache()` method to the DataSet API and force users
>> to translate a Table to a Dataset before caching it. Then the users should
>> manually register the cached dataset to a table again (we may need some
>> table replacement mechanisms for datasets with an identical schema but
>> different contents here). After all, it’s the dataset rather than the
>> dynamic table that need to be cached, right?
>>> 
>>> Best,
>>> Xingcan
>>> 
>>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <be...@gmail.com> wrote:
>>>> 
>>>> Hi Piotrek and Jark,
>>>> 
>>>> Thanks for the feedback and explanation. Those are good arguments. But I
>>>> think those arguments are mostly about materialized view. Let me try to
>>>> explain the reason I believe cache() and materialize() are different.
>>>> 
>>>> I think cache() and materialize() have quite different implications. An
>>>> analogy I can think of is save()/publish(). When users call cache(), it
>> is
>>>> just like they are saving an intermediate result as a draft of their
>> work,
>>>> this intermediate result may not have any realistic meaning. Calling
>>>> cache() does not mean users want to publish the cached table in any
>> manner.
>>>> But when users call materialize(), that means "I have something
>> meaningful
>>>> to be reused by others", now users need to think about the validation,
>>>> update & versioning, lifecycle of the result, etc.
>>>> 
>>>> Piotrek's suggestions on variations of the materialize() methods are
>> very
>>>> useful. It would be great if Flink have them. The concept of
>> materialized
>>>> view is actually a pretty big feature, not to say the related stuff like
>>>> triggers/hooks you mentioned earlier. I think the materialized view
>> itself
>>>> should be discussed in a more thorough and systematic manner. And I
>> found
>>>> that discussion is kind of orthogonal and way beyond interactive
>>>> programming experience.
>>>> 
>>>> The example you gave was interesting. I still have some questions,
>> though.
>>>> 
>>>> Table source = … // some source that scans files from a directory
>>>>> “/foo/bar/“
>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>> 
>>>> t2.count() // initialise cache (if it’s lazily initialised)
>>>>> int a1 = t1.count()
>>>>> int b1 = t2.count()
>>>>> // something in the background (or we trigger it) writes new files to
>>>>> /foo/bar
>>>>> int a2 = t1.count()
>>>>> int b2 = t2.count()
>>>>> t2.refresh() // possible future extension, not to be implemented in the
>>>>> initial version
>>>>> 
>>>> 
>>>> what if someone else added some more files to /foo/bar at this point? In
>>>> that case, a3 won't equals to b3, and the result become
>> non-deterministic,
>>>> right?
>>>> 
>>>> int a3 = t1.count()
>>>>> int b3 = t2.count()
>>>>> t2.drop() // another possible future extension, manual “cache” dropping
>>>> 
>>>> 
>>>> When we talk about interactive programming, in most cases, we are
>> talking
>>>> about batch applications. A fundamental assumption of such case is that
>> the
>>>> source data is complete before the data processing begins, and the data
>>>> will not change during the data processing. IMO, if additional rows
>> needs
>>>> to be added to some source during the processing, it should be done in
>> ways
>>>> like union the source with another table containing the rows to be
>> added.
>>>> 
>>>> There are a few cases that computations are executed repeatedly on the
>>>> changing data source.
>>>> 
>>>> For example, people may run a ML training job every hour with the
>> samples
>>>> newly added in the past hour. In that case, the source data between will
>>>> indeed change. But still, the data remain unchanged within one run. And
>>>> usually in that case, the result will need versioning, i.e. for a given
>>>> result, it tells that the result is a result from the source data by a
>>>> certain timestamp.
>>>> 
>>>> Another example is something like data warehouse. In this case, there
>> are a
>>>> few source of original/raw data. On top of those sources, many
>> materialized
>>>> view / queries / reports / dashboards can be created to generate derived
>>>> data. Those derived data needs to be updated when the underlying
>> original
>>>> data changes. In that case, the processing logic that derives the
>> original
>>>> data needs to be executed repeatedly to update those reports/views.
>> Again,
>>>> all those derived data also need to have version management, such as
>>>> timestamp.
>>>> 
>>>> In any of the above two cases, during a single run of the processing
>> logic,
>>>> the data cannot change. Otherwise the behavior of the processing logic
>> may
>>>> be undefined. In the above two examples, when writing the processing
>> logic,
>>>> Users can use .cache() to hint Flink that those results should be saved
>> to
>>>> avoid repeated computation. And then for the result of my application
>>>> logic, I'll call materialize(), so that these results could be managed
>> by
>>>> the system with versioning, metadata management, lifecycle management,
>>>> ACLs, etc.
>>>> 
>>>> It is true we can use materialize() to do the cache() job, but I am
>> really
>>>> reluctant to shoehorn cache() into materialize() and force users to
>> worry
>>>> about a bunch of implications that they needn't have to. I am
>> absolutely on
>>>> your side that redundant API is bad. But it is equally frustrating, if
>> not
>>>> more, that the same API does different things.
>>>> 
>>>> Thanks,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <ws...@gmail.com>
>> wrote:
>>>> 
>>>>> Thanks Piotrek,
>>>>> You provided a very good example, it explains all the confusions I
>> have.
>>>>> It is clear that there is something we have not considered in the
>> initial
>>>>> proposal. We intend to force the user to reuse the cached/materialized
>>>>> table, if its cache() method is executed. We did not expect that user
>> may
>>>>> want to re-executed the plan from the source table. Let me re-think
>> about
>>>>> it and get back to you later.
>>>>> 
>>>>> In the meanwhile, this example/observation also infers that we cannot
>> fully
>>>>> involve the optimizer to decide the plan if a cache/materialize is
>>>>> explicitly used, because weather to reuse the cache data or re-execute
>> the
>>>>> query from source data may lead to different results. (But I guess
>>>>> optimizer can still help in some cases ---- as long as it does not
>>>>> re-execute from the varied source, we should be safe).
>>>>> 
>>>>> Regards,
>>>>> Shaoxuan
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
>> piotr@data-artisans.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi Shaoxuan,
>>>>>> 
>>>>>> Re 2:
>>>>>> 
>>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1’
>>>>>> 
>>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
>>>>>> `methodThatAppliesOperators()` method has changed it’s plan?
>>>>>> 
>>>>>> I was thinking more about something like this:
>>>>>> 
>>>>>> Table source = … // some source that scans files from a directory
>>>>>> “/foo/bar/“
>>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>>>> 
>>>>>> t2.count() // initialise cache (if it’s lazily initialised)
>>>>>> 
>>>>>> int a1 = t1.count()
>>>>>> int b1 = t2.count()
>>>>>> 
>>>>>> // something in the background (or we trigger it) writes new files to
>>>>>> /foo/bar
>>>>>> 
>>>>>> int a2 = t1.count()
>>>>>> int b2 = t2.count()
>>>>>> 
>>>>>> t2.refresh() // possible future extension, not to be implemented in
>> the
>>>>>> initial version
>>>>>> 
>>>>>> int a3 = t1.count()
>>>>>> int b3 = t2.count()
>>>>>> 
>>>>>> t2.drop() // another possible future extension, manual “cache”
>> dropping
>>>>>> 
>>>>>> assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
>>>>>> assertTrue(b1 == b2) // both values come from the same cache
>>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full table
>>>>> scan
>>>>>> and has more data
>>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
>>>>>> assertTrue(b3 == a2 == a3)
>>>>>> 
>>>>>> Piotrek
>>>>>> 
>>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> It is an very interesting and useful design!
>>>>>>> 
>>>>>>> Here I want to share some of my thoughts:
>>>>>>> 
>>>>>>> 1. Agree with that cache() method should return some Table to avoid
>>>>> some
>>>>>>> unexpected problems because of the mutable object.
>>>>>>> All the existing methods of Table are returning a new Table instance.
>>>>>>> 
>>>>>>> 2. I think materialize() would be more consistent with SQL, this
>> makes
>>>>> it
>>>>>>> possible to support the same feature for SQL (materialize view) and
>>>>> keep
>>>>>>> the same API for users in the future.
>>>>>>> But I'm also fine if we choose cache().
>>>>>>> 
>>>>>>> 3. In the proposal, a TableService (or FlinkService?) is used to
>> cache
>>>>>> the
>>>>>>> result of the (intermediate) table.
>>>>>>> But the name of TableService may be a bit general which is not quite
>>>>>>> understanding correctly in the first glance (a metastore for
>> tables?).
>>>>>>> Maybe a more specific name would be better, such as TableCacheSerive
>>>>> or
>>>>>>> TableMaterializeSerivce or something else.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jark
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Thanks for the clarification Becket!
>>>>>>>> 
>>>>>>>> I have a few thoughts to share / questions:
>>>>>>>> 
>>>>>>>> 1) I'd like to know how you plan to implement the feature on a plan
>> /
>>>>>>>> planner level.
>>>>>>>> 
>>>>>>>> I would imaging the following to happen when Table.cache() is
>> called:
>>>>>>>> 
>>>>>>>> 1) immediately optimize the Table and internally convert it into a
>>>>>>>> DataSet/DataStream. This is necessary, to avoid that operators of
>>>>> later
>>>>>>>> queries on top of the Table are pushed down.
>>>>>>>> 2) register the DataSet/DataStream as a DataSet/DataStream-backed
>>>>> Table
>>>>>> X
>>>>>>>> 3) add a sink to the DataSet/DataStream. This is the materialization
>>>>> of
>>>>>> the
>>>>>>>> Table X
>>>>>>>> 
>>>>>>>> Based on your proposal the following would happen:
>>>>>>>> 
>>>>>>>> Table t1 = ....
>>>>>>>> t1.cache(); // cache() returns void. The logical plan of t1 is
>>>>> replaced
>>>>>> by
>>>>>>>> a scan of X. There is also a reference to the materialization of X.
>>>>>>>> 
>>>>>>>> t1.count(); // this executes the program, including the
>>>>>> DataSet/DataStream
>>>>>>>> that backs X and the sink that writes the materialization of X
>>>>>>>> t1.count(); // this executes the program, but reads X from the
>>>>>>>> materialization.
>>>>>>>> 
>>>>>>>> My question is, how do you determine when whether the scan of t1
>>>>> should
>>>>>> go
>>>>>>>> against the DataSet/DataStream program and when against the
>>>>>>>> materialization?
>>>>>>>> AFAIK, there is no hook that will tell you that a part of the
>> program
>>>>>> was
>>>>>>>> executed. Flipping a switch during optimization or plan generation
>> is
>>>>>> not
>>>>>>>> sufficient as there is no guarantee that the plan is also executed.
>>>>>>>> 
>>>>>>>> Overall, this behavior is somewhat similar to what I proposed in
>>>>>>>> FLINK-8950, which does not include persisting the table, but just
>>>>>>>> optimizing and reregistering it as DataSet/DataStream scan.
>>>>>>>> 
>>>>>>>> 2) I think Piotr has a point about the implicit behavior and side
>>>>>> effects
>>>>>>>> of the cache() method if it does not return anything.
>>>>>>>> Consider the following example:
>>>>>>>> 
>>>>>>>> Table t1 = ???
>>>>>>>> Table t2 = methodThatAppliesOperators(t1);
>>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
>>>>>>>> 
>>>>>>>> In this case, the behavior/performance of the plan that results from
>>>>> the
>>>>>>>> second method call depends on whether t1 was modified by the first
>>>>>> method
>>>>>>>> or not.
>>>>>>>> This is the classic issue of mutable vs. immutable objects.
>>>>>>>> Also, as Piotr pointed out, it might also be good to have the
>> original
>>>>>> plan
>>>>>>>> of t1, because in some cases it is possible to push filters down
>> such
>>>>>> that
>>>>>>>> evaluating the query from scratch might be more efficient than
>>>>> accessing
>>>>>>>> the cache.
>>>>>>>> Moreover, a CachedTable could extend Table() and offer a method
>>>>>> refresh().
>>>>>>>> This sounds quite useful in an interactive session mode.
>>>>>>>> 
>>>>>>>> 3) Regarding the name, I can see both arguments. IMO, materialize()
>>>>>> seems
>>>>>>>> to be more future proof.
>>>>>>>> 
>>>>>>>> Best, Fabian
>>>>>>>> 
>>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
>>>>>>>> wshaoxuan@gmail.com>:
>>>>>>>> 
>>>>>>>>> Hi Piotr,
>>>>>>>>> 
>>>>>>>>> Thanks for sharing your ideas on the method naming. We will think
>>>>> about
>>>>>>>>> your suggestions. But I don't understand why we need to change the
>>>>>> return
>>>>>>>>> type of cache().
>>>>>>>>> 
>>>>>>>>> Cache() is a physical operation, it does not change the logic of
>>>>>>>>> the `Table`. On the tableAPI layer, we should not introduce a new
>>>>> table
>>>>>>>>> type unless the logic of table has been changed. If we introduce a
>>>>> new
>>>>>>>>> table type `CachedTable`, we need create the same set of methods of
>>>>>>>> `Table`
>>>>>>>>> for it. I don't think it is worth doing this. Or can you please
>>>>>> elaborate
>>>>>>>>> more on what could be the "implicit behaviours/side effects" you
>> are
>>>>>>>>> thinking about?
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Shaoxuan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
>>>>>> piotr@data-artisans.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Becket,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the response.
>>>>>>>>>> 
>>>>>>>>>> 1. I wasn’t saying that materialised view must be mutable or not.
>>>>> The
>>>>>>>>> same
>>>>>>>>>> thing applies to caches as well. To the contrary, I would expect
>>>>> more
>>>>>>>>>> consistency and updates from something that is called “cache” vs
>>>>>>>>> something
>>>>>>>>>> that’s a “materialised view”. In other words, IMO most caches do
>> not
>>>>>>>>> serve
>>>>>>>>>> you invalid/outdated data and they handle updates on their own.
>>>>>>>>>> 
>>>>>>>>>> 2. I don’t think that having in the future two very similar
>> concepts
>>>>>> of
>>>>>>>>>> `materialized` view and `cache` is a good idea. It would be
>>>>> confusing
>>>>>>>> for
>>>>>>>>>> the users. I think it could be handled by variations/overloading
>> of
>>>>>>>>>> materialised view concept. We could start with:
>>>>>>>>>> 
>>>>>>>>>> `MaterializedTable materialize()` - immutable, session life scope
>>>>>>>>>> (basically the same semantic as you are proposing
>>>>>>>>>> 
>>>>>>>>>> And then in the future (if ever) build on top of that/expand it
>>>>> with:
>>>>>>>>>> 
>>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
>> `MaterializedTable
>>>>>>>>>> materialize(refreshHook=…)`
>>>>>>>>>> 
>>>>>>>>>> Or with cross session support:
>>>>>>>>>> 
>>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
>>>>> `MaterializedTable
>>>>>>>>>> materializeInto(tableFactory=…)`
>>>>>>>>>> 
>>>>>>>>>> I’m not saying that we should implement cross session/refreshing
>> now
>>>>>> or
>>>>>>>>>> even in the near future. I’m just arguing that naming current
>>>>>> immutable
>>>>>>>>>> session life scope method `materialize()` is more future proof and
>>>>>> more
>>>>>>>>>> consistent with SQL (on which after all table-api is heavily
>> basing
>>>>>>>> on).
>>>>>>>>>> 
>>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would still insist
>> on
>>>>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
>>>>>>>>> behaviours/side
>>>>>>>>>> effects and to give both us & users more flexibility.
>>>>>>>>>> 
>>>>>>>>>> Piotrek
>>>>>>>>>> 
>>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com>
>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Just to add a little bit, the materialized view is probably more
>>>>>>>>> similar
>>>>>>>>>> to
>>>>>>>>>>> the persistent() brought up earlier in the thread. So it is
>> usually
>>>>>>>>> cross
>>>>>>>>>>> session and could be used in a larger scope. For example, a
>>>>>>>>> materialized
>>>>>>>>>>> view created by user A may be visible to user B. It is probably
>>>>>>>>> something
>>>>>>>>>>> we want to have in the future. I'll put it in the future work
>>>>>>>> section.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <becket.qin@gmail.com
>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>> 
>>>>>>>>>>>> Right now we are mostly thinking of the cached table as
>>>>> immutable. I
>>>>>>>>> can
>>>>>>>>>>>> see the Materialized view would be useful in the future. That
>>>>> said,
>>>>>>>> I
>>>>>>>>>> think
>>>>>>>>>>>> a simple cache mechanism is probably still needed. So to me,
>>>>> cache()
>>>>>>>>> and
>>>>>>>>>>>> materialize() should be two separate method as they address
>>>>>>>> different
>>>>>>>>>>>> needs. Materialize() is a higher level concept usually implying
>>>>>>>>>> periodical
>>>>>>>>>>>> update, while cache() has much simpler semantic. For example,
>> one
>>>>>>>> may
>>>>>>>>>>>> create a materialized view and use cache() method in the
>>>>>>>> materialized
>>>>>>>>>> view
>>>>>>>>>>>> creation logic. So that during the materialized view update,
>> they
>>>>> do
>>>>>>>>> not
>>>>>>>>>>>> need to worry about the case that the cached table is also
>>>>> changed.
>>>>>>>>>> Maybe
>>>>>>>>>>>> under the hood, materialized() and cache() could share some
>>>>>>>> mechanism,
>>>>>>>>>> but
>>>>>>>>>>>> I think a simple cache() method would be handy in a lot of
>> cases.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
>> that
>>>>>>>>> they
>>>>>>>>>>>>> cannot do on a Table?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Maybe not in the initial implementation, but various DBs offer
>>>>>>>>>> different
>>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks, triggers,
>> timers,
>>>>>>>>>> manually
>>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle that in
>>>>> the
>>>>>>>>>> future.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> After users call *table.cache(), *users can just use that
>> table
>>>>>>>> and
>>>>>>>>> do
>>>>>>>>>>>>> anything that is supported on a Table, including SQL.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This is some implicit behaviour with side effects. Imagine if
>>>>> user
>>>>>>>>> has
>>>>>>>>>> a
>>>>>>>>>>>>> long and complicated program, that touches table `b` multiple
>>>>>>>> times,
>>>>>>>>>> maybe
>>>>>>>>>>>>> scattered around different methods. If he modifies his program
>> by
>>>>>>>>>> inserting
>>>>>>>>>>>>> in one place
>>>>>>>>>>>>> 
>>>>>>>>>>>>> b.cache()
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This implicitly alters the semantic and behaviour of his code
>> all
>>>>>>>>> over
>>>>>>>>>>>>> the place, maybe in a ways that might cause problems. For
>> example
>>>>>>>>> what
>>>>>>>>>> if
>>>>>>>>>>>>> underlying data is changing?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Having invisible side effects is also not very clean, for
>> example
>>>>>>>>> think
>>>>>>>>>>>>> about something like this (but more complicated):
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Table b = ...;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>>>> processTable1(b)
>>>>>>>>>>>>> }
>>>>>>>>>>>>> else {
>>>>>>>>>>>>> processTable2(b)
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> // do more stuff with b
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
>> `processTable1`
>>>>>>>> or
>>>>>>>>>>>>> `processTable2` methods.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On the other hand
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Table materialisedB = b.materialize()
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Avoids (at least some of) the side effect issues and forces
>> user
>>>>> to
>>>>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate and
>> forces
>>>>>>>> user
>>>>>>>>>> to
>>>>>>>>>>>>> think what does it actually mean. And if something doesn’t work
>>>>> in
>>>>>>>>> the
>>>>>>>>>> end
>>>>>>>>>>>>> for the user, he will know what has he changed instead of
>> blaming
>>>>>>>>>> Flink for
>>>>>>>>>>>>> some “magic” underneath. In the above example, after
>>>>> materialising
>>>>>>>> b
>>>>>>>>> in
>>>>>>>>>>>>> only one of the methods, he should/would realise about the
>> issue
>>>>>>>> when
>>>>>>>>>>>>> handling the return value `MaterializedTable` of that method.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I guess it comes down to personal preferences if you like
>> things
>>>>> to
>>>>>>>>> be
>>>>>>>>>>>>> implicit or not. The more power is the user, probably the more
>>>>>>>> likely
>>>>>>>>>> he is
>>>>>>>>>>>>> to like/understand implicit behaviour. And we as Table API
>>>>>>>> designers
>>>>>>>>>> are
>>>>>>>>>>>>> the most power users out there, so I would proceed with caution
>>>>> (so
>>>>>>>>>> that we
>>>>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely implicit
>>>>>>>>> method
>>>>>>>>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Table API to also support non-relational processing cases,
>>>>> cache()
>>>>>>>>>>>>> might be slightly better.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think even such extended Table API could benefit from
>> sticking
>>>>>>>>>> to/being
>>>>>>>>>>>>> consistent with SQL where both SQL and Table API are basically
>>>>> the
>>>>>>>>>> same.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> One more thing. `MaterializedTable materialize()` could be more
>>>>>>>>>>>>> powerful/flexible allowing the user to operate both on
>>>>> materialised
>>>>>>>>>> and not
>>>>>>>>>>>>> materialised view at the same time for whatever reasons
>>>>> (underlying
>>>>>>>>>> data
>>>>>>>>>>>>> changing/better optimisation opportunities after pushing down
>>>>> more
>>>>>>>>>> filters
>>>>>>>>>>>>> etc). For example:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Table b = …;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Val min = mb.min();
>>>>>>>>>>>>> Val max = mb.max();
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Could be more efficient compared to `b.cache()` if
>>>>> `filter(‘userId
>>>>>>>> =
>>>>>>>>>>>>> 42);` allows for much more aggressive optimisations.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was just an
>>>>>>>>>> example.
>>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
>>>>>>>>>>>>>> For the sake of this proposal, it would be up to the user to
>>>>>>>>>> implement a
>>>>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink classes
>>>>> to
>>>>>>>>>>>>> persist
>>>>>>>>>>>>>> and read the data.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier
>> <
>>>>>>>>>>>>>> pompermaier@okkam.it>:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an
>> alternative
>>>>> to
>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>> Ignite?
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
>>>>>>>> fhueske@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the proposal!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> To summarize, you propose a new method Table.cache(): Table
>>>>> that
>>>>>>>>>> will
>>>>>>>>>>>>>>>> trigger a job and write the result into some temporary
>> storage
>>>>>>>> as
>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>> by a TableFactory.
>>>>>>>>>>>>>>>> The cache() call blocks while the job is running and
>>>>> eventually
>>>>>>>>>>>>> returns a
>>>>>>>>>>>>>>>> Table object that represents a scan of the temporary table.
>>>>>>>>>>>>>>>> When the "session" is closed (closing to be defined?), the
>>>>>>>>> temporary
>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>> are all dropped.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think this behavior makes sense and is a good first step
>>>>>>>> towards
>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>> interactive workloads.
>>>>>>>>>>>>>>>> However, its performance suffers from writing to and reading
>>>>>>>> from
>>>>>>>>>>>>>>> external
>>>>>>>>>>>>>>>> systems.
>>>>>>>>>>>>>>>> I think this is OK for now. Changes that would significantly
>>>>>>>>> improve
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would
>>>>> have
>>>>>>>>>> large
>>>>>>>>>>>>>>>> impacts on many components of Flink.
>>>>>>>>>>>>>>>> Users could use in-memory filesystems or storage grids
>> (Apache
>>>>>>>>>>>>> Ignite) to
>>>>>>>>>>>>>>>> mitigate some of the performance effects.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
>>>>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
>>>>>>>> that
>>>>>>>>>> they
>>>>>>>>>>>>>>>>> cannot do on a Table? After users call *table.cache(),
>> *users
>>>>>>>> can
>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>> that table and do anything that is supported on a Table,
>>>>>>>>> including
>>>>>>>>>>>>> SQL.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine to
>>>>> me.
>>>>>>>>>>>>> cache()
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> a bit more general than materialize(). Given that we are
>>>>>>>>> enhancing
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> Table API to also support non-relational processing cases,
>>>>>>>>> cache()
>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> slightly better.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
>>>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse
>> existing
>>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you
>>>>> want
>>>>>>>> to
>>>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>> alternate way of writing the data.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we
>> could
>>>>>>>>>> rename
>>>>>>>>>>>>>>>>>> `cache()` to
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> void materialize()
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> or going step further
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> MaterializedTable materialize()
>>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The second option with returning a handle I think is more
>>>>>>>>> flexible
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete” or
>>>>> generally
>>>>>>>>>>>>>>> speaking
>>>>>>>>>>>>>>>>>> manage the the view. In the future we could also think
>> about
>>>>>>>>>> adding
>>>>>>>>>>>>>>>> hooks
>>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more
>> explicit
>>>>> -
>>>>>>>>>>>>>>>>>> materialization returning a new table handle will not have
>>>>> the
>>>>>>>>>> same
>>>>>>>>>>>>>>>>>> implicit side effects as adding a simple line of code like
>>>>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>>>>> would have.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It would also be more SQL like, making it more intuitive
>> for
>>>>>>>>> users
>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>> familiar with the SQL.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
>> becket.qin@gmail.com
>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
>>>>>>>>> creating
>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> BUILT-IN
>>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That functionality is
>>>>>>>>> missing
>>>>>>>>>>>>>>>>> today,
>>>>>>>>>>>>>>>>>>> though. Not sure if I understand your question. Do you
>> mean
>>>>>>>> we
>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we want to
>>>>> stop
>>>>>>>>> at
>>>>>>>>>>>>>>>>> creating
>>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend that in
>> the
>>>>>>>>> future
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> useful unified data store distributed with Flink? And do
>> we
>>>>>>>>> want
>>>>>>>>>> to
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern with their
>>>>> own
>>>>>>>>>> user
>>>>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>>>> services. These considerations are much more
>> architectural.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
>>>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the problem.
>>>>>>>> Isn’t
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink
>> and
>>>>>>>>> later
>>>>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live
>>>>> time?
>>>>>>>>> And
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> sink
>>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a materialised
>>>>> view
>>>>>>>>>> from a
>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
>>>>>>>>> materialised
>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
>>>>>>>>> materialised
>>>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we need
>> some
>>>>>>>>>>>>>>> syntactic
>>>>>>>>>>>>>>>>>> sugar
>>>>>>>>>>>>>>>>>>>> on top of it?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
>>>>> becket.qin@gmail.com
>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
>>>>>>>>>>>>>>>>> lifecycle/defined
>>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the future work for
>>>>> this.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
>>>>>>>> `cache()`, I
>>>>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>> you designed this way!
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
>>>>>>>> data
>>>>>>>>>>>>>>>>>> persistence?
>>>>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the
>>>>> user
>>>>>>>>> is
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> worried
>>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the time
>> range
>>>>>>>> for
>>>>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also
>>>>> share
>>>>>>>>> in a
>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>> group of session, for example:
>>>>>>>>> LifeCycle.SESSION_GROUP(...), I
>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> sure,
>>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
>>>>>>>> 下午1:33写道:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
>>>>>>>> persist(),
>>>>>>>>>>>>>>>>>> personally I
>>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
>>>>>>>> behavior,
>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted after
>>>>> the
>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> closed.
>>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people might
>>>>> think
>>>>>>>>> the
>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>> still be there even after the session is gone.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
>>>>> processing
>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I
>> imagine
>>>>>>>> that
>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>>>> change across the board, including sources, operators
>>>>> and
>>>>>>>>>>>>>>>>>>>> optimizations,
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several separate
>>>>> in-depth
>>>>>>>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
>>>>>>>>>>>>>>> xingcanc@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain
>> are
>>>>>>>> both
>>>>>>>>>>>>>>>>>> orthogonal
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the
>> first
>>>>>>>> time
>>>>>>>>>> we
>>>>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
>>>>>>>> state.
>>>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>>>> it’s
>>>>>>>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate on
>> a
>>>>>>>>>> specific
>>>>>>>>>>>>>>>>> part?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>>>>> service.
>>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the
>> existing
>>>>>>>>>>>>>>> codebase.
>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to support
>>>>>>>> other
>>>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more
>> interactive
>>>>>>>>> Table
>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
>>>>>>>>>>>>>>>> xiaoweij@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean
>> up
>>>>>>>> is
>>>>>>>>>> not
>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>> reliable.
>>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
>>>>>>>>>> successfully.
>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>> risk
>>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
>>>>>>>> have
>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>> association
>>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can always
>>>>>>>> clean
>>>>>>>>>> up
>>>>>>>>>>>>>>>> temp
>>>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any active
>>>>>>>> sessions.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
>>>>>>>> friendly
>>>>>>>>>> in
>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>>>>>>>> examples.
>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
>>>>>>>> executed
>>>>>>>>> in
>>>>>>>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>>>>>>> stages
>>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink
>> ML,
>>>>> in
>>>>>>>>>> order
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> utilize
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to
>> submit a
>>>>>>>> job
>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>> env.execute().
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
>>>>>>>>>>>>>>> `persist()`,
>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally
>>>>> cache
>>>>>>>>> in
>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into
>> state
>>>>>>>>>> backend
>>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future,
>>>>> support
>>>>>>>>> for
>>>>>>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also
>>>>> benefit
>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> "Interactive
>>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs
>>>>> and
>>>>>>>>>> FLIP!
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
>>>>>>>>>> 下午9:56写道:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out,
>> it
>>>>>>>> is a
>>>>>>>>>>>>>>>>> promising
>>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
>>>>>>>>>> aspects,
>>>>>>>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One
>> of
>>>>>>>> the
>>>>>>>>>>>>>>>>> scenarios
>>>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
>>>>> programming.
>>>>>>>> To
>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we
>>>>> put
>>>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>> 
>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotrek,

Thanks again for the clarification. Some more replies are following.

But keep in mind that `.cache()` will/might not only be used in interactive
> programming and not only in batching.

It is true. Actually in stream processing, cache() has the same semantic as
batch processing. The semantic is following:
For a table created via a series of computation, save that table for later
reference to avoid running the computation logic to regenerate the table.
Once the application exits, drop all the cache.
This semantic is same for both batch and stream processing. The difference
is that stream applications will only run once as they are long running.
And the batch applications may be run multiple times, hence the cache may
be created and dropped each time the application runs.
Admittedly, there will probably be some resource management requirements
for the streaming cached table, such as time based / size based retention,
to address the infinite data issue. But such requirement does not change
the semantic.
You are right that interactive programming is just one use case of cache().
It is not the only use case.

For me the more important issue is of not having the `void cache()` with
> side effects.

This is indeed the key point. The argument around whether cache() should
return something already indicates that cache() and materialize() address
different issues.
Can you explain a bit more one what are the side effects? So far my
understanding is that such side effects only exist if a table is mutable.
Is that the case?

I don’t know, probably initially we should make CachedTable read-only. I
> don’t find it more confusing than the fact that user can not write to views
> or materialised views in SQL or that user currently can not write to a
> Table.

I don't think anyone should insert something to a cache. By definition the
cache should only be updated when the corresponding original table is
updated. What I am wondering is that given the following two facts:
1. If and only if a table is mutable (with something like insert()), a
CachedTable may have implicit behavior.
2. A CachedTable extends a Table.
We can come to the conclusion that a CachedTable is mutable and users can
insert into the CachedTable directly. This is where I thought confusing.

Thanks,

Jiangjie (Becket) Qin

On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi all,
>
> Regarding naming `cache()` vs `materialize()`. One more explanation why I
> think `materialize()` is more natural to me is that I think of all “Table”s
> in Table-API as views. They behave the same way as SQL views, the only
> difference for me is that their live scope is short - current session which
> is limited by different execution model. That’s why “cashing” a view for me
> is just materialising it.
>
> However I see and I understand your point of view. Coming from
> DataSet/DataStream and generally speaking non-SQL world, `cache()` is more
> natural. But keep in mind that `.cache()` will/might not only be used in
> interactive programming and not only in batching. But naming is one issue,
> and not that critical to me. Especially that once we implement proper
> materialised views, we can always deprecate/rename `cache()` if we deem so.
>
>
> For me the more important issue is of not having the `void cache()` with
> side effects. Exactly for the reasons that you have mentioned. True:
> results might be non deterministic if underlying source table are changing.
> Problem is that `void cache()` implicitly changes the semantic of
> subsequent uses of the cached/materialized Table. It can cause “wtf” moment
> for a user if he inserts “b.cache()” call in some place in his code and
> suddenly some other random places are behaving differently. If
> `materialize()` or `cache()` returns a Table handle, we force user to
> explicitly use the cache which removes the “random” part from the "suddenly
> some other random places are behaving differently”.
>
> This argument and others that I’ve raised (greater flexibility/allowing
> user to explicitly bypass the cache) are independent of `cache()` vs
> `materialize()` discussion.
>
> > Does that mean one can also insert into the CachedTable? This sounds
> pretty confusing.
>
> I don’t know, probably initially we should make CachedTable read-only. I
> don’t find it more confusing than the fact that user can not write to views
> or materialised views in SQL or that user currently can not write to a
> Table.
>
> Piotrek
>
> > On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I agree with @Becket that `cache()` and `materialize()` should be
> considered as two different methods where the later one is more
> sophisticated.
> >
> > According to my understanding, the initial idea is just to introduce a
> simple cache or persist mechanism, but as the TableAPI is a high-level API,
> it’s naturally for as to think in a SQL way.
> >
> > Maybe we can add the `cache()` method to the DataSet API and force users
> to translate a Table to a Dataset before caching it. Then the users should
> manually register the cached dataset to a table again (we may need some
> table replacement mechanisms for datasets with an identical schema but
> different contents here). After all, it’s the dataset rather than the
> dynamic table that need to be cached, right?
> >
> > Best,
> > Xingcan
> >
> >> On Nov 30, 2018, at 10:57 AM, Becket Qin <be...@gmail.com> wrote:
> >>
> >> Hi Piotrek and Jark,
> >>
> >> Thanks for the feedback and explanation. Those are good arguments. But I
> >> think those arguments are mostly about materialized view. Let me try to
> >> explain the reason I believe cache() and materialize() are different.
> >>
> >> I think cache() and materialize() have quite different implications. An
> >> analogy I can think of is save()/publish(). When users call cache(), it
> is
> >> just like they are saving an intermediate result as a draft of their
> work,
> >> this intermediate result may not have any realistic meaning. Calling
> >> cache() does not mean users want to publish the cached table in any
> manner.
> >> But when users call materialize(), that means "I have something
> meaningful
> >> to be reused by others", now users need to think about the validation,
> >> update & versioning, lifecycle of the result, etc.
> >>
> >> Piotrek's suggestions on variations of the materialize() methods are
> very
> >> useful. It would be great if Flink have them. The concept of
> materialized
> >> view is actually a pretty big feature, not to say the related stuff like
> >> triggers/hooks you mentioned earlier. I think the materialized view
> itself
> >> should be discussed in a more thorough and systematic manner. And I
> found
> >> that discussion is kind of orthogonal and way beyond interactive
> >> programming experience.
> >>
> >> The example you gave was interesting. I still have some questions,
> though.
> >>
> >> Table source = … // some source that scans files from a directory
> >>> “/foo/bar/“
> >>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>> Table t2 = t1.materialize() // (or `cache()`)
> >>
> >> t2.count() // initialise cache (if it’s lazily initialised)
> >>> int a1 = t1.count()
> >>> int b1 = t2.count()
> >>> // something in the background (or we trigger it) writes new files to
> >>> /foo/bar
> >>> int a2 = t1.count()
> >>> int b2 = t2.count()
> >>> t2.refresh() // possible future extension, not to be implemented in the
> >>> initial version
> >>>
> >>
> >> what if someone else added some more files to /foo/bar at this point? In
> >> that case, a3 won't equals to b3, and the result become
> non-deterministic,
> >> right?
> >>
> >> int a3 = t1.count()
> >>> int b3 = t2.count()
> >>> t2.drop() // another possible future extension, manual “cache” dropping
> >>
> >>
> >> When we talk about interactive programming, in most cases, we are
> talking
> >> about batch applications. A fundamental assumption of such case is that
> the
> >> source data is complete before the data processing begins, and the data
> >> will not change during the data processing. IMO, if additional rows
> needs
> >> to be added to some source during the processing, it should be done in
> ways
> >> like union the source with another table containing the rows to be
> added.
> >>
> >> There are a few cases that computations are executed repeatedly on the
> >> changing data source.
> >>
> >> For example, people may run a ML training job every hour with the
> samples
> >> newly added in the past hour. In that case, the source data between will
> >> indeed change. But still, the data remain unchanged within one run. And
> >> usually in that case, the result will need versioning, i.e. for a given
> >> result, it tells that the result is a result from the source data by a
> >> certain timestamp.
> >>
> >> Another example is something like data warehouse. In this case, there
> are a
> >> few source of original/raw data. On top of those sources, many
> materialized
> >> view / queries / reports / dashboards can be created to generate derived
> >> data. Those derived data needs to be updated when the underlying
> original
> >> data changes. In that case, the processing logic that derives the
> original
> >> data needs to be executed repeatedly to update those reports/views.
> Again,
> >> all those derived data also need to have version management, such as
> >> timestamp.
> >>
> >> In any of the above two cases, during a single run of the processing
> logic,
> >> the data cannot change. Otherwise the behavior of the processing logic
> may
> >> be undefined. In the above two examples, when writing the processing
> logic,
> >> Users can use .cache() to hint Flink that those results should be saved
> to
> >> avoid repeated computation. And then for the result of my application
> >> logic, I'll call materialize(), so that these results could be managed
> by
> >> the system with versioning, metadata management, lifecycle management,
> >> ACLs, etc.
> >>
> >> It is true we can use materialize() to do the cache() job, but I am
> really
> >> reluctant to shoehorn cache() into materialize() and force users to
> worry
> >> about a bunch of implications that they needn't have to. I am
> absolutely on
> >> your side that redundant API is bad. But it is equally frustrating, if
> not
> >> more, that the same API does different things.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <ws...@gmail.com>
> wrote:
> >>
> >>> Thanks Piotrek,
> >>> You provided a very good example, it explains all the confusions I
> have.
> >>> It is clear that there is something we have not considered in the
> initial
> >>> proposal. We intend to force the user to reuse the cached/materialized
> >>> table, if its cache() method is executed. We did not expect that user
> may
> >>> want to re-executed the plan from the source table. Let me re-think
> about
> >>> it and get back to you later.
> >>>
> >>> In the meanwhile, this example/observation also infers that we cannot
> fully
> >>> involve the optimizer to decide the plan if a cache/materialize is
> >>> explicitly used, because weather to reuse the cache data or re-execute
> the
> >>> query from source data may lead to different results. (But I guess
> >>> optimizer can still help in some cases ---- as long as it does not
> >>> re-execute from the varied source, we should be safe).
> >>>
> >>> Regards,
> >>> Shaoxuan
> >>>
> >>>
> >>>
> >>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> piotr@data-artisans.com>
> >>> wrote:
> >>>
> >>>> Hi Shaoxuan,
> >>>>
> >>>> Re 2:
> >>>>
> >>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1’
> >>>>
> >>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> >>>> `methodThatAppliesOperators()` method has changed it’s plan?
> >>>>
> >>>> I was thinking more about something like this:
> >>>>
> >>>> Table source = … // some source that scans files from a directory
> >>>> “/foo/bar/“
> >>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>
> >>>> t2.count() // initialise cache (if it’s lazily initialised)
> >>>>
> >>>> int a1 = t1.count()
> >>>> int b1 = t2.count()
> >>>>
> >>>> // something in the background (or we trigger it) writes new files to
> >>>> /foo/bar
> >>>>
> >>>> int a2 = t1.count()
> >>>> int b2 = t2.count()
> >>>>
> >>>> t2.refresh() // possible future extension, not to be implemented in
> the
> >>>> initial version
> >>>>
> >>>> int a3 = t1.count()
> >>>> int b3 = t2.count()
> >>>>
> >>>> t2.drop() // another possible future extension, manual “cache”
> dropping
> >>>>
> >>>> assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
> >>>> assertTrue(b1 == b2) // both values come from the same cache
> >>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full table
> >>> scan
> >>>> and has more data
> >>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> >>>> assertTrue(b3 == a2 == a3)
> >>>>
> >>>> Piotrek
> >>>>
> >>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> It is an very interesting and useful design!
> >>>>>
> >>>>> Here I want to share some of my thoughts:
> >>>>>
> >>>>> 1. Agree with that cache() method should return some Table to avoid
> >>> some
> >>>>> unexpected problems because of the mutable object.
> >>>>> All the existing methods of Table are returning a new Table instance.
> >>>>>
> >>>>> 2. I think materialize() would be more consistent with SQL, this
> makes
> >>> it
> >>>>> possible to support the same feature for SQL (materialize view) and
> >>> keep
> >>>>> the same API for users in the future.
> >>>>> But I'm also fine if we choose cache().
> >>>>>
> >>>>> 3. In the proposal, a TableService (or FlinkService?) is used to
> cache
> >>>> the
> >>>>> result of the (intermediate) table.
> >>>>> But the name of TableService may be a bit general which is not quite
> >>>>> understanding correctly in the first glance (a metastore for
> tables?).
> >>>>> Maybe a more specific name would be better, such as TableCacheSerive
> >>> or
> >>>>> TableMaterializeSerivce or something else.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>>
> >>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Thanks for the clarification Becket!
> >>>>>>
> >>>>>> I have a few thoughts to share / questions:
> >>>>>>
> >>>>>> 1) I'd like to know how you plan to implement the feature on a plan
> /
> >>>>>> planner level.
> >>>>>>
> >>>>>> I would imaging the following to happen when Table.cache() is
> called:
> >>>>>>
> >>>>>> 1) immediately optimize the Table and internally convert it into a
> >>>>>> DataSet/DataStream. This is necessary, to avoid that operators of
> >>> later
> >>>>>> queries on top of the Table are pushed down.
> >>>>>> 2) register the DataSet/DataStream as a DataSet/DataStream-backed
> >>> Table
> >>>> X
> >>>>>> 3) add a sink to the DataSet/DataStream. This is the materialization
> >>> of
> >>>> the
> >>>>>> Table X
> >>>>>>
> >>>>>> Based on your proposal the following would happen:
> >>>>>>
> >>>>>> Table t1 = ....
> >>>>>> t1.cache(); // cache() returns void. The logical plan of t1 is
> >>> replaced
> >>>> by
> >>>>>> a scan of X. There is also a reference to the materialization of X.
> >>>>>>
> >>>>>> t1.count(); // this executes the program, including the
> >>>> DataSet/DataStream
> >>>>>> that backs X and the sink that writes the materialization of X
> >>>>>> t1.count(); // this executes the program, but reads X from the
> >>>>>> materialization.
> >>>>>>
> >>>>>> My question is, how do you determine when whether the scan of t1
> >>> should
> >>>> go
> >>>>>> against the DataSet/DataStream program and when against the
> >>>>>> materialization?
> >>>>>> AFAIK, there is no hook that will tell you that a part of the
> program
> >>>> was
> >>>>>> executed. Flipping a switch during optimization or plan generation
> is
> >>>> not
> >>>>>> sufficient as there is no guarantee that the plan is also executed.
> >>>>>>
> >>>>>> Overall, this behavior is somewhat similar to what I proposed in
> >>>>>> FLINK-8950, which does not include persisting the table, but just
> >>>>>> optimizing and reregistering it as DataSet/DataStream scan.
> >>>>>>
> >>>>>> 2) I think Piotr has a point about the implicit behavior and side
> >>>> effects
> >>>>>> of the cache() method if it does not return anything.
> >>>>>> Consider the following example:
> >>>>>>
> >>>>>> Table t1 = ???
> >>>>>> Table t2 = methodThatAppliesOperators(t1);
> >>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> >>>>>>
> >>>>>> In this case, the behavior/performance of the plan that results from
> >>> the
> >>>>>> second method call depends on whether t1 was modified by the first
> >>>> method
> >>>>>> or not.
> >>>>>> This is the classic issue of mutable vs. immutable objects.
> >>>>>> Also, as Piotr pointed out, it might also be good to have the
> original
> >>>> plan
> >>>>>> of t1, because in some cases it is possible to push filters down
> such
> >>>> that
> >>>>>> evaluating the query from scratch might be more efficient than
> >>> accessing
> >>>>>> the cache.
> >>>>>> Moreover, a CachedTable could extend Table() and offer a method
> >>>> refresh().
> >>>>>> This sounds quite useful in an interactive session mode.
> >>>>>>
> >>>>>> 3) Regarding the name, I can see both arguments. IMO, materialize()
> >>>> seems
> >>>>>> to be more future proof.
> >>>>>>
> >>>>>> Best, Fabian
> >>>>>>
> >>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> >>>>>> wshaoxuan@gmail.com>:
> >>>>>>
> >>>>>>> Hi Piotr,
> >>>>>>>
> >>>>>>> Thanks for sharing your ideas on the method naming. We will think
> >>> about
> >>>>>>> your suggestions. But I don't understand why we need to change the
> >>>> return
> >>>>>>> type of cache().
> >>>>>>>
> >>>>>>> Cache() is a physical operation, it does not change the logic of
> >>>>>>> the `Table`. On the tableAPI layer, we should not introduce a new
> >>> table
> >>>>>>> type unless the logic of table has been changed. If we introduce a
> >>> new
> >>>>>>> table type `CachedTable`, we need create the same set of methods of
> >>>>>> `Table`
> >>>>>>> for it. I don't think it is worth doing this. Or can you please
> >>>> elaborate
> >>>>>>> more on what could be the "implicit behaviours/side effects" you
> are
> >>>>>>> thinking about?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Shaoxuan
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> >>>> piotr@data-artisans.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> Thanks for the response.
> >>>>>>>>
> >>>>>>>> 1. I wasn’t saying that materialised view must be mutable or not.
> >>> The
> >>>>>>> same
> >>>>>>>> thing applies to caches as well. To the contrary, I would expect
> >>> more
> >>>>>>>> consistency and updates from something that is called “cache” vs
> >>>>>>> something
> >>>>>>>> that’s a “materialised view”. In other words, IMO most caches do
> not
> >>>>>>> serve
> >>>>>>>> you invalid/outdated data and they handle updates on their own.
> >>>>>>>>
> >>>>>>>> 2. I don’t think that having in the future two very similar
> concepts
> >>>> of
> >>>>>>>> `materialized` view and `cache` is a good idea. It would be
> >>> confusing
> >>>>>> for
> >>>>>>>> the users. I think it could be handled by variations/overloading
> of
> >>>>>>>> materialised view concept. We could start with:
> >>>>>>>>
> >>>>>>>> `MaterializedTable materialize()` - immutable, session life scope
> >>>>>>>> (basically the same semantic as you are proposing
> >>>>>>>>
> >>>>>>>> And then in the future (if ever) build on top of that/expand it
> >>> with:
> >>>>>>>>
> >>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> `MaterializedTable
> >>>>>>>> materialize(refreshHook=…)`
> >>>>>>>>
> >>>>>>>> Or with cross session support:
> >>>>>>>>
> >>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> >>> `MaterializedTable
> >>>>>>>> materializeInto(tableFactory=…)`
> >>>>>>>>
> >>>>>>>> I’m not saying that we should implement cross session/refreshing
> now
> >>>> or
> >>>>>>>> even in the near future. I’m just arguing that naming current
> >>>> immutable
> >>>>>>>> session life scope method `materialize()` is more future proof and
> >>>> more
> >>>>>>>> consistent with SQL (on which after all table-api is heavily
> basing
> >>>>>> on).
> >>>>>>>>
> >>>>>>>> 3. Even if we agree on naming it `cache()`, I would still insist
> on
> >>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
> >>>>>>> behaviours/side
> >>>>>>>> effects and to give both us & users more flexibility.
> >>>>>>>>
> >>>>>>>> Piotrek
> >>>>>>>>
> >>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>> Just to add a little bit, the materialized view is probably more
> >>>>>>> similar
> >>>>>>>> to
> >>>>>>>>> the persistent() brought up earlier in the thread. So it is
> usually
> >>>>>>> cross
> >>>>>>>>> session and could be used in a larger scope. For example, a
> >>>>>>> materialized
> >>>>>>>>> view created by user A may be visible to user B. It is probably
> >>>>>>> something
> >>>>>>>>> we want to have in the future. I'll put it in the future work
> >>>>>> section.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>
> >>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <becket.qin@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the explanation.
> >>>>>>>>>>
> >>>>>>>>>> Right now we are mostly thinking of the cached table as
> >>> immutable. I
> >>>>>>> can
> >>>>>>>>>> see the Materialized view would be useful in the future. That
> >>> said,
> >>>>>> I
> >>>>>>>> think
> >>>>>>>>>> a simple cache mechanism is probably still needed. So to me,
> >>> cache()
> >>>>>>> and
> >>>>>>>>>> materialize() should be two separate method as they address
> >>>>>> different
> >>>>>>>>>> needs. Materialize() is a higher level concept usually implying
> >>>>>>>> periodical
> >>>>>>>>>> update, while cache() has much simpler semantic. For example,
> one
> >>>>>> may
> >>>>>>>>>> create a materialized view and use cache() method in the
> >>>>>> materialized
> >>>>>>>> view
> >>>>>>>>>> creation logic. So that during the materialized view update,
> they
> >>> do
> >>>>>>> not
> >>>>>>>>>> need to worry about the case that the cached table is also
> >>> changed.
> >>>>>>>> Maybe
> >>>>>>>>>> under the hood, materialized() and cache() could share some
> >>>>>> mechanism,
> >>>>>>>> but
> >>>>>>>>>> I think a simple cache() method would be handy in a lot of
> cases.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> >>>>>>> piotr@data-artisans.com
> >>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>
> >>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
> that
> >>>>>>> they
> >>>>>>>>>>> cannot do on a Table?
> >>>>>>>>>>>
> >>>>>>>>>>> Maybe not in the initial implementation, but various DBs offer
> >>>>>>>> different
> >>>>>>>>>>> ways to “refresh” the materialised view. Hooks, triggers,
> timers,
> >>>>>>>> manually
> >>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle that in
> >>> the
> >>>>>>>> future.
> >>>>>>>>>>>
> >>>>>>>>>>>> After users call *table.cache(), *users can just use that
> table
> >>>>>> and
> >>>>>>> do
> >>>>>>>>>>> anything that is supported on a Table, including SQL.
> >>>>>>>>>>>
> >>>>>>>>>>> This is some implicit behaviour with side effects. Imagine if
> >>> user
> >>>>>>> has
> >>>>>>>> a
> >>>>>>>>>>> long and complicated program, that touches table `b` multiple
> >>>>>> times,
> >>>>>>>> maybe
> >>>>>>>>>>> scattered around different methods. If he modifies his program
> by
> >>>>>>>> inserting
> >>>>>>>>>>> in one place
> >>>>>>>>>>>
> >>>>>>>>>>> b.cache()
> >>>>>>>>>>>
> >>>>>>>>>>> This implicitly alters the semantic and behaviour of his code
> all
> >>>>>>> over
> >>>>>>>>>>> the place, maybe in a ways that might cause problems. For
> example
> >>>>>>> what
> >>>>>>>> if
> >>>>>>>>>>> underlying data is changing?
> >>>>>>>>>>>
> >>>>>>>>>>> Having invisible side effects is also not very clean, for
> example
> >>>>>>> think
> >>>>>>>>>>> about something like this (but more complicated):
> >>>>>>>>>>>
> >>>>>>>>>>> Table b = ...;
> >>>>>>>>>>>
> >>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>> processTable1(b)
> >>>>>>>>>>> }
> >>>>>>>>>>> else {
> >>>>>>>>>>> processTable2(b)
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> // do more stuff with b
> >>>>>>>>>>>
> >>>>>>>>>>> And user adds `b.cache()` call to only one of the
> `processTable1`
> >>>>>> or
> >>>>>>>>>>> `processTable2` methods.
> >>>>>>>>>>>
> >>>>>>>>>>> On the other hand
> >>>>>>>>>>>
> >>>>>>>>>>> Table materialisedB = b.materialize()
> >>>>>>>>>>>
> >>>>>>>>>>> Avoids (at least some of) the side effect issues and forces
> user
> >>> to
> >>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate and
> forces
> >>>>>> user
> >>>>>>>> to
> >>>>>>>>>>> think what does it actually mean. And if something doesn’t work
> >>> in
> >>>>>>> the
> >>>>>>>> end
> >>>>>>>>>>> for the user, he will know what has he changed instead of
> blaming
> >>>>>>>> Flink for
> >>>>>>>>>>> some “magic” underneath. In the above example, after
> >>> materialising
> >>>>>> b
> >>>>>>> in
> >>>>>>>>>>> only one of the methods, he should/would realise about the
> issue
> >>>>>> when
> >>>>>>>>>>> handling the return value `MaterializedTable` of that method.
> >>>>>>>>>>>
> >>>>>>>>>>> I guess it comes down to personal preferences if you like
> things
> >>> to
> >>>>>>> be
> >>>>>>>>>>> implicit or not. The more power is the user, probably the more
> >>>>>> likely
> >>>>>>>> he is
> >>>>>>>>>>> to like/understand implicit behaviour. And we as Table API
> >>>>>> designers
> >>>>>>>> are
> >>>>>>>>>>> the most power users out there, so I would proceed with caution
> >>> (so
> >>>>>>>> that we
> >>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely implicit
> >>>>>>> method
> >>>>>>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> >>>>>>>>>>>
> >>>>>>>>>>>> Table API to also support non-relational processing cases,
> >>> cache()
> >>>>>>>>>>> might be slightly better.
> >>>>>>>>>>>
> >>>>>>>>>>> I think even such extended Table API could benefit from
> sticking
> >>>>>>>> to/being
> >>>>>>>>>>> consistent with SQL where both SQL and Table API are basically
> >>> the
> >>>>>>>> same.
> >>>>>>>>>>>
> >>>>>>>>>>> One more thing. `MaterializedTable materialize()` could be more
> >>>>>>>>>>> powerful/flexible allowing the user to operate both on
> >>> materialised
> >>>>>>>> and not
> >>>>>>>>>>> materialised view at the same time for whatever reasons
> >>> (underlying
> >>>>>>>> data
> >>>>>>>>>>> changing/better optimisation opportunities after pushing down
> >>> more
> >>>>>>>> filters
> >>>>>>>>>>> etc). For example:
> >>>>>>>>>>>
> >>>>>>>>>>> Table b = …;
> >>>>>>>>>>>
> >>>>>>>>>>> MaterlizedTable mb = b.materialize();
> >>>>>>>>>>>
> >>>>>>>>>>> Val min = mb.min();
> >>>>>>>>>>> Val max = mb.max();
> >>>>>>>>>>>
> >>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> >>>>>>>>>>>
> >>>>>>>>>>> Could be more efficient compared to `b.cache()` if
> >>> `filter(‘userId
> >>>>>> =
> >>>>>>>>>>> 42);` allows for much more aggressive optimisations.
> >>>>>>>>>>>
> >>>>>>>>>>> Piotrek
> >>>>>>>>>>>
> >>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was just an
> >>>>>>>> example.
> >>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> >>>>>>>>>>>> For the sake of this proposal, it would be up to the user to
> >>>>>>>> implement a
> >>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink classes
> >>> to
> >>>>>>>>>>> persist
> >>>>>>>>>>>> and read the data.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier
> <
> >>>>>>>>>>>> pompermaier@okkam.it>:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an
> alternative
> >>> to
> >>>>>>>>>>> Apache
> >>>>>>>>>>>>> Ignite?
> >>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> >>>>>> fhueske@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the proposal!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To summarize, you propose a new method Table.cache(): Table
> >>> that
> >>>>>>>> will
> >>>>>>>>>>>>>> trigger a job and write the result into some temporary
> storage
> >>>>>> as
> >>>>>>>>>>> defined
> >>>>>>>>>>>>>> by a TableFactory.
> >>>>>>>>>>>>>> The cache() call blocks while the job is running and
> >>> eventually
> >>>>>>>>>>> returns a
> >>>>>>>>>>>>>> Table object that represents a scan of the temporary table.
> >>>>>>>>>>>>>> When the "session" is closed (closing to be defined?), the
> >>>>>>> temporary
> >>>>>>>>>>>>> tables
> >>>>>>>>>>>>>> are all dropped.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think this behavior makes sense and is a good first step
> >>>>>> towards
> >>>>>>>>>>> more
> >>>>>>>>>>>>>> interactive workloads.
> >>>>>>>>>>>>>> However, its performance suffers from writing to and reading
> >>>>>> from
> >>>>>>>>>>>>> external
> >>>>>>>>>>>>>> systems.
> >>>>>>>>>>>>>> I think this is OK for now. Changes that would significantly
> >>>>>>> improve
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would
> >>> have
> >>>>>>>> large
> >>>>>>>>>>>>>> impacts on many components of Flink.
> >>>>>>>>>>>>>> Users could use in-memory filesystems or storage grids
> (Apache
> >>>>>>>>>>> Ignite) to
> >>>>>>>>>>>>>> mitigate some of the performance effects.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best, Fabian
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> >>>>>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
> >>>>>> that
> >>>>>>>> they
> >>>>>>>>>>>>>>> cannot do on a Table? After users call *table.cache(),
> *users
> >>>>>> can
> >>>>>>>>>>> just
> >>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>> that table and do anything that is supported on a Table,
> >>>>>>> including
> >>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine to
> >>> me.
> >>>>>>>>>>> cache()
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> a bit more general than materialize(). Given that we are
> >>>>>>> enhancing
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> Table API to also support non-relational processing cases,
> >>>>>>> cache()
> >>>>>>>>>>>>> might
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> slightly better.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> >>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse
> existing
> >>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you
> >>> want
> >>>>>> to
> >>>>>>>>>>>>>> provide
> >>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>> alternate way of writing the data.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we
> could
> >>>>>>>> rename
> >>>>>>>>>>>>>>>> `cache()` to
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> void materialize()
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> or going step further
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> MaterializedTable materialize()
> >>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> ?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The second option with returning a handle I think is more
> >>>>>>> flexible
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete” or
> >>> generally
> >>>>>>>>>>>>> speaking
> >>>>>>>>>>>>>>>> manage the the view. In the future we could also think
> about
> >>>>>>>> adding
> >>>>>>>>>>>>>> hooks
> >>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more
> explicit
> >>> -
> >>>>>>>>>>>>>>>> materialization returning a new table handle will not have
> >>> the
> >>>>>>>> same
> >>>>>>>>>>>>>>>> implicit side effects as adding a simple line of code like
> >>>>>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>>>>> would have.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> It would also be more SQL like, making it more intuitive
> for
> >>>>>>> users
> >>>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>> familiar with the SQL.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> becket.qin@gmail.com
> >>>>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
> >>>>>>> creating
> >>>>>>>> a
> >>>>>>>>>>>>>>>> BUILT-IN
> >>>>>>>>>>>>>>>>> materialized view with a lifecycle. That functionality is
> >>>>>>> missing
> >>>>>>>>>>>>>>> today,
> >>>>>>>>>>>>>>>>> though. Not sure if I understand your question. Do you
> mean
> >>>>>> we
> >>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we want to
> >>> stop
> >>>>>>> at
> >>>>>>>>>>>>>>> creating
> >>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend that in
> the
> >>>>>>> future
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>> useful unified data store distributed with Flink? And do
> we
> >>>>>>> want
> >>>>>>>> to
> >>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern with their
> >>> own
> >>>>>>>> user
> >>>>>>>>>>>>>>>> defined
> >>>>>>>>>>>>>>>>> services. These considerations are much more
> architectural.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> >>>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the problem.
> >>>>>> Isn’t
> >>>>>>>> the
> >>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink
> and
> >>>>>>> later
> >>>>>>>>>>>>>>> reading
> >>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live
> >>> time?
> >>>>>>> And
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> sink
> >>>>>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a materialised
> >>> view
> >>>>>>>> from a
> >>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
> >>>>>>> materialised
> >>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> >>>>>>> materialised
> >>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>> (for
> >>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we need
> some
> >>>>>>>>>>>>> syntactic
> >>>>>>>>>>>>>>>> sugar
> >>>>>>>>>>>>>>>>>> on top of it?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> >>> becket.qin@gmail.com
> >>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> >>>>>>>>>>>>>>> lifecycle/defined
> >>>>>>>>>>>>>>>>>>> scope. I just added a section in the future work for
> >>> this.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> >>>>>>>>>>>>>>> sunjincheng121@gmail.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
> >>>>>> `cache()`, I
> >>>>>>>>>>>>>>>> understand
> >>>>>>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>>>>>> you designed this way!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
> >>>>>> data
> >>>>>>>>>>>>>>>> persistence?
> >>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the
> >>> user
> >>>>>>> is
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>> worried
> >>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the time
> range
> >>>>>> for
> >>>>>>>>>>>>>> keeping
> >>>>>>>>>>>>>>>>>> time.
> >>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also
> >>> share
> >>>>>>> in a
> >>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>> group of session, for example:
> >>>>>>> LifeCycle.SESSION_GROUP(...), I
> >>>>>>>>>>>>> am
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>> sure,
> >>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
> >>>>>> 下午1:33写道:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> >>>>>> persist(),
> >>>>>>>>>>>>>>>> personally I
> >>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
> >>>>>> behavior,
> >>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted after
> >>> the
> >>>>>>>>>>>>> session
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> closed.
> >>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people might
> >>> think
> >>>>>>> the
> >>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>> still be there even after the session is gone.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
> >>> processing
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>> job.
> >>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I
> imagine
> >>>>>> that
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>> change across the board, including sources, operators
> >>> and
> >>>>>>>>>>>>>>>>>> optimizations,
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several separate
> >>> in-depth
> >>>>>>>>>>>>>>> discussions.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> >>>>>>>>>>>>> xingcanc@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain
> are
> >>>>>> both
> >>>>>>>>>>>>>>>> orthogonal
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the
> first
> >>>>>> time
> >>>>>>>> we
> >>>>>>>>>>>>>> plan
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
> >>>>>> state.
> >>>>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>> it’s
> >>>>>>>>>>>>>>>>>>>>> better
> >>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate on
> a
> >>>>>>>> specific
> >>>>>>>>>>>>>>> part?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> >>>>>>>> underlying
> >>>>>>>>>>>>>>>>>> service.
> >>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the
> existing
> >>>>>>>>>>>>> codebase.
> >>>>>>>>>>>>>> As
> >>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to support
> >>>>>> other
> >>>>>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more
> interactive
> >>>>>>> Table
> >>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> >>>>>>>>>>>>>> xiaoweij@gmail.com>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean
> up
> >>>>>> is
> >>>>>>>> not
> >>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>> reliable.
> >>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
> >>>>>>>> successfully.
> >>>>>>>>>>>>> We
> >>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>> risk
> >>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
> >>>>>> have
> >>>>>>> an
> >>>>>>>>>>>>>>>>>>>> association
> >>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can always
> >>>>>> clean
> >>>>>>>> up
> >>>>>>>>>>>>>> temp
> >>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any active
> >>>>>> sessions.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> >>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
> >>>>>> friendly
> >>>>>>>> in
> >>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>>>>>>>> examples.
> >>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> >>>>>> executed
> >>>>>>> in
> >>>>>>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>>>>>> stages
> >>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink
> ML,
> >>> in
> >>>>>>>> order
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> utilize
> >>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to
> submit a
> >>>>>> job
> >>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>> env.execute().
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> >>>>>>>>>>>>> `persist()`,
> >>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally
> >>> cache
> >>>>>>> in
> >>>>>>>>>>>>>> memory
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>> persist
> >>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into
> state
> >>>>>>>> backend
> >>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future,
> >>> support
> >>>>>>> for
> >>>>>>>>>>>>>>>> streaming
> >>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also
> >>> benefit
> >>>>>>> in
> >>>>>>>>>>>>>>>>>>>> "Interactive
> >>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs
> >>> and
> >>>>>>>> FLIP!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
> >>>>>>>> 下午9:56写道:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out,
> it
> >>>>>> is a
> >>>>>>>>>>>>>>> promising
> >>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> >>>>>>>> aspects,
> >>>>>>>>>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One
> of
> >>>>>> the
> >>>>>>>>>>>>>>> scenarios
> >>>>>>>>>>>>>>>>>>>>> where
> >>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
> >>> programming.
> >>>>>> To
> >>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we
> >>> put
> >>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >
> >
>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Shaoxuan Wang <ws...@gmail.com>.
Hi Piotrek,

Cache() should not affect semantics and business logic, and thus it will
not lead to random behavior/results. The underlying design should ensure
this. I thought your example as a valid anti-case. But Jiangjie is correct,
the source table in batching should be immutable. It is the user’s
responsibility to ensure it, otherwise even a regular failover may lead
to inconsistent results. If you consider cache as an optimization hint,
rather than a special case of materialized view, it might be easy to
understand the problem we are trying to solve.

Regards,
Shaoxuan


On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi all,
>
> Regarding naming `cache()` vs `materialize()`. One more explanation why I
> think `materialize()` is more natural to me is that I think of all “Table”s
> in Table-API as views. They behave the same way as SQL views, the only
> difference for me is that their live scope is short - current session which
> is limited by different execution model. That’s why “cashing” a view for me
> is just materialising it.
>
> However I see and I understand your point of view. Coming from
> DataSet/DataStream and generally speaking non-SQL world, `cache()` is more
> natural. But keep in mind that `.cache()` will/might not only be used in
> interactive programming and not only in batching. But naming is one issue,
> and not that critical to me. Especially that once we implement proper
> materialised views, we can always deprecate/rename `cache()` if we deem so.
>
>
> For me the more important issue is of not having the `void cache()` with
> side effects. Exactly for the reasons that you have mentioned. True:
> results might be non deterministic if underlying source table are changing.
> Problem is that `void cache()` implicitly changes the semantic of
> subsequent uses of the cached/materialized Table. It can cause “wtf” moment
> for a user if he inserts “b.cache()” call in some place in his code and
> suddenly some other random places are behaving differently. If
> `materialize()` or `cache()` returns a Table handle, we force user to
> explicitly use the cache which removes the “random” part from the "suddenly
> some other random places are behaving differently”.
>
> This argument and others that I’ve raised (greater flexibility/allowing
> user to explicitly bypass the cache) are independent of `cache()` vs
> `materialize()` discussion.
>
> > Does that mean one can also insert into the CachedTable? This sounds
> pretty confusing.
>
> I don’t know, probably initially we should make CachedTable read-only. I
> don’t find it more confusing than the fact that user can not write to views
> or materialised views in SQL or that user currently can not write to a
> Table.
>
> Piotrek
>
> > On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I agree with @Becket that `cache()` and `materialize()` should be
> considered as two different methods where the later one is more
> sophisticated.
> >
> > According to my understanding, the initial idea is just to introduce a
> simple cache or persist mechanism, but as the TableAPI is a high-level API,
> it’s naturally for as to think in a SQL way.
> >
> > Maybe we can add the `cache()` method to the DataSet API and force users
> to translate a Table to a Dataset before caching it. Then the users should
> manually register the cached dataset to a table again (we may need some
> table replacement mechanisms for datasets with an identical schema but
> different contents here). After all, it’s the dataset rather than the
> dynamic table that need to be cached, right?
> >
> > Best,
> > Xingcan
> >
> >> On Nov 30, 2018, at 10:57 AM, Becket Qin <be...@gmail.com> wrote:
> >>
> >> Hi Piotrek and Jark,
> >>
> >> Thanks for the feedback and explanation. Those are good arguments. But I
> >> think those arguments are mostly about materialized view. Let me try to
> >> explain the reason I believe cache() and materialize() are different.
> >>
> >> I think cache() and materialize() have quite different implications. An
> >> analogy I can think of is save()/publish(). When users call cache(), it
> is
> >> just like they are saving an intermediate result as a draft of their
> work,
> >> this intermediate result may not have any realistic meaning. Calling
> >> cache() does not mean users want to publish the cached table in any
> manner.
> >> But when users call materialize(), that means "I have something
> meaningful
> >> to be reused by others", now users need to think about the validation,
> >> update & versioning, lifecycle of the result, etc.
> >>
> >> Piotrek's suggestions on variations of the materialize() methods are
> very
> >> useful. It would be great if Flink have them. The concept of
> materialized
> >> view is actually a pretty big feature, not to say the related stuff like
> >> triggers/hooks you mentioned earlier. I think the materialized view
> itself
> >> should be discussed in a more thorough and systematic manner. And I
> found
> >> that discussion is kind of orthogonal and way beyond interactive
> >> programming experience.
> >>
> >> The example you gave was interesting. I still have some questions,
> though.
> >>
> >> Table source = … // some source that scans files from a directory
> >>> “/foo/bar/“
> >>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>> Table t2 = t1.materialize() // (or `cache()`)
> >>
> >> t2.count() // initialise cache (if it’s lazily initialised)
> >>> int a1 = t1.count()
> >>> int b1 = t2.count()
> >>> // something in the background (or we trigger it) writes new files to
> >>> /foo/bar
> >>> int a2 = t1.count()
> >>> int b2 = t2.count()
> >>> t2.refresh() // possible future extension, not to be implemented in the
> >>> initial version
> >>>
> >>
> >> what if someone else added some more files to /foo/bar at this point? In
> >> that case, a3 won't equals to b3, and the result become
> non-deterministic,
> >> right?
> >>
> >> int a3 = t1.count()
> >>> int b3 = t2.count()
> >>> t2.drop() // another possible future extension, manual “cache” dropping
> >>
> >>
> >> When we talk about interactive programming, in most cases, we are
> talking
> >> about batch applications. A fundamental assumption of such case is that
> the
> >> source data is complete before the data processing begins, and the data
> >> will not change during the data processing. IMO, if additional rows
> needs
> >> to be added to some source during the processing, it should be done in
> ways
> >> like union the source with another table containing the rows to be
> added.
> >>
> >> There are a few cases that computations are executed repeatedly on the
> >> changing data source.
> >>
> >> For example, people may run a ML training job every hour with the
> samples
> >> newly added in the past hour. In that case, the source data between will
> >> indeed change. But still, the data remain unchanged within one run. And
> >> usually in that case, the result will need versioning, i.e. for a given
> >> result, it tells that the result is a result from the source data by a
> >> certain timestamp.
> >>
> >> Another example is something like data warehouse. In this case, there
> are a
> >> few source of original/raw data. On top of those sources, many
> materialized
> >> view / queries / reports / dashboards can be created to generate derived
> >> data. Those derived data needs to be updated when the underlying
> original
> >> data changes. In that case, the processing logic that derives the
> original
> >> data needs to be executed repeatedly to update those reports/views.
> Again,
> >> all those derived data also need to have version management, such as
> >> timestamp.
> >>
> >> In any of the above two cases, during a single run of the processing
> logic,
> >> the data cannot change. Otherwise the behavior of the processing logic
> may
> >> be undefined. In the above two examples, when writing the processing
> logic,
> >> Users can use .cache() to hint Flink that those results should be saved
> to
> >> avoid repeated computation. And then for the result of my application
> >> logic, I'll call materialize(), so that these results could be managed
> by
> >> the system with versioning, metadata management, lifecycle management,
> >> ACLs, etc.
> >>
> >> It is true we can use materialize() to do the cache() job, but I am
> really
> >> reluctant to shoehorn cache() into materialize() and force users to
> worry
> >> about a bunch of implications that they needn't have to. I am
> absolutely on
> >> your side that redundant API is bad. But it is equally frustrating, if
> not
> >> more, that the same API does different things.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <ws...@gmail.com>
> wrote:
> >>
> >>> Thanks Piotrek,
> >>> You provided a very good example, it explains all the confusions I
> have.
> >>> It is clear that there is something we have not considered in the
> initial
> >>> proposal. We intend to force the user to reuse the cached/materialized
> >>> table, if its cache() method is executed. We did not expect that user
> may
> >>> want to re-executed the plan from the source table. Let me re-think
> about
> >>> it and get back to you later.
> >>>
> >>> In the meanwhile, this example/observation also infers that we cannot
> fully
> >>> involve the optimizer to decide the plan if a cache/materialize is
> >>> explicitly used, because weather to reuse the cache data or re-execute
> the
> >>> query from source data may lead to different results. (But I guess
> >>> optimizer can still help in some cases ---- as long as it does not
> >>> re-execute from the varied source, we should be safe).
> >>>
> >>> Regards,
> >>> Shaoxuan
> >>>
> >>>
> >>>
> >>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> piotr@data-artisans.com>
> >>> wrote:
> >>>
> >>>> Hi Shaoxuan,
> >>>>
> >>>> Re 2:
> >>>>
> >>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1’
> >>>>
> >>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> >>>> `methodThatAppliesOperators()` method has changed it’s plan?
> >>>>
> >>>> I was thinking more about something like this:
> >>>>
> >>>> Table source = … // some source that scans files from a directory
> >>>> “/foo/bar/“
> >>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> >>>> Table t2 = t1.materialize() // (or `cache()`)
> >>>>
> >>>> t2.count() // initialise cache (if it’s lazily initialised)
> >>>>
> >>>> int a1 = t1.count()
> >>>> int b1 = t2.count()
> >>>>
> >>>> // something in the background (or we trigger it) writes new files to
> >>>> /foo/bar
> >>>>
> >>>> int a2 = t1.count()
> >>>> int b2 = t2.count()
> >>>>
> >>>> t2.refresh() // possible future extension, not to be implemented in
> the
> >>>> initial version
> >>>>
> >>>> int a3 = t1.count()
> >>>> int b3 = t2.count()
> >>>>
> >>>> t2.drop() // another possible future extension, manual “cache”
> dropping
> >>>>
> >>>> assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
> >>>> assertTrue(b1 == b2) // both values come from the same cache
> >>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full table
> >>> scan
> >>>> and has more data
> >>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> >>>> assertTrue(b3 == a2 == a3)
> >>>>
> >>>> Piotrek
> >>>>
> >>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> It is an very interesting and useful design!
> >>>>>
> >>>>> Here I want to share some of my thoughts:
> >>>>>
> >>>>> 1. Agree with that cache() method should return some Table to avoid
> >>> some
> >>>>> unexpected problems because of the mutable object.
> >>>>> All the existing methods of Table are returning a new Table instance.
> >>>>>
> >>>>> 2. I think materialize() would be more consistent with SQL, this
> makes
> >>> it
> >>>>> possible to support the same feature for SQL (materialize view) and
> >>> keep
> >>>>> the same API for users in the future.
> >>>>> But I'm also fine if we choose cache().
> >>>>>
> >>>>> 3. In the proposal, a TableService (or FlinkService?) is used to
> cache
> >>>> the
> >>>>> result of the (intermediate) table.
> >>>>> But the name of TableService may be a bit general which is not quite
> >>>>> understanding correctly in the first glance (a metastore for
> tables?).
> >>>>> Maybe a more specific name would be better, such as TableCacheSerive
> >>> or
> >>>>> TableMaterializeSerivce or something else.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>>
> >>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Thanks for the clarification Becket!
> >>>>>>
> >>>>>> I have a few thoughts to share / questions:
> >>>>>>
> >>>>>> 1) I'd like to know how you plan to implement the feature on a plan
> /
> >>>>>> planner level.
> >>>>>>
> >>>>>> I would imaging the following to happen when Table.cache() is
> called:
> >>>>>>
> >>>>>> 1) immediately optimize the Table and internally convert it into a
> >>>>>> DataSet/DataStream. This is necessary, to avoid that operators of
> >>> later
> >>>>>> queries on top of the Table are pushed down.
> >>>>>> 2) register the DataSet/DataStream as a DataSet/DataStream-backed
> >>> Table
> >>>> X
> >>>>>> 3) add a sink to the DataSet/DataStream. This is the materialization
> >>> of
> >>>> the
> >>>>>> Table X
> >>>>>>
> >>>>>> Based on your proposal the following would happen:
> >>>>>>
> >>>>>> Table t1 = ....
> >>>>>> t1.cache(); // cache() returns void. The logical plan of t1 is
> >>> replaced
> >>>> by
> >>>>>> a scan of X. There is also a reference to the materialization of X.
> >>>>>>
> >>>>>> t1.count(); // this executes the program, including the
> >>>> DataSet/DataStream
> >>>>>> that backs X and the sink that writes the materialization of X
> >>>>>> t1.count(); // this executes the program, but reads X from the
> >>>>>> materialization.
> >>>>>>
> >>>>>> My question is, how do you determine when whether the scan of t1
> >>> should
> >>>> go
> >>>>>> against the DataSet/DataStream program and when against the
> >>>>>> materialization?
> >>>>>> AFAIK, there is no hook that will tell you that a part of the
> program
> >>>> was
> >>>>>> executed. Flipping a switch during optimization or plan generation
> is
> >>>> not
> >>>>>> sufficient as there is no guarantee that the plan is also executed.
> >>>>>>
> >>>>>> Overall, this behavior is somewhat similar to what I proposed in
> >>>>>> FLINK-8950, which does not include persisting the table, but just
> >>>>>> optimizing and reregistering it as DataSet/DataStream scan.
> >>>>>>
> >>>>>> 2) I think Piotr has a point about the implicit behavior and side
> >>>> effects
> >>>>>> of the cache() method if it does not return anything.
> >>>>>> Consider the following example:
> >>>>>>
> >>>>>> Table t1 = ???
> >>>>>> Table t2 = methodThatAppliesOperators(t1);
> >>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> >>>>>>
> >>>>>> In this case, the behavior/performance of the plan that results from
> >>> the
> >>>>>> second method call depends on whether t1 was modified by the first
> >>>> method
> >>>>>> or not.
> >>>>>> This is the classic issue of mutable vs. immutable objects.
> >>>>>> Also, as Piotr pointed out, it might also be good to have the
> original
> >>>> plan
> >>>>>> of t1, because in some cases it is possible to push filters down
> such
> >>>> that
> >>>>>> evaluating the query from scratch might be more efficient than
> >>> accessing
> >>>>>> the cache.
> >>>>>> Moreover, a CachedTable could extend Table() and offer a method
> >>>> refresh().
> >>>>>> This sounds quite useful in an interactive session mode.
> >>>>>>
> >>>>>> 3) Regarding the name, I can see both arguments. IMO, materialize()
> >>>> seems
> >>>>>> to be more future proof.
> >>>>>>
> >>>>>> Best, Fabian
> >>>>>>
> >>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> >>>>>> wshaoxuan@gmail.com>:
> >>>>>>
> >>>>>>> Hi Piotr,
> >>>>>>>
> >>>>>>> Thanks for sharing your ideas on the method naming. We will think
> >>> about
> >>>>>>> your suggestions. But I don't understand why we need to change the
> >>>> return
> >>>>>>> type of cache().
> >>>>>>>
> >>>>>>> Cache() is a physical operation, it does not change the logic of
> >>>>>>> the `Table`. On the tableAPI layer, we should not introduce a new
> >>> table
> >>>>>>> type unless the logic of table has been changed. If we introduce a
> >>> new
> >>>>>>> table type `CachedTable`, we need create the same set of methods of
> >>>>>> `Table`
> >>>>>>> for it. I don't think it is worth doing this. Or can you please
> >>>> elaborate
> >>>>>>> more on what could be the "implicit behaviours/side effects" you
> are
> >>>>>>> thinking about?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Shaoxuan
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> >>>> piotr@data-artisans.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> Thanks for the response.
> >>>>>>>>
> >>>>>>>> 1. I wasn’t saying that materialised view must be mutable or not.
> >>> The
> >>>>>>> same
> >>>>>>>> thing applies to caches as well. To the contrary, I would expect
> >>> more
> >>>>>>>> consistency and updates from something that is called “cache” vs
> >>>>>>> something
> >>>>>>>> that’s a “materialised view”. In other words, IMO most caches do
> not
> >>>>>>> serve
> >>>>>>>> you invalid/outdated data and they handle updates on their own.
> >>>>>>>>
> >>>>>>>> 2. I don’t think that having in the future two very similar
> concepts
> >>>> of
> >>>>>>>> `materialized` view and `cache` is a good idea. It would be
> >>> confusing
> >>>>>> for
> >>>>>>>> the users. I think it could be handled by variations/overloading
> of
> >>>>>>>> materialised view concept. We could start with:
> >>>>>>>>
> >>>>>>>> `MaterializedTable materialize()` - immutable, session life scope
> >>>>>>>> (basically the same semantic as you are proposing
> >>>>>>>>
> >>>>>>>> And then in the future (if ever) build on top of that/expand it
> >>> with:
> >>>>>>>>
> >>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> `MaterializedTable
> >>>>>>>> materialize(refreshHook=…)`
> >>>>>>>>
> >>>>>>>> Or with cross session support:
> >>>>>>>>
> >>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> >>> `MaterializedTable
> >>>>>>>> materializeInto(tableFactory=…)`
> >>>>>>>>
> >>>>>>>> I’m not saying that we should implement cross session/refreshing
> now
> >>>> or
> >>>>>>>> even in the near future. I’m just arguing that naming current
> >>>> immutable
> >>>>>>>> session life scope method `materialize()` is more future proof and
> >>>> more
> >>>>>>>> consistent with SQL (on which after all table-api is heavily
> basing
> >>>>>> on).
> >>>>>>>>
> >>>>>>>> 3. Even if we agree on naming it `cache()`, I would still insist
> on
> >>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
> >>>>>>> behaviours/side
> >>>>>>>> effects and to give both us & users more flexibility.
> >>>>>>>>
> >>>>>>>> Piotrek
> >>>>>>>>
> >>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>> Just to add a little bit, the materialized view is probably more
> >>>>>>> similar
> >>>>>>>> to
> >>>>>>>>> the persistent() brought up earlier in the thread. So it is
> usually
> >>>>>>> cross
> >>>>>>>>> session and could be used in a larger scope. For example, a
> >>>>>>> materialized
> >>>>>>>>> view created by user A may be visible to user B. It is probably
> >>>>>>> something
> >>>>>>>>> we want to have in the future. I'll put it in the future work
> >>>>>> section.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>
> >>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <becket.qin@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the explanation.
> >>>>>>>>>>
> >>>>>>>>>> Right now we are mostly thinking of the cached table as
> >>> immutable. I
> >>>>>>> can
> >>>>>>>>>> see the Materialized view would be useful in the future. That
> >>> said,
> >>>>>> I
> >>>>>>>> think
> >>>>>>>>>> a simple cache mechanism is probably still needed. So to me,
> >>> cache()
> >>>>>>> and
> >>>>>>>>>> materialize() should be two separate method as they address
> >>>>>> different
> >>>>>>>>>> needs. Materialize() is a higher level concept usually implying
> >>>>>>>> periodical
> >>>>>>>>>> update, while cache() has much simpler semantic. For example,
> one
> >>>>>> may
> >>>>>>>>>> create a materialized view and use cache() method in the
> >>>>>> materialized
> >>>>>>>> view
> >>>>>>>>>> creation logic. So that during the materialized view update,
> they
> >>> do
> >>>>>>> not
> >>>>>>>>>> need to worry about the case that the cached table is also
> >>> changed.
> >>>>>>>> Maybe
> >>>>>>>>>> under the hood, materialized() and cache() could share some
> >>>>>> mechanism,
> >>>>>>>> but
> >>>>>>>>>> I think a simple cache() method would be handy in a lot of
> cases.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> >>>>>>> piotr@data-artisans.com
> >>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>
> >>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
> that
> >>>>>>> they
> >>>>>>>>>>> cannot do on a Table?
> >>>>>>>>>>>
> >>>>>>>>>>> Maybe not in the initial implementation, but various DBs offer
> >>>>>>>> different
> >>>>>>>>>>> ways to “refresh” the materialised view. Hooks, triggers,
> timers,
> >>>>>>>> manually
> >>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle that in
> >>> the
> >>>>>>>> future.
> >>>>>>>>>>>
> >>>>>>>>>>>> After users call *table.cache(), *users can just use that
> table
> >>>>>> and
> >>>>>>> do
> >>>>>>>>>>> anything that is supported on a Table, including SQL.
> >>>>>>>>>>>
> >>>>>>>>>>> This is some implicit behaviour with side effects. Imagine if
> >>> user
> >>>>>>> has
> >>>>>>>> a
> >>>>>>>>>>> long and complicated program, that touches table `b` multiple
> >>>>>> times,
> >>>>>>>> maybe
> >>>>>>>>>>> scattered around different methods. If he modifies his program
> by
> >>>>>>>> inserting
> >>>>>>>>>>> in one place
> >>>>>>>>>>>
> >>>>>>>>>>> b.cache()
> >>>>>>>>>>>
> >>>>>>>>>>> This implicitly alters the semantic and behaviour of his code
> all
> >>>>>>> over
> >>>>>>>>>>> the place, maybe in a ways that might cause problems. For
> example
> >>>>>>> what
> >>>>>>>> if
> >>>>>>>>>>> underlying data is changing?
> >>>>>>>>>>>
> >>>>>>>>>>> Having invisible side effects is also not very clean, for
> example
> >>>>>>> think
> >>>>>>>>>>> about something like this (but more complicated):
> >>>>>>>>>>>
> >>>>>>>>>>> Table b = ...;
> >>>>>>>>>>>
> >>>>>>>>>>> If (some_condition) {
> >>>>>>>>>>> processTable1(b)
> >>>>>>>>>>> }
> >>>>>>>>>>> else {
> >>>>>>>>>>> processTable2(b)
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> // do more stuff with b
> >>>>>>>>>>>
> >>>>>>>>>>> And user adds `b.cache()` call to only one of the
> `processTable1`
> >>>>>> or
> >>>>>>>>>>> `processTable2` methods.
> >>>>>>>>>>>
> >>>>>>>>>>> On the other hand
> >>>>>>>>>>>
> >>>>>>>>>>> Table materialisedB = b.materialize()
> >>>>>>>>>>>
> >>>>>>>>>>> Avoids (at least some of) the side effect issues and forces
> user
> >>> to
> >>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate and
> forces
> >>>>>> user
> >>>>>>>> to
> >>>>>>>>>>> think what does it actually mean. And if something doesn’t work
> >>> in
> >>>>>>> the
> >>>>>>>> end
> >>>>>>>>>>> for the user, he will know what has he changed instead of
> blaming
> >>>>>>>> Flink for
> >>>>>>>>>>> some “magic” underneath. In the above example, after
> >>> materialising
> >>>>>> b
> >>>>>>> in
> >>>>>>>>>>> only one of the methods, he should/would realise about the
> issue
> >>>>>> when
> >>>>>>>>>>> handling the return value `MaterializedTable` of that method.
> >>>>>>>>>>>
> >>>>>>>>>>> I guess it comes down to personal preferences if you like
> things
> >>> to
> >>>>>>> be
> >>>>>>>>>>> implicit or not. The more power is the user, probably the more
> >>>>>> likely
> >>>>>>>> he is
> >>>>>>>>>>> to like/understand implicit behaviour. And we as Table API
> >>>>>> designers
> >>>>>>>> are
> >>>>>>>>>>> the most power users out there, so I would proceed with caution
> >>> (so
> >>>>>>>> that we
> >>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely implicit
> >>>>>>> method
> >>>>>>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> >>>>>>>>>>>
> >>>>>>>>>>>> Table API to also support non-relational processing cases,
> >>> cache()
> >>>>>>>>>>> might be slightly better.
> >>>>>>>>>>>
> >>>>>>>>>>> I think even such extended Table API could benefit from
> sticking
> >>>>>>>> to/being
> >>>>>>>>>>> consistent with SQL where both SQL and Table API are basically
> >>> the
> >>>>>>>> same.
> >>>>>>>>>>>
> >>>>>>>>>>> One more thing. `MaterializedTable materialize()` could be more
> >>>>>>>>>>> powerful/flexible allowing the user to operate both on
> >>> materialised
> >>>>>>>> and not
> >>>>>>>>>>> materialised view at the same time for whatever reasons
> >>> (underlying
> >>>>>>>> data
> >>>>>>>>>>> changing/better optimisation opportunities after pushing down
> >>> more
> >>>>>>>> filters
> >>>>>>>>>>> etc). For example:
> >>>>>>>>>>>
> >>>>>>>>>>> Table b = …;
> >>>>>>>>>>>
> >>>>>>>>>>> MaterlizedTable mb = b.materialize();
> >>>>>>>>>>>
> >>>>>>>>>>> Val min = mb.min();
> >>>>>>>>>>> Val max = mb.max();
> >>>>>>>>>>>
> >>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> >>>>>>>>>>>
> >>>>>>>>>>> Could be more efficient compared to `b.cache()` if
> >>> `filter(‘userId
> >>>>>> =
> >>>>>>>>>>> 42);` allows for much more aggressive optimisations.
> >>>>>>>>>>>
> >>>>>>>>>>> Piotrek
> >>>>>>>>>>>
> >>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was just an
> >>>>>>>> example.
> >>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> >>>>>>>>>>>> For the sake of this proposal, it would be up to the user to
> >>>>>>>> implement a
> >>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink classes
> >>> to
> >>>>>>>>>>> persist
> >>>>>>>>>>>> and read the data.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier
> <
> >>>>>>>>>>>> pompermaier@okkam.it>:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an
> alternative
> >>> to
> >>>>>>>>>>> Apache
> >>>>>>>>>>>>> Ignite?
> >>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> >>>>>> fhueske@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the proposal!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To summarize, you propose a new method Table.cache(): Table
> >>> that
> >>>>>>>> will
> >>>>>>>>>>>>>> trigger a job and write the result into some temporary
> storage
> >>>>>> as
> >>>>>>>>>>> defined
> >>>>>>>>>>>>>> by a TableFactory.
> >>>>>>>>>>>>>> The cache() call blocks while the job is running and
> >>> eventually
> >>>>>>>>>>> returns a
> >>>>>>>>>>>>>> Table object that represents a scan of the temporary table.
> >>>>>>>>>>>>>> When the "session" is closed (closing to be defined?), the
> >>>>>>> temporary
> >>>>>>>>>>>>> tables
> >>>>>>>>>>>>>> are all dropped.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think this behavior makes sense and is a good first step
> >>>>>> towards
> >>>>>>>>>>> more
> >>>>>>>>>>>>>> interactive workloads.
> >>>>>>>>>>>>>> However, its performance suffers from writing to and reading
> >>>>>> from
> >>>>>>>>>>>>> external
> >>>>>>>>>>>>>> systems.
> >>>>>>>>>>>>>> I think this is OK for now. Changes that would significantly
> >>>>>>> improve
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would
> >>> have
> >>>>>>>> large
> >>>>>>>>>>>>>> impacts on many components of Flink.
> >>>>>>>>>>>>>> Users could use in-memory filesystems or storage grids
> (Apache
> >>>>>>>>>>> Ignite) to
> >>>>>>>>>>>>>> mitigate some of the performance effects.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best, Fabian
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> >>>>>>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
> >>>>>> that
> >>>>>>>> they
> >>>>>>>>>>>>>>> cannot do on a Table? After users call *table.cache(),
> *users
> >>>>>> can
> >>>>>>>>>>> just
> >>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>> that table and do anything that is supported on a Table,
> >>>>>>> including
> >>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine to
> >>> me.
> >>>>>>>>>>> cache()
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> a bit more general than materialize(). Given that we are
> >>>>>>> enhancing
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> Table API to also support non-relational processing cases,
> >>>>>>> cache()
> >>>>>>>>>>>>> might
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> slightly better.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> >>>>>>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse
> existing
> >>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you
> >>> want
> >>>>>> to
> >>>>>>>>>>>>>> provide
> >>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>> alternate way of writing the data.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we
> could
> >>>>>>>> rename
> >>>>>>>>>>>>>>>> `cache()` to
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> void materialize()
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> or going step further
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> MaterializedTable materialize()
> >>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> ?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The second option with returning a handle I think is more
> >>>>>>> flexible
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete” or
> >>> generally
> >>>>>>>>>>>>> speaking
> >>>>>>>>>>>>>>>> manage the the view. In the future we could also think
> about
> >>>>>>>> adding
> >>>>>>>>>>>>>> hooks
> >>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more
> explicit
> >>> -
> >>>>>>>>>>>>>>>> materialization returning a new table handle will not have
> >>> the
> >>>>>>>> same
> >>>>>>>>>>>>>>>> implicit side effects as adding a simple line of code like
> >>>>>>>>>>>>> `b.cache()`
> >>>>>>>>>>>>>>>> would have.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> It would also be more SQL like, making it more intuitive
> for
> >>>>>>> users
> >>>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>> familiar with the SQL.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> becket.qin@gmail.com
> >>>>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
> >>>>>>> creating
> >>>>>>>> a
> >>>>>>>>>>>>>>>> BUILT-IN
> >>>>>>>>>>>>>>>>> materialized view with a lifecycle. That functionality is
> >>>>>>> missing
> >>>>>>>>>>>>>>> today,
> >>>>>>>>>>>>>>>>> though. Not sure if I understand your question. Do you
> mean
> >>>>>> we
> >>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we want to
> >>> stop
> >>>>>>> at
> >>>>>>>>>>>>>>> creating
> >>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend that in
> the
> >>>>>>> future
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>> useful unified data store distributed with Flink? And do
> we
> >>>>>>> want
> >>>>>>>> to
> >>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern with their
> >>> own
> >>>>>>>> user
> >>>>>>>>>>>>>>>> defined
> >>>>>>>>>>>>>>>>> services. These considerations are much more
> architectural.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> >>>>>>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the problem.
> >>>>>> Isn’t
> >>>>>>>> the
> >>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink
> and
> >>>>>>> later
> >>>>>>>>>>>>>>> reading
> >>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live
> >>> time?
> >>>>>>> And
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> sink
> >>>>>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a materialised
> >>> view
> >>>>>>>> from a
> >>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
> >>>>>>> materialised
> >>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> >>>>>>> materialised
> >>>>>>>>>>>>>> views
> >>>>>>>>>>>>>>>> (for
> >>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we need
> some
> >>>>>>>>>>>>> syntactic
> >>>>>>>>>>>>>>>> sugar
> >>>>>>>>>>>>>>>>>> on top of it?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> >>> becket.qin@gmail.com
> >>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> >>>>>>>>>>>>>>> lifecycle/defined
> >>>>>>>>>>>>>>>>>>> scope. I just added a section in the future work for
> >>> this.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> >>>>>>>>>>>>>>> sunjincheng121@gmail.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
> >>>>>> `cache()`, I
> >>>>>>>>>>>>>>>> understand
> >>>>>>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>>>>>> you designed this way!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
> >>>>>> data
> >>>>>>>>>>>>>>>> persistence?
> >>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the
> >>> user
> >>>>>>> is
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>> worried
> >>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the time
> range
> >>>>>> for
> >>>>>>>>>>>>>> keeping
> >>>>>>>>>>>>>>>>>> time.
> >>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also
> >>> share
> >>>>>>> in a
> >>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>> group of session, for example:
> >>>>>>> LifeCycle.SESSION_GROUP(...), I
> >>>>>>>>>>>>> am
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>> sure,
> >>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
> >>>>>> 下午1:33写道:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> >>>>>> persist(),
> >>>>>>>>>>>>>>>> personally I
> >>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
> >>>>>> behavior,
> >>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted after
> >>> the
> >>>>>>>>>>>>> session
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> closed.
> >>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people might
> >>> think
> >>>>>>> the
> >>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>> still be there even after the session is gone.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
> >>> processing
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>> job.
> >>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I
> imagine
> >>>>>> that
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>>>>>> change across the board, including sources, operators
> >>> and
> >>>>>>>>>>>>>>>>>> optimizations,
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several separate
> >>> in-depth
> >>>>>>>>>>>>>>> discussions.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> >>>>>>>>>>>>> xingcanc@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain
> are
> >>>>>> both
> >>>>>>>>>>>>>>>> orthogonal
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the
> first
> >>>>>> time
> >>>>>>>> we
> >>>>>>>>>>>>>> plan
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
> >>>>>> state.
> >>>>>>>>>>>>> Maybe
> >>>>>>>>>>>>>>> it’s
> >>>>>>>>>>>>>>>>>>>>> better
> >>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate on
> a
> >>>>>>>> specific
> >>>>>>>>>>>>>>> part?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> >>>>>>>> underlying
> >>>>>>>>>>>>>>>>>> service.
> >>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the
> existing
> >>>>>>>>>>>>> codebase.
> >>>>>>>>>>>>>> As
> >>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to support
> >>>>>> other
> >>>>>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more
> interactive
> >>>>>>> Table
> >>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> >>>>>>>>>>>>>> xiaoweij@gmail.com>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean
> up
> >>>>>> is
> >>>>>>>> not
> >>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>>>> reliable.
> >>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
> >>>>>>>> successfully.
> >>>>>>>>>>>>> We
> >>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>>> risk
> >>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
> >>>>>> have
> >>>>>>> an
> >>>>>>>>>>>>>>>>>>>> association
> >>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can always
> >>>>>> clean
> >>>>>>>> up
> >>>>>>>>>>>>>> temp
> >>>>>>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any active
> >>>>>> sessions.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> >>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
> >>>>>> friendly
> >>>>>>>> in
> >>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>>>>>>>> examples.
> >>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> >>>>>> executed
> >>>>>>> in
> >>>>>>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>>>>>> stages
> >>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink
> ML,
> >>> in
> >>>>>>>> order
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> utilize
> >>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to
> submit a
> >>>>>> job
> >>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>> env.execute().
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> >>>>>>>>>>>>> `persist()`,
> >>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally
> >>> cache
> >>>>>>> in
> >>>>>>>>>>>>>> memory
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>> persist
> >>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into
> state
> >>>>>>>> backend
> >>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future,
> >>> support
> >>>>>>> for
> >>>>>>>>>>>>>>>> streaming
> >>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also
> >>> benefit
> >>>>>>> in
> >>>>>>>>>>>>>>>>>>>> "Interactive
> >>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs
> >>> and
> >>>>>>>> FLIP!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
> >>>>>>>> 下午9:56写道:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out,
> it
> >>>>>> is a
> >>>>>>>>>>>>>>> promising
> >>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> >>>>>>>> aspects,
> >>>>>>>>>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One
> of
> >>>>>> the
> >>>>>>>>>>>>>>> scenarios
> >>>>>>>>>>>>>>>>>>>>> where
> >>>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
> >>> programming.
> >>>>>> To
> >>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we
> >>> put
> >>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >
> >
>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi all,

Regarding naming `cache()` vs `materialize()`. One more explanation why I think `materialize()` is more natural to me is that I think of all “Table”s in Table-API as views. They behave the same way as SQL views, the only difference for me is that their live scope is short - current session which is limited by different execution model. That’s why “cashing” a view for me is just materialising it. 

However I see and I understand your point of view. Coming from DataSet/DataStream and generally speaking non-SQL world, `cache()` is more natural. But keep in mind that `.cache()` will/might not only be used in interactive programming and not only in batching. But naming is one issue, and not that critical to me. Especially that once we implement proper materialised views, we can always deprecate/rename `cache()` if we deem so. 


For me the more important issue is of not having the `void cache()` with side effects. Exactly for the reasons that you have mentioned. True: results might be non deterministic if underlying source table are changing. Problem is that `void cache()` implicitly changes the semantic of subsequent uses of the cached/materialized Table. It can cause “wtf” moment for a user if he inserts “b.cache()” call in some place in his code and suddenly some other random places are behaving differently. If `materialize()` or `cache()` returns a Table handle, we force user to explicitly use the cache which removes the “random” part from the "suddenly some other random places are behaving differently”.

This argument and others that I’ve raised (greater flexibility/allowing user to explicitly bypass the cache) are independent of `cache()` vs `materialize()` discussion.

> Does that mean one can also insert into the CachedTable? This sounds pretty confusing.

I don’t know, probably initially we should make CachedTable read-only. I don’t find it more confusing than the fact that user can not write to views or materialised views in SQL or that user currently can not write to a Table. 

Piotrek

> On 30 Nov 2018, at 17:38, Xingcan Cui <xi...@gmail.com> wrote:
> 
> Hi all,
> 
> I agree with @Becket that `cache()` and `materialize()` should be considered as two different methods where the later one is more sophisticated.
> 
> According to my understanding, the initial idea is just to introduce a simple cache or persist mechanism, but as the TableAPI is a high-level API, it’s naturally for as to think in a SQL way.
> 
> Maybe we can add the `cache()` method to the DataSet API and force users to translate a Table to a Dataset before caching it. Then the users should manually register the cached dataset to a table again (we may need some table replacement mechanisms for datasets with an identical schema but different contents here). After all, it’s the dataset rather than the dynamic table that need to be cached, right?
> 
> Best,
> Xingcan
> 
>> On Nov 30, 2018, at 10:57 AM, Becket Qin <be...@gmail.com> wrote:
>> 
>> Hi Piotrek and Jark,
>> 
>> Thanks for the feedback and explanation. Those are good arguments. But I
>> think those arguments are mostly about materialized view. Let me try to
>> explain the reason I believe cache() and materialize() are different.
>> 
>> I think cache() and materialize() have quite different implications. An
>> analogy I can think of is save()/publish(). When users call cache(), it is
>> just like they are saving an intermediate result as a draft of their work,
>> this intermediate result may not have any realistic meaning. Calling
>> cache() does not mean users want to publish the cached table in any manner.
>> But when users call materialize(), that means "I have something meaningful
>> to be reused by others", now users need to think about the validation,
>> update & versioning, lifecycle of the result, etc.
>> 
>> Piotrek's suggestions on variations of the materialize() methods are very
>> useful. It would be great if Flink have them. The concept of materialized
>> view is actually a pretty big feature, not to say the related stuff like
>> triggers/hooks you mentioned earlier. I think the materialized view itself
>> should be discussed in a more thorough and systematic manner. And I found
>> that discussion is kind of orthogonal and way beyond interactive
>> programming experience.
>> 
>> The example you gave was interesting. I still have some questions, though.
>> 
>> Table source = … // some source that scans files from a directory
>>> “/foo/bar/“
>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>> Table t2 = t1.materialize() // (or `cache()`)
>> 
>> t2.count() // initialise cache (if it’s lazily initialised)
>>> int a1 = t1.count()
>>> int b1 = t2.count()
>>> // something in the background (or we trigger it) writes new files to
>>> /foo/bar
>>> int a2 = t1.count()
>>> int b2 = t2.count()
>>> t2.refresh() // possible future extension, not to be implemented in the
>>> initial version
>>> 
>> 
>> what if someone else added some more files to /foo/bar at this point? In
>> that case, a3 won't equals to b3, and the result become non-deterministic,
>> right?
>> 
>> int a3 = t1.count()
>>> int b3 = t2.count()
>>> t2.drop() // another possible future extension, manual “cache” dropping
>> 
>> 
>> When we talk about interactive programming, in most cases, we are talking
>> about batch applications. A fundamental assumption of such case is that the
>> source data is complete before the data processing begins, and the data
>> will not change during the data processing. IMO, if additional rows needs
>> to be added to some source during the processing, it should be done in ways
>> like union the source with another table containing the rows to be added.
>> 
>> There are a few cases that computations are executed repeatedly on the
>> changing data source.
>> 
>> For example, people may run a ML training job every hour with the samples
>> newly added in the past hour. In that case, the source data between will
>> indeed change. But still, the data remain unchanged within one run. And
>> usually in that case, the result will need versioning, i.e. for a given
>> result, it tells that the result is a result from the source data by a
>> certain timestamp.
>> 
>> Another example is something like data warehouse. In this case, there are a
>> few source of original/raw data. On top of those sources, many materialized
>> view / queries / reports / dashboards can be created to generate derived
>> data. Those derived data needs to be updated when the underlying original
>> data changes. In that case, the processing logic that derives the original
>> data needs to be executed repeatedly to update those reports/views. Again,
>> all those derived data also need to have version management, such as
>> timestamp.
>> 
>> In any of the above two cases, during a single run of the processing logic,
>> the data cannot change. Otherwise the behavior of the processing logic may
>> be undefined. In the above two examples, when writing the processing logic,
>> Users can use .cache() to hint Flink that those results should be saved to
>> avoid repeated computation. And then for the result of my application
>> logic, I'll call materialize(), so that these results could be managed by
>> the system with versioning, metadata management, lifecycle management,
>> ACLs, etc.
>> 
>> It is true we can use materialize() to do the cache() job, but I am really
>> reluctant to shoehorn cache() into materialize() and force users to worry
>> about a bunch of implications that they needn't have to. I am absolutely on
>> your side that redundant API is bad. But it is equally frustrating, if not
>> more, that the same API does different things.
>> 
>> Thanks,
>> 
>> Jiangjie (Becket) Qin
>> 
>> 
>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <ws...@gmail.com> wrote:
>> 
>>> Thanks Piotrek,
>>> You provided a very good example, it explains all the confusions I have.
>>> It is clear that there is something we have not considered in the initial
>>> proposal. We intend to force the user to reuse the cached/materialized
>>> table, if its cache() method is executed. We did not expect that user may
>>> want to re-executed the plan from the source table. Let me re-think about
>>> it and get back to you later.
>>> 
>>> In the meanwhile, this example/observation also infers that we cannot fully
>>> involve the optimizer to decide the plan if a cache/materialize is
>>> explicitly used, because weather to reuse the cache data or re-execute the
>>> query from source data may lead to different results. (But I guess
>>> optimizer can still help in some cases ---- as long as it does not
>>> re-execute from the varied source, we should be safe).
>>> 
>>> Regards,
>>> Shaoxuan
>>> 
>>> 
>>> 
>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <pi...@data-artisans.com>
>>> wrote:
>>> 
>>>> Hi Shaoxuan,
>>>> 
>>>> Re 2:
>>>> 
>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1’
>>>> 
>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
>>>> `methodThatAppliesOperators()` method has changed it’s plan?
>>>> 
>>>> I was thinking more about something like this:
>>>> 
>>>> Table source = … // some source that scans files from a directory
>>>> “/foo/bar/“
>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>>> Table t2 = t1.materialize() // (or `cache()`)
>>>> 
>>>> t2.count() // initialise cache (if it’s lazily initialised)
>>>> 
>>>> int a1 = t1.count()
>>>> int b1 = t2.count()
>>>> 
>>>> // something in the background (or we trigger it) writes new files to
>>>> /foo/bar
>>>> 
>>>> int a2 = t1.count()
>>>> int b2 = t2.count()
>>>> 
>>>> t2.refresh() // possible future extension, not to be implemented in the
>>>> initial version
>>>> 
>>>> int a3 = t1.count()
>>>> int b3 = t2.count()
>>>> 
>>>> t2.drop() // another possible future extension, manual “cache” dropping
>>>> 
>>>> assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
>>>> assertTrue(b1 == b2) // both values come from the same cache
>>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full table
>>> scan
>>>> and has more data
>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
>>>> assertTrue(b3 == a2 == a3)
>>>> 
>>>> Piotrek
>>>> 
>>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> It is an very interesting and useful design!
>>>>> 
>>>>> Here I want to share some of my thoughts:
>>>>> 
>>>>> 1. Agree with that cache() method should return some Table to avoid
>>> some
>>>>> unexpected problems because of the mutable object.
>>>>> All the existing methods of Table are returning a new Table instance.
>>>>> 
>>>>> 2. I think materialize() would be more consistent with SQL, this makes
>>> it
>>>>> possible to support the same feature for SQL (materialize view) and
>>> keep
>>>>> the same API for users in the future.
>>>>> But I'm also fine if we choose cache().
>>>>> 
>>>>> 3. In the proposal, a TableService (or FlinkService?) is used to cache
>>>> the
>>>>> result of the (intermediate) table.
>>>>> But the name of TableService may be a bit general which is not quite
>>>>> understanding correctly in the first glance (a metastore for tables?).
>>>>> Maybe a more specific name would be better, such as TableCacheSerive
>>> or
>>>>> TableMaterializeSerivce or something else.
>>>>> 
>>>>> Best,
>>>>> Jark
>>>>> 
>>>>> 
>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Thanks for the clarification Becket!
>>>>>> 
>>>>>> I have a few thoughts to share / questions:
>>>>>> 
>>>>>> 1) I'd like to know how you plan to implement the feature on a plan /
>>>>>> planner level.
>>>>>> 
>>>>>> I would imaging the following to happen when Table.cache() is called:
>>>>>> 
>>>>>> 1) immediately optimize the Table and internally convert it into a
>>>>>> DataSet/DataStream. This is necessary, to avoid that operators of
>>> later
>>>>>> queries on top of the Table are pushed down.
>>>>>> 2) register the DataSet/DataStream as a DataSet/DataStream-backed
>>> Table
>>>> X
>>>>>> 3) add a sink to the DataSet/DataStream. This is the materialization
>>> of
>>>> the
>>>>>> Table X
>>>>>> 
>>>>>> Based on your proposal the following would happen:
>>>>>> 
>>>>>> Table t1 = ....
>>>>>> t1.cache(); // cache() returns void. The logical plan of t1 is
>>> replaced
>>>> by
>>>>>> a scan of X. There is also a reference to the materialization of X.
>>>>>> 
>>>>>> t1.count(); // this executes the program, including the
>>>> DataSet/DataStream
>>>>>> that backs X and the sink that writes the materialization of X
>>>>>> t1.count(); // this executes the program, but reads X from the
>>>>>> materialization.
>>>>>> 
>>>>>> My question is, how do you determine when whether the scan of t1
>>> should
>>>> go
>>>>>> against the DataSet/DataStream program and when against the
>>>>>> materialization?
>>>>>> AFAIK, there is no hook that will tell you that a part of the program
>>>> was
>>>>>> executed. Flipping a switch during optimization or plan generation is
>>>> not
>>>>>> sufficient as there is no guarantee that the plan is also executed.
>>>>>> 
>>>>>> Overall, this behavior is somewhat similar to what I proposed in
>>>>>> FLINK-8950, which does not include persisting the table, but just
>>>>>> optimizing and reregistering it as DataSet/DataStream scan.
>>>>>> 
>>>>>> 2) I think Piotr has a point about the implicit behavior and side
>>>> effects
>>>>>> of the cache() method if it does not return anything.
>>>>>> Consider the following example:
>>>>>> 
>>>>>> Table t1 = ???
>>>>>> Table t2 = methodThatAppliesOperators(t1);
>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
>>>>>> 
>>>>>> In this case, the behavior/performance of the plan that results from
>>> the
>>>>>> second method call depends on whether t1 was modified by the first
>>>> method
>>>>>> or not.
>>>>>> This is the classic issue of mutable vs. immutable objects.
>>>>>> Also, as Piotr pointed out, it might also be good to have the original
>>>> plan
>>>>>> of t1, because in some cases it is possible to push filters down such
>>>> that
>>>>>> evaluating the query from scratch might be more efficient than
>>> accessing
>>>>>> the cache.
>>>>>> Moreover, a CachedTable could extend Table() and offer a method
>>>> refresh().
>>>>>> This sounds quite useful in an interactive session mode.
>>>>>> 
>>>>>> 3) Regarding the name, I can see both arguments. IMO, materialize()
>>>> seems
>>>>>> to be more future proof.
>>>>>> 
>>>>>> Best, Fabian
>>>>>> 
>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
>>>>>> wshaoxuan@gmail.com>:
>>>>>> 
>>>>>>> Hi Piotr,
>>>>>>> 
>>>>>>> Thanks for sharing your ideas on the method naming. We will think
>>> about
>>>>>>> your suggestions. But I don't understand why we need to change the
>>>> return
>>>>>>> type of cache().
>>>>>>> 
>>>>>>> Cache() is a physical operation, it does not change the logic of
>>>>>>> the `Table`. On the tableAPI layer, we should not introduce a new
>>> table
>>>>>>> type unless the logic of table has been changed. If we introduce a
>>> new
>>>>>>> table type `CachedTable`, we need create the same set of methods of
>>>>>> `Table`
>>>>>>> for it. I don't think it is worth doing this. Or can you please
>>>> elaborate
>>>>>>> more on what could be the "implicit behaviours/side effects" you are
>>>>>>> thinking about?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Shaoxuan
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
>>>> piotr@data-artisans.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Becket,
>>>>>>>> 
>>>>>>>> Thanks for the response.
>>>>>>>> 
>>>>>>>> 1. I wasn’t saying that materialised view must be mutable or not.
>>> The
>>>>>>> same
>>>>>>>> thing applies to caches as well. To the contrary, I would expect
>>> more
>>>>>>>> consistency and updates from something that is called “cache” vs
>>>>>>> something
>>>>>>>> that’s a “materialised view”. In other words, IMO most caches do not
>>>>>>> serve
>>>>>>>> you invalid/outdated data and they handle updates on their own.
>>>>>>>> 
>>>>>>>> 2. I don’t think that having in the future two very similar concepts
>>>> of
>>>>>>>> `materialized` view and `cache` is a good idea. It would be
>>> confusing
>>>>>> for
>>>>>>>> the users. I think it could be handled by variations/overloading of
>>>>>>>> materialised view concept. We could start with:
>>>>>>>> 
>>>>>>>> `MaterializedTable materialize()` - immutable, session life scope
>>>>>>>> (basically the same semantic as you are proposing
>>>>>>>> 
>>>>>>>> And then in the future (if ever) build on top of that/expand it
>>> with:
>>>>>>>> 
>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
>>>>>>>> materialize(refreshHook=…)`
>>>>>>>> 
>>>>>>>> Or with cross session support:
>>>>>>>> 
>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
>>> `MaterializedTable
>>>>>>>> materializeInto(tableFactory=…)`
>>>>>>>> 
>>>>>>>> I’m not saying that we should implement cross session/refreshing now
>>>> or
>>>>>>>> even in the near future. I’m just arguing that naming current
>>>> immutable
>>>>>>>> session life scope method `materialize()` is more future proof and
>>>> more
>>>>>>>> consistent with SQL (on which after all table-api is heavily basing
>>>>>> on).
>>>>>>>> 
>>>>>>>> 3. Even if we agree on naming it `cache()`, I would still insist on
>>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
>>>>>>> behaviours/side
>>>>>>>> effects and to give both us & users more flexibility.
>>>>>>>> 
>>>>>>>> Piotrek
>>>>>>>> 
>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Just to add a little bit, the materialized view is probably more
>>>>>>> similar
>>>>>>>> to
>>>>>>>>> the persistent() brought up earlier in the thread. So it is usually
>>>>>>> cross
>>>>>>>>> session and could be used in a larger scope. For example, a
>>>>>>> materialized
>>>>>>>>> view created by user A may be visible to user B. It is probably
>>>>>>> something
>>>>>>>>> we want to have in the future. I'll put it in the future work
>>>>>> section.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Piotrek,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>> 
>>>>>>>>>> Right now we are mostly thinking of the cached table as
>>> immutable. I
>>>>>>> can
>>>>>>>>>> see the Materialized view would be useful in the future. That
>>> said,
>>>>>> I
>>>>>>>> think
>>>>>>>>>> a simple cache mechanism is probably still needed. So to me,
>>> cache()
>>>>>>> and
>>>>>>>>>> materialize() should be two separate method as they address
>>>>>> different
>>>>>>>>>> needs. Materialize() is a higher level concept usually implying
>>>>>>>> periodical
>>>>>>>>>> update, while cache() has much simpler semantic. For example, one
>>>>>> may
>>>>>>>>>> create a materialized view and use cache() method in the
>>>>>> materialized
>>>>>>>> view
>>>>>>>>>> creation logic. So that during the materialized view update, they
>>> do
>>>>>>> not
>>>>>>>>>> need to worry about the case that the cached table is also
>>> changed.
>>>>>>>> Maybe
>>>>>>>>>> under the hood, materialized() and cache() could share some
>>>>>> mechanism,
>>>>>>>> but
>>>>>>>>>> I think a simple cache() method would be handy in a lot of cases.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>> 
>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
>>>>>>> piotr@data-artisans.com
>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Becket,
>>>>>>>>>>> 
>>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable that
>>>>>>> they
>>>>>>>>>>> cannot do on a Table?
>>>>>>>>>>> 
>>>>>>>>>>> Maybe not in the initial implementation, but various DBs offer
>>>>>>>> different
>>>>>>>>>>> ways to “refresh” the materialised view. Hooks, triggers, timers,
>>>>>>>> manually
>>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle that in
>>> the
>>>>>>>> future.
>>>>>>>>>>> 
>>>>>>>>>>>> After users call *table.cache(), *users can just use that table
>>>>>> and
>>>>>>> do
>>>>>>>>>>> anything that is supported on a Table, including SQL.
>>>>>>>>>>> 
>>>>>>>>>>> This is some implicit behaviour with side effects. Imagine if
>>> user
>>>>>>> has
>>>>>>>> a
>>>>>>>>>>> long and complicated program, that touches table `b` multiple
>>>>>> times,
>>>>>>>> maybe
>>>>>>>>>>> scattered around different methods. If he modifies his program by
>>>>>>>> inserting
>>>>>>>>>>> in one place
>>>>>>>>>>> 
>>>>>>>>>>> b.cache()
>>>>>>>>>>> 
>>>>>>>>>>> This implicitly alters the semantic and behaviour of his code all
>>>>>>> over
>>>>>>>>>>> the place, maybe in a ways that might cause problems. For example
>>>>>>> what
>>>>>>>> if
>>>>>>>>>>> underlying data is changing?
>>>>>>>>>>> 
>>>>>>>>>>> Having invisible side effects is also not very clean, for example
>>>>>>> think
>>>>>>>>>>> about something like this (but more complicated):
>>>>>>>>>>> 
>>>>>>>>>>> Table b = ...;
>>>>>>>>>>> 
>>>>>>>>>>> If (some_condition) {
>>>>>>>>>>> processTable1(b)
>>>>>>>>>>> }
>>>>>>>>>>> else {
>>>>>>>>>>> processTable2(b)
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> // do more stuff with b
>>>>>>>>>>> 
>>>>>>>>>>> And user adds `b.cache()` call to only one of the `processTable1`
>>>>>> or
>>>>>>>>>>> `processTable2` methods.
>>>>>>>>>>> 
>>>>>>>>>>> On the other hand
>>>>>>>>>>> 
>>>>>>>>>>> Table materialisedB = b.materialize()
>>>>>>>>>>> 
>>>>>>>>>>> Avoids (at least some of) the side effect issues and forces user
>>> to
>>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate and forces
>>>>>> user
>>>>>>>> to
>>>>>>>>>>> think what does it actually mean. And if something doesn’t work
>>> in
>>>>>>> the
>>>>>>>> end
>>>>>>>>>>> for the user, he will know what has he changed instead of blaming
>>>>>>>> Flink for
>>>>>>>>>>> some “magic” underneath. In the above example, after
>>> materialising
>>>>>> b
>>>>>>> in
>>>>>>>>>>> only one of the methods, he should/would realise about the issue
>>>>>> when
>>>>>>>>>>> handling the return value `MaterializedTable` of that method.
>>>>>>>>>>> 
>>>>>>>>>>> I guess it comes down to personal preferences if you like things
>>> to
>>>>>>> be
>>>>>>>>>>> implicit or not. The more power is the user, probably the more
>>>>>> likely
>>>>>>>> he is
>>>>>>>>>>> to like/understand implicit behaviour. And we as Table API
>>>>>> designers
>>>>>>>> are
>>>>>>>>>>> the most power users out there, so I would proceed with caution
>>> (so
>>>>>>>> that we
>>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely implicit
>>>>>>> method
>>>>>>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
>>>>>>>>>>> 
>>>>>>>>>>>> Table API to also support non-relational processing cases,
>>> cache()
>>>>>>>>>>> might be slightly better.
>>>>>>>>>>> 
>>>>>>>>>>> I think even such extended Table API could benefit from sticking
>>>>>>>> to/being
>>>>>>>>>>> consistent with SQL where both SQL and Table API are basically
>>> the
>>>>>>>> same.
>>>>>>>>>>> 
>>>>>>>>>>> One more thing. `MaterializedTable materialize()` could be more
>>>>>>>>>>> powerful/flexible allowing the user to operate both on
>>> materialised
>>>>>>>> and not
>>>>>>>>>>> materialised view at the same time for whatever reasons
>>> (underlying
>>>>>>>> data
>>>>>>>>>>> changing/better optimisation opportunities after pushing down
>>> more
>>>>>>>> filters
>>>>>>>>>>> etc). For example:
>>>>>>>>>>> 
>>>>>>>>>>> Table b = …;
>>>>>>>>>>> 
>>>>>>>>>>> MaterlizedTable mb = b.materialize();
>>>>>>>>>>> 
>>>>>>>>>>> Val min = mb.min();
>>>>>>>>>>> Val max = mb.max();
>>>>>>>>>>> 
>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
>>>>>>>>>>> 
>>>>>>>>>>> Could be more efficient compared to `b.cache()` if
>>> `filter(‘userId
>>>>>> =
>>>>>>>>>>> 42);` allows for much more aggressive optimisations.
>>>>>>>>>>> 
>>>>>>>>>>> Piotrek
>>>>>>>>>>> 
>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was just an
>>>>>>>> example.
>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
>>>>>>>>>>>> For the sake of this proposal, it would be up to the user to
>>>>>>>> implement a
>>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink classes
>>> to
>>>>>>>>>>> persist
>>>>>>>>>>>> and read the data.
>>>>>>>>>>>> 
>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
>>>>>>>>>>>> pompermaier@okkam.it>:
>>>>>>>>>>>> 
>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an alternative
>>> to
>>>>>>>>>>> Apache
>>>>>>>>>>>>> Ignite?
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
>>>>>> fhueske@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the proposal!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> To summarize, you propose a new method Table.cache(): Table
>>> that
>>>>>>>> will
>>>>>>>>>>>>>> trigger a job and write the result into some temporary storage
>>>>>> as
>>>>>>>>>>> defined
>>>>>>>>>>>>>> by a TableFactory.
>>>>>>>>>>>>>> The cache() call blocks while the job is running and
>>> eventually
>>>>>>>>>>> returns a
>>>>>>>>>>>>>> Table object that represents a scan of the temporary table.
>>>>>>>>>>>>>> When the "session" is closed (closing to be defined?), the
>>>>>>> temporary
>>>>>>>>>>>>> tables
>>>>>>>>>>>>>> are all dropped.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think this behavior makes sense and is a good first step
>>>>>> towards
>>>>>>>>>>> more
>>>>>>>>>>>>>> interactive workloads.
>>>>>>>>>>>>>> However, its performance suffers from writing to and reading
>>>>>> from
>>>>>>>>>>>>> external
>>>>>>>>>>>>>> systems.
>>>>>>>>>>>>>> I think this is OK for now. Changes that would significantly
>>>>>>> improve
>>>>>>>>>>> the
>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would
>>> have
>>>>>>>> large
>>>>>>>>>>>>>> impacts on many components of Flink.
>>>>>>>>>>>>>> Users could use in-memory filesystems or storage grids (Apache
>>>>>>>>>>> Ignite) to
>>>>>>>>>>>>>> mitigate some of the performance effects.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
>>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
>>>>>> that
>>>>>>>> they
>>>>>>>>>>>>>>> cannot do on a Table? After users call *table.cache(), *users
>>>>>> can
>>>>>>>>>>> just
>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>> that table and do anything that is supported on a Table,
>>>>>>> including
>>>>>>>>>>> SQL.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine to
>>> me.
>>>>>>>>>>> cache()
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> a bit more general than materialize(). Given that we are
>>>>>>> enhancing
>>>>>>>>>>> the
>>>>>>>>>>>>>>> Table API to also support non-relational processing cases,
>>>>>>> cache()
>>>>>>>>>>>>> might
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> slightly better.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
>>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you
>>> want
>>>>>> to
>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>> alternate way of writing the data.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we could
>>>>>>>> rename
>>>>>>>>>>>>>>>> `cache()` to
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> void materialize()
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> or going step further
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> MaterializedTable materialize()
>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The second option with returning a handle I think is more
>>>>>>> flexible
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete” or
>>> generally
>>>>>>>>>>>>> speaking
>>>>>>>>>>>>>>>> manage the the view. In the future we could also think about
>>>>>>>> adding
>>>>>>>>>>>>>> hooks
>>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more explicit
>>> -
>>>>>>>>>>>>>>>> materialization returning a new table handle will not have
>>> the
>>>>>>>> same
>>>>>>>>>>>>>>>> implicit side effects as adding a simple line of code like
>>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>>> would have.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It would also be more SQL like, making it more intuitive for
>>>>>>> users
>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>> familiar with the SQL.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <becket.qin@gmail.com
>>>> 
>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
>>>>>>> creating
>>>>>>>> a
>>>>>>>>>>>>>>>> BUILT-IN
>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That functionality is
>>>>>>> missing
>>>>>>>>>>>>>>> today,
>>>>>>>>>>>>>>>>> though. Not sure if I understand your question. Do you mean
>>>>>> we
>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we want to
>>> stop
>>>>>>> at
>>>>>>>>>>>>>>> creating
>>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend that in the
>>>>>>> future
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> useful unified data store distributed with Flink? And do we
>>>>>>> want
>>>>>>>> to
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern with their
>>> own
>>>>>>>> user
>>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>> services. These considerations are much more architectural.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
>>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the problem.
>>>>>> Isn’t
>>>>>>>> the
>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
>>>>>>> later
>>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live
>>> time?
>>>>>>> And
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> sink
>>>>>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a materialised
>>> view
>>>>>>>> from a
>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
>>>>>>> materialised
>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
>>>>>>> materialised
>>>>>>>>>>>>>> views
>>>>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we need some
>>>>>>>>>>>>> syntactic
>>>>>>>>>>>>>>>> sugar
>>>>>>>>>>>>>>>>>> on top of it?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
>>> becket.qin@gmail.com
>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
>>>>>>>>>>>>>>> lifecycle/defined
>>>>>>>>>>>>>>>>>>> scope. I just added a section in the future work for
>>> this.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
>>>>>>>>>>>>>>> sunjincheng121@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
>>>>>> `cache()`, I
>>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>> you designed this way!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
>>>>>> data
>>>>>>>>>>>>>>>> persistence?
>>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the
>>> user
>>>>>>> is
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> worried
>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the time range
>>>>>> for
>>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also
>>> share
>>>>>>> in a
>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>> group of session, for example:
>>>>>>> LifeCycle.SESSION_GROUP(...), I
>>>>>>>>>>>>> am
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> sure,
>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
>>>>>> 下午1:33写道:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
>>>>>> persist(),
>>>>>>>>>>>>>>>> personally I
>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
>>>>>> behavior,
>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted after
>>> the
>>>>>>>>>>>>> session
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> closed.
>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people might
>>> think
>>>>>>> the
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>> still be there even after the session is gone.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
>>> processing
>>>>>> in
>>>>>>>> the
>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
>>>>>> that
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>>> change across the board, including sources, operators
>>> and
>>>>>>>>>>>>>>>>>> optimizations,
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several separate
>>> in-depth
>>>>>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
>>>>>>>>>>>>> xingcanc@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
>>>>>> both
>>>>>>>>>>>>>>>> orthogonal
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
>>>>>> time
>>>>>>>> we
>>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
>>>>>> state.
>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>> it’s
>>>>>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
>>>>>>>> specific
>>>>>>>>>>>>>>> part?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
>>>>>>>> underlying
>>>>>>>>>>>>>>>>>> service.
>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the existing
>>>>>>>>>>>>> codebase.
>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to support
>>>>>> other
>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
>>>>>>> Table
>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
>>>>>>>>>>>>>> xiaoweij@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
>>>>>> is
>>>>>>>> not
>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>> reliable.
>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
>>>>>>>> successfully.
>>>>>>>>>>>>> We
>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>> risk
>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
>>>>>> have
>>>>>>> an
>>>>>>>>>>>>>>>>>>>> association
>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can always
>>>>>> clean
>>>>>>>> up
>>>>>>>>>>>>>> temp
>>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any active
>>>>>> sessions.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
>>>>>> friendly
>>>>>>>> in
>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>>>>>> examples.
>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
>>>>>> executed
>>>>>>> in
>>>>>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>>>>> stages
>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML,
>>> in
>>>>>>>> order
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> utilize
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
>>>>>> job
>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>> env.execute().
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
>>>>>>>>>>>>> `persist()`,
>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally
>>> cache
>>>>>>> in
>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
>>>>>>>> backend
>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future,
>>> support
>>>>>>> for
>>>>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also
>>> benefit
>>>>>>> in
>>>>>>>>>>>>>>>>>>>> "Interactive
>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs
>>> and
>>>>>>>> FLIP!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
>>>>>>>> 下午9:56写道:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
>>>>>> is a
>>>>>>>>>>>>>>> promising
>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
>>>>>>>> aspects,
>>>>>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
>>>>>> the
>>>>>>>>>>>>>>> scenarios
>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
>>> programming.
>>>>>> To
>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we
>>> put
>>>>>>>>>>>>>> together
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
> 
> 



Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Xingcan Cui <xi...@gmail.com>.
Hi all,

I agree with @Becket that `cache()` and `materialize()` should be considered as two different methods where the later one is more sophisticated.

According to my understanding, the initial idea is just to introduce a simple cache or persist mechanism, but as the TableAPI is a high-level API, it’s naturally for as to think in a SQL way.

Maybe we can add the `cache()` method to the DataSet API and force users to translate a Table to a Dataset before caching it. Then the users should manually register the cached dataset to a table again (we may need some table replacement mechanisms for datasets with an identical schema but different contents here). After all, it’s the dataset rather than the dynamic table that need to be cached, right?

Best,
Xingcan

> On Nov 30, 2018, at 10:57 AM, Becket Qin <be...@gmail.com> wrote:
> 
> Hi Piotrek and Jark,
> 
> Thanks for the feedback and explanation. Those are good arguments. But I
> think those arguments are mostly about materialized view. Let me try to
> explain the reason I believe cache() and materialize() are different.
> 
> I think cache() and materialize() have quite different implications. An
> analogy I can think of is save()/publish(). When users call cache(), it is
> just like they are saving an intermediate result as a draft of their work,
> this intermediate result may not have any realistic meaning. Calling
> cache() does not mean users want to publish the cached table in any manner.
> But when users call materialize(), that means "I have something meaningful
> to be reused by others", now users need to think about the validation,
> update & versioning, lifecycle of the result, etc.
> 
> Piotrek's suggestions on variations of the materialize() methods are very
> useful. It would be great if Flink have them. The concept of materialized
> view is actually a pretty big feature, not to say the related stuff like
> triggers/hooks you mentioned earlier. I think the materialized view itself
> should be discussed in a more thorough and systematic manner. And I found
> that discussion is kind of orthogonal and way beyond interactive
> programming experience.
> 
> The example you gave was interesting. I still have some questions, though.
> 
> Table source = … // some source that scans files from a directory
>> “/foo/bar/“
>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>> Table t2 = t1.materialize() // (or `cache()`)
> 
> t2.count() // initialise cache (if it’s lazily initialised)
>> int a1 = t1.count()
>> int b1 = t2.count()
>> // something in the background (or we trigger it) writes new files to
>> /foo/bar
>> int a2 = t1.count()
>> int b2 = t2.count()
>> t2.refresh() // possible future extension, not to be implemented in the
>> initial version
>> 
> 
> what if someone else added some more files to /foo/bar at this point? In
> that case, a3 won't equals to b3, and the result become non-deterministic,
> right?
> 
> int a3 = t1.count()
>> int b3 = t2.count()
>> t2.drop() // another possible future extension, manual “cache” dropping
> 
> 
> When we talk about interactive programming, in most cases, we are talking
> about batch applications. A fundamental assumption of such case is that the
> source data is complete before the data processing begins, and the data
> will not change during the data processing. IMO, if additional rows needs
> to be added to some source during the processing, it should be done in ways
> like union the source with another table containing the rows to be added.
> 
> There are a few cases that computations are executed repeatedly on the
> changing data source.
> 
> For example, people may run a ML training job every hour with the samples
> newly added in the past hour. In that case, the source data between will
> indeed change. But still, the data remain unchanged within one run. And
> usually in that case, the result will need versioning, i.e. for a given
> result, it tells that the result is a result from the source data by a
> certain timestamp.
> 
> Another example is something like data warehouse. In this case, there are a
> few source of original/raw data. On top of those sources, many materialized
> view / queries / reports / dashboards can be created to generate derived
> data. Those derived data needs to be updated when the underlying original
> data changes. In that case, the processing logic that derives the original
> data needs to be executed repeatedly to update those reports/views. Again,
> all those derived data also need to have version management, such as
> timestamp.
> 
> In any of the above two cases, during a single run of the processing logic,
> the data cannot change. Otherwise the behavior of the processing logic may
> be undefined. In the above two examples, when writing the processing logic,
> Users can use .cache() to hint Flink that those results should be saved to
> avoid repeated computation. And then for the result of my application
> logic, I'll call materialize(), so that these results could be managed by
> the system with versioning, metadata management, lifecycle management,
> ACLs, etc.
> 
> It is true we can use materialize() to do the cache() job, but I am really
> reluctant to shoehorn cache() into materialize() and force users to worry
> about a bunch of implications that they needn't have to. I am absolutely on
> your side that redundant API is bad. But it is equally frustrating, if not
> more, that the same API does different things.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <ws...@gmail.com> wrote:
> 
>> Thanks Piotrek,
>> You provided a very good example, it explains all the confusions I have.
>> It is clear that there is something we have not considered in the initial
>> proposal. We intend to force the user to reuse the cached/materialized
>> table, if its cache() method is executed. We did not expect that user may
>> want to re-executed the plan from the source table. Let me re-think about
>> it and get back to you later.
>> 
>> In the meanwhile, this example/observation also infers that we cannot fully
>> involve the optimizer to decide the plan if a cache/materialize is
>> explicitly used, because weather to reuse the cache data or re-execute the
>> query from source data may lead to different results. (But I guess
>> optimizer can still help in some cases ---- as long as it does not
>> re-execute from the varied source, we should be safe).
>> 
>> Regards,
>> Shaoxuan
>> 
>> 
>> 
>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <pi...@data-artisans.com>
>> wrote:
>> 
>>> Hi Shaoxuan,
>>> 
>>> Re 2:
>>> 
>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1’
>>> 
>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
>>> `methodThatAppliesOperators()` method has changed it’s plan?
>>> 
>>> I was thinking more about something like this:
>>> 
>>> Table source = … // some source that scans files from a directory
>>> “/foo/bar/“
>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
>>> Table t2 = t1.materialize() // (or `cache()`)
>>> 
>>> t2.count() // initialise cache (if it’s lazily initialised)
>>> 
>>> int a1 = t1.count()
>>> int b1 = t2.count()
>>> 
>>> // something in the background (or we trigger it) writes new files to
>>> /foo/bar
>>> 
>>> int a2 = t1.count()
>>> int b2 = t2.count()
>>> 
>>> t2.refresh() // possible future extension, not to be implemented in the
>>> initial version
>>> 
>>> int a3 = t1.count()
>>> int b3 = t2.count()
>>> 
>>> t2.drop() // another possible future extension, manual “cache” dropping
>>> 
>>> assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
>>> assertTrue(b1 == b2) // both values come from the same cache
>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full table
>> scan
>>> and has more data
>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
>>> assertTrue(b3 == a2 == a3)
>>> 
>>> Piotrek
>>> 
>>>> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> It is an very interesting and useful design!
>>>> 
>>>> Here I want to share some of my thoughts:
>>>> 
>>>> 1. Agree with that cache() method should return some Table to avoid
>> some
>>>> unexpected problems because of the mutable object.
>>>>  All the existing methods of Table are returning a new Table instance.
>>>> 
>>>> 2. I think materialize() would be more consistent with SQL, this makes
>> it
>>>> possible to support the same feature for SQL (materialize view) and
>> keep
>>>> the same API for users in the future.
>>>>  But I'm also fine if we choose cache().
>>>> 
>>>> 3. In the proposal, a TableService (or FlinkService?) is used to cache
>>> the
>>>> result of the (intermediate) table.
>>>>  But the name of TableService may be a bit general which is not quite
>>>> understanding correctly in the first glance (a metastore for tables?).
>>>>  Maybe a more specific name would be better, such as TableCacheSerive
>> or
>>>> TableMaterializeSerivce or something else.
>>>> 
>>>> Best,
>>>> Jark
>>>> 
>>>> 
>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Thanks for the clarification Becket!
>>>>> 
>>>>> I have a few thoughts to share / questions:
>>>>> 
>>>>> 1) I'd like to know how you plan to implement the feature on a plan /
>>>>> planner level.
>>>>> 
>>>>> I would imaging the following to happen when Table.cache() is called:
>>>>> 
>>>>> 1) immediately optimize the Table and internally convert it into a
>>>>> DataSet/DataStream. This is necessary, to avoid that operators of
>> later
>>>>> queries on top of the Table are pushed down.
>>>>> 2) register the DataSet/DataStream as a DataSet/DataStream-backed
>> Table
>>> X
>>>>> 3) add a sink to the DataSet/DataStream. This is the materialization
>> of
>>> the
>>>>> Table X
>>>>> 
>>>>> Based on your proposal the following would happen:
>>>>> 
>>>>> Table t1 = ....
>>>>> t1.cache(); // cache() returns void. The logical plan of t1 is
>> replaced
>>> by
>>>>> a scan of X. There is also a reference to the materialization of X.
>>>>> 
>>>>> t1.count(); // this executes the program, including the
>>> DataSet/DataStream
>>>>> that backs X and the sink that writes the materialization of X
>>>>> t1.count(); // this executes the program, but reads X from the
>>>>> materialization.
>>>>> 
>>>>> My question is, how do you determine when whether the scan of t1
>> should
>>> go
>>>>> against the DataSet/DataStream program and when against the
>>>>> materialization?
>>>>> AFAIK, there is no hook that will tell you that a part of the program
>>> was
>>>>> executed. Flipping a switch during optimization or plan generation is
>>> not
>>>>> sufficient as there is no guarantee that the plan is also executed.
>>>>> 
>>>>> Overall, this behavior is somewhat similar to what I proposed in
>>>>> FLINK-8950, which does not include persisting the table, but just
>>>>> optimizing and reregistering it as DataSet/DataStream scan.
>>>>> 
>>>>> 2) I think Piotr has a point about the implicit behavior and side
>>> effects
>>>>> of the cache() method if it does not return anything.
>>>>> Consider the following example:
>>>>> 
>>>>> Table t1 = ???
>>>>> Table t2 = methodThatAppliesOperators(t1);
>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
>>>>> 
>>>>> In this case, the behavior/performance of the plan that results from
>> the
>>>>> second method call depends on whether t1 was modified by the first
>>> method
>>>>> or not.
>>>>> This is the classic issue of mutable vs. immutable objects.
>>>>> Also, as Piotr pointed out, it might also be good to have the original
>>> plan
>>>>> of t1, because in some cases it is possible to push filters down such
>>> that
>>>>> evaluating the query from scratch might be more efficient than
>> accessing
>>>>> the cache.
>>>>> Moreover, a CachedTable could extend Table() and offer a method
>>> refresh().
>>>>> This sounds quite useful in an interactive session mode.
>>>>> 
>>>>> 3) Regarding the name, I can see both arguments. IMO, materialize()
>>> seems
>>>>> to be more future proof.
>>>>> 
>>>>> Best, Fabian
>>>>> 
>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
>>>>> wshaoxuan@gmail.com>:
>>>>> 
>>>>>> Hi Piotr,
>>>>>> 
>>>>>> Thanks for sharing your ideas on the method naming. We will think
>> about
>>>>>> your suggestions. But I don't understand why we need to change the
>>> return
>>>>>> type of cache().
>>>>>> 
>>>>>> Cache() is a physical operation, it does not change the logic of
>>>>>> the `Table`. On the tableAPI layer, we should not introduce a new
>> table
>>>>>> type unless the logic of table has been changed. If we introduce a
>> new
>>>>>> table type `CachedTable`, we need create the same set of methods of
>>>>> `Table`
>>>>>> for it. I don't think it is worth doing this. Or can you please
>>> elaborate
>>>>>> more on what could be the "implicit behaviours/side effects" you are
>>>>>> thinking about?
>>>>>> 
>>>>>> Regards,
>>>>>> Shaoxuan
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
>>> piotr@data-artisans.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Becket,
>>>>>>> 
>>>>>>> Thanks for the response.
>>>>>>> 
>>>>>>> 1. I wasn’t saying that materialised view must be mutable or not.
>> The
>>>>>> same
>>>>>>> thing applies to caches as well. To the contrary, I would expect
>> more
>>>>>>> consistency and updates from something that is called “cache” vs
>>>>>> something
>>>>>>> that’s a “materialised view”. In other words, IMO most caches do not
>>>>>> serve
>>>>>>> you invalid/outdated data and they handle updates on their own.
>>>>>>> 
>>>>>>> 2. I don’t think that having in the future two very similar concepts
>>> of
>>>>>>> `materialized` view and `cache` is a good idea. It would be
>> confusing
>>>>> for
>>>>>>> the users. I think it could be handled by variations/overloading of
>>>>>>> materialised view concept. We could start with:
>>>>>>> 
>>>>>>> `MaterializedTable materialize()` - immutable, session life scope
>>>>>>> (basically the same semantic as you are proposing
>>>>>>> 
>>>>>>> And then in the future (if ever) build on top of that/expand it
>> with:
>>>>>>> 
>>>>>>> `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
>>>>>>> materialize(refreshHook=…)`
>>>>>>> 
>>>>>>> Or with cross session support:
>>>>>>> 
>>>>>>> `MaterializedTable materializeInto(connector=…)` or
>> `MaterializedTable
>>>>>>> materializeInto(tableFactory=…)`
>>>>>>> 
>>>>>>> I’m not saying that we should implement cross session/refreshing now
>>> or
>>>>>>> even in the near future. I’m just arguing that naming current
>>> immutable
>>>>>>> session life scope method `materialize()` is more future proof and
>>> more
>>>>>>> consistent with SQL (on which after all table-api is heavily basing
>>>>> on).
>>>>>>> 
>>>>>>> 3. Even if we agree on naming it `cache()`, I would still insist on
>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
>>>>>> behaviours/side
>>>>>>> effects and to give both us & users more flexibility.
>>>>>>> 
>>>>>>> Piotrek
>>>>>>> 
>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Just to add a little bit, the materialized view is probably more
>>>>>> similar
>>>>>>> to
>>>>>>>> the persistent() brought up earlier in the thread. So it is usually
>>>>>> cross
>>>>>>>> session and could be used in a larger scope. For example, a
>>>>>> materialized
>>>>>>>> view created by user A may be visible to user B. It is probably
>>>>>> something
>>>>>>>> we want to have in the future. I'll put it in the future work
>>>>> section.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> 
>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Piotrek,
>>>>>>>>> 
>>>>>>>>> Thanks for the explanation.
>>>>>>>>> 
>>>>>>>>> Right now we are mostly thinking of the cached table as
>> immutable. I
>>>>>> can
>>>>>>>>> see the Materialized view would be useful in the future. That
>> said,
>>>>> I
>>>>>>> think
>>>>>>>>> a simple cache mechanism is probably still needed. So to me,
>> cache()
>>>>>> and
>>>>>>>>> materialize() should be two separate method as they address
>>>>> different
>>>>>>>>> needs. Materialize() is a higher level concept usually implying
>>>>>>> periodical
>>>>>>>>> update, while cache() has much simpler semantic. For example, one
>>>>> may
>>>>>>>>> create a materialized view and use cache() method in the
>>>>> materialized
>>>>>>> view
>>>>>>>>> creation logic. So that during the materialized view update, they
>> do
>>>>>> not
>>>>>>>>> need to worry about the case that the cached table is also
>> changed.
>>>>>>> Maybe
>>>>>>>>> under the hood, materialized() and cache() could share some
>>>>> mechanism,
>>>>>>> but
>>>>>>>>> I think a simple cache() method would be handy in a lot of cases.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
>>>>>> piotr@data-artisans.com
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Becket,
>>>>>>>>>> 
>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable that
>>>>>> they
>>>>>>>>>> cannot do on a Table?
>>>>>>>>>> 
>>>>>>>>>> Maybe not in the initial implementation, but various DBs offer
>>>>>>> different
>>>>>>>>>> ways to “refresh” the materialised view. Hooks, triggers, timers,
>>>>>>> manually
>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle that in
>> the
>>>>>>> future.
>>>>>>>>>> 
>>>>>>>>>>> After users call *table.cache(), *users can just use that table
>>>>> and
>>>>>> do
>>>>>>>>>> anything that is supported on a Table, including SQL.
>>>>>>>>>> 
>>>>>>>>>> This is some implicit behaviour with side effects. Imagine if
>> user
>>>>>> has
>>>>>>> a
>>>>>>>>>> long and complicated program, that touches table `b` multiple
>>>>> times,
>>>>>>> maybe
>>>>>>>>>> scattered around different methods. If he modifies his program by
>>>>>>> inserting
>>>>>>>>>> in one place
>>>>>>>>>> 
>>>>>>>>>> b.cache()
>>>>>>>>>> 
>>>>>>>>>> This implicitly alters the semantic and behaviour of his code all
>>>>>> over
>>>>>>>>>> the place, maybe in a ways that might cause problems. For example
>>>>>> what
>>>>>>> if
>>>>>>>>>> underlying data is changing?
>>>>>>>>>> 
>>>>>>>>>> Having invisible side effects is also not very clean, for example
>>>>>> think
>>>>>>>>>> about something like this (but more complicated):
>>>>>>>>>> 
>>>>>>>>>> Table b = ...;
>>>>>>>>>> 
>>>>>>>>>> If (some_condition) {
>>>>>>>>>> processTable1(b)
>>>>>>>>>> }
>>>>>>>>>> else {
>>>>>>>>>> processTable2(b)
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> // do more stuff with b
>>>>>>>>>> 
>>>>>>>>>> And user adds `b.cache()` call to only one of the `processTable1`
>>>>> or
>>>>>>>>>> `processTable2` methods.
>>>>>>>>>> 
>>>>>>>>>> On the other hand
>>>>>>>>>> 
>>>>>>>>>> Table materialisedB = b.materialize()
>>>>>>>>>> 
>>>>>>>>>> Avoids (at least some of) the side effect issues and forces user
>> to
>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate and forces
>>>>> user
>>>>>>> to
>>>>>>>>>> think what does it actually mean. And if something doesn’t work
>> in
>>>>>> the
>>>>>>> end
>>>>>>>>>> for the user, he will know what has he changed instead of blaming
>>>>>>> Flink for
>>>>>>>>>> some “magic” underneath. In the above example, after
>> materialising
>>>>> b
>>>>>> in
>>>>>>>>>> only one of the methods, he should/would realise about the issue
>>>>> when
>>>>>>>>>> handling the return value `MaterializedTable` of that method.
>>>>>>>>>> 
>>>>>>>>>> I guess it comes down to personal preferences if you like things
>> to
>>>>>> be
>>>>>>>>>> implicit or not. The more power is the user, probably the more
>>>>> likely
>>>>>>> he is
>>>>>>>>>> to like/understand implicit behaviour. And we as Table API
>>>>> designers
>>>>>>> are
>>>>>>>>>> the most power users out there, so I would proceed with caution
>> (so
>>>>>>> that we
>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely implicit
>>>>>> method
>>>>>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
>>>>>>>>>> 
>>>>>>>>>>> Table API to also support non-relational processing cases,
>> cache()
>>>>>>>>>> might be slightly better.
>>>>>>>>>> 
>>>>>>>>>> I think even such extended Table API could benefit from sticking
>>>>>>> to/being
>>>>>>>>>> consistent with SQL where both SQL and Table API are basically
>> the
>>>>>>> same.
>>>>>>>>>> 
>>>>>>>>>> One more thing. `MaterializedTable materialize()` could be more
>>>>>>>>>> powerful/flexible allowing the user to operate both on
>> materialised
>>>>>>> and not
>>>>>>>>>> materialised view at the same time for whatever reasons
>> (underlying
>>>>>>> data
>>>>>>>>>> changing/better optimisation opportunities after pushing down
>> more
>>>>>>> filters
>>>>>>>>>> etc). For example:
>>>>>>>>>> 
>>>>>>>>>> Table b = …;
>>>>>>>>>> 
>>>>>>>>>> MaterlizedTable mb = b.materialize();
>>>>>>>>>> 
>>>>>>>>>> Val min = mb.min();
>>>>>>>>>> Val max = mb.max();
>>>>>>>>>> 
>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
>>>>>>>>>> 
>>>>>>>>>> Could be more efficient compared to `b.cache()` if
>> `filter(‘userId
>>>>> =
>>>>>>>>>> 42);` allows for much more aggressive optimisations.
>>>>>>>>>> 
>>>>>>>>>> Piotrek
>>>>>>>>>> 
>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was just an
>>>>>>> example.
>>>>>>>>>>> Plasma and Arrow sound interesting, too.
>>>>>>>>>>> For the sake of this proposal, it would be up to the user to
>>>>>>> implement a
>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink classes
>> to
>>>>>>>>>> persist
>>>>>>>>>>> and read the data.
>>>>>>>>>>> 
>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
>>>>>>>>>>> pompermaier@okkam.it>:
>>>>>>>>>>> 
>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an alternative
>> to
>>>>>>>>>> Apache
>>>>>>>>>>>> Ignite?
>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
>>>>> fhueske@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the proposal!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> To summarize, you propose a new method Table.cache(): Table
>> that
>>>>>>> will
>>>>>>>>>>>>> trigger a job and write the result into some temporary storage
>>>>> as
>>>>>>>>>> defined
>>>>>>>>>>>>> by a TableFactory.
>>>>>>>>>>>>> The cache() call blocks while the job is running and
>> eventually
>>>>>>>>>> returns a
>>>>>>>>>>>>> Table object that represents a scan of the temporary table.
>>>>>>>>>>>>> When the "session" is closed (closing to be defined?), the
>>>>>> temporary
>>>>>>>>>>>> tables
>>>>>>>>>>>>> are all dropped.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think this behavior makes sense and is a good first step
>>>>> towards
>>>>>>>>>> more
>>>>>>>>>>>>> interactive workloads.
>>>>>>>>>>>>> However, its performance suffers from writing to and reading
>>>>> from
>>>>>>>>>>>> external
>>>>>>>>>>>>> systems.
>>>>>>>>>>>>> I think this is OK for now. Changes that would significantly
>>>>>> improve
>>>>>>>>>> the
>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would
>> have
>>>>>>> large
>>>>>>>>>>>>> impacts on many components of Flink.
>>>>>>>>>>>>> Users could use in-memory filesystems or storage grids (Apache
>>>>>>>>>> Ignite) to
>>>>>>>>>>>>> mitigate some of the performance effects.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best, Fabian
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
>>>>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>>>>> :
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
>>>>> that
>>>>>>> they
>>>>>>>>>>>>>> cannot do on a Table? After users call *table.cache(), *users
>>>>> can
>>>>>>>>>> just
>>>>>>>>>>>>> use
>>>>>>>>>>>>>> that table and do anything that is supported on a Table,
>>>>>> including
>>>>>>>>>> SQL.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine to
>> me.
>>>>>>>>>> cache()
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> a bit more general than materialize(). Given that we are
>>>>>> enhancing
>>>>>>>>>> the
>>>>>>>>>>>>>> Table API to also support non-relational processing cases,
>>>>>> cache()
>>>>>>>>>>>> might
>>>>>>>>>>>>> be
>>>>>>>>>>>>>> slightly better.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
>>>>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you
>> want
>>>>> to
>>>>>>>>>>>>> provide
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>> alternate way of writing the data.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we could
>>>>>>> rename
>>>>>>>>>>>>>>> `cache()` to
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> void materialize()
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> or going step further
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> MaterializedTable materialize()
>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The second option with returning a handle I think is more
>>>>>> flexible
>>>>>>>>>>>> and
>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete” or
>> generally
>>>>>>>>>>>> speaking
>>>>>>>>>>>>>>> manage the the view. In the future we could also think about
>>>>>>> adding
>>>>>>>>>>>>> hooks
>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more explicit
>> -
>>>>>>>>>>>>>>> materialization returning a new table handle will not have
>> the
>>>>>>> same
>>>>>>>>>>>>>>> implicit side effects as adding a simple line of code like
>>>>>>>>>>>> `b.cache()`
>>>>>>>>>>>>>>> would have.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It would also be more SQL like, making it more intuitive for
>>>>>> users
>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>> familiar with the SQL.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <becket.qin@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
>>>>>> creating
>>>>>>> a
>>>>>>>>>>>>>>> BUILT-IN
>>>>>>>>>>>>>>>> materialized view with a lifecycle. That functionality is
>>>>>> missing
>>>>>>>>>>>>>> today,
>>>>>>>>>>>>>>>> though. Not sure if I understand your question. Do you mean
>>>>> we
>>>>>>>>>>>>> already
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we want to
>> stop
>>>>>> at
>>>>>>>>>>>>>> creating
>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend that in the
>>>>>> future
>>>>>>>>>>>> to
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>> useful unified data store distributed with Flink? And do we
>>>>>> want
>>>>>>> to
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern with their
>> own
>>>>>>> user
>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>> services. These considerations are much more architectural.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
>>>>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the problem.
>>>>> Isn’t
>>>>>>> the
>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
>>>>>> later
>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live
>> time?
>>>>>> And
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> sink
>>>>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a materialised
>> view
>>>>>>> from a
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
>>>>>> materialised
>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
>>>>>> materialised
>>>>>>>>>>>>> views
>>>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we need some
>>>>>>>>>>>> syntactic
>>>>>>>>>>>>>>> sugar
>>>>>>>>>>>>>>>>> on top of it?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
>> becket.qin@gmail.com
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
>>>>>>>>>>>>>> lifecycle/defined
>>>>>>>>>>>>>>>>>> scope. I just added a section in the future work for
>> this.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
>>>>>>>>>>>>>> sunjincheng121@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
>>>>> `cache()`, I
>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>> you designed this way!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
>>>>> data
>>>>>>>>>>>>>>> persistence?
>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the
>> user
>>>>>> is
>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> worried
>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the time range
>>>>> for
>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also
>> share
>>>>>> in a
>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>> group of session, for example:
>>>>>> LifeCycle.SESSION_GROUP(...), I
>>>>>>>>>>>> am
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> sure,
>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
>>>>> 下午1:33写道:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
>>>>> persist(),
>>>>>>>>>>>>>>> personally I
>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
>>>>> behavior,
>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted after
>> the
>>>>>>>>>>>> session
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> closed.
>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people might
>> think
>>>>>> the
>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>> still be there even after the session is gone.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
>> processing
>>>>> in
>>>>>>> the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
>>>>> that
>>>>>>>>>>>> would
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>>> change across the board, including sources, operators
>> and
>>>>>>>>>>>>>>>>> optimizations,
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several separate
>> in-depth
>>>>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
>>>>>>>>>>>> xingcanc@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
>>>>> both
>>>>>>>>>>>>>>> orthogonal
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
>>>>> time
>>>>>>> we
>>>>>>>>>>>>> plan
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
>>>>> state.
>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>> it’s
>>>>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
>>>>>>> specific
>>>>>>>>>>>>>> part?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
>>>>>>> underlying
>>>>>>>>>>>>>>>>> service.
>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the existing
>>>>>>>>>>>> codebase.
>>>>>>>>>>>>> As
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to support
>>>>> other
>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
>>>>>> Table
>>>>>>>>>>>>> API,
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
>>>>>>>>>>>>> xiaoweij@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
>>>>> is
>>>>>>> not
>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>> reliable.
>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
>>>>>>> successfully.
>>>>>>>>>>>> We
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>> risk
>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
>>>>> have
>>>>>> an
>>>>>>>>>>>>>>>>>>> association
>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can always
>>>>> clean
>>>>>>> up
>>>>>>>>>>>>> temp
>>>>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any active
>>>>> sessions.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>>>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
>>>>> friendly
>>>>>>> in
>>>>>>>>>>>>> case
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>>>>> examples.
>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
>>>>> executed
>>>>>> in
>>>>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>>>>> stages
>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML,
>> in
>>>>>>> order
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> utilize
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
>>>>> job
>>>>>>> by
>>>>>>>>>>>>>>>>>>>>> env.execute().
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
>>>>>>>>>>>> `persist()`,
>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally
>> cache
>>>>>> in
>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
>>>>>>> backend
>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future,
>> support
>>>>>> for
>>>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also
>> benefit
>>>>>> in
>>>>>>>>>>>>>>>>>>> "Interactive
>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs
>> and
>>>>>>> FLIP!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
>>>>>>> 下午9:56写道:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
>>>>> is a
>>>>>>>>>>>>>> promising
>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
>>>>>>> aspects,
>>>>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
>>>>> the
>>>>>>>>>>>>>> scenarios
>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
>> programming.
>>>>> To
>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we
>> put
>>>>>>>>>>>>> together
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>> 



Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotrek and Jark,

Thanks for the feedback and explanation. Those are good arguments. But I
think those arguments are mostly about materialized view. Let me try to
explain the reason I believe cache() and materialize() are different.

I think cache() and materialize() have quite different implications. An
analogy I can think of is save()/publish(). When users call cache(), it is
just like they are saving an intermediate result as a draft of their work,
this intermediate result may not have any realistic meaning. Calling
cache() does not mean users want to publish the cached table in any manner.
But when users call materialize(), that means "I have something meaningful
to be reused by others", now users need to think about the validation,
update & versioning, lifecycle of the result, etc.

Piotrek's suggestions on variations of the materialize() methods are very
useful. It would be great if Flink have them. The concept of materialized
view is actually a pretty big feature, not to say the related stuff like
triggers/hooks you mentioned earlier. I think the materialized view itself
should be discussed in a more thorough and systematic manner. And I found
that discussion is kind of orthogonal and way beyond interactive
programming experience.

The example you gave was interesting. I still have some questions, though.

Table source = … // some source that scans files from a directory
> “/foo/bar/“
> Table t1 = source.groupBy(…).select(…).where(…) ….;
> Table t2 = t1.materialize() // (or `cache()`)

t2.count() // initialise cache (if it’s lazily initialised)
> int a1 = t1.count()
> int b1 = t2.count()
> // something in the background (or we trigger it) writes new files to
> /foo/bar
> int a2 = t1.count()
> int b2 = t2.count()
> t2.refresh() // possible future extension, not to be implemented in the
> initial version
>

what if someone else added some more files to /foo/bar at this point? In
that case, a3 won't equals to b3, and the result become non-deterministic,
right?

int a3 = t1.count()
> int b3 = t2.count()
> t2.drop() // another possible future extension, manual “cache” dropping


When we talk about interactive programming, in most cases, we are talking
about batch applications. A fundamental assumption of such case is that the
source data is complete before the data processing begins, and the data
will not change during the data processing. IMO, if additional rows needs
to be added to some source during the processing, it should be done in ways
like union the source with another table containing the rows to be added.

There are a few cases that computations are executed repeatedly on the
changing data source.

For example, people may run a ML training job every hour with the samples
newly added in the past hour. In that case, the source data between will
indeed change. But still, the data remain unchanged within one run. And
usually in that case, the result will need versioning, i.e. for a given
result, it tells that the result is a result from the source data by a
certain timestamp.

Another example is something like data warehouse. In this case, there are a
few source of original/raw data. On top of those sources, many materialized
view / queries / reports / dashboards can be created to generate derived
data. Those derived data needs to be updated when the underlying original
data changes. In that case, the processing logic that derives the original
data needs to be executed repeatedly to update those reports/views. Again,
all those derived data also need to have version management, such as
timestamp.

In any of the above two cases, during a single run of the processing logic,
the data cannot change. Otherwise the behavior of the processing logic may
be undefined. In the above two examples, when writing the processing logic,
Users can use .cache() to hint Flink that those results should be saved to
avoid repeated computation. And then for the result of my application
logic, I'll call materialize(), so that these results could be managed by
the system with versioning, metadata management, lifecycle management,
ACLs, etc.

It is true we can use materialize() to do the cache() job, but I am really
reluctant to shoehorn cache() into materialize() and force users to worry
about a bunch of implications that they needn't have to. I am absolutely on
your side that redundant API is bad. But it is equally frustrating, if not
more, that the same API does different things.

Thanks,

Jiangjie (Becket) Qin


On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <ws...@gmail.com> wrote:

> Thanks Piotrek,
> You provided a very good example, it explains all the confusions I have.
> It is clear that there is something we have not considered in the initial
> proposal. We intend to force the user to reuse the cached/materialized
> table, if its cache() method is executed. We did not expect that user may
> want to re-executed the plan from the source table. Let me re-think about
> it and get back to you later.
>
> In the meanwhile, this example/observation also infers that we cannot fully
> involve the optimizer to decide the plan if a cache/materialize is
> explicitly used, because weather to reuse the cache data or re-execute the
> query from source data may lead to different results. (But I guess
> optimizer can still help in some cases ---- as long as it does not
> re-execute from the varied source, we should be safe).
>
> Regards,
> Shaoxuan
>
>
>
> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
> > Hi Shaoxuan,
> >
> > Re 2:
> >
> > > Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1’
> >
> > What do you mean that “ t1 is modified to-> t1’ ” ? That
> > `methodThatAppliesOperators()` method has changed it’s plan?
> >
> > I was thinking more about something like this:
> >
> > Table source = … // some source that scans files from a directory
> > “/foo/bar/“
> > Table t1 = source.groupBy(…).select(…).where(…) ….;
> > Table t2 = t1.materialize() // (or `cache()`)
> >
> > t2.count() // initialise cache (if it’s lazily initialised)
> >
> > int a1 = t1.count()
> > int b1 = t2.count()
> >
> > // something in the background (or we trigger it) writes new files to
> > /foo/bar
> >
> > int a2 = t1.count()
> > int b2 = t2.count()
> >
> > t2.refresh() // possible future extension, not to be implemented in the
> > initial version
> >
> > int a3 = t1.count()
> > int b3 = t2.count()
> >
> > t2.drop() // another possible future extension, manual “cache” dropping
> >
> > assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
> > assertTrue(b1 == b2) // both values come from the same cache
> > assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full table
> scan
> > and has more data
> > assertTrue(b3 > b2) // b3 comes from refreshed cache
> > assertTrue(b3 == a2 == a3)
> >
> > Piotrek
> >
> > > On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > It is an very interesting and useful design!
> > >
> > > Here I want to share some of my thoughts:
> > >
> > > 1. Agree with that cache() method should return some Table to avoid
> some
> > > unexpected problems because of the mutable object.
> > >   All the existing methods of Table are returning a new Table instance.
> > >
> > > 2. I think materialize() would be more consistent with SQL, this makes
> it
> > > possible to support the same feature for SQL (materialize view) and
> keep
> > > the same API for users in the future.
> > >   But I'm also fine if we choose cache().
> > >
> > > 3. In the proposal, a TableService (or FlinkService?) is used to cache
> > the
> > > result of the (intermediate) table.
> > >   But the name of TableService may be a bit general which is not quite
> > > understanding correctly in the first glance (a metastore for tables?).
> > >   Maybe a more specific name would be better, such as TableCacheSerive
> or
> > > TableMaterializeSerivce or something else.
> > >
> > > Best,
> > > Jark
> > >
> > >
> > > On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> Thanks for the clarification Becket!
> > >>
> > >> I have a few thoughts to share / questions:
> > >>
> > >> 1) I'd like to know how you plan to implement the feature on a plan /
> > >> planner level.
> > >>
> > >> I would imaging the following to happen when Table.cache() is called:
> > >>
> > >> 1) immediately optimize the Table and internally convert it into a
> > >> DataSet/DataStream. This is necessary, to avoid that operators of
> later
> > >> queries on top of the Table are pushed down.
> > >> 2) register the DataSet/DataStream as a DataSet/DataStream-backed
> Table
> > X
> > >> 3) add a sink to the DataSet/DataStream. This is the materialization
> of
> > the
> > >> Table X
> > >>
> > >> Based on your proposal the following would happen:
> > >>
> > >> Table t1 = ....
> > >> t1.cache(); // cache() returns void. The logical plan of t1 is
> replaced
> > by
> > >> a scan of X. There is also a reference to the materialization of X.
> > >>
> > >> t1.count(); // this executes the program, including the
> > DataSet/DataStream
> > >> that backs X and the sink that writes the materialization of X
> > >> t1.count(); // this executes the program, but reads X from the
> > >> materialization.
> > >>
> > >> My question is, how do you determine when whether the scan of t1
> should
> > go
> > >> against the DataSet/DataStream program and when against the
> > >> materialization?
> > >> AFAIK, there is no hook that will tell you that a part of the program
> > was
> > >> executed. Flipping a switch during optimization or plan generation is
> > not
> > >> sufficient as there is no guarantee that the plan is also executed.
> > >>
> > >> Overall, this behavior is somewhat similar to what I proposed in
> > >> FLINK-8950, which does not include persisting the table, but just
> > >> optimizing and reregistering it as DataSet/DataStream scan.
> > >>
> > >> 2) I think Piotr has a point about the implicit behavior and side
> > effects
> > >> of the cache() method if it does not return anything.
> > >> Consider the following example:
> > >>
> > >> Table t1 = ???
> > >> Table t2 = methodThatAppliesOperators(t1);
> > >> Table t3 = methodThatAppliesOtherOperators(t1);
> > >>
> > >> In this case, the behavior/performance of the plan that results from
> the
> > >> second method call depends on whether t1 was modified by the first
> > method
> > >> or not.
> > >> This is the classic issue of mutable vs. immutable objects.
> > >> Also, as Piotr pointed out, it might also be good to have the original
> > plan
> > >> of t1, because in some cases it is possible to push filters down such
> > that
> > >> evaluating the query from scratch might be more efficient than
> accessing
> > >> the cache.
> > >> Moreover, a CachedTable could extend Table() and offer a method
> > refresh().
> > >> This sounds quite useful in an interactive session mode.
> > >>
> > >> 3) Regarding the name, I can see both arguments. IMO, materialize()
> > seems
> > >> to be more future proof.
> > >>
> > >> Best, Fabian
> > >>
> > >> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> > >> wshaoxuan@gmail.com>:
> > >>
> > >>> Hi Piotr,
> > >>>
> > >>> Thanks for sharing your ideas on the method naming. We will think
> about
> > >>> your suggestions. But I don't understand why we need to change the
> > return
> > >>> type of cache().
> > >>>
> > >>> Cache() is a physical operation, it does not change the logic of
> > >>> the `Table`. On the tableAPI layer, we should not introduce a new
> table
> > >>> type unless the logic of table has been changed. If we introduce a
> new
> > >>> table type `CachedTable`, we need create the same set of methods of
> > >> `Table`
> > >>> for it. I don't think it is worth doing this. Or can you please
> > elaborate
> > >>> more on what could be the "implicit behaviours/side effects" you are
> > >>> thinking about?
> > >>>
> > >>> Regards,
> > >>> Shaoxuan
> > >>>
> > >>>
> > >>>
> > >>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > piotr@data-artisans.com>
> > >>> wrote:
> > >>>
> > >>>> Hi Becket,
> > >>>>
> > >>>> Thanks for the response.
> > >>>>
> > >>>> 1. I wasn’t saying that materialised view must be mutable or not.
> The
> > >>> same
> > >>>> thing applies to caches as well. To the contrary, I would expect
> more
> > >>>> consistency and updates from something that is called “cache” vs
> > >>> something
> > >>>> that’s a “materialised view”. In other words, IMO most caches do not
> > >>> serve
> > >>>> you invalid/outdated data and they handle updates on their own.
> > >>>>
> > >>>> 2. I don’t think that having in the future two very similar concepts
> > of
> > >>>> `materialized` view and `cache` is a good idea. It would be
> confusing
> > >> for
> > >>>> the users. I think it could be handled by variations/overloading of
> > >>>> materialised view concept. We could start with:
> > >>>>
> > >>>> `MaterializedTable materialize()` - immutable, session life scope
> > >>>> (basically the same semantic as you are proposing
> > >>>>
> > >>>> And then in the future (if ever) build on top of that/expand it
> with:
> > >>>>
> > >>>> `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
> > >>>> materialize(refreshHook=…)`
> > >>>>
> > >>>> Or with cross session support:
> > >>>>
> > >>>> `MaterializedTable materializeInto(connector=…)` or
> `MaterializedTable
> > >>>> materializeInto(tableFactory=…)`
> > >>>>
> > >>>> I’m not saying that we should implement cross session/refreshing now
> > or
> > >>>> even in the near future. I’m just arguing that naming current
> > immutable
> > >>>> session life scope method `materialize()` is more future proof and
> > more
> > >>>> consistent with SQL (on which after all table-api is heavily basing
> > >> on).
> > >>>>
> > >>>> 3. Even if we agree on naming it `cache()`, I would still insist on
> > >>>> `cache()` returning `CachedTable` handle to avoid implicit
> > >>> behaviours/side
> > >>>> effects and to give both us & users more flexibility.
> > >>>>
> > >>>> Piotrek
> > >>>>
> > >>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
> > >>>>>
> > >>>>> Just to add a little bit, the materialized view is probably more
> > >>> similar
> > >>>> to
> > >>>>> the persistent() brought up earlier in the thread. So it is usually
> > >>> cross
> > >>>>> session and could be used in a larger scope. For example, a
> > >>> materialized
> > >>>>> view created by user A may be visible to user B. It is probably
> > >>> something
> > >>>>> we want to have in the future. I'll put it in the future work
> > >> section.
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Jiangjie (Becket) Qin
> > >>>>>
> > >>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com>
> > >>> wrote:
> > >>>>>
> > >>>>>> Hi Piotrek,
> > >>>>>>
> > >>>>>> Thanks for the explanation.
> > >>>>>>
> > >>>>>> Right now we are mostly thinking of the cached table as
> immutable. I
> > >>> can
> > >>>>>> see the Materialized view would be useful in the future. That
> said,
> > >> I
> > >>>> think
> > >>>>>> a simple cache mechanism is probably still needed. So to me,
> cache()
> > >>> and
> > >>>>>> materialize() should be two separate method as they address
> > >> different
> > >>>>>> needs. Materialize() is a higher level concept usually implying
> > >>>> periodical
> > >>>>>> update, while cache() has much simpler semantic. For example, one
> > >> may
> > >>>>>> create a materialized view and use cache() method in the
> > >> materialized
> > >>>> view
> > >>>>>> creation logic. So that during the materialized view update, they
> do
> > >>> not
> > >>>>>> need to worry about the case that the cached table is also
> changed.
> > >>>> Maybe
> > >>>>>> under the hood, materialized() and cache() could share some
> > >> mechanism,
> > >>>> but
> > >>>>>> I think a simple cache() method would be handy in a lot of cases.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Jiangjie (Becket) Qin
> > >>>>>>
> > >>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > >>> piotr@data-artisans.com
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Becket,
> > >>>>>>>
> > >>>>>>>> Is there any extra thing user can do on a MaterializedTable that
> > >>> they
> > >>>>>>> cannot do on a Table?
> > >>>>>>>
> > >>>>>>> Maybe not in the initial implementation, but various DBs offer
> > >>>> different
> > >>>>>>> ways to “refresh” the materialised view. Hooks, triggers, timers,
> > >>>> manually
> > >>>>>>> etc. Having `MaterializedTable` would help us to handle that in
> the
> > >>>> future.
> > >>>>>>>
> > >>>>>>>> After users call *table.cache(), *users can just use that table
> > >> and
> > >>> do
> > >>>>>>> anything that is supported on a Table, including SQL.
> > >>>>>>>
> > >>>>>>> This is some implicit behaviour with side effects. Imagine if
> user
> > >>> has
> > >>>> a
> > >>>>>>> long and complicated program, that touches table `b` multiple
> > >> times,
> > >>>> maybe
> > >>>>>>> scattered around different methods. If he modifies his program by
> > >>>> inserting
> > >>>>>>> in one place
> > >>>>>>>
> > >>>>>>> b.cache()
> > >>>>>>>
> > >>>>>>> This implicitly alters the semantic and behaviour of his code all
> > >>> over
> > >>>>>>> the place, maybe in a ways that might cause problems. For example
> > >>> what
> > >>>> if
> > >>>>>>> underlying data is changing?
> > >>>>>>>
> > >>>>>>> Having invisible side effects is also not very clean, for example
> > >>> think
> > >>>>>>> about something like this (but more complicated):
> > >>>>>>>
> > >>>>>>> Table b = ...;
> > >>>>>>>
> > >>>>>>> If (some_condition) {
> > >>>>>>> processTable1(b)
> > >>>>>>> }
> > >>>>>>> else {
> > >>>>>>> processTable2(b)
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> // do more stuff with b
> > >>>>>>>
> > >>>>>>> And user adds `b.cache()` call to only one of the `processTable1`
> > >> or
> > >>>>>>> `processTable2` methods.
> > >>>>>>>
> > >>>>>>> On the other hand
> > >>>>>>>
> > >>>>>>> Table materialisedB = b.materialize()
> > >>>>>>>
> > >>>>>>> Avoids (at least some of) the side effect issues and forces user
> to
> > >>>>>>> explicitly use `materialisedB` where it’s appropriate and forces
> > >> user
> > >>>> to
> > >>>>>>> think what does it actually mean. And if something doesn’t work
> in
> > >>> the
> > >>>> end
> > >>>>>>> for the user, he will know what has he changed instead of blaming
> > >>>> Flink for
> > >>>>>>> some “magic” underneath. In the above example, after
> materialising
> > >> b
> > >>> in
> > >>>>>>> only one of the methods, he should/would realise about the issue
> > >> when
> > >>>>>>> handling the return value `MaterializedTable` of that method.
> > >>>>>>>
> > >>>>>>> I guess it comes down to personal preferences if you like things
> to
> > >>> be
> > >>>>>>> implicit or not. The more power is the user, probably the more
> > >> likely
> > >>>> he is
> > >>>>>>> to like/understand implicit behaviour. And we as Table API
> > >> designers
> > >>>> are
> > >>>>>>> the most power users out there, so I would proceed with caution
> (so
> > >>>> that we
> > >>>>>>> do not end up in the crazy perl realm with it’s lovely implicit
> > >>> method
> > >>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> > >>>>>>>
> > >>>>>>>> Table API to also support non-relational processing cases,
> cache()
> > >>>>>>> might be slightly better.
> > >>>>>>>
> > >>>>>>> I think even such extended Table API could benefit from sticking
> > >>>> to/being
> > >>>>>>> consistent with SQL where both SQL and Table API are basically
> the
> > >>>> same.
> > >>>>>>>
> > >>>>>>> One more thing. `MaterializedTable materialize()` could be more
> > >>>>>>> powerful/flexible allowing the user to operate both on
> materialised
> > >>>> and not
> > >>>>>>> materialised view at the same time for whatever reasons
> (underlying
> > >>>> data
> > >>>>>>> changing/better optimisation opportunities after pushing down
> more
> > >>>> filters
> > >>>>>>> etc). For example:
> > >>>>>>>
> > >>>>>>> Table b = …;
> > >>>>>>>
> > >>>>>>> MaterlizedTable mb = b.materialize();
> > >>>>>>>
> > >>>>>>> Val min = mb.min();
> > >>>>>>> Val max = mb.max();
> > >>>>>>>
> > >>>>>>> Val user42 = b.filter(‘userId = 42);
> > >>>>>>>
> > >>>>>>> Could be more efficient compared to `b.cache()` if
> `filter(‘userId
> > >> =
> > >>>>>>> 42);` allows for much more aggressive optimisations.
> > >>>>>>>
> > >>>>>>> Piotrek
> > >>>>>>>
> > >>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
> > >> wrote:
> > >>>>>>>>
> > >>>>>>>> I'm not suggesting to add support for Ignite. This was just an
> > >>>> example.
> > >>>>>>>> Plasma and Arrow sound interesting, too.
> > >>>>>>>> For the sake of this proposal, it would be up to the user to
> > >>>> implement a
> > >>>>>>>> TableFactory and corresponding TableSource / TableSink classes
> to
> > >>>>>>> persist
> > >>>>>>>> and read the data.
> > >>>>>>>>
> > >>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> > >>>>>>>> pompermaier@okkam.it>:
> > >>>>>>>>
> > >>>>>>>>> What about to add also Apache Plasma + Arrow as an alternative
> to
> > >>>>>>> Apache
> > >>>>>>>>> Ignite?
> > >>>>>>>>> [1]
> > >>>>>>>>>
> > >>>>>>>
> > >>>>
> > >>
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> > >> fhueske@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for the proposal!
> > >>>>>>>>>>
> > >>>>>>>>>> To summarize, you propose a new method Table.cache(): Table
> that
> > >>>> will
> > >>>>>>>>>> trigger a job and write the result into some temporary storage
> > >> as
> > >>>>>>> defined
> > >>>>>>>>>> by a TableFactory.
> > >>>>>>>>>> The cache() call blocks while the job is running and
> eventually
> > >>>>>>> returns a
> > >>>>>>>>>> Table object that represents a scan of the temporary table.
> > >>>>>>>>>> When the "session" is closed (closing to be defined?), the
> > >>> temporary
> > >>>>>>>>> tables
> > >>>>>>>>>> are all dropped.
> > >>>>>>>>>>
> > >>>>>>>>>> I think this behavior makes sense and is a good first step
> > >> towards
> > >>>>>>> more
> > >>>>>>>>>> interactive workloads.
> > >>>>>>>>>> However, its performance suffers from writing to and reading
> > >> from
> > >>>>>>>>> external
> > >>>>>>>>>> systems.
> > >>>>>>>>>> I think this is OK for now. Changes that would significantly
> > >>> improve
> > >>>>>>> the
> > >>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would
> have
> > >>>> large
> > >>>>>>>>>> impacts on many components of Flink.
> > >>>>>>>>>> Users could use in-memory filesystems or storage grids (Apache
> > >>>>>>> Ignite) to
> > >>>>>>>>>> mitigate some of the performance effects.
> > >>>>>>>>>>
> > >>>>>>>>>> Best, Fabian
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> > >>>>>>>>>> becket.qin@gmail.com
> > >>>>>>>>>>> :
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks for the explanation, Piotrek.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
> > >> that
> > >>>> they
> > >>>>>>>>>>> cannot do on a Table? After users call *table.cache(), *users
> > >> can
> > >>>>>>> just
> > >>>>>>>>>> use
> > >>>>>>>>>>> that table and do anything that is supported on a Table,
> > >>> including
> > >>>>>>> SQL.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine to
> me.
> > >>>>>>> cache()
> > >>>>>>>>>> is
> > >>>>>>>>>>> a bit more general than materialize(). Given that we are
> > >>> enhancing
> > >>>>>>> the
> > >>>>>>>>>>> Table API to also support non-relational processing cases,
> > >>> cache()
> > >>>>>>>>> might
> > >>>>>>>>>> be
> > >>>>>>>>>>> slightly better.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > >>>>>>>>> piotr@data-artisans.com
> > >>>>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Becket,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> > >>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you
> want
> > >> to
> > >>>>>>>>>> provide
> > >>>>>>>>>>> an
> > >>>>>>>>>>>> alternate way of writing the data.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we could
> > >>>> rename
> > >>>>>>>>>>>> `cache()` to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> void materialize()
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> or going step further
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> MaterializedTable materialize()
> > >>>>>>>>>>>> MaterializedTable createMaterializedView()
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> ?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The second option with returning a handle I think is more
> > >>> flexible
> > >>>>>>>>> and
> > >>>>>>>>>>>> could provide features such as “refresh”/“delete” or
> generally
> > >>>>>>>>> speaking
> > >>>>>>>>>>>> manage the the view. In the future we could also think about
> > >>>> adding
> > >>>>>>>>>> hooks
> > >>>>>>>>>>>> to automatically refresh view etc. It is also more explicit
> -
> > >>>>>>>>>>>> materialization returning a new table handle will not have
> the
> > >>>> same
> > >>>>>>>>>>>> implicit side effects as adding a simple line of code like
> > >>>>>>>>> `b.cache()`
> > >>>>>>>>>>>> would have.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> It would also be more SQL like, making it more intuitive for
> > >>> users
> > >>>>>>>>>>> already
> > >>>>>>>>>>>> familiar with the SQL.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <becket.qin@gmail.com
> >
> > >>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Piotrek,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
> > >>> creating
> > >>>> a
> > >>>>>>>>>>>> BUILT-IN
> > >>>>>>>>>>>>> materialized view with a lifecycle. That functionality is
> > >>> missing
> > >>>>>>>>>>> today,
> > >>>>>>>>>>>>> though. Not sure if I understand your question. Do you mean
> > >> we
> > >>>>>>>>>> already
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>>> the functionality and just need a syntax sugar?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> What's more interesting in the proposal is do we want to
> stop
> > >>> at
> > >>>>>>>>>>> creating
> > >>>>>>>>>>>>> the materialized view? Or do we want to extend that in the
> > >>> future
> > >>>>>>>>> to
> > >>>>>>>>>> a
> > >>>>>>>>>>>> more
> > >>>>>>>>>>>>> useful unified data store distributed with Flink? And do we
> > >>> want
> > >>>> to
> > >>>>>>>>>>> have
> > >>>>>>>>>>>> a
> > >>>>>>>>>>>>> mechanism allow more flexible user job pattern with their
> own
> > >>>> user
> > >>>>>>>>>>>> defined
> > >>>>>>>>>>>>> services. These considerations are much more architectural.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > >>>>>>>>>>> piotr@data-artisans.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Interesting idea. I’m trying to understand the problem.
> > >> Isn’t
> > >>>> the
> > >>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
> > >>> later
> > >>>>>>>>>>> reading
> > >>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live
> time?
> > >>> And
> > >>>>>>>>> the
> > >>>>>>>>>>>> sink
> > >>>>>>>>>>>>>> could be implemented as in memory or a file sink?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> If so, what’s the problem with creating a materialised
> view
> > >>>> from a
> > >>>>>>>>>>> table
> > >>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
> > >>> materialised
> > >>>>>>>>>> view
> > >>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> > >>> materialised
> > >>>>>>>>>> views
> > >>>>>>>>>>>> (for
> > >>>>>>>>>>>>>> example when current session finishes)? Maybe we need some
> > >>>>>>>>> syntactic
> > >>>>>>>>>>>> sugar
> > >>>>>>>>>>>>>> on top of it?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Piotrek
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> becket.qin@gmail.com
> > >>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> > >>>>>>>>>>> lifecycle/defined
> > >>>>>>>>>>>>>>> scope. I just added a section in the future work for
> this.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > >>>>>>>>>>> sunjincheng121@gmail.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Jiangjie,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thank you for the explanation about the name of
> > >> `cache()`, I
> > >>>>>>>>>>>> understand
> > >>>>>>>>>>>>>> why
> > >>>>>>>>>>>>>>>> you designed this way!
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
> > >> data
> > >>>>>>>>>>>> persistence?
> > >>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the
> user
> > >>> is
> > >>>>>>>>> not
> > >>>>>>>>>>>>>> worried
> > >>>>>>>>>>>>>>>> about data loss, and will clearly specify the time range
> > >> for
> > >>>>>>>>>> keeping
> > >>>>>>>>>>>>>> time.
> > >>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also
> share
> > >>> in a
> > >>>>>>>>>>> certain
> > >>>>>>>>>>>>>>>> group of session, for example:
> > >>> LifeCycle.SESSION_GROUP(...), I
> > >>>>>>>>> am
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>>> sure,
> > >>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Bests,
> > >>>>>>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
> > >> 下午1:33写道:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Re: Jincheng,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> > >> persist(),
> > >>>>>>>>>>>> personally I
> > >>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
> > >> behavior,
> > >>>>>>>>> i.e.
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> Table
> > >>>>>>>>>>>>>>>>> is cached for the session, but will be deleted after
> the
> > >>>>>>>>> session
> > >>>>>>>>>> is
> > >>>>>>>>>>>>>>>> closed.
> > >>>>>>>>>>>>>>>>> persist() seems a little misleading as people might
> think
> > >>> the
> > >>>>>>>>>> table
> > >>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>> still be there even after the session is gone.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
> processing
> > >> in
> > >>>> the
> > >>>>>>>>>>> same
> > >>>>>>>>>>>>>> job.
> > >>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
> > >> that
> > >>>>>>>>> would
> > >>>>>>>>>>> be
> > >>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>> huge
> > >>>>>>>>>>>>>>>>> change across the board, including sources, operators
> and
> > >>>>>>>>>>>>>> optimizations,
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> name some. Likely we will need several separate
> in-depth
> > >>>>>>>>>>> discussions.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > >>>>>>>>> xingcanc@gmail.com>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
> > >> both
> > >>>>>>>>>>>> orthogonal
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
> > >> time
> > >>>> we
> > >>>>>>>>>> plan
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
> > >> state.
> > >>>>>>>>> Maybe
> > >>>>>>>>>>> it’s
> > >>>>>>>>>>>>>>>>> better
> > >>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
> > >>>> specific
> > >>>>>>>>>>> part?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> > >>>> underlying
> > >>>>>>>>>>>>>> service.
> > >>>>>>>>>>>>>>>>>> This seems to be quite a major change to the existing
> > >>>>>>>>> codebase.
> > >>>>>>>>>> As
> > >>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>> claimed, the service should be extendible to support
> > >> other
> > >>>>>>>>>>>> components
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
> > >>> Table
> > >>>>>>>>>> API,
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>> Xingcan
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > >>>>>>>>>> xiaoweij@gmail.com>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
> > >> is
> > >>>> not
> > >>>>>>>>>> very
> > >>>>>>>>>>>>>>>>>> reliable.
> > >>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
> > >>>> successfully.
> > >>>>>>>>> We
> > >>>>>>>>>>> may
> > >>>>>>>>>>>>>>>>> risk
> > >>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
> > >> have
> > >>> an
> > >>>>>>>>>>>>>>>> association
> > >>>>>>>>>>>>>>>>>>> between temp table and session id. So we can always
> > >> clean
> > >>>> up
> > >>>>>>>>>> temp
> > >>>>>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>>>>>> which are no longer associated with any active
> > >> sessions.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>> Xiaowei
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > >>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> > >>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
> > >> friendly
> > >>>> in
> > >>>>>>>>>> case
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> your
> > >>>>>>>>>>>>>>>>>>>> examples.
> > >>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> > >> executed
> > >>> in
> > >>>>>>>>>>> several
> > >>>>>>>>>>>>>>>>>> stages
> > >>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML,
> in
> > >>>> order
> > >>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> utilize
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
> > >> job
> > >>>> by
> > >>>>>>>>>>>>>>>>>> env.execute().
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> > >>>>>>>>> `persist()`,
> > >>>>>>>>>>> And
> > >>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally
> cache
> > >>> in
> > >>>>>>>>>> memory
> > >>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>> persist
> > >>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
> > >>>> backend
> > >>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future,
> support
> > >>> for
> > >>>>>>>>>>>> streaming
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also
> benefit
> > >>> in
> > >>>>>>>>>>>>>>>> "Interactive
> > >>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs
> and
> > >>>> FLIP!
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
> > >>>> 下午9:56写道:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
> > >> is a
> > >>>>>>>>>>> promising
> > >>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> > >>>> aspects,
> > >>>>>>>>>>>>>>>> including
> > >>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
> > >> the
> > >>>>>>>>>>> scenarios
> > >>>>>>>>>>>>>>>>> where
> > >>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
> programming.
> > >> To
> > >>>>>>>>>> explain
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> issues
> > >>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we
> put
> > >>>>>>>>>> together
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>
> > >>>
> > >>
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Shaoxuan Wang <ws...@gmail.com>.
Thanks Piotrek,
You provided a very good example, it explains all the confusions I have.
It is clear that there is something we have not considered in the initial
proposal. We intend to force the user to reuse the cached/materialized
table, if its cache() method is executed. We did not expect that user may
want to re-executed the plan from the source table. Let me re-think about
it and get back to you later.

In the meanwhile, this example/observation also infers that we cannot fully
involve the optimizer to decide the plan if a cache/materialize is
explicitly used, because weather to reuse the cache data or re-execute the
query from source data may lead to different results. (But I guess
optimizer can still help in some cases ---- as long as it does not
re-execute from the varied source, we should be safe).

Regards,
Shaoxuan



On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi Shaoxuan,
>
> Re 2:
>
> > Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1’
>
> What do you mean that “ t1 is modified to-> t1’ ” ? That
> `methodThatAppliesOperators()` method has changed it’s plan?
>
> I was thinking more about something like this:
>
> Table source = … // some source that scans files from a directory
> “/foo/bar/“
> Table t1 = source.groupBy(…).select(…).where(…) ….;
> Table t2 = t1.materialize() // (or `cache()`)
>
> t2.count() // initialise cache (if it’s lazily initialised)
>
> int a1 = t1.count()
> int b1 = t2.count()
>
> // something in the background (or we trigger it) writes new files to
> /foo/bar
>
> int a2 = t1.count()
> int b2 = t2.count()
>
> t2.refresh() // possible future extension, not to be implemented in the
> initial version
>
> int a3 = t1.count()
> int b3 = t2.count()
>
> t2.drop() // another possible future extension, manual “cache” dropping
>
> assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
> assertTrue(b1 == b2) // both values come from the same cache
> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full table scan
> and has more data
> assertTrue(b3 > b2) // b3 comes from refreshed cache
> assertTrue(b3 == a2 == a3)
>
> Piotrek
>
> > On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
> >
> > Hi,
> >
> > It is an very interesting and useful design!
> >
> > Here I want to share some of my thoughts:
> >
> > 1. Agree with that cache() method should return some Table to avoid some
> > unexpected problems because of the mutable object.
> >   All the existing methods of Table are returning a new Table instance.
> >
> > 2. I think materialize() would be more consistent with SQL, this makes it
> > possible to support the same feature for SQL (materialize view) and keep
> > the same API for users in the future.
> >   But I'm also fine if we choose cache().
> >
> > 3. In the proposal, a TableService (or FlinkService?) is used to cache
> the
> > result of the (intermediate) table.
> >   But the name of TableService may be a bit general which is not quite
> > understanding correctly in the first glance (a metastore for tables?).
> >   Maybe a more specific name would be better, such as TableCacheSerive or
> > TableMaterializeSerivce or something else.
> >
> > Best,
> > Jark
> >
> >
> > On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> Thanks for the clarification Becket!
> >>
> >> I have a few thoughts to share / questions:
> >>
> >> 1) I'd like to know how you plan to implement the feature on a plan /
> >> planner level.
> >>
> >> I would imaging the following to happen when Table.cache() is called:
> >>
> >> 1) immediately optimize the Table and internally convert it into a
> >> DataSet/DataStream. This is necessary, to avoid that operators of later
> >> queries on top of the Table are pushed down.
> >> 2) register the DataSet/DataStream as a DataSet/DataStream-backed Table
> X
> >> 3) add a sink to the DataSet/DataStream. This is the materialization of
> the
> >> Table X
> >>
> >> Based on your proposal the following would happen:
> >>
> >> Table t1 = ....
> >> t1.cache(); // cache() returns void. The logical plan of t1 is replaced
> by
> >> a scan of X. There is also a reference to the materialization of X.
> >>
> >> t1.count(); // this executes the program, including the
> DataSet/DataStream
> >> that backs X and the sink that writes the materialization of X
> >> t1.count(); // this executes the program, but reads X from the
> >> materialization.
> >>
> >> My question is, how do you determine when whether the scan of t1 should
> go
> >> against the DataSet/DataStream program and when against the
> >> materialization?
> >> AFAIK, there is no hook that will tell you that a part of the program
> was
> >> executed. Flipping a switch during optimization or plan generation is
> not
> >> sufficient as there is no guarantee that the plan is also executed.
> >>
> >> Overall, this behavior is somewhat similar to what I proposed in
> >> FLINK-8950, which does not include persisting the table, but just
> >> optimizing and reregistering it as DataSet/DataStream scan.
> >>
> >> 2) I think Piotr has a point about the implicit behavior and side
> effects
> >> of the cache() method if it does not return anything.
> >> Consider the following example:
> >>
> >> Table t1 = ???
> >> Table t2 = methodThatAppliesOperators(t1);
> >> Table t3 = methodThatAppliesOtherOperators(t1);
> >>
> >> In this case, the behavior/performance of the plan that results from the
> >> second method call depends on whether t1 was modified by the first
> method
> >> or not.
> >> This is the classic issue of mutable vs. immutable objects.
> >> Also, as Piotr pointed out, it might also be good to have the original
> plan
> >> of t1, because in some cases it is possible to push filters down such
> that
> >> evaluating the query from scratch might be more efficient than accessing
> >> the cache.
> >> Moreover, a CachedTable could extend Table() and offer a method
> refresh().
> >> This sounds quite useful in an interactive session mode.
> >>
> >> 3) Regarding the name, I can see both arguments. IMO, materialize()
> seems
> >> to be more future proof.
> >>
> >> Best, Fabian
> >>
> >> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> >> wshaoxuan@gmail.com>:
> >>
> >>> Hi Piotr,
> >>>
> >>> Thanks for sharing your ideas on the method naming. We will think about
> >>> your suggestions. But I don't understand why we need to change the
> return
> >>> type of cache().
> >>>
> >>> Cache() is a physical operation, it does not change the logic of
> >>> the `Table`. On the tableAPI layer, we should not introduce a new table
> >>> type unless the logic of table has been changed. If we introduce a new
> >>> table type `CachedTable`, we need create the same set of methods of
> >> `Table`
> >>> for it. I don't think it is worth doing this. Or can you please
> elaborate
> >>> more on what could be the "implicit behaviours/side effects" you are
> >>> thinking about?
> >>>
> >>> Regards,
> >>> Shaoxuan
> >>>
> >>>
> >>>
> >>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> piotr@data-artisans.com>
> >>> wrote:
> >>>
> >>>> Hi Becket,
> >>>>
> >>>> Thanks for the response.
> >>>>
> >>>> 1. I wasn’t saying that materialised view must be mutable or not. The
> >>> same
> >>>> thing applies to caches as well. To the contrary, I would expect more
> >>>> consistency and updates from something that is called “cache” vs
> >>> something
> >>>> that’s a “materialised view”. In other words, IMO most caches do not
> >>> serve
> >>>> you invalid/outdated data and they handle updates on their own.
> >>>>
> >>>> 2. I don’t think that having in the future two very similar concepts
> of
> >>>> `materialized` view and `cache` is a good idea. It would be confusing
> >> for
> >>>> the users. I think it could be handled by variations/overloading of
> >>>> materialised view concept. We could start with:
> >>>>
> >>>> `MaterializedTable materialize()` - immutable, session life scope
> >>>> (basically the same semantic as you are proposing
> >>>>
> >>>> And then in the future (if ever) build on top of that/expand it with:
> >>>>
> >>>> `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
> >>>> materialize(refreshHook=…)`
> >>>>
> >>>> Or with cross session support:
> >>>>
> >>>> `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
> >>>> materializeInto(tableFactory=…)`
> >>>>
> >>>> I’m not saying that we should implement cross session/refreshing now
> or
> >>>> even in the near future. I’m just arguing that naming current
> immutable
> >>>> session life scope method `materialize()` is more future proof and
> more
> >>>> consistent with SQL (on which after all table-api is heavily basing
> >> on).
> >>>>
> >>>> 3. Even if we agree on naming it `cache()`, I would still insist on
> >>>> `cache()` returning `CachedTable` handle to avoid implicit
> >>> behaviours/side
> >>>> effects and to give both us & users more flexibility.
> >>>>
> >>>> Piotrek
> >>>>
> >>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
> >>>>>
> >>>>> Just to add a little bit, the materialized view is probably more
> >>> similar
> >>>> to
> >>>>> the persistent() brought up earlier in the thread. So it is usually
> >>> cross
> >>>>> session and could be used in a larger scope. For example, a
> >>> materialized
> >>>>> view created by user A may be visible to user B. It is probably
> >>> something
> >>>>> we want to have in the future. I'll put it in the future work
> >> section.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jiangjie (Becket) Qin
> >>>>>
> >>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Hi Piotrek,
> >>>>>>
> >>>>>> Thanks for the explanation.
> >>>>>>
> >>>>>> Right now we are mostly thinking of the cached table as immutable. I
> >>> can
> >>>>>> see the Materialized view would be useful in the future. That said,
> >> I
> >>>> think
> >>>>>> a simple cache mechanism is probably still needed. So to me, cache()
> >>> and
> >>>>>> materialize() should be two separate method as they address
> >> different
> >>>>>> needs. Materialize() is a higher level concept usually implying
> >>>> periodical
> >>>>>> update, while cache() has much simpler semantic. For example, one
> >> may
> >>>>>> create a materialized view and use cache() method in the
> >> materialized
> >>>> view
> >>>>>> creation logic. So that during the materialized view update, they do
> >>> not
> >>>>>> need to worry about the case that the cached table is also changed.
> >>>> Maybe
> >>>>>> under the hood, materialized() and cache() could share some
> >> mechanism,
> >>>> but
> >>>>>> I think a simple cache() method would be handy in a lot of cases.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> >>> piotr@data-artisans.com
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Becket,
> >>>>>>>
> >>>>>>>> Is there any extra thing user can do on a MaterializedTable that
> >>> they
> >>>>>>> cannot do on a Table?
> >>>>>>>
> >>>>>>> Maybe not in the initial implementation, but various DBs offer
> >>>> different
> >>>>>>> ways to “refresh” the materialised view. Hooks, triggers, timers,
> >>>> manually
> >>>>>>> etc. Having `MaterializedTable` would help us to handle that in the
> >>>> future.
> >>>>>>>
> >>>>>>>> After users call *table.cache(), *users can just use that table
> >> and
> >>> do
> >>>>>>> anything that is supported on a Table, including SQL.
> >>>>>>>
> >>>>>>> This is some implicit behaviour with side effects. Imagine if user
> >>> has
> >>>> a
> >>>>>>> long and complicated program, that touches table `b` multiple
> >> times,
> >>>> maybe
> >>>>>>> scattered around different methods. If he modifies his program by
> >>>> inserting
> >>>>>>> in one place
> >>>>>>>
> >>>>>>> b.cache()
> >>>>>>>
> >>>>>>> This implicitly alters the semantic and behaviour of his code all
> >>> over
> >>>>>>> the place, maybe in a ways that might cause problems. For example
> >>> what
> >>>> if
> >>>>>>> underlying data is changing?
> >>>>>>>
> >>>>>>> Having invisible side effects is also not very clean, for example
> >>> think
> >>>>>>> about something like this (but more complicated):
> >>>>>>>
> >>>>>>> Table b = ...;
> >>>>>>>
> >>>>>>> If (some_condition) {
> >>>>>>> processTable1(b)
> >>>>>>> }
> >>>>>>> else {
> >>>>>>> processTable2(b)
> >>>>>>> }
> >>>>>>>
> >>>>>>> // do more stuff with b
> >>>>>>>
> >>>>>>> And user adds `b.cache()` call to only one of the `processTable1`
> >> or
> >>>>>>> `processTable2` methods.
> >>>>>>>
> >>>>>>> On the other hand
> >>>>>>>
> >>>>>>> Table materialisedB = b.materialize()
> >>>>>>>
> >>>>>>> Avoids (at least some of) the side effect issues and forces user to
> >>>>>>> explicitly use `materialisedB` where it’s appropriate and forces
> >> user
> >>>> to
> >>>>>>> think what does it actually mean. And if something doesn’t work in
> >>> the
> >>>> end
> >>>>>>> for the user, he will know what has he changed instead of blaming
> >>>> Flink for
> >>>>>>> some “magic” underneath. In the above example, after materialising
> >> b
> >>> in
> >>>>>>> only one of the methods, he should/would realise about the issue
> >> when
> >>>>>>> handling the return value `MaterializedTable` of that method.
> >>>>>>>
> >>>>>>> I guess it comes down to personal preferences if you like things to
> >>> be
> >>>>>>> implicit or not. The more power is the user, probably the more
> >> likely
> >>>> he is
> >>>>>>> to like/understand implicit behaviour. And we as Table API
> >> designers
> >>>> are
> >>>>>>> the most power users out there, so I would proceed with caution (so
> >>>> that we
> >>>>>>> do not end up in the crazy perl realm with it’s lovely implicit
> >>> method
> >>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> >>>>>>>
> >>>>>>>> Table API to also support non-relational processing cases, cache()
> >>>>>>> might be slightly better.
> >>>>>>>
> >>>>>>> I think even such extended Table API could benefit from sticking
> >>>> to/being
> >>>>>>> consistent with SQL where both SQL and Table API are basically the
> >>>> same.
> >>>>>>>
> >>>>>>> One more thing. `MaterializedTable materialize()` could be more
> >>>>>>> powerful/flexible allowing the user to operate both on materialised
> >>>> and not
> >>>>>>> materialised view at the same time for whatever reasons (underlying
> >>>> data
> >>>>>>> changing/better optimisation opportunities after pushing down more
> >>>> filters
> >>>>>>> etc). For example:
> >>>>>>>
> >>>>>>> Table b = …;
> >>>>>>>
> >>>>>>> MaterlizedTable mb = b.materialize();
> >>>>>>>
> >>>>>>> Val min = mb.min();
> >>>>>>> Val max = mb.max();
> >>>>>>>
> >>>>>>> Val user42 = b.filter(‘userId = 42);
> >>>>>>>
> >>>>>>> Could be more efficient compared to `b.cache()` if `filter(‘userId
> >> =
> >>>>>>> 42);` allows for much more aggressive optimisations.
> >>>>>>>
> >>>>>>> Piotrek
> >>>>>>>
> >>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>> I'm not suggesting to add support for Ignite. This was just an
> >>>> example.
> >>>>>>>> Plasma and Arrow sound interesting, too.
> >>>>>>>> For the sake of this proposal, it would be up to the user to
> >>>> implement a
> >>>>>>>> TableFactory and corresponding TableSource / TableSink classes to
> >>>>>>> persist
> >>>>>>>> and read the data.
> >>>>>>>>
> >>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> >>>>>>>> pompermaier@okkam.it>:
> >>>>>>>>
> >>>>>>>>> What about to add also Apache Plasma + Arrow as an alternative to
> >>>>>>> Apache
> >>>>>>>>> Ignite?
> >>>>>>>>> [1]
> >>>>>>>>>
> >>>>>>>
> >>>>
> >> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> >>>>>>>>>
> >>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> >> fhueske@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the proposal!
> >>>>>>>>>>
> >>>>>>>>>> To summarize, you propose a new method Table.cache(): Table that
> >>>> will
> >>>>>>>>>> trigger a job and write the result into some temporary storage
> >> as
> >>>>>>> defined
> >>>>>>>>>> by a TableFactory.
> >>>>>>>>>> The cache() call blocks while the job is running and eventually
> >>>>>>> returns a
> >>>>>>>>>> Table object that represents a scan of the temporary table.
> >>>>>>>>>> When the "session" is closed (closing to be defined?), the
> >>> temporary
> >>>>>>>>> tables
> >>>>>>>>>> are all dropped.
> >>>>>>>>>>
> >>>>>>>>>> I think this behavior makes sense and is a good first step
> >> towards
> >>>>>>> more
> >>>>>>>>>> interactive workloads.
> >>>>>>>>>> However, its performance suffers from writing to and reading
> >> from
> >>>>>>>>> external
> >>>>>>>>>> systems.
> >>>>>>>>>> I think this is OK for now. Changes that would significantly
> >>> improve
> >>>>>>> the
> >>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would have
> >>>> large
> >>>>>>>>>> impacts on many components of Flink.
> >>>>>>>>>> Users could use in-memory filesystems or storage grids (Apache
> >>>>>>> Ignite) to
> >>>>>>>>>> mitigate some of the performance effects.
> >>>>>>>>>>
> >>>>>>>>>> Best, Fabian
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> >>>>>>>>>> becket.qin@gmail.com
> >>>>>>>>>>> :
> >>>>>>>>>>
> >>>>>>>>>>> Thanks for the explanation, Piotrek.
> >>>>>>>>>>>
> >>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
> >> that
> >>>> they
> >>>>>>>>>>> cannot do on a Table? After users call *table.cache(), *users
> >> can
> >>>>>>> just
> >>>>>>>>>> use
> >>>>>>>>>>> that table and do anything that is supported on a Table,
> >>> including
> >>>>>>> SQL.
> >>>>>>>>>>>
> >>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
> >>>>>>> cache()
> >>>>>>>>>> is
> >>>>>>>>>>> a bit more general than materialize(). Given that we are
> >>> enhancing
> >>>>>>> the
> >>>>>>>>>>> Table API to also support non-relational processing cases,
> >>> cache()
> >>>>>>>>> might
> >>>>>>>>>> be
> >>>>>>>>>>> slightly better.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> >>>>>>>>> piotr@data-artisans.com
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> >>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want
> >> to
> >>>>>>>>>> provide
> >>>>>>>>>>> an
> >>>>>>>>>>>> alternate way of writing the data.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we could
> >>>> rename
> >>>>>>>>>>>> `cache()` to
> >>>>>>>>>>>>
> >>>>>>>>>>>> void materialize()
> >>>>>>>>>>>>
> >>>>>>>>>>>> or going step further
> >>>>>>>>>>>>
> >>>>>>>>>>>> MaterializedTable materialize()
> >>>>>>>>>>>> MaterializedTable createMaterializedView()
> >>>>>>>>>>>>
> >>>>>>>>>>>> ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> The second option with returning a handle I think is more
> >>> flexible
> >>>>>>>>> and
> >>>>>>>>>>>> could provide features such as “refresh”/“delete” or generally
> >>>>>>>>> speaking
> >>>>>>>>>>>> manage the the view. In the future we could also think about
> >>>> adding
> >>>>>>>>>> hooks
> >>>>>>>>>>>> to automatically refresh view etc. It is also more explicit -
> >>>>>>>>>>>> materialization returning a new table handle will not have the
> >>>> same
> >>>>>>>>>>>> implicit side effects as adding a simple line of code like
> >>>>>>>>> `b.cache()`
> >>>>>>>>>>>> would have.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It would also be more SQL like, making it more intuitive for
> >>> users
> >>>>>>>>>>> already
> >>>>>>>>>>>> familiar with the SQL.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Piotrek,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
> >>> creating
> >>>> a
> >>>>>>>>>>>> BUILT-IN
> >>>>>>>>>>>>> materialized view with a lifecycle. That functionality is
> >>> missing
> >>>>>>>>>>> today,
> >>>>>>>>>>>>> though. Not sure if I understand your question. Do you mean
> >> we
> >>>>>>>>>> already
> >>>>>>>>>>>> have
> >>>>>>>>>>>>> the functionality and just need a syntax sugar?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What's more interesting in the proposal is do we want to stop
> >>> at
> >>>>>>>>>>> creating
> >>>>>>>>>>>>> the materialized view? Or do we want to extend that in the
> >>> future
> >>>>>>>>> to
> >>>>>>>>>> a
> >>>>>>>>>>>> more
> >>>>>>>>>>>>> useful unified data store distributed with Flink? And do we
> >>> want
> >>>> to
> >>>>>>>>>>> have
> >>>>>>>>>>>> a
> >>>>>>>>>>>>> mechanism allow more flexible user job pattern with their own
> >>>> user
> >>>>>>>>>>>> defined
> >>>>>>>>>>>>> services. These considerations are much more architectural.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> >>>>>>>>>>> piotr@data-artisans.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Interesting idea. I’m trying to understand the problem.
> >> Isn’t
> >>>> the
> >>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
> >>> later
> >>>>>>>>>>> reading
> >>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live time?
> >>> And
> >>>>>>>>> the
> >>>>>>>>>>>> sink
> >>>>>>>>>>>>>> could be implemented as in memory or a file sink?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If so, what’s the problem with creating a materialised view
> >>>> from a
> >>>>>>>>>>> table
> >>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
> >>> materialised
> >>>>>>>>>> view
> >>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> >>> materialised
> >>>>>>>>>> views
> >>>>>>>>>>>> (for
> >>>>>>>>>>>>>> example when current session finishes)? Maybe we need some
> >>>>>>>>> syntactic
> >>>>>>>>>>>> sugar
> >>>>>>>>>>>>>> on top of it?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Piotrek
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <becket.qin@gmail.com
> >>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> >>>>>>>>>>> lifecycle/defined
> >>>>>>>>>>>>>>> scope. I just added a section in the future work for this.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> >>>>>>>>>>> sunjincheng121@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Jiangjie,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thank you for the explanation about the name of
> >> `cache()`, I
> >>>>>>>>>>>> understand
> >>>>>>>>>>>>>> why
> >>>>>>>>>>>>>>>> you designed this way!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
> >> data
> >>>>>>>>>>>> persistence?
> >>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user
> >>> is
> >>>>>>>>> not
> >>>>>>>>>>>>>> worried
> >>>>>>>>>>>>>>>> about data loss, and will clearly specify the time range
> >> for
> >>>>>>>>>> keeping
> >>>>>>>>>>>>>> time.
> >>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also share
> >>> in a
> >>>>>>>>>>> certain
> >>>>>>>>>>>>>>>> group of session, for example:
> >>> LifeCycle.SESSION_GROUP(...), I
> >>>>>>>>> am
> >>>>>>>>>>> not
> >>>>>>>>>>>>>> sure,
> >>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
> >> 下午1:33写道:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Re: Jincheng,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> >> persist(),
> >>>>>>>>>>>> personally I
> >>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
> >> behavior,
> >>>>>>>>> i.e.
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> Table
> >>>>>>>>>>>>>>>>> is cached for the session, but will be deleted after the
> >>>>>>>>> session
> >>>>>>>>>> is
> >>>>>>>>>>>>>>>> closed.
> >>>>>>>>>>>>>>>>> persist() seems a little misleading as people might think
> >>> the
> >>>>>>>>>> table
> >>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> still be there even after the session is gone.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Great point about mixing the batch and stream processing
> >> in
> >>>> the
> >>>>>>>>>>> same
> >>>>>>>>>>>>>> job.
> >>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
> >> that
> >>>>>>>>> would
> >>>>>>>>>>> be
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> huge
> >>>>>>>>>>>>>>>>> change across the board, including sources, operators and
> >>>>>>>>>>>>>> optimizations,
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> name some. Likely we will need several separate in-depth
> >>>>>>>>>>> discussions.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> >>>>>>>>> xingcanc@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
> >> both
> >>>>>>>>>>>> orthogonal
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
> >> time
> >>>> we
> >>>>>>>>>> plan
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
> >> state.
> >>>>>>>>> Maybe
> >>>>>>>>>>> it’s
> >>>>>>>>>>>>>>>>> better
> >>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
> >>>> specific
> >>>>>>>>>>> part?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> >>>> underlying
> >>>>>>>>>>>>>> service.
> >>>>>>>>>>>>>>>>>> This seems to be quite a major change to the existing
> >>>>>>>>> codebase.
> >>>>>>>>>> As
> >>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>> claimed, the service should be extendible to support
> >> other
> >>>>>>>>>>>> components
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
> >>> Table
> >>>>>>>>>> API,
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> >>>>>>>>>> xiaoweij@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
> >> is
> >>>> not
> >>>>>>>>>> very
> >>>>>>>>>>>>>>>>>> reliable.
> >>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
> >>>> successfully.
> >>>>>>>>> We
> >>>>>>>>>>> may
> >>>>>>>>>>>>>>>>> risk
> >>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
> >> have
> >>> an
> >>>>>>>>>>>>>>>> association
> >>>>>>>>>>>>>>>>>>> between temp table and session id. So we can always
> >> clean
> >>>> up
> >>>>>>>>>> temp
> >>>>>>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>>>>>> which are no longer associated with any active
> >> sessions.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>> Xiaowei
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> >>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
> >> friendly
> >>>> in
> >>>>>>>>>> case
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>>>> examples.
> >>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> >> executed
> >>> in
> >>>>>>>>>>> several
> >>>>>>>>>>>>>>>>>> stages
> >>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in
> >>>> order
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>> utilize
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
> >> job
> >>>> by
> >>>>>>>>>>>>>>>>>> env.execute().
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> >>>>>>>>> `persist()`,
> >>>>>>>>>>> And
> >>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally cache
> >>> in
> >>>>>>>>>> memory
> >>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>> persist
> >>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
> >>>> backend
> >>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support
> >>> for
> >>>>>>>>>>>> streaming
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit
> >>> in
> >>>>>>>>>>>>>>>> "Interactive
> >>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
> >>>> FLIP!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
> >>>> 下午9:56写道:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
> >> is a
> >>>>>>>>>>> promising
> >>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> >>>> aspects,
> >>>>>>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
> >> the
> >>>>>>>>>>> scenarios
> >>>>>>>>>>>>>>>>> where
> >>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming.
> >> To
> >>>>>>>>>> explain
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
> >>>>>>>>>> together
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi Shaoxuan,

Re 2:

> Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1’

What do you mean that “ t1 is modified to-> t1’ ” ? That `methodThatAppliesOperators()` method has changed it’s plan? 

I was thinking more about something like this:

Table source = … // some source that scans files from a directory “/foo/bar/“
Table t1 = source.groupBy(…).select(…).where(…) ….; 
Table t2 = t1.materialize() // (or `cache()`)

t2.count() // initialise cache (if it’s lazily initialised)

int a1 = t1.count()
int b1 = t2.count() 

// something in the background (or we trigger it) writes new files to /foo/bar

int a2 = t1.count()
int b2 = t2.count() 

t2.refresh() // possible future extension, not to be implemented in the initial version

int a3 = t1.count()
int b3 = t2.count() 

t2.drop() // another possible future extension, manual “cache” dropping

assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
assertTrue(b1 == b2) // both values come from the same cache 
assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full table scan and has more data
assertTrue(b3 > b2) // b3 comes from refreshed cache
assertTrue(b3 == a2 == a3) 

Piotrek

> On 30 Nov 2018, at 10:22, Jark Wu <im...@gmail.com> wrote:
> 
> Hi,
> 
> It is an very interesting and useful design!
> 
> Here I want to share some of my thoughts:
> 
> 1. Agree with that cache() method should return some Table to avoid some
> unexpected problems because of the mutable object.
>   All the existing methods of Table are returning a new Table instance.
> 
> 2. I think materialize() would be more consistent with SQL, this makes it
> possible to support the same feature for SQL (materialize view) and keep
> the same API for users in the future.
>   But I'm also fine if we choose cache().
> 
> 3. In the proposal, a TableService (or FlinkService?) is used to cache the
> result of the (intermediate) table.
>   But the name of TableService may be a bit general which is not quite
> understanding correctly in the first glance (a metastore for tables?).
>   Maybe a more specific name would be better, such as TableCacheSerive or
> TableMaterializeSerivce or something else.
> 
> Best,
> Jark
> 
> 
> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com> wrote:
> 
>> Hi,
>> 
>> Thanks for the clarification Becket!
>> 
>> I have a few thoughts to share / questions:
>> 
>> 1) I'd like to know how you plan to implement the feature on a plan /
>> planner level.
>> 
>> I would imaging the following to happen when Table.cache() is called:
>> 
>> 1) immediately optimize the Table and internally convert it into a
>> DataSet/DataStream. This is necessary, to avoid that operators of later
>> queries on top of the Table are pushed down.
>> 2) register the DataSet/DataStream as a DataSet/DataStream-backed Table X
>> 3) add a sink to the DataSet/DataStream. This is the materialization of the
>> Table X
>> 
>> Based on your proposal the following would happen:
>> 
>> Table t1 = ....
>> t1.cache(); // cache() returns void. The logical plan of t1 is replaced by
>> a scan of X. There is also a reference to the materialization of X.
>> 
>> t1.count(); // this executes the program, including the DataSet/DataStream
>> that backs X and the sink that writes the materialization of X
>> t1.count(); // this executes the program, but reads X from the
>> materialization.
>> 
>> My question is, how do you determine when whether the scan of t1 should go
>> against the DataSet/DataStream program and when against the
>> materialization?
>> AFAIK, there is no hook that will tell you that a part of the program was
>> executed. Flipping a switch during optimization or plan generation is not
>> sufficient as there is no guarantee that the plan is also executed.
>> 
>> Overall, this behavior is somewhat similar to what I proposed in
>> FLINK-8950, which does not include persisting the table, but just
>> optimizing and reregistering it as DataSet/DataStream scan.
>> 
>> 2) I think Piotr has a point about the implicit behavior and side effects
>> of the cache() method if it does not return anything.
>> Consider the following example:
>> 
>> Table t1 = ???
>> Table t2 = methodThatAppliesOperators(t1);
>> Table t3 = methodThatAppliesOtherOperators(t1);
>> 
>> In this case, the behavior/performance of the plan that results from the
>> second method call depends on whether t1 was modified by the first method
>> or not.
>> This is the classic issue of mutable vs. immutable objects.
>> Also, as Piotr pointed out, it might also be good to have the original plan
>> of t1, because in some cases it is possible to push filters down such that
>> evaluating the query from scratch might be more efficient than accessing
>> the cache.
>> Moreover, a CachedTable could extend Table() and offer a method refresh().
>> This sounds quite useful in an interactive session mode.
>> 
>> 3) Regarding the name, I can see both arguments. IMO, materialize() seems
>> to be more future proof.
>> 
>> Best, Fabian
>> 
>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
>> wshaoxuan@gmail.com>:
>> 
>>> Hi Piotr,
>>> 
>>> Thanks for sharing your ideas on the method naming. We will think about
>>> your suggestions. But I don't understand why we need to change the return
>>> type of cache().
>>> 
>>> Cache() is a physical operation, it does not change the logic of
>>> the `Table`. On the tableAPI layer, we should not introduce a new table
>>> type unless the logic of table has been changed. If we introduce a new
>>> table type `CachedTable`, we need create the same set of methods of
>> `Table`
>>> for it. I don't think it is worth doing this. Or can you please elaborate
>>> more on what could be the "implicit behaviours/side effects" you are
>>> thinking about?
>>> 
>>> Regards,
>>> Shaoxuan
>>> 
>>> 
>>> 
>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <pi...@data-artisans.com>
>>> wrote:
>>> 
>>>> Hi Becket,
>>>> 
>>>> Thanks for the response.
>>>> 
>>>> 1. I wasn’t saying that materialised view must be mutable or not. The
>>> same
>>>> thing applies to caches as well. To the contrary, I would expect more
>>>> consistency and updates from something that is called “cache” vs
>>> something
>>>> that’s a “materialised view”. In other words, IMO most caches do not
>>> serve
>>>> you invalid/outdated data and they handle updates on their own.
>>>> 
>>>> 2. I don’t think that having in the future two very similar concepts of
>>>> `materialized` view and `cache` is a good idea. It would be confusing
>> for
>>>> the users. I think it could be handled by variations/overloading of
>>>> materialised view concept. We could start with:
>>>> 
>>>> `MaterializedTable materialize()` - immutable, session life scope
>>>> (basically the same semantic as you are proposing
>>>> 
>>>> And then in the future (if ever) build on top of that/expand it with:
>>>> 
>>>> `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
>>>> materialize(refreshHook=…)`
>>>> 
>>>> Or with cross session support:
>>>> 
>>>> `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
>>>> materializeInto(tableFactory=…)`
>>>> 
>>>> I’m not saying that we should implement cross session/refreshing now or
>>>> even in the near future. I’m just arguing that naming current immutable
>>>> session life scope method `materialize()` is more future proof and more
>>>> consistent with SQL (on which after all table-api is heavily basing
>> on).
>>>> 
>>>> 3. Even if we agree on naming it `cache()`, I would still insist on
>>>> `cache()` returning `CachedTable` handle to avoid implicit
>>> behaviours/side
>>>> effects and to give both us & users more flexibility.
>>>> 
>>>> Piotrek
>>>> 
>>>>> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
>>>>> 
>>>>> Just to add a little bit, the materialized view is probably more
>>> similar
>>>> to
>>>>> the persistent() brought up earlier in the thread. So it is usually
>>> cross
>>>>> session and could be used in a larger scope. For example, a
>>> materialized
>>>>> view created by user A may be visible to user B. It is probably
>>> something
>>>>> we want to have in the future. I'll put it in the future work
>> section.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jiangjie (Becket) Qin
>>>>> 
>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Hi Piotrek,
>>>>>> 
>>>>>> Thanks for the explanation.
>>>>>> 
>>>>>> Right now we are mostly thinking of the cached table as immutable. I
>>> can
>>>>>> see the Materialized view would be useful in the future. That said,
>> I
>>>> think
>>>>>> a simple cache mechanism is probably still needed. So to me, cache()
>>> and
>>>>>> materialize() should be two separate method as they address
>> different
>>>>>> needs. Materialize() is a higher level concept usually implying
>>>> periodical
>>>>>> update, while cache() has much simpler semantic. For example, one
>> may
>>>>>> create a materialized view and use cache() method in the
>> materialized
>>>> view
>>>>>> creation logic. So that during the materialized view update, they do
>>> not
>>>>>> need to worry about the case that the cached table is also changed.
>>>> Maybe
>>>>>> under the hood, materialized() and cache() could share some
>> mechanism,
>>>> but
>>>>>> I think a simple cache() method would be handy in a lot of cases.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jiangjie (Becket) Qin
>>>>>> 
>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
>>> piotr@data-artisans.com
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Becket,
>>>>>>> 
>>>>>>>> Is there any extra thing user can do on a MaterializedTable that
>>> they
>>>>>>> cannot do on a Table?
>>>>>>> 
>>>>>>> Maybe not in the initial implementation, but various DBs offer
>>>> different
>>>>>>> ways to “refresh” the materialised view. Hooks, triggers, timers,
>>>> manually
>>>>>>> etc. Having `MaterializedTable` would help us to handle that in the
>>>> future.
>>>>>>> 
>>>>>>>> After users call *table.cache(), *users can just use that table
>> and
>>> do
>>>>>>> anything that is supported on a Table, including SQL.
>>>>>>> 
>>>>>>> This is some implicit behaviour with side effects. Imagine if user
>>> has
>>>> a
>>>>>>> long and complicated program, that touches table `b` multiple
>> times,
>>>> maybe
>>>>>>> scattered around different methods. If he modifies his program by
>>>> inserting
>>>>>>> in one place
>>>>>>> 
>>>>>>> b.cache()
>>>>>>> 
>>>>>>> This implicitly alters the semantic and behaviour of his code all
>>> over
>>>>>>> the place, maybe in a ways that might cause problems. For example
>>> what
>>>> if
>>>>>>> underlying data is changing?
>>>>>>> 
>>>>>>> Having invisible side effects is also not very clean, for example
>>> think
>>>>>>> about something like this (but more complicated):
>>>>>>> 
>>>>>>> Table b = ...;
>>>>>>> 
>>>>>>> If (some_condition) {
>>>>>>> processTable1(b)
>>>>>>> }
>>>>>>> else {
>>>>>>> processTable2(b)
>>>>>>> }
>>>>>>> 
>>>>>>> // do more stuff with b
>>>>>>> 
>>>>>>> And user adds `b.cache()` call to only one of the `processTable1`
>> or
>>>>>>> `processTable2` methods.
>>>>>>> 
>>>>>>> On the other hand
>>>>>>> 
>>>>>>> Table materialisedB = b.materialize()
>>>>>>> 
>>>>>>> Avoids (at least some of) the side effect issues and forces user to
>>>>>>> explicitly use `materialisedB` where it’s appropriate and forces
>> user
>>>> to
>>>>>>> think what does it actually mean. And if something doesn’t work in
>>> the
>>>> end
>>>>>>> for the user, he will know what has he changed instead of blaming
>>>> Flink for
>>>>>>> some “magic” underneath. In the above example, after materialising
>> b
>>> in
>>>>>>> only one of the methods, he should/would realise about the issue
>> when
>>>>>>> handling the return value `MaterializedTable` of that method.
>>>>>>> 
>>>>>>> I guess it comes down to personal preferences if you like things to
>>> be
>>>>>>> implicit or not. The more power is the user, probably the more
>> likely
>>>> he is
>>>>>>> to like/understand implicit behaviour. And we as Table API
>> designers
>>>> are
>>>>>>> the most power users out there, so I would proceed with caution (so
>>>> that we
>>>>>>> do not end up in the crazy perl realm with it’s lovely implicit
>>> method
>>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
>>>>>>> 
>>>>>>>> Table API to also support non-relational processing cases, cache()
>>>>>>> might be slightly better.
>>>>>>> 
>>>>>>> I think even such extended Table API could benefit from sticking
>>>> to/being
>>>>>>> consistent with SQL where both SQL and Table API are basically the
>>>> same.
>>>>>>> 
>>>>>>> One more thing. `MaterializedTable materialize()` could be more
>>>>>>> powerful/flexible allowing the user to operate both on materialised
>>>> and not
>>>>>>> materialised view at the same time for whatever reasons (underlying
>>>> data
>>>>>>> changing/better optimisation opportunities after pushing down more
>>>> filters
>>>>>>> etc). For example:
>>>>>>> 
>>>>>>> Table b = …;
>>>>>>> 
>>>>>>> MaterlizedTable mb = b.materialize();
>>>>>>> 
>>>>>>> Val min = mb.min();
>>>>>>> Val max = mb.max();
>>>>>>> 
>>>>>>> Val user42 = b.filter(‘userId = 42);
>>>>>>> 
>>>>>>> Could be more efficient compared to `b.cache()` if `filter(‘userId
>> =
>>>>>>> 42);` allows for much more aggressive optimisations.
>>>>>>> 
>>>>>>> Piotrek
>>>>>>> 
>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>> I'm not suggesting to add support for Ignite. This was just an
>>>> example.
>>>>>>>> Plasma and Arrow sound interesting, too.
>>>>>>>> For the sake of this proposal, it would be up to the user to
>>>> implement a
>>>>>>>> TableFactory and corresponding TableSource / TableSink classes to
>>>>>>> persist
>>>>>>>> and read the data.
>>>>>>>> 
>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
>>>>>>>> pompermaier@okkam.it>:
>>>>>>>> 
>>>>>>>>> What about to add also Apache Plasma + Arrow as an alternative to
>>>>>>> Apache
>>>>>>>>> Ignite?
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>> 
>>>> 
>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>>>>>>>>> 
>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
>> fhueske@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the proposal!
>>>>>>>>>> 
>>>>>>>>>> To summarize, you propose a new method Table.cache(): Table that
>>>> will
>>>>>>>>>> trigger a job and write the result into some temporary storage
>> as
>>>>>>> defined
>>>>>>>>>> by a TableFactory.
>>>>>>>>>> The cache() call blocks while the job is running and eventually
>>>>>>> returns a
>>>>>>>>>> Table object that represents a scan of the temporary table.
>>>>>>>>>> When the "session" is closed (closing to be defined?), the
>>> temporary
>>>>>>>>> tables
>>>>>>>>>> are all dropped.
>>>>>>>>>> 
>>>>>>>>>> I think this behavior makes sense and is a good first step
>> towards
>>>>>>> more
>>>>>>>>>> interactive workloads.
>>>>>>>>>> However, its performance suffers from writing to and reading
>> from
>>>>>>>>> external
>>>>>>>>>> systems.
>>>>>>>>>> I think this is OK for now. Changes that would significantly
>>> improve
>>>>>>> the
>>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would have
>>>> large
>>>>>>>>>> impacts on many components of Flink.
>>>>>>>>>> Users could use in-memory filesystems or storage grids (Apache
>>>>>>> Ignite) to
>>>>>>>>>> mitigate some of the performance effects.
>>>>>>>>>> 
>>>>>>>>>> Best, Fabian
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
>>>>>>>>>> becket.qin@gmail.com
>>>>>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>>> Thanks for the explanation, Piotrek.
>>>>>>>>>>> 
>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
>> that
>>>> they
>>>>>>>>>>> cannot do on a Table? After users call *table.cache(), *users
>> can
>>>>>>> just
>>>>>>>>>> use
>>>>>>>>>>> that table and do anything that is supported on a Table,
>>> including
>>>>>>> SQL.
>>>>>>>>>>> 
>>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
>>>>>>> cache()
>>>>>>>>>> is
>>>>>>>>>>> a bit more general than materialize(). Given that we are
>>> enhancing
>>>>>>> the
>>>>>>>>>>> Table API to also support non-relational processing cases,
>>> cache()
>>>>>>>>> might
>>>>>>>>>> be
>>>>>>>>>>> slightly better.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
>>>>>>>>> piotr@data-artisans.com
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>> 
>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want
>> to
>>>>>>>>>> provide
>>>>>>>>>>> an
>>>>>>>>>>>> alternate way of writing the data.
>>>>>>>>>>>> 
>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we could
>>>> rename
>>>>>>>>>>>> `cache()` to
>>>>>>>>>>>> 
>>>>>>>>>>>> void materialize()
>>>>>>>>>>>> 
>>>>>>>>>>>> or going step further
>>>>>>>>>>>> 
>>>>>>>>>>>> MaterializedTable materialize()
>>>>>>>>>>>> MaterializedTable createMaterializedView()
>>>>>>>>>>>> 
>>>>>>>>>>>> ?
>>>>>>>>>>>> 
>>>>>>>>>>>> The second option with returning a handle I think is more
>>> flexible
>>>>>>>>> and
>>>>>>>>>>>> could provide features such as “refresh”/“delete” or generally
>>>>>>>>> speaking
>>>>>>>>>>>> manage the the view. In the future we could also think about
>>>> adding
>>>>>>>>>> hooks
>>>>>>>>>>>> to automatically refresh view etc. It is also more explicit -
>>>>>>>>>>>> materialization returning a new table handle will not have the
>>>> same
>>>>>>>>>>>> implicit side effects as adding a simple line of code like
>>>>>>>>> `b.cache()`
>>>>>>>>>>>> would have.
>>>>>>>>>>>> 
>>>>>>>>>>>> It would also be more SQL like, making it more intuitive for
>>> users
>>>>>>>>>>> already
>>>>>>>>>>>> familiar with the SQL.
>>>>>>>>>>>> 
>>>>>>>>>>>> Piotrek
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com>
>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
>>> creating
>>>> a
>>>>>>>>>>>> BUILT-IN
>>>>>>>>>>>>> materialized view with a lifecycle. That functionality is
>>> missing
>>>>>>>>>>> today,
>>>>>>>>>>>>> though. Not sure if I understand your question. Do you mean
>> we
>>>>>>>>>> already
>>>>>>>>>>>> have
>>>>>>>>>>>>> the functionality and just need a syntax sugar?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What's more interesting in the proposal is do we want to stop
>>> at
>>>>>>>>>>> creating
>>>>>>>>>>>>> the materialized view? Or do we want to extend that in the
>>> future
>>>>>>>>> to
>>>>>>>>>> a
>>>>>>>>>>>> more
>>>>>>>>>>>>> useful unified data store distributed with Flink? And do we
>>> want
>>>> to
>>>>>>>>>>> have
>>>>>>>>>>>> a
>>>>>>>>>>>>> mechanism allow more flexible user job pattern with their own
>>>> user
>>>>>>>>>>>> defined
>>>>>>>>>>>>> services. These considerations are much more architectural.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
>>>>>>>>>>> piotr@data-artisans.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the problem.
>> Isn’t
>>>> the
>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
>>> later
>>>>>>>>>>> reading
>>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live time?
>>> And
>>>>>>>>> the
>>>>>>>>>>>> sink
>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If so, what’s the problem with creating a materialised view
>>>> from a
>>>>>>>>>>> table
>>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
>>> materialised
>>>>>>>>>> view
>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
>>> materialised
>>>>>>>>>> views
>>>>>>>>>>>> (for
>>>>>>>>>>>>>> example when current session finishes)? Maybe we need some
>>>>>>>>> syntactic
>>>>>>>>>>>> sugar
>>>>>>>>>>>>>> on top of it?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <becket.qin@gmail.com
>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
>>>>>>>>>>> lifecycle/defined
>>>>>>>>>>>>>>> scope. I just added a section in the future work for this.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
>>>>>>>>>>> sunjincheng121@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Jiangjie,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
>> `cache()`, I
>>>>>>>>>>>> understand
>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>> you designed this way!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
>> data
>>>>>>>>>>>> persistence?
>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user
>>> is
>>>>>>>>> not
>>>>>>>>>>>>>> worried
>>>>>>>>>>>>>>>> about data loss, and will clearly specify the time range
>> for
>>>>>>>>>> keeping
>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also share
>>> in a
>>>>>>>>>>> certain
>>>>>>>>>>>>>>>> group of session, for example:
>>> LifeCycle.SESSION_GROUP(...), I
>>>>>>>>> am
>>>>>>>>>>> not
>>>>>>>>>>>>>> sure,
>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
>> 下午1:33写道:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Re: Jincheng,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
>> persist(),
>>>>>>>>>>>> personally I
>>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
>> behavior,
>>>>>>>>> i.e.
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted after the
>>>>>>>>> session
>>>>>>>>>> is
>>>>>>>>>>>>>>>> closed.
>>>>>>>>>>>>>>>>> persist() seems a little misleading as people might think
>>> the
>>>>>>>>>> table
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> still be there even after the session is gone.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream processing
>> in
>>>> the
>>>>>>>>>>> same
>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
>> that
>>>>>>>>> would
>>>>>>>>>>> be
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>> change across the board, including sources, operators and
>>>>>>>>>>>>>> optimizations,
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> name some. Likely we will need several separate in-depth
>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
>>>>>>>>> xingcanc@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
>> both
>>>>>>>>>>>> orthogonal
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
>> time
>>>> we
>>>>>>>>>> plan
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
>> state.
>>>>>>>>> Maybe
>>>>>>>>>>> it’s
>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
>>>> specific
>>>>>>>>>>> part?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
>>>> underlying
>>>>>>>>>>>>>> service.
>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the existing
>>>>>>>>> codebase.
>>>>>>>>>> As
>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to support
>> other
>>>>>>>>>>>> components
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
>>> Table
>>>>>>>>>> API,
>>>>>>>>>>> in
>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
>>>>>>>>>> xiaoweij@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
>> is
>>>> not
>>>>>>>>>> very
>>>>>>>>>>>>>>>>>> reliable.
>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
>>>> successfully.
>>>>>>>>> We
>>>>>>>>>>> may
>>>>>>>>>>>>>>>>> risk
>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
>> have
>>> an
>>>>>>>>>>>>>>>> association
>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can always
>> clean
>>>> up
>>>>>>>>>> temp
>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>> which are no longer associated with any active
>> sessions.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>>>>>>>>>>>>>>>> sunjincheng121@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
>> friendly
>>>> in
>>>>>>>>>> case
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>> examples.
>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to be
>> executed
>>> in
>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>> stages
>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in
>>>> order
>>>>>>>>>> to
>>>>>>>>>>>>>>>>> utilize
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
>> job
>>>> by
>>>>>>>>>>>>>>>>>> env.execute().
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
>>>>>>>>> `persist()`,
>>>>>>>>>>> And
>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally cache
>>> in
>>>>>>>>>> memory
>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
>>>> backend
>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support
>>> for
>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit
>>> in
>>>>>>>>>>>>>>>> "Interactive
>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
>>>> FLIP!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
>>>> 下午9:56写道:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
>> is a
>>>>>>>>>>> promising
>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
>>>> aspects,
>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
>> the
>>>>>>>>>>> scenarios
>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming.
>> To
>>>>>>>>>> explain
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
>>>>>>>>>> together
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Jark Wu <im...@gmail.com>.
Hi,

It is an very interesting and useful design!

Here I want to share some of my thoughts:

1. Agree with that cache() method should return some Table to avoid some
unexpected problems because of the mutable object.
   All the existing methods of Table are returning a new Table instance.

2. I think materialize() would be more consistent with SQL, this makes it
possible to support the same feature for SQL (materialize view) and keep
the same API for users in the future.
   But I'm also fine if we choose cache().

3. In the proposal, a TableService (or FlinkService?) is used to cache the
result of the (intermediate) table.
   But the name of TableService may be a bit general which is not quite
understanding correctly in the first glance (a metastore for tables?).
   Maybe a more specific name would be better, such as TableCacheSerive or
TableMaterializeSerivce or something else.

Best,
Jark


On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fh...@gmail.com> wrote:

> Hi,
>
> Thanks for the clarification Becket!
>
> I have a few thoughts to share / questions:
>
> 1) I'd like to know how you plan to implement the feature on a plan /
> planner level.
>
> I would imaging the following to happen when Table.cache() is called:
>
> 1) immediately optimize the Table and internally convert it into a
> DataSet/DataStream. This is necessary, to avoid that operators of later
> queries on top of the Table are pushed down.
> 2) register the DataSet/DataStream as a DataSet/DataStream-backed Table X
> 3) add a sink to the DataSet/DataStream. This is the materialization of the
> Table X
>
> Based on your proposal the following would happen:
>
> Table t1 = ....
> t1.cache(); // cache() returns void. The logical plan of t1 is replaced by
> a scan of X. There is also a reference to the materialization of X.
>
> t1.count(); // this executes the program, including the DataSet/DataStream
> that backs X and the sink that writes the materialization of X
> t1.count(); // this executes the program, but reads X from the
> materialization.
>
> My question is, how do you determine when whether the scan of t1 should go
> against the DataSet/DataStream program and when against the
> materialization?
> AFAIK, there is no hook that will tell you that a part of the program was
> executed. Flipping a switch during optimization or plan generation is not
> sufficient as there is no guarantee that the plan is also executed.
>
> Overall, this behavior is somewhat similar to what I proposed in
> FLINK-8950, which does not include persisting the table, but just
> optimizing and reregistering it as DataSet/DataStream scan.
>
> 2) I think Piotr has a point about the implicit behavior and side effects
> of the cache() method if it does not return anything.
> Consider the following example:
>
> Table t1 = ???
> Table t2 = methodThatAppliesOperators(t1);
> Table t3 = methodThatAppliesOtherOperators(t1);
>
> In this case, the behavior/performance of the plan that results from the
> second method call depends on whether t1 was modified by the first method
> or not.
> This is the classic issue of mutable vs. immutable objects.
> Also, as Piotr pointed out, it might also be good to have the original plan
> of t1, because in some cases it is possible to push filters down such that
> evaluating the query from scratch might be more efficient than accessing
> the cache.
> Moreover, a CachedTable could extend Table() and offer a method refresh().
> This sounds quite useful in an interactive session mode.
>
> 3) Regarding the name, I can see both arguments. IMO, materialize() seems
> to be more future proof.
>
> Best, Fabian
>
> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> wshaoxuan@gmail.com>:
>
> > Hi Piotr,
> >
> > Thanks for sharing your ideas on the method naming. We will think about
> > your suggestions. But I don't understand why we need to change the return
> > type of cache().
> >
> > Cache() is a physical operation, it does not change the logic of
> > the `Table`. On the tableAPI layer, we should not introduce a new table
> > type unless the logic of table has been changed. If we introduce a new
> > table type `CachedTable`, we need create the same set of methods of
> `Table`
> > for it. I don't think it is worth doing this. Or can you please elaborate
> > more on what could be the "implicit behaviours/side effects" you are
> > thinking about?
> >
> > Regards,
> > Shaoxuan
> >
> >
> >
> > On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <pi...@data-artisans.com>
> > wrote:
> >
> > > Hi Becket,
> > >
> > > Thanks for the response.
> > >
> > > 1. I wasn’t saying that materialised view must be mutable or not. The
> > same
> > > thing applies to caches as well. To the contrary, I would expect more
> > > consistency and updates from something that is called “cache” vs
> > something
> > > that’s a “materialised view”. In other words, IMO most caches do not
> > serve
> > > you invalid/outdated data and they handle updates on their own.
> > >
> > > 2. I don’t think that having in the future two very similar concepts of
> > > `materialized` view and `cache` is a good idea. It would be confusing
> for
> > > the users. I think it could be handled by variations/overloading of
> > > materialised view concept. We could start with:
> > >
> > > `MaterializedTable materialize()` - immutable, session life scope
> > > (basically the same semantic as you are proposing
> > >
> > > And then in the future (if ever) build on top of that/expand it with:
> > >
> > > `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
> > > materialize(refreshHook=…)`
> > >
> > > Or with cross session support:
> > >
> > > `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
> > > materializeInto(tableFactory=…)`
> > >
> > > I’m not saying that we should implement cross session/refreshing now or
> > > even in the near future. I’m just arguing that naming current immutable
> > > session life scope method `materialize()` is more future proof and more
> > > consistent with SQL (on which after all table-api is heavily basing
> on).
> > >
> > > 3. Even if we agree on naming it `cache()`, I would still insist on
> > > `cache()` returning `CachedTable` handle to avoid implicit
> > behaviours/side
> > > effects and to give both us & users more flexibility.
> > >
> > > Piotrek
> > >
> > > > On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
> > > >
> > > > Just to add a little bit, the materialized view is probably more
> > similar
> > > to
> > > > the persistent() brought up earlier in the thread. So it is usually
> > cross
> > > > session and could be used in a larger scope. For example, a
> > materialized
> > > > view created by user A may be visible to user B. It is probably
> > something
> > > > we want to have in the future. I'll put it in the future work
> section.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com>
> > wrote:
> > > >
> > > >> Hi Piotrek,
> > > >>
> > > >> Thanks for the explanation.
> > > >>
> > > >> Right now we are mostly thinking of the cached table as immutable. I
> > can
> > > >> see the Materialized view would be useful in the future. That said,
> I
> > > think
> > > >> a simple cache mechanism is probably still needed. So to me, cache()
> > and
> > > >> materialize() should be two separate method as they address
> different
> > > >> needs. Materialize() is a higher level concept usually implying
> > > periodical
> > > >> update, while cache() has much simpler semantic. For example, one
> may
> > > >> create a materialized view and use cache() method in the
> materialized
> > > view
> > > >> creation logic. So that during the materialized view update, they do
> > not
> > > >> need to worry about the case that the cached table is also changed.
> > > Maybe
> > > >> under the hood, materialized() and cache() could share some
> mechanism,
> > > but
> > > >> I think a simple cache() method would be handy in a lot of cases.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jiangjie (Becket) Qin
> > > >>
> > > >> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > piotr@data-artisans.com
> > > >
> > > >> wrote:
> > > >>
> > > >>> Hi Becket,
> > > >>>
> > > >>>> Is there any extra thing user can do on a MaterializedTable that
> > they
> > > >>> cannot do on a Table?
> > > >>>
> > > >>> Maybe not in the initial implementation, but various DBs offer
> > > different
> > > >>> ways to “refresh” the materialised view. Hooks, triggers, timers,
> > > manually
> > > >>> etc. Having `MaterializedTable` would help us to handle that in the
> > > future.
> > > >>>
> > > >>>> After users call *table.cache(), *users can just use that table
> and
> > do
> > > >>> anything that is supported on a Table, including SQL.
> > > >>>
> > > >>> This is some implicit behaviour with side effects. Imagine if user
> > has
> > > a
> > > >>> long and complicated program, that touches table `b` multiple
> times,
> > > maybe
> > > >>> scattered around different methods. If he modifies his program by
> > > inserting
> > > >>> in one place
> > > >>>
> > > >>> b.cache()
> > > >>>
> > > >>> This implicitly alters the semantic and behaviour of his code all
> > over
> > > >>> the place, maybe in a ways that might cause problems. For example
> > what
> > > if
> > > >>> underlying data is changing?
> > > >>>
> > > >>> Having invisible side effects is also not very clean, for example
> > think
> > > >>> about something like this (but more complicated):
> > > >>>
> > > >>> Table b = ...;
> > > >>>
> > > >>> If (some_condition) {
> > > >>>  processTable1(b)
> > > >>> }
> > > >>> else {
> > > >>>  processTable2(b)
> > > >>> }
> > > >>>
> > > >>> // do more stuff with b
> > > >>>
> > > >>> And user adds `b.cache()` call to only one of the `processTable1`
> or
> > > >>> `processTable2` methods.
> > > >>>
> > > >>> On the other hand
> > > >>>
> > > >>> Table materialisedB = b.materialize()
> > > >>>
> > > >>> Avoids (at least some of) the side effect issues and forces user to
> > > >>> explicitly use `materialisedB` where it’s appropriate and forces
> user
> > > to
> > > >>> think what does it actually mean. And if something doesn’t work in
> > the
> > > end
> > > >>> for the user, he will know what has he changed instead of blaming
> > > Flink for
> > > >>> some “magic” underneath. In the above example, after materialising
> b
> > in
> > > >>> only one of the methods, he should/would realise about the issue
> when
> > > >>> handling the return value `MaterializedTable` of that method.
> > > >>>
> > > >>> I guess it comes down to personal preferences if you like things to
> > be
> > > >>> implicit or not. The more power is the user, probably the more
> likely
> > > he is
> > > >>> to like/understand implicit behaviour. And we as Table API
> designers
> > > are
> > > >>> the most power users out there, so I would proceed with caution (so
> > > that we
> > > >>> do not end up in the crazy perl realm with it’s lovely implicit
> > method
> > > >>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> > > >>>
> > > >>>> Table API to also support non-relational processing cases, cache()
> > > >>> might be slightly better.
> > > >>>
> > > >>> I think even such extended Table API could benefit from sticking
> > > to/being
> > > >>> consistent with SQL where both SQL and Table API are basically the
> > > same.
> > > >>>
> > > >>> One more thing. `MaterializedTable materialize()` could be more
> > > >>> powerful/flexible allowing the user to operate both on materialised
> > > and not
> > > >>> materialised view at the same time for whatever reasons (underlying
> > > data
> > > >>> changing/better optimisation opportunities after pushing down more
> > > filters
> > > >>> etc). For example:
> > > >>>
> > > >>> Table b = …;
> > > >>>
> > > >>> MaterlizedTable mb = b.materialize();
> > > >>>
> > > >>> Val min = mb.min();
> > > >>> Val max = mb.max();
> > > >>>
> > > >>> Val user42 = b.filter(‘userId = 42);
> > > >>>
> > > >>> Could be more efficient compared to `b.cache()` if `filter(‘userId
> =
> > > >>> 42);` allows for much more aggressive optimisations.
> > > >>>
> > > >>> Piotrek
> > > >>>
> > > >>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com>
> wrote:
> > > >>>>
> > > >>>> I'm not suggesting to add support for Ignite. This was just an
> > > example.
> > > >>>> Plasma and Arrow sound interesting, too.
> > > >>>> For the sake of this proposal, it would be up to the user to
> > > implement a
> > > >>>> TableFactory and corresponding TableSource / TableSink classes to
> > > >>> persist
> > > >>>> and read the data.
> > > >>>>
> > > >>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> > > >>>> pompermaier@okkam.it>:
> > > >>>>
> > > >>>>> What about to add also Apache Plasma + Arrow as an alternative to
> > > >>> Apache
> > > >>>>> Ignite?
> > > >>>>> [1]
> > > >>>>>
> > > >>>
> > >
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > >>>>>
> > > >>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> fhueske@gmail.com>
> > > >>> wrote:
> > > >>>>>
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> Thanks for the proposal!
> > > >>>>>>
> > > >>>>>> To summarize, you propose a new method Table.cache(): Table that
> > > will
> > > >>>>>> trigger a job and write the result into some temporary storage
> as
> > > >>> defined
> > > >>>>>> by a TableFactory.
> > > >>>>>> The cache() call blocks while the job is running and eventually
> > > >>> returns a
> > > >>>>>> Table object that represents a scan of the temporary table.
> > > >>>>>> When the "session" is closed (closing to be defined?), the
> > temporary
> > > >>>>> tables
> > > >>>>>> are all dropped.
> > > >>>>>>
> > > >>>>>> I think this behavior makes sense and is a good first step
> towards
> > > >>> more
> > > >>>>>> interactive workloads.
> > > >>>>>> However, its performance suffers from writing to and reading
> from
> > > >>>>> external
> > > >>>>>> systems.
> > > >>>>>> I think this is OK for now. Changes that would significantly
> > improve
> > > >>> the
> > > >>>>>> situation (i.e., pinning data in-memory across jobs) would have
> > > large
> > > >>>>>> impacts on many components of Flink.
> > > >>>>>> Users could use in-memory filesystems or storage grids (Apache
> > > >>> Ignite) to
> > > >>>>>> mitigate some of the performance effects.
> > > >>>>>>
> > > >>>>>> Best, Fabian
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> > > >>>>>> becket.qin@gmail.com
> > > >>>>>>> :
> > > >>>>>>
> > > >>>>>>> Thanks for the explanation, Piotrek.
> > > >>>>>>>
> > > >>>>>>> Is there any extra thing user can do on a MaterializedTable
> that
> > > they
> > > >>>>>>> cannot do on a Table? After users call *table.cache(), *users
> can
> > > >>> just
> > > >>>>>> use
> > > >>>>>>> that table and do anything that is supported on a Table,
> > including
> > > >>> SQL.
> > > >>>>>>>
> > > >>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
> > > >>> cache()
> > > >>>>>> is
> > > >>>>>>> a bit more general than materialize(). Given that we are
> > enhancing
> > > >>> the
> > > >>>>>>> Table API to also support non-relational processing cases,
> > cache()
> > > >>>>> might
> > > >>>>>> be
> > > >>>>>>> slightly better.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>>
> > > >>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > > >>>>> piotr@data-artisans.com
> > > >>>>>>>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Becket,
> > > >>>>>>>>
> > > >>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> > > >>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want
> to
> > > >>>>>> provide
> > > >>>>>>> an
> > > >>>>>>>> alternate way of writing the data.
> > > >>>>>>>>
> > > >>>>>>>> Now that I hopefully understand the proposal, maybe we could
> > > rename
> > > >>>>>>>> `cache()` to
> > > >>>>>>>>
> > > >>>>>>>> void materialize()
> > > >>>>>>>>
> > > >>>>>>>> or going step further
> > > >>>>>>>>
> > > >>>>>>>> MaterializedTable materialize()
> > > >>>>>>>> MaterializedTable createMaterializedView()
> > > >>>>>>>>
> > > >>>>>>>> ?
> > > >>>>>>>>
> > > >>>>>>>> The second option with returning a handle I think is more
> > flexible
> > > >>>>> and
> > > >>>>>>>> could provide features such as “refresh”/“delete” or generally
> > > >>>>> speaking
> > > >>>>>>>> manage the the view. In the future we could also think about
> > > adding
> > > >>>>>> hooks
> > > >>>>>>>> to automatically refresh view etc. It is also more explicit -
> > > >>>>>>>> materialization returning a new table handle will not have the
> > > same
> > > >>>>>>>> implicit side effects as adding a simple line of code like
> > > >>>>> `b.cache()`
> > > >>>>>>>> would have.
> > > >>>>>>>>
> > > >>>>>>>> It would also be more SQL like, making it more intuitive for
> > users
> > > >>>>>>> already
> > > >>>>>>>> familiar with the SQL.
> > > >>>>>>>>
> > > >>>>>>>> Piotrek
> > > >>>>>>>>
> > > >>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>
> > > >>>>>>>>> For the cache() method itself, yes, it is equivalent to
> > creating
> > > a
> > > >>>>>>>> BUILT-IN
> > > >>>>>>>>> materialized view with a lifecycle. That functionality is
> > missing
> > > >>>>>>> today,
> > > >>>>>>>>> though. Not sure if I understand your question. Do you mean
> we
> > > >>>>>> already
> > > >>>>>>>> have
> > > >>>>>>>>> the functionality and just need a syntax sugar?
> > > >>>>>>>>>
> > > >>>>>>>>> What's more interesting in the proposal is do we want to stop
> > at
> > > >>>>>>> creating
> > > >>>>>>>>> the materialized view? Or do we want to extend that in the
> > future
> > > >>>>> to
> > > >>>>>> a
> > > >>>>>>>> more
> > > >>>>>>>>> useful unified data store distributed with Flink? And do we
> > want
> > > to
> > > >>>>>>> have
> > > >>>>>>>> a
> > > >>>>>>>>> mechanism allow more flexible user job pattern with their own
> > > user
> > > >>>>>>>> defined
> > > >>>>>>>>> services. These considerations are much more architectural.
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>>
> > > >>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > > >>>>>>> piotr@data-artisans.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Interesting idea. I’m trying to understand the problem.
> Isn’t
> > > the
> > > >>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
> > later
> > > >>>>>>> reading
> > > >>>>>>>>>> from it? Where this sink has a limited live scope/live time?
> > And
> > > >>>>> the
> > > >>>>>>>> sink
> > > >>>>>>>>>> could be implemented as in memory or a file sink?
> > > >>>>>>>>>>
> > > >>>>>>>>>> If so, what’s the problem with creating a materialised view
> > > from a
> > > >>>>>>> table
> > > >>>>>>>>>> “b” (from your document’s example) and reusing this
> > materialised
> > > >>>>>> view
> > > >>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> > materialised
> > > >>>>>> views
> > > >>>>>>>> (for
> > > >>>>>>>>>> example when current session finishes)? Maybe we need some
> > > >>>>> syntactic
> > > >>>>>>>> sugar
> > > >>>>>>>>>> on top of it?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Piotrek
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <becket.qin@gmail.com
> >
> > > >>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> > > >>>>>>> lifecycle/defined
> > > >>>>>>>>>>> scope. I just added a section in the future work for this.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > > >>>>>>> sunjincheng121@gmail.com
> > > >>>>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Jiangjie,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thank you for the explanation about the name of
> `cache()`, I
> > > >>>>>>>> understand
> > > >>>>>>>>>> why
> > > >>>>>>>>>>>> you designed this way!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
> data
> > > >>>>>>>> persistence?
> > > >>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user
> > is
> > > >>>>> not
> > > >>>>>>>>>> worried
> > > >>>>>>>>>>>> about data loss, and will clearly specify the time range
> for
> > > >>>>>> keeping
> > > >>>>>>>>>> time.
> > > >>>>>>>>>>>> At the same time, if we want to expand, we can also share
> > in a
> > > >>>>>>> certain
> > > >>>>>>>>>>>> group of session, for example:
> > LifeCycle.SESSION_GROUP(...), I
> > > >>>>> am
> > > >>>>>>> not
> > > >>>>>>>>>> sure,
> > > >>>>>>>>>>>> just an immature suggestion, for reference only!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Bests,
> > > >>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五
> 下午1:33写道:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Re: Jincheng,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> persist(),
> > > >>>>>>>> personally I
> > > >>>>>>>>>>>>> find cache() to be more accurately describing the
> behavior,
> > > >>>>> i.e.
> > > >>>>>>> the
> > > >>>>>>>>>>>> Table
> > > >>>>>>>>>>>>> is cached for the session, but will be deleted after the
> > > >>>>> session
> > > >>>>>> is
> > > >>>>>>>>>>>> closed.
> > > >>>>>>>>>>>>> persist() seems a little misleading as people might think
> > the
> > > >>>>>> table
> > > >>>>>>>>>> will
> > > >>>>>>>>>>>>> still be there even after the session is gone.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Great point about mixing the batch and stream processing
> in
> > > the
> > > >>>>>>> same
> > > >>>>>>>>>> job.
> > > >>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
> that
> > > >>>>> would
> > > >>>>>>> be
> > > >>>>>>>> a
> > > >>>>>>>>>>>> huge
> > > >>>>>>>>>>>>> change across the board, including sources, operators and
> > > >>>>>>>>>> optimizations,
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>> name some. Likely we will need several separate in-depth
> > > >>>>>>> discussions.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > > >>>>> xingcanc@gmail.com>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
> both
> > > >>>>>>>> orthogonal
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
> time
> > > we
> > > >>>>>> plan
> > > >>>>>>>> to
> > > >>>>>>>>>>>>>> introduce another storage mechanism other than the
> state.
> > > >>>>> Maybe
> > > >>>>>>> it’s
> > > >>>>>>>>>>>>> better
> > > >>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
> > > specific
> > > >>>>>>> part?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> > > underlying
> > > >>>>>>>>>> service.
> > > >>>>>>>>>>>>>> This seems to be quite a major change to the existing
> > > >>>>> codebase.
> > > >>>>>> As
> > > >>>>>>>> you
> > > >>>>>>>>>>>>>> claimed, the service should be extendible to support
> other
> > > >>>>>>>> components
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>> we’d better discussed it in another thread.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
> > Table
> > > >>>>>> API,
> > > >>>>>>> in
> > > >>>>>>>>>>>> case
> > > >>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>> Xingcan
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > > >>>>>> xiaoweij@gmail.com>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
> is
> > > not
> > > >>>>>> very
> > > >>>>>>>>>>>>>> reliable.
> > > >>>>>>>>>>>>>>> There is no guarantee that it will be executed
> > > successfully.
> > > >>>>> We
> > > >>>>>>> may
> > > >>>>>>>>>>>>> risk
> > > >>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
> have
> > an
> > > >>>>>>>>>>>> association
> > > >>>>>>>>>>>>>>> between temp table and session id. So we can always
> clean
> > > up
> > > >>>>>> temp
> > > >>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>> which are no longer associated with any active
> sessions.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>> Xiaowei
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > > >>>>>>>>>>>>> sunjincheng121@gmail.com>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Interactive Programming is very useful and user
> friendly
> > > in
> > > >>>>>> case
> > > >>>>>>>> of
> > > >>>>>>>>>>>>> your
> > > >>>>>>>>>>>>>>>> examples.
> > > >>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> executed
> > in
> > > >>>>>>> several
> > > >>>>>>>>>>>>>> stages
> > > >>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in
> > > order
> > > >>>>>> to
> > > >>>>>>>>>>>>> utilize
> > > >>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
> job
> > > by
> > > >>>>>>>>>>>>>> env.execute().
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> > > >>>>> `persist()`,
> > > >>>>>>> And
> > > >>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>> Flink framework determines whether we internally cache
> > in
> > > >>>>>> memory
> > > >>>>>>>> or
> > > >>>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
> > > backend
> > > >>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support
> > for
> > > >>>>>>>> streaming
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit
> > in
> > > >>>>>>>>>>>> "Interactive
> > > >>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
> > > FLIP!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
> > > 下午9:56写道:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
> is a
> > > >>>>>>> promising
> > > >>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> > > aspects,
> > > >>>>>>>>>>>> including
> > > >>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
> the
> > > >>>>>>> scenarios
> > > >>>>>>>>>>>>> where
> > > >>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming.
> To
> > > >>>>>> explain
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>> issues
> > > >>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
> > > >>>>>> together
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>>> following document with our proposal.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Fabian Hueske <fh...@gmail.com>.
Hi,

Thanks for the clarification Becket!

I have a few thoughts to share / questions:

1) I'd like to know how you plan to implement the feature on a plan /
planner level.

I would imaging the following to happen when Table.cache() is called:

1) immediately optimize the Table and internally convert it into a
DataSet/DataStream. This is necessary, to avoid that operators of later
queries on top of the Table are pushed down.
2) register the DataSet/DataStream as a DataSet/DataStream-backed Table X
3) add a sink to the DataSet/DataStream. This is the materialization of the
Table X

Based on your proposal the following would happen:

Table t1 = ....
t1.cache(); // cache() returns void. The logical plan of t1 is replaced by
a scan of X. There is also a reference to the materialization of X.

t1.count(); // this executes the program, including the DataSet/DataStream
that backs X and the sink that writes the materialization of X
t1.count(); // this executes the program, but reads X from the
materialization.

My question is, how do you determine when whether the scan of t1 should go
against the DataSet/DataStream program and when against the materialization?
AFAIK, there is no hook that will tell you that a part of the program was
executed. Flipping a switch during optimization or plan generation is not
sufficient as there is no guarantee that the plan is also executed.

Overall, this behavior is somewhat similar to what I proposed in
FLINK-8950, which does not include persisting the table, but just
optimizing and reregistering it as DataSet/DataStream scan.

2) I think Piotr has a point about the implicit behavior and side effects
of the cache() method if it does not return anything.
Consider the following example:

Table t1 = ???
Table t2 = methodThatAppliesOperators(t1);
Table t3 = methodThatAppliesOtherOperators(t1);

In this case, the behavior/performance of the plan that results from the
second method call depends on whether t1 was modified by the first method
or not.
This is the classic issue of mutable vs. immutable objects.
Also, as Piotr pointed out, it might also be good to have the original plan
of t1, because in some cases it is possible to push filters down such that
evaluating the query from scratch might be more efficient than accessing
the cache.
Moreover, a CachedTable could extend Table() and offer a method refresh().
This sounds quite useful in an interactive session mode.

3) Regarding the name, I can see both arguments. IMO, materialize() seems
to be more future proof.

Best, Fabian

Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
wshaoxuan@gmail.com>:

> Hi Piotr,
>
> Thanks for sharing your ideas on the method naming. We will think about
> your suggestions. But I don't understand why we need to change the return
> type of cache().
>
> Cache() is a physical operation, it does not change the logic of
> the `Table`. On the tableAPI layer, we should not introduce a new table
> type unless the logic of table has been changed. If we introduce a new
> table type `CachedTable`, we need create the same set of methods of `Table`
> for it. I don't think it is worth doing this. Or can you please elaborate
> more on what could be the "implicit behaviours/side effects" you are
> thinking about?
>
> Regards,
> Shaoxuan
>
>
>
> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
> > Hi Becket,
> >
> > Thanks for the response.
> >
> > 1. I wasn’t saying that materialised view must be mutable or not. The
> same
> > thing applies to caches as well. To the contrary, I would expect more
> > consistency and updates from something that is called “cache” vs
> something
> > that’s a “materialised view”. In other words, IMO most caches do not
> serve
> > you invalid/outdated data and they handle updates on their own.
> >
> > 2. I don’t think that having in the future two very similar concepts of
> > `materialized` view and `cache` is a good idea. It would be confusing for
> > the users. I think it could be handled by variations/overloading of
> > materialised view concept. We could start with:
> >
> > `MaterializedTable materialize()` - immutable, session life scope
> > (basically the same semantic as you are proposing
> >
> > And then in the future (if ever) build on top of that/expand it with:
> >
> > `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
> > materialize(refreshHook=…)`
> >
> > Or with cross session support:
> >
> > `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
> > materializeInto(tableFactory=…)`
> >
> > I’m not saying that we should implement cross session/refreshing now or
> > even in the near future. I’m just arguing that naming current immutable
> > session life scope method `materialize()` is more future proof and more
> > consistent with SQL (on which after all table-api is heavily basing on).
> >
> > 3. Even if we agree on naming it `cache()`, I would still insist on
> > `cache()` returning `CachedTable` handle to avoid implicit
> behaviours/side
> > effects and to give both us & users more flexibility.
> >
> > Piotrek
> >
> > > On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
> > >
> > > Just to add a little bit, the materialized view is probably more
> similar
> > to
> > > the persistent() brought up earlier in the thread. So it is usually
> cross
> > > session and could be used in a larger scope. For example, a
> materialized
> > > view created by user A may be visible to user B. It is probably
> something
> > > we want to have in the future. I'll put it in the future work section.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com>
> wrote:
> > >
> > >> Hi Piotrek,
> > >>
> > >> Thanks for the explanation.
> > >>
> > >> Right now we are mostly thinking of the cached table as immutable. I
> can
> > >> see the Materialized view would be useful in the future. That said, I
> > think
> > >> a simple cache mechanism is probably still needed. So to me, cache()
> and
> > >> materialize() should be two separate method as they address different
> > >> needs. Materialize() is a higher level concept usually implying
> > periodical
> > >> update, while cache() has much simpler semantic. For example, one may
> > >> create a materialized view and use cache() method in the materialized
> > view
> > >> creation logic. So that during the materialized view update, they do
> not
> > >> need to worry about the case that the cached table is also changed.
> > Maybe
> > >> under the hood, materialized() and cache() could share some mechanism,
> > but
> > >> I think a simple cache() method would be handy in a lot of cases.
> > >>
> > >> Thanks,
> > >>
> > >> Jiangjie (Becket) Qin
> > >>
> > >> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> piotr@data-artisans.com
> > >
> > >> wrote:
> > >>
> > >>> Hi Becket,
> > >>>
> > >>>> Is there any extra thing user can do on a MaterializedTable that
> they
> > >>> cannot do on a Table?
> > >>>
> > >>> Maybe not in the initial implementation, but various DBs offer
> > different
> > >>> ways to “refresh” the materialised view. Hooks, triggers, timers,
> > manually
> > >>> etc. Having `MaterializedTable` would help us to handle that in the
> > future.
> > >>>
> > >>>> After users call *table.cache(), *users can just use that table and
> do
> > >>> anything that is supported on a Table, including SQL.
> > >>>
> > >>> This is some implicit behaviour with side effects. Imagine if user
> has
> > a
> > >>> long and complicated program, that touches table `b` multiple times,
> > maybe
> > >>> scattered around different methods. If he modifies his program by
> > inserting
> > >>> in one place
> > >>>
> > >>> b.cache()
> > >>>
> > >>> This implicitly alters the semantic and behaviour of his code all
> over
> > >>> the place, maybe in a ways that might cause problems. For example
> what
> > if
> > >>> underlying data is changing?
> > >>>
> > >>> Having invisible side effects is also not very clean, for example
> think
> > >>> about something like this (but more complicated):
> > >>>
> > >>> Table b = ...;
> > >>>
> > >>> If (some_condition) {
> > >>>  processTable1(b)
> > >>> }
> > >>> else {
> > >>>  processTable2(b)
> > >>> }
> > >>>
> > >>> // do more stuff with b
> > >>>
> > >>> And user adds `b.cache()` call to only one of the `processTable1` or
> > >>> `processTable2` methods.
> > >>>
> > >>> On the other hand
> > >>>
> > >>> Table materialisedB = b.materialize()
> > >>>
> > >>> Avoids (at least some of) the side effect issues and forces user to
> > >>> explicitly use `materialisedB` where it’s appropriate and forces user
> > to
> > >>> think what does it actually mean. And if something doesn’t work in
> the
> > end
> > >>> for the user, he will know what has he changed instead of blaming
> > Flink for
> > >>> some “magic” underneath. In the above example, after materialising b
> in
> > >>> only one of the methods, he should/would realise about the issue when
> > >>> handling the return value `MaterializedTable` of that method.
> > >>>
> > >>> I guess it comes down to personal preferences if you like things to
> be
> > >>> implicit or not. The more power is the user, probably the more likely
> > he is
> > >>> to like/understand implicit behaviour. And we as Table API designers
> > are
> > >>> the most power users out there, so I would proceed with caution (so
> > that we
> > >>> do not end up in the crazy perl realm with it’s lovely implicit
> method
> > >>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> > >>>
> > >>>> Table API to also support non-relational processing cases, cache()
> > >>> might be slightly better.
> > >>>
> > >>> I think even such extended Table API could benefit from sticking
> > to/being
> > >>> consistent with SQL where both SQL and Table API are basically the
> > same.
> > >>>
> > >>> One more thing. `MaterializedTable materialize()` could be more
> > >>> powerful/flexible allowing the user to operate both on materialised
> > and not
> > >>> materialised view at the same time for whatever reasons (underlying
> > data
> > >>> changing/better optimisation opportunities after pushing down more
> > filters
> > >>> etc). For example:
> > >>>
> > >>> Table b = …;
> > >>>
> > >>> MaterlizedTable mb = b.materialize();
> > >>>
> > >>> Val min = mb.min();
> > >>> Val max = mb.max();
> > >>>
> > >>> Val user42 = b.filter(‘userId = 42);
> > >>>
> > >>> Could be more efficient compared to `b.cache()` if `filter(‘userId =
> > >>> 42);` allows for much more aggressive optimisations.
> > >>>
> > >>> Piotrek
> > >>>
> > >>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com> wrote:
> > >>>>
> > >>>> I'm not suggesting to add support for Ignite. This was just an
> > example.
> > >>>> Plasma and Arrow sound interesting, too.
> > >>>> For the sake of this proposal, it would be up to the user to
> > implement a
> > >>>> TableFactory and corresponding TableSource / TableSink classes to
> > >>> persist
> > >>>> and read the data.
> > >>>>
> > >>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> > >>>> pompermaier@okkam.it>:
> > >>>>
> > >>>>> What about to add also Apache Plasma + Arrow as an alternative to
> > >>> Apache
> > >>>>> Ignite?
> > >>>>> [1]
> > >>>>>
> > >>>
> > https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > >>>>>
> > >>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fh...@gmail.com>
> > >>> wrote:
> > >>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> Thanks for the proposal!
> > >>>>>>
> > >>>>>> To summarize, you propose a new method Table.cache(): Table that
> > will
> > >>>>>> trigger a job and write the result into some temporary storage as
> > >>> defined
> > >>>>>> by a TableFactory.
> > >>>>>> The cache() call blocks while the job is running and eventually
> > >>> returns a
> > >>>>>> Table object that represents a scan of the temporary table.
> > >>>>>> When the "session" is closed (closing to be defined?), the
> temporary
> > >>>>> tables
> > >>>>>> are all dropped.
> > >>>>>>
> > >>>>>> I think this behavior makes sense and is a good first step towards
> > >>> more
> > >>>>>> interactive workloads.
> > >>>>>> However, its performance suffers from writing to and reading from
> > >>>>> external
> > >>>>>> systems.
> > >>>>>> I think this is OK for now. Changes that would significantly
> improve
> > >>> the
> > >>>>>> situation (i.e., pinning data in-memory across jobs) would have
> > large
> > >>>>>> impacts on many components of Flink.
> > >>>>>> Users could use in-memory filesystems or storage grids (Apache
> > >>> Ignite) to
> > >>>>>> mitigate some of the performance effects.
> > >>>>>>
> > >>>>>> Best, Fabian
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> > >>>>>> becket.qin@gmail.com
> > >>>>>>> :
> > >>>>>>
> > >>>>>>> Thanks for the explanation, Piotrek.
> > >>>>>>>
> > >>>>>>> Is there any extra thing user can do on a MaterializedTable that
> > they
> > >>>>>>> cannot do on a Table? After users call *table.cache(), *users can
> > >>> just
> > >>>>>> use
> > >>>>>>> that table and do anything that is supported on a Table,
> including
> > >>> SQL.
> > >>>>>>>
> > >>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
> > >>> cache()
> > >>>>>> is
> > >>>>>>> a bit more general than materialize(). Given that we are
> enhancing
> > >>> the
> > >>>>>>> Table API to also support non-relational processing cases,
> cache()
> > >>>>> might
> > >>>>>> be
> > >>>>>>> slightly better.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>>
> > >>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > >>>>> piotr@data-artisans.com
> > >>>>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Becket,
> > >>>>>>>>
> > >>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> > >>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want to
> > >>>>>> provide
> > >>>>>>> an
> > >>>>>>>> alternate way of writing the data.
> > >>>>>>>>
> > >>>>>>>> Now that I hopefully understand the proposal, maybe we could
> > rename
> > >>>>>>>> `cache()` to
> > >>>>>>>>
> > >>>>>>>> void materialize()
> > >>>>>>>>
> > >>>>>>>> or going step further
> > >>>>>>>>
> > >>>>>>>> MaterializedTable materialize()
> > >>>>>>>> MaterializedTable createMaterializedView()
> > >>>>>>>>
> > >>>>>>>> ?
> > >>>>>>>>
> > >>>>>>>> The second option with returning a handle I think is more
> flexible
> > >>>>> and
> > >>>>>>>> could provide features such as “refresh”/“delete” or generally
> > >>>>> speaking
> > >>>>>>>> manage the the view. In the future we could also think about
> > adding
> > >>>>>> hooks
> > >>>>>>>> to automatically refresh view etc. It is also more explicit -
> > >>>>>>>> materialization returning a new table handle will not have the
> > same
> > >>>>>>>> implicit side effects as adding a simple line of code like
> > >>>>> `b.cache()`
> > >>>>>>>> would have.
> > >>>>>>>>
> > >>>>>>>> It would also be more SQL like, making it more intuitive for
> users
> > >>>>>>> already
> > >>>>>>>> familiar with the SQL.
> > >>>>>>>>
> > >>>>>>>> Piotrek
> > >>>>>>>>
> > >>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com>
> > wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Piotrek,
> > >>>>>>>>>
> > >>>>>>>>> For the cache() method itself, yes, it is equivalent to
> creating
> > a
> > >>>>>>>> BUILT-IN
> > >>>>>>>>> materialized view with a lifecycle. That functionality is
> missing
> > >>>>>>> today,
> > >>>>>>>>> though. Not sure if I understand your question. Do you mean we
> > >>>>>> already
> > >>>>>>>> have
> > >>>>>>>>> the functionality and just need a syntax sugar?
> > >>>>>>>>>
> > >>>>>>>>> What's more interesting in the proposal is do we want to stop
> at
> > >>>>>>> creating
> > >>>>>>>>> the materialized view? Or do we want to extend that in the
> future
> > >>>>> to
> > >>>>>> a
> > >>>>>>>> more
> > >>>>>>>>> useful unified data store distributed with Flink? And do we
> want
> > to
> > >>>>>>> have
> > >>>>>>>> a
> > >>>>>>>>> mechanism allow more flexible user job pattern with their own
> > user
> > >>>>>>>> defined
> > >>>>>>>>> services. These considerations are much more architectural.
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>>
> > >>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > >>>>>>> piotr@data-artisans.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> Interesting idea. I’m trying to understand the problem. Isn’t
> > the
> > >>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
> later
> > >>>>>>> reading
> > >>>>>>>>>> from it? Where this sink has a limited live scope/live time?
> And
> > >>>>> the
> > >>>>>>>> sink
> > >>>>>>>>>> could be implemented as in memory or a file sink?
> > >>>>>>>>>>
> > >>>>>>>>>> If so, what’s the problem with creating a materialised view
> > from a
> > >>>>>>> table
> > >>>>>>>>>> “b” (from your document’s example) and reusing this
> materialised
> > >>>>>> view
> > >>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> materialised
> > >>>>>> views
> > >>>>>>>> (for
> > >>>>>>>>>> example when current session finishes)? Maybe we need some
> > >>>>> syntactic
> > >>>>>>>> sugar
> > >>>>>>>>>> on top of it?
> > >>>>>>>>>>
> > >>>>>>>>>> Piotrek
> > >>>>>>>>>>
> > >>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> > >>>>>>> lifecycle/defined
> > >>>>>>>>>>> scope. I just added a section in the future work for this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > >>>>>>> sunjincheng121@gmail.com
> > >>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Jiangjie,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thank you for the explanation about the name of `cache()`, I
> > >>>>>>>> understand
> > >>>>>>>>>> why
> > >>>>>>>>>>>> you designed this way!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Another idea is whether we can specify a lifecycle for data
> > >>>>>>>> persistence?
> > >>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user
> is
> > >>>>> not
> > >>>>>>>>>> worried
> > >>>>>>>>>>>> about data loss, and will clearly specify the time range for
> > >>>>>> keeping
> > >>>>>>>>>> time.
> > >>>>>>>>>>>> At the same time, if we want to expand, we can also share
> in a
> > >>>>>>> certain
> > >>>>>>>>>>>> group of session, for example:
> LifeCycle.SESSION_GROUP(...), I
> > >>>>> am
> > >>>>>>> not
> > >>>>>>>>>> sure,
> > >>>>>>>>>>>> just an immature suggestion, for reference only!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Bests,
> > >>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Re: Jincheng,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
> > >>>>>>>> personally I
> > >>>>>>>>>>>>> find cache() to be more accurately describing the behavior,
> > >>>>> i.e.
> > >>>>>>> the
> > >>>>>>>>>>>> Table
> > >>>>>>>>>>>>> is cached for the session, but will be deleted after the
> > >>>>> session
> > >>>>>> is
> > >>>>>>>>>>>> closed.
> > >>>>>>>>>>>>> persist() seems a little misleading as people might think
> the
> > >>>>>> table
> > >>>>>>>>>> will
> > >>>>>>>>>>>>> still be there even after the session is gone.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Great point about mixing the batch and stream processing in
> > the
> > >>>>>>> same
> > >>>>>>>>>> job.
> > >>>>>>>>>>>>> We should absolutely move towards that goal. I imagine that
> > >>>>> would
> > >>>>>>> be
> > >>>>>>>> a
> > >>>>>>>>>>>> huge
> > >>>>>>>>>>>>> change across the board, including sources, operators and
> > >>>>>>>>>> optimizations,
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>> name some. Likely we will need several separate in-depth
> > >>>>>>> discussions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > >>>>> xingcanc@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are both
> > >>>>>>>> orthogonal
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>> the cache problem. Essentially, this may be the first time
> > we
> > >>>>>> plan
> > >>>>>>>> to
> > >>>>>>>>>>>>>> introduce another storage mechanism other than the state.
> > >>>>> Maybe
> > >>>>>>> it’s
> > >>>>>>>>>>>>> better
> > >>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
> > specific
> > >>>>>>> part?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> > underlying
> > >>>>>>>>>> service.
> > >>>>>>>>>>>>>> This seems to be quite a major change to the existing
> > >>>>> codebase.
> > >>>>>> As
> > >>>>>>>> you
> > >>>>>>>>>>>>>> claimed, the service should be extendible to support other
> > >>>>>>>> components
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>> we’d better discussed it in another thread.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
> Table
> > >>>>>> API,
> > >>>>>>> in
> > >>>>>>>>>>>> case
> > >>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>> Xingcan
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > >>>>>> xiaoweij@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up is
> > not
> > >>>>>> very
> > >>>>>>>>>>>>>> reliable.
> > >>>>>>>>>>>>>>> There is no guarantee that it will be executed
> > successfully.
> > >>>>> We
> > >>>>>>> may
> > >>>>>>>>>>>>> risk
> > >>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to have
> an
> > >>>>>>>>>>>> association
> > >>>>>>>>>>>>>>> between temp table and session id. So we can always clean
> > up
> > >>>>>> temp
> > >>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>> which are no longer associated with any active sessions.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>> Xiaowei
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > >>>>>>>>>>>>> sunjincheng121@gmail.com>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Interactive Programming is very useful and user friendly
> > in
> > >>>>>> case
> > >>>>>>>> of
> > >>>>>>>>>>>>> your
> > >>>>>>>>>>>>>>>> examples.
> > >>>>>>>>>>>>>>>> Moreover, especially when a business has to be executed
> in
> > >>>>>>> several
> > >>>>>>>>>>>>>> stages
> > >>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in
> > order
> > >>>>>> to
> > >>>>>>>>>>>>> utilize
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> intermediate calculation results we have to submit a job
> > by
> > >>>>>>>>>>>>>> env.execute().
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> > >>>>> `persist()`,
> > >>>>>>> And
> > >>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>> Flink framework determines whether we internally cache
> in
> > >>>>>> memory
> > >>>>>>>> or
> > >>>>>>>>>>>>>> persist
> > >>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
> > backend
> > >>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support
> for
> > >>>>>>>> streaming
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit
> in
> > >>>>>>>>>>>> "Interactive
> > >>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
> > FLIP!
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
> > 下午9:56写道:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it is a
> > >>>>>>> promising
> > >>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> > aspects,
> > >>>>>>>>>>>> including
> > >>>>>>>>>>>>>>>>> functionality and ease of use among others. One of the
> > >>>>>>> scenarios
> > >>>>>>>>>>>>> where
> > >>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming. To
> > >>>>>> explain
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>> issues
> > >>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
> > >>>>>> together
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>>> following document with our proposal.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>>
> >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Shaoxuan Wang <ws...@gmail.com>.
Hi Piotr,

Thanks for sharing your ideas on the method naming. We will think about
your suggestions. But I don't understand why we need to change the return
type of cache().

Cache() is a physical operation, it does not change the logic of
the `Table`. On the tableAPI layer, we should not introduce a new table
type unless the logic of table has been changed. If we introduce a new
table type `CachedTable`, we need create the same set of methods of `Table`
for it. I don't think it is worth doing this. Or can you please elaborate
more on what could be the "implicit behaviours/side effects" you are
thinking about?

Regards,
Shaoxuan



On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi Becket,
>
> Thanks for the response.
>
> 1. I wasn’t saying that materialised view must be mutable or not. The same
> thing applies to caches as well. To the contrary, I would expect more
> consistency and updates from something that is called “cache” vs something
> that’s a “materialised view”. In other words, IMO most caches do not serve
> you invalid/outdated data and they handle updates on their own.
>
> 2. I don’t think that having in the future two very similar concepts of
> `materialized` view and `cache` is a good idea. It would be confusing for
> the users. I think it could be handled by variations/overloading of
> materialised view concept. We could start with:
>
> `MaterializedTable materialize()` - immutable, session life scope
> (basically the same semantic as you are proposing
>
> And then in the future (if ever) build on top of that/expand it with:
>
> `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
> materialize(refreshHook=…)`
>
> Or with cross session support:
>
> `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
> materializeInto(tableFactory=…)`
>
> I’m not saying that we should implement cross session/refreshing now or
> even in the near future. I’m just arguing that naming current immutable
> session life scope method `materialize()` is more future proof and more
> consistent with SQL (on which after all table-api is heavily basing on).
>
> 3. Even if we agree on naming it `cache()`, I would still insist on
> `cache()` returning `CachedTable` handle to avoid implicit behaviours/side
> effects and to give both us & users more flexibility.
>
> Piotrek
>
> > On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
> >
> > Just to add a little bit, the materialized view is probably more similar
> to
> > the persistent() brought up earlier in the thread. So it is usually cross
> > session and could be used in a larger scope. For example, a materialized
> > view created by user A may be visible to user B. It is probably something
> > we want to have in the future. I'll put it in the future work section.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com> wrote:
> >
> >> Hi Piotrek,
> >>
> >> Thanks for the explanation.
> >>
> >> Right now we are mostly thinking of the cached table as immutable. I can
> >> see the Materialized view would be useful in the future. That said, I
> think
> >> a simple cache mechanism is probably still needed. So to me, cache() and
> >> materialize() should be two separate method as they address different
> >> needs. Materialize() is a higher level concept usually implying
> periodical
> >> update, while cache() has much simpler semantic. For example, one may
> >> create a materialized view and use cache() method in the materialized
> view
> >> creation logic. So that during the materialized view update, they do not
> >> need to worry about the case that the cached table is also changed.
> Maybe
> >> under the hood, materialized() and cache() could share some mechanism,
> but
> >> I think a simple cache() method would be handy in a lot of cases.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <piotr@data-artisans.com
> >
> >> wrote:
> >>
> >>> Hi Becket,
> >>>
> >>>> Is there any extra thing user can do on a MaterializedTable that they
> >>> cannot do on a Table?
> >>>
> >>> Maybe not in the initial implementation, but various DBs offer
> different
> >>> ways to “refresh” the materialised view. Hooks, triggers, timers,
> manually
> >>> etc. Having `MaterializedTable` would help us to handle that in the
> future.
> >>>
> >>>> After users call *table.cache(), *users can just use that table and do
> >>> anything that is supported on a Table, including SQL.
> >>>
> >>> This is some implicit behaviour with side effects. Imagine if user has
> a
> >>> long and complicated program, that touches table `b` multiple times,
> maybe
> >>> scattered around different methods. If he modifies his program by
> inserting
> >>> in one place
> >>>
> >>> b.cache()
> >>>
> >>> This implicitly alters the semantic and behaviour of his code all over
> >>> the place, maybe in a ways that might cause problems. For example what
> if
> >>> underlying data is changing?
> >>>
> >>> Having invisible side effects is also not very clean, for example think
> >>> about something like this (but more complicated):
> >>>
> >>> Table b = ...;
> >>>
> >>> If (some_condition) {
> >>>  processTable1(b)
> >>> }
> >>> else {
> >>>  processTable2(b)
> >>> }
> >>>
> >>> // do more stuff with b
> >>>
> >>> And user adds `b.cache()` call to only one of the `processTable1` or
> >>> `processTable2` methods.
> >>>
> >>> On the other hand
> >>>
> >>> Table materialisedB = b.materialize()
> >>>
> >>> Avoids (at least some of) the side effect issues and forces user to
> >>> explicitly use `materialisedB` where it’s appropriate and forces user
> to
> >>> think what does it actually mean. And if something doesn’t work in the
> end
> >>> for the user, he will know what has he changed instead of blaming
> Flink for
> >>> some “magic” underneath. In the above example, after materialising b in
> >>> only one of the methods, he should/would realise about the issue when
> >>> handling the return value `MaterializedTable` of that method.
> >>>
> >>> I guess it comes down to personal preferences if you like things to be
> >>> implicit or not. The more power is the user, probably the more likely
> he is
> >>> to like/understand implicit behaviour. And we as Table API designers
> are
> >>> the most power users out there, so I would proceed with caution (so
> that we
> >>> do not end up in the crazy perl realm with it’s lovely implicit method
> >>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> >>>
> >>>> Table API to also support non-relational processing cases, cache()
> >>> might be slightly better.
> >>>
> >>> I think even such extended Table API could benefit from sticking
> to/being
> >>> consistent with SQL where both SQL and Table API are basically the
> same.
> >>>
> >>> One more thing. `MaterializedTable materialize()` could be more
> >>> powerful/flexible allowing the user to operate both on materialised
> and not
> >>> materialised view at the same time for whatever reasons (underlying
> data
> >>> changing/better optimisation opportunities after pushing down more
> filters
> >>> etc). For example:
> >>>
> >>> Table b = …;
> >>>
> >>> MaterlizedTable mb = b.materialize();
> >>>
> >>> Val min = mb.min();
> >>> Val max = mb.max();
> >>>
> >>> Val user42 = b.filter(‘userId = 42);
> >>>
> >>> Could be more efficient compared to `b.cache()` if `filter(‘userId =
> >>> 42);` allows for much more aggressive optimisations.
> >>>
> >>> Piotrek
> >>>
> >>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com> wrote:
> >>>>
> >>>> I'm not suggesting to add support for Ignite. This was just an
> example.
> >>>> Plasma and Arrow sound interesting, too.
> >>>> For the sake of this proposal, it would be up to the user to
> implement a
> >>>> TableFactory and corresponding TableSource / TableSink classes to
> >>> persist
> >>>> and read the data.
> >>>>
> >>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> >>>> pompermaier@okkam.it>:
> >>>>
> >>>>> What about to add also Apache Plasma + Arrow as an alternative to
> >>> Apache
> >>>>> Ignite?
> >>>>> [1]
> >>>>>
> >>>
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> >>>>>
> >>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fh...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Thanks for the proposal!
> >>>>>>
> >>>>>> To summarize, you propose a new method Table.cache(): Table that
> will
> >>>>>> trigger a job and write the result into some temporary storage as
> >>> defined
> >>>>>> by a TableFactory.
> >>>>>> The cache() call blocks while the job is running and eventually
> >>> returns a
> >>>>>> Table object that represents a scan of the temporary table.
> >>>>>> When the "session" is closed (closing to be defined?), the temporary
> >>>>> tables
> >>>>>> are all dropped.
> >>>>>>
> >>>>>> I think this behavior makes sense and is a good first step towards
> >>> more
> >>>>>> interactive workloads.
> >>>>>> However, its performance suffers from writing to and reading from
> >>>>> external
> >>>>>> systems.
> >>>>>> I think this is OK for now. Changes that would significantly improve
> >>> the
> >>>>>> situation (i.e., pinning data in-memory across jobs) would have
> large
> >>>>>> impacts on many components of Flink.
> >>>>>> Users could use in-memory filesystems or storage grids (Apache
> >>> Ignite) to
> >>>>>> mitigate some of the performance effects.
> >>>>>>
> >>>>>> Best, Fabian
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> >>>>>> becket.qin@gmail.com
> >>>>>>> :
> >>>>>>
> >>>>>>> Thanks for the explanation, Piotrek.
> >>>>>>>
> >>>>>>> Is there any extra thing user can do on a MaterializedTable that
> they
> >>>>>>> cannot do on a Table? After users call *table.cache(), *users can
> >>> just
> >>>>>> use
> >>>>>>> that table and do anything that is supported on a Table, including
> >>> SQL.
> >>>>>>>
> >>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
> >>> cache()
> >>>>>> is
> >>>>>>> a bit more general than materialize(). Given that we are enhancing
> >>> the
> >>>>>>> Table API to also support non-relational processing cases, cache()
> >>>>> might
> >>>>>> be
> >>>>>>> slightly better.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jiangjie (Becket) Qin
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> >>>>> piotr@data-artisans.com
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> >>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want to
> >>>>>> provide
> >>>>>>> an
> >>>>>>>> alternate way of writing the data.
> >>>>>>>>
> >>>>>>>> Now that I hopefully understand the proposal, maybe we could
> rename
> >>>>>>>> `cache()` to
> >>>>>>>>
> >>>>>>>> void materialize()
> >>>>>>>>
> >>>>>>>> or going step further
> >>>>>>>>
> >>>>>>>> MaterializedTable materialize()
> >>>>>>>> MaterializedTable createMaterializedView()
> >>>>>>>>
> >>>>>>>> ?
> >>>>>>>>
> >>>>>>>> The second option with returning a handle I think is more flexible
> >>>>> and
> >>>>>>>> could provide features such as “refresh”/“delete” or generally
> >>>>> speaking
> >>>>>>>> manage the the view. In the future we could also think about
> adding
> >>>>>> hooks
> >>>>>>>> to automatically refresh view etc. It is also more explicit -
> >>>>>>>> materialization returning a new table handle will not have the
> same
> >>>>>>>> implicit side effects as adding a simple line of code like
> >>>>> `b.cache()`
> >>>>>>>> would have.
> >>>>>>>>
> >>>>>>>> It would also be more SQL like, making it more intuitive for users
> >>>>>>> already
> >>>>>>>> familiar with the SQL.
> >>>>>>>>
> >>>>>>>> Piotrek
> >>>>>>>>
> >>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Piotrek,
> >>>>>>>>>
> >>>>>>>>> For the cache() method itself, yes, it is equivalent to creating
> a
> >>>>>>>> BUILT-IN
> >>>>>>>>> materialized view with a lifecycle. That functionality is missing
> >>>>>>> today,
> >>>>>>>>> though. Not sure if I understand your question. Do you mean we
> >>>>>> already
> >>>>>>>> have
> >>>>>>>>> the functionality and just need a syntax sugar?
> >>>>>>>>>
> >>>>>>>>> What's more interesting in the proposal is do we want to stop at
> >>>>>>> creating
> >>>>>>>>> the materialized view? Or do we want to extend that in the future
> >>>>> to
> >>>>>> a
> >>>>>>>> more
> >>>>>>>>> useful unified data store distributed with Flink? And do we want
> to
> >>>>>>> have
> >>>>>>>> a
> >>>>>>>>> mechanism allow more flexible user job pattern with their own
> user
> >>>>>>>> defined
> >>>>>>>>> services. These considerations are much more architectural.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>
> >>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> >>>>>>> piotr@data-artisans.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Interesting idea. I’m trying to understand the problem. Isn’t
> the
> >>>>>>>>>> `cache()` call an equivalent of writing data to a sink and later
> >>>>>>> reading
> >>>>>>>>>> from it? Where this sink has a limited live scope/live time? And
> >>>>> the
> >>>>>>>> sink
> >>>>>>>>>> could be implemented as in memory or a file sink?
> >>>>>>>>>>
> >>>>>>>>>> If so, what’s the problem with creating a materialised view
> from a
> >>>>>>> table
> >>>>>>>>>> “b” (from your document’s example) and reusing this materialised
> >>>>>> view
> >>>>>>>>>> later? Maybe we are lacking mechanisms to clean up materialised
> >>>>>> views
> >>>>>>>> (for
> >>>>>>>>>> example when current session finishes)? Maybe we need some
> >>>>> syntactic
> >>>>>>>> sugar
> >>>>>>>>>> on top of it?
> >>>>>>>>>>
> >>>>>>>>>> Piotrek
> >>>>>>>>>>
> >>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the suggestion, Jincheng.
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> >>>>>>> lifecycle/defined
> >>>>>>>>>>> scope. I just added a section in the future work for this.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> >>>>>>> sunjincheng121@gmail.com
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Jiangjie,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thank you for the explanation about the name of `cache()`, I
> >>>>>>>> understand
> >>>>>>>>>> why
> >>>>>>>>>>>> you designed this way!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another idea is whether we can specify a lifecycle for data
> >>>>>>>> persistence?
> >>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user is
> >>>>> not
> >>>>>>>>>> worried
> >>>>>>>>>>>> about data loss, and will clearly specify the time range for
> >>>>>> keeping
> >>>>>>>>>> time.
> >>>>>>>>>>>> At the same time, if we want to expand, we can also share in a
> >>>>>>> certain
> >>>>>>>>>>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I
> >>>>> am
> >>>>>>> not
> >>>>>>>>>> sure,
> >>>>>>>>>>>> just an immature suggestion, for reference only!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Bests,
> >>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>
> >>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Re: Jincheng,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
> >>>>>>>> personally I
> >>>>>>>>>>>>> find cache() to be more accurately describing the behavior,
> >>>>> i.e.
> >>>>>>> the
> >>>>>>>>>>>> Table
> >>>>>>>>>>>>> is cached for the session, but will be deleted after the
> >>>>> session
> >>>>>> is
> >>>>>>>>>>>> closed.
> >>>>>>>>>>>>> persist() seems a little misleading as people might think the
> >>>>>> table
> >>>>>>>>>> will
> >>>>>>>>>>>>> still be there even after the session is gone.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Great point about mixing the batch and stream processing in
> the
> >>>>>>> same
> >>>>>>>>>> job.
> >>>>>>>>>>>>> We should absolutely move towards that goal. I imagine that
> >>>>> would
> >>>>>>> be
> >>>>>>>> a
> >>>>>>>>>>>> huge
> >>>>>>>>>>>>> change across the board, including sources, operators and
> >>>>>>>>>> optimizations,
> >>>>>>>>>>>> to
> >>>>>>>>>>>>> name some. Likely we will need several separate in-depth
> >>>>>>> discussions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> >>>>> xingcanc@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are both
> >>>>>>>> orthogonal
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>> the cache problem. Essentially, this may be the first time
> we
> >>>>>> plan
> >>>>>>>> to
> >>>>>>>>>>>>>> introduce another storage mechanism other than the state.
> >>>>> Maybe
> >>>>>>> it’s
> >>>>>>>>>>>>> better
> >>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
> specific
> >>>>>>> part?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> underlying
> >>>>>>>>>> service.
> >>>>>>>>>>>>>> This seems to be quite a major change to the existing
> >>>>> codebase.
> >>>>>> As
> >>>>>>>> you
> >>>>>>>>>>>>>> claimed, the service should be extendible to support other
> >>>>>>>> components
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>> we’d better discussed it in another thread.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive Table
> >>>>>> API,
> >>>>>>> in
> >>>>>>>>>>>> case
> >>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Xingcan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> >>>>>> xiaoweij@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up is
> not
> >>>>>> very
> >>>>>>>>>>>>>> reliable.
> >>>>>>>>>>>>>>> There is no guarantee that it will be executed
> successfully.
> >>>>> We
> >>>>>>> may
> >>>>>>>>>>>>> risk
> >>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to have an
> >>>>>>>>>>>> association
> >>>>>>>>>>>>>>> between temp table and session id. So we can always clean
> up
> >>>>>> temp
> >>>>>>>>>>>>> tables
> >>>>>>>>>>>>>>> which are no longer associated with any active sessions.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Xiaowei
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> >>>>>>>>>>>>> sunjincheng121@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Interactive Programming is very useful and user friendly
> in
> >>>>>> case
> >>>>>>>> of
> >>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>> examples.
> >>>>>>>>>>>>>>>> Moreover, especially when a business has to be executed in
> >>>>>>> several
> >>>>>>>>>>>>>> stages
> >>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in
> order
> >>>>>> to
> >>>>>>>>>>>>> utilize
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> intermediate calculation results we have to submit a job
> by
> >>>>>>>>>>>>>> env.execute().
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> >>>>> `persist()`,
> >>>>>>> And
> >>>>>>>>>>>> The
> >>>>>>>>>>>>>>>> Flink framework determines whether we internally cache in
> >>>>>> memory
> >>>>>>>> or
> >>>>>>>>>>>>>> persist
> >>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
> backend
> >>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support for
> >>>>>>>> streaming
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit in
> >>>>>>>>>>>> "Interactive
> >>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
> FLIP!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二
> 下午9:56写道:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it is a
> >>>>>>> promising
> >>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> aspects,
> >>>>>>>>>>>> including
> >>>>>>>>>>>>>>>>> functionality and ease of use among others. One of the
> >>>>>>> scenarios
> >>>>>>>>>>>>> where
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming. To
> >>>>>> explain
> >>>>>>>> the
> >>>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
> >>>>>> together
> >>>>>>>> the
> >>>>>>>>>>>>>>>>> following document with our proposal.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi Becket,

Thanks for the response.

1. I wasn’t saying that materialised view must be mutable or not. The same thing applies to caches as well. To the contrary, I would expect more consistency and updates from something that is called “cache” vs something that’s a “materialised view”. In other words, IMO most caches do not serve you invalid/outdated data and they handle updates on their own.

2. I don’t think that having in the future two very similar concepts of `materialized` view and `cache` is a good idea. It would be confusing for the users. I think it could be handled by variations/overloading of materialised view concept. We could start with:

`MaterializedTable materialize()` - immutable, session life scope (basically the same semantic as you are proposing 

And then in the future (if ever) build on top of that/expand it with:

`MaterializedTable materialize(refreshTime=…)` or `MaterializedTable materialize(refreshHook=…)`

Or with cross session support:

`MaterializedTable materializeInto(connector=…)` or `MaterializedTable materializeInto(tableFactory=…)`

I’m not saying that we should implement cross session/refreshing now or even in the near future. I’m just arguing that naming current immutable session life scope method `materialize()` is more future proof and more consistent with SQL (on which after all table-api is heavily basing on).

3. Even if we agree on naming it `cache()`, I would still insist on `cache()` returning `CachedTable` handle to avoid implicit behaviours/side effects and to give both us & users more flexibility.

Piotrek

> On 29 Nov 2018, at 06:20, Becket Qin <be...@gmail.com> wrote:
> 
> Just to add a little bit, the materialized view is probably more similar to
> the persistent() brought up earlier in the thread. So it is usually cross
> session and could be used in a larger scope. For example, a materialized
> view created by user A may be visible to user B. It is probably something
> we want to have in the future. I'll put it in the future work section.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com> wrote:
> 
>> Hi Piotrek,
>> 
>> Thanks for the explanation.
>> 
>> Right now we are mostly thinking of the cached table as immutable. I can
>> see the Materialized view would be useful in the future. That said, I think
>> a simple cache mechanism is probably still needed. So to me, cache() and
>> materialize() should be two separate method as they address different
>> needs. Materialize() is a higher level concept usually implying periodical
>> update, while cache() has much simpler semantic. For example, one may
>> create a materialized view and use cache() method in the materialized view
>> creation logic. So that during the materialized view update, they do not
>> need to worry about the case that the cached table is also changed. Maybe
>> under the hood, materialized() and cache() could share some mechanism, but
>> I think a simple cache() method would be handy in a lot of cases.
>> 
>> Thanks,
>> 
>> Jiangjie (Becket) Qin
>> 
>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <pi...@data-artisans.com>
>> wrote:
>> 
>>> Hi Becket,
>>> 
>>>> Is there any extra thing user can do on a MaterializedTable that they
>>> cannot do on a Table?
>>> 
>>> Maybe not in the initial implementation, but various DBs offer different
>>> ways to “refresh” the materialised view. Hooks, triggers, timers, manually
>>> etc. Having `MaterializedTable` would help us to handle that in the future.
>>> 
>>>> After users call *table.cache(), *users can just use that table and do
>>> anything that is supported on a Table, including SQL.
>>> 
>>> This is some implicit behaviour with side effects. Imagine if user has a
>>> long and complicated program, that touches table `b` multiple times, maybe
>>> scattered around different methods. If he modifies his program by inserting
>>> in one place
>>> 
>>> b.cache()
>>> 
>>> This implicitly alters the semantic and behaviour of his code all over
>>> the place, maybe in a ways that might cause problems. For example what if
>>> underlying data is changing?
>>> 
>>> Having invisible side effects is also not very clean, for example think
>>> about something like this (but more complicated):
>>> 
>>> Table b = ...;
>>> 
>>> If (some_condition) {
>>>  processTable1(b)
>>> }
>>> else {
>>>  processTable2(b)
>>> }
>>> 
>>> // do more stuff with b
>>> 
>>> And user adds `b.cache()` call to only one of the `processTable1` or
>>> `processTable2` methods.
>>> 
>>> On the other hand
>>> 
>>> Table materialisedB = b.materialize()
>>> 
>>> Avoids (at least some of) the side effect issues and forces user to
>>> explicitly use `materialisedB` where it’s appropriate and forces user to
>>> think what does it actually mean. And if something doesn’t work in the end
>>> for the user, he will know what has he changed instead of blaming Flink for
>>> some “magic” underneath. In the above example, after materialising b in
>>> only one of the methods, he should/would realise about the issue when
>>> handling the return value `MaterializedTable` of that method.
>>> 
>>> I guess it comes down to personal preferences if you like things to be
>>> implicit or not. The more power is the user, probably the more likely he is
>>> to like/understand implicit behaviour. And we as Table API designers are
>>> the most power users out there, so I would proceed with caution (so that we
>>> do not end up in the crazy perl realm with it’s lovely implicit method
>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
>>> 
>>>> Table API to also support non-relational processing cases, cache()
>>> might be slightly better.
>>> 
>>> I think even such extended Table API could benefit from sticking to/being
>>> consistent with SQL where both SQL and Table API are basically the same.
>>> 
>>> One more thing. `MaterializedTable materialize()` could be more
>>> powerful/flexible allowing the user to operate both on materialised and not
>>> materialised view at the same time for whatever reasons (underlying data
>>> changing/better optimisation opportunities after pushing down more filters
>>> etc). For example:
>>> 
>>> Table b = …;
>>> 
>>> MaterlizedTable mb = b.materialize();
>>> 
>>> Val min = mb.min();
>>> Val max = mb.max();
>>> 
>>> Val user42 = b.filter(‘userId = 42);
>>> 
>>> Could be more efficient compared to `b.cache()` if `filter(‘userId =
>>> 42);` allows for much more aggressive optimisations.
>>> 
>>> Piotrek
>>> 
>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com> wrote:
>>>> 
>>>> I'm not suggesting to add support for Ignite. This was just an example.
>>>> Plasma and Arrow sound interesting, too.
>>>> For the sake of this proposal, it would be up to the user to implement a
>>>> TableFactory and corresponding TableSource / TableSink classes to
>>> persist
>>>> and read the data.
>>>> 
>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
>>>> pompermaier@okkam.it>:
>>>> 
>>>>> What about to add also Apache Plasma + Arrow as an alternative to
>>> Apache
>>>>> Ignite?
>>>>> [1]
>>>>> 
>>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>>>>> 
>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fh...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Thanks for the proposal!
>>>>>> 
>>>>>> To summarize, you propose a new method Table.cache(): Table that will
>>>>>> trigger a job and write the result into some temporary storage as
>>> defined
>>>>>> by a TableFactory.
>>>>>> The cache() call blocks while the job is running and eventually
>>> returns a
>>>>>> Table object that represents a scan of the temporary table.
>>>>>> When the "session" is closed (closing to be defined?), the temporary
>>>>> tables
>>>>>> are all dropped.
>>>>>> 
>>>>>> I think this behavior makes sense and is a good first step towards
>>> more
>>>>>> interactive workloads.
>>>>>> However, its performance suffers from writing to and reading from
>>>>> external
>>>>>> systems.
>>>>>> I think this is OK for now. Changes that would significantly improve
>>> the
>>>>>> situation (i.e., pinning data in-memory across jobs) would have large
>>>>>> impacts on many components of Flink.
>>>>>> Users could use in-memory filesystems or storage grids (Apache
>>> Ignite) to
>>>>>> mitigate some of the performance effects.
>>>>>> 
>>>>>> Best, Fabian
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
>>>>>> becket.qin@gmail.com
>>>>>>> :
>>>>>> 
>>>>>>> Thanks for the explanation, Piotrek.
>>>>>>> 
>>>>>>> Is there any extra thing user can do on a MaterializedTable that they
>>>>>>> cannot do on a Table? After users call *table.cache(), *users can
>>> just
>>>>>> use
>>>>>>> that table and do anything that is supported on a Table, including
>>> SQL.
>>>>>>> 
>>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
>>> cache()
>>>>>> is
>>>>>>> a bit more general than materialize(). Given that we are enhancing
>>> the
>>>>>>> Table API to also support non-relational processing cases, cache()
>>>>> might
>>>>>> be
>>>>>>> slightly better.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Jiangjie (Becket) Qin
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
>>>>> piotr@data-artisans.com
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Becket,
>>>>>>>> 
>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want to
>>>>>> provide
>>>>>>> an
>>>>>>>> alternate way of writing the data.
>>>>>>>> 
>>>>>>>> Now that I hopefully understand the proposal, maybe we could rename
>>>>>>>> `cache()` to
>>>>>>>> 
>>>>>>>> void materialize()
>>>>>>>> 
>>>>>>>> or going step further
>>>>>>>> 
>>>>>>>> MaterializedTable materialize()
>>>>>>>> MaterializedTable createMaterializedView()
>>>>>>>> 
>>>>>>>> ?
>>>>>>>> 
>>>>>>>> The second option with returning a handle I think is more flexible
>>>>> and
>>>>>>>> could provide features such as “refresh”/“delete” or generally
>>>>> speaking
>>>>>>>> manage the the view. In the future we could also think about adding
>>>>>> hooks
>>>>>>>> to automatically refresh view etc. It is also more explicit -
>>>>>>>> materialization returning a new table handle will not have the same
>>>>>>>> implicit side effects as adding a simple line of code like
>>>>> `b.cache()`
>>>>>>>> would have.
>>>>>>>> 
>>>>>>>> It would also be more SQL like, making it more intuitive for users
>>>>>>> already
>>>>>>>> familiar with the SQL.
>>>>>>>> 
>>>>>>>> Piotrek
>>>>>>>> 
>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Piotrek,
>>>>>>>>> 
>>>>>>>>> For the cache() method itself, yes, it is equivalent to creating a
>>>>>>>> BUILT-IN
>>>>>>>>> materialized view with a lifecycle. That functionality is missing
>>>>>>> today,
>>>>>>>>> though. Not sure if I understand your question. Do you mean we
>>>>>> already
>>>>>>>> have
>>>>>>>>> the functionality and just need a syntax sugar?
>>>>>>>>> 
>>>>>>>>> What's more interesting in the proposal is do we want to stop at
>>>>>>> creating
>>>>>>>>> the materialized view? Or do we want to extend that in the future
>>>>> to
>>>>>> a
>>>>>>>> more
>>>>>>>>> useful unified data store distributed with Flink? And do we want to
>>>>>>> have
>>>>>>>> a
>>>>>>>>> mechanism allow more flexible user job pattern with their own user
>>>>>>>> defined
>>>>>>>>> services. These considerations are much more architectural.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
>>>>>>> piotr@data-artisans.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Interesting idea. I’m trying to understand the problem. Isn’t the
>>>>>>>>>> `cache()` call an equivalent of writing data to a sink and later
>>>>>>> reading
>>>>>>>>>> from it? Where this sink has a limited live scope/live time? And
>>>>> the
>>>>>>>> sink
>>>>>>>>>> could be implemented as in memory or a file sink?
>>>>>>>>>> 
>>>>>>>>>> If so, what’s the problem with creating a materialised view from a
>>>>>>> table
>>>>>>>>>> “b” (from your document’s example) and reusing this materialised
>>>>>> view
>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up materialised
>>>>>> views
>>>>>>>> (for
>>>>>>>>>> example when current session finishes)? Maybe we need some
>>>>> syntactic
>>>>>>>> sugar
>>>>>>>>>> on top of it?
>>>>>>>>>> 
>>>>>>>>>> Piotrek
>>>>>>>>>> 
>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>>>>> 
>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
>>>>>>> lifecycle/defined
>>>>>>>>>>> scope. I just added a section in the future work for this.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
>>>>>>> sunjincheng121@gmail.com
>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Jiangjie,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank you for the explanation about the name of `cache()`, I
>>>>>>>> understand
>>>>>>>>>> why
>>>>>>>>>>>> you designed this way!
>>>>>>>>>>>> 
>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for data
>>>>>>>> persistence?
>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user is
>>>>> not
>>>>>>>>>> worried
>>>>>>>>>>>> about data loss, and will clearly specify the time range for
>>>>>> keeping
>>>>>>>>>> time.
>>>>>>>>>>>> At the same time, if we want to expand, we can also share in a
>>>>>>> certain
>>>>>>>>>>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I
>>>>> am
>>>>>>> not
>>>>>>>>>> sure,
>>>>>>>>>>>> just an immature suggestion, for reference only!
>>>>>>>>>>>> 
>>>>>>>>>>>> Bests,
>>>>>>>>>>>> Jincheng
>>>>>>>>>>>> 
>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Re: Jincheng,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
>>>>>>>> personally I
>>>>>>>>>>>>> find cache() to be more accurately describing the behavior,
>>>>> i.e.
>>>>>>> the
>>>>>>>>>>>> Table
>>>>>>>>>>>>> is cached for the session, but will be deleted after the
>>>>> session
>>>>>> is
>>>>>>>>>>>> closed.
>>>>>>>>>>>>> persist() seems a little misleading as people might think the
>>>>>> table
>>>>>>>>>> will
>>>>>>>>>>>>> still be there even after the session is gone.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Great point about mixing the batch and stream processing in the
>>>>>>> same
>>>>>>>>>> job.
>>>>>>>>>>>>> We should absolutely move towards that goal. I imagine that
>>>>> would
>>>>>>> be
>>>>>>>> a
>>>>>>>>>>>> huge
>>>>>>>>>>>>> change across the board, including sources, operators and
>>>>>>>>>> optimizations,
>>>>>>>>>>>> to
>>>>>>>>>>>>> name some. Likely we will need several separate in-depth
>>>>>>> discussions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
>>>>> xingcanc@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are both
>>>>>>>> orthogonal
>>>>>>>>>>>> to
>>>>>>>>>>>>>> the cache problem. Essentially, this may be the first time we
>>>>>> plan
>>>>>>>> to
>>>>>>>>>>>>>> introduce another storage mechanism other than the state.
>>>>> Maybe
>>>>>>> it’s
>>>>>>>>>>>>> better
>>>>>>>>>>>>>> to first draw a big picture and then concentrate on a specific
>>>>>>> part?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the underlying
>>>>>>>>>> service.
>>>>>>>>>>>>>> This seems to be quite a major change to the existing
>>>>> codebase.
>>>>>> As
>>>>>>>> you
>>>>>>>>>>>>>> claimed, the service should be extendible to support other
>>>>>>>> components
>>>>>>>>>>>> and
>>>>>>>>>>>>>> we’d better discussed it in another thread.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive Table
>>>>>> API,
>>>>>>> in
>>>>>>>>>>>> case
>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
>>>>>> xiaoweij@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up is not
>>>>>> very
>>>>>>>>>>>>>> reliable.
>>>>>>>>>>>>>>> There is no guarantee that it will be executed successfully.
>>>>> We
>>>>>>> may
>>>>>>>>>>>>> risk
>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to have an
>>>>>>>>>>>> association
>>>>>>>>>>>>>>> between temp table and session id. So we can always clean up
>>>>>> temp
>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>> which are no longer associated with any active sessions.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>>>>>>>>>>>> sunjincheng121@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Interactive Programming is very useful and user friendly in
>>>>>> case
>>>>>>>> of
>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>> examples.
>>>>>>>>>>>>>>>> Moreover, especially when a business has to be executed in
>>>>>>> several
>>>>>>>>>>>>>> stages
>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in order
>>>>>> to
>>>>>>>>>>>>> utilize
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> intermediate calculation results we have to submit a job by
>>>>>>>>>>>>>> env.execute().
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
>>>>> `persist()`,
>>>>>>> And
>>>>>>>>>>>> The
>>>>>>>>>>>>>>>> Flink framework determines whether we internally cache in
>>>>>> memory
>>>>>>>> or
>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state backend
>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support for
>>>>>>>> streaming
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit in
>>>>>>>>>>>> "Interactive
>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it is a
>>>>>>> promising
>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various aspects,
>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>> functionality and ease of use among others. One of the
>>>>>>> scenarios
>>>>>>>>>>>>> where
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming. To
>>>>>> explain
>>>>>>>> the
>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
>>>>>> together
>>>>>>>> the
>>>>>>>>>>>>>>>>> following document with our proposal.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Just to add a little bit, the materialized view is probably more similar to
the persistent() brought up earlier in the thread. So it is usually cross
session and could be used in a larger scope. For example, a materialized
view created by user A may be visible to user B. It is probably something
we want to have in the future. I'll put it in the future work section.

Thanks,

Jiangjie (Becket) Qin

On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <be...@gmail.com> wrote:

> Hi Piotrek,
>
> Thanks for the explanation.
>
> Right now we are mostly thinking of the cached table as immutable. I can
> see the Materialized view would be useful in the future. That said, I think
> a simple cache mechanism is probably still needed. So to me, cache() and
> materialize() should be two separate method as they address different
> needs. Materialize() is a higher level concept usually implying periodical
> update, while cache() has much simpler semantic. For example, one may
> create a materialized view and use cache() method in the materialized view
> creation logic. So that during the materialized view update, they do not
> need to worry about the case that the cached table is also changed. Maybe
> under the hood, materialized() and cache() could share some mechanism, but
> I think a simple cache() method would be handy in a lot of cases.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
>> Hi Becket,
>>
>> > Is there any extra thing user can do on a MaterializedTable that they
>> cannot do on a Table?
>>
>> Maybe not in the initial implementation, but various DBs offer different
>> ways to “refresh” the materialised view. Hooks, triggers, timers, manually
>> etc. Having `MaterializedTable` would help us to handle that in the future.
>>
>> > After users call *table.cache(), *users can just use that table and do
>> anything that is supported on a Table, including SQL.
>>
>> This is some implicit behaviour with side effects. Imagine if user has a
>> long and complicated program, that touches table `b` multiple times, maybe
>> scattered around different methods. If he modifies his program by inserting
>> in one place
>>
>> b.cache()
>>
>> This implicitly alters the semantic and behaviour of his code all over
>> the place, maybe in a ways that might cause problems. For example what if
>> underlying data is changing?
>>
>> Having invisible side effects is also not very clean, for example think
>> about something like this (but more complicated):
>>
>> Table b = ...;
>>
>> If (some_condition) {
>>   processTable1(b)
>> }
>> else {
>>   processTable2(b)
>> }
>>
>> // do more stuff with b
>>
>> And user adds `b.cache()` call to only one of the `processTable1` or
>> `processTable2` methods.
>>
>> On the other hand
>>
>> Table materialisedB = b.materialize()
>>
>> Avoids (at least some of) the side effect issues and forces user to
>> explicitly use `materialisedB` where it’s appropriate and forces user to
>> think what does it actually mean. And if something doesn’t work in the end
>> for the user, he will know what has he changed instead of blaming Flink for
>> some “magic” underneath. In the above example, after materialising b in
>> only one of the methods, he should/would realise about the issue when
>> handling the return value `MaterializedTable` of that method.
>>
>> I guess it comes down to personal preferences if you like things to be
>> implicit or not. The more power is the user, probably the more likely he is
>> to like/understand implicit behaviour. And we as Table API designers are
>> the most power users out there, so I would proceed with caution (so that we
>> do not end up in the crazy perl realm with it’s lovely implicit method
>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
>>
>> > Table API to also support non-relational processing cases, cache()
>> might be slightly better.
>>
>> I think even such extended Table API could benefit from sticking to/being
>> consistent with SQL where both SQL and Table API are basically the same.
>>
>> One more thing. `MaterializedTable materialize()` could be more
>> powerful/flexible allowing the user to operate both on materialised and not
>> materialised view at the same time for whatever reasons (underlying data
>> changing/better optimisation opportunities after pushing down more filters
>> etc). For example:
>>
>> Table b = …;
>>
>> MaterlizedTable mb = b.materialize();
>>
>> Val min = mb.min();
>> Val max = mb.max();
>>
>> Val user42 = b.filter(‘userId = 42);
>>
>> Could be more efficient compared to `b.cache()` if `filter(‘userId =
>> 42);` allows for much more aggressive optimisations.
>>
>> Piotrek
>>
>> > On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com> wrote:
>> >
>> > I'm not suggesting to add support for Ignite. This was just an example.
>> > Plasma and Arrow sound interesting, too.
>> > For the sake of this proposal, it would be up to the user to implement a
>> > TableFactory and corresponding TableSource / TableSink classes to
>> persist
>> > and read the data.
>> >
>> > Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
>> > pompermaier@okkam.it>:
>> >
>> >> What about to add also Apache Plasma + Arrow as an alternative to
>> Apache
>> >> Ignite?
>> >> [1]
>> >>
>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>> >>
>> >> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fh...@gmail.com>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> Thanks for the proposal!
>> >>>
>> >>> To summarize, you propose a new method Table.cache(): Table that will
>> >>> trigger a job and write the result into some temporary storage as
>> defined
>> >>> by a TableFactory.
>> >>> The cache() call blocks while the job is running and eventually
>> returns a
>> >>> Table object that represents a scan of the temporary table.
>> >>> When the "session" is closed (closing to be defined?), the temporary
>> >> tables
>> >>> are all dropped.
>> >>>
>> >>> I think this behavior makes sense and is a good first step towards
>> more
>> >>> interactive workloads.
>> >>> However, its performance suffers from writing to and reading from
>> >> external
>> >>> systems.
>> >>> I think this is OK for now. Changes that would significantly improve
>> the
>> >>> situation (i.e., pinning data in-memory across jobs) would have large
>> >>> impacts on many components of Flink.
>> >>> Users could use in-memory filesystems or storage grids (Apache
>> Ignite) to
>> >>> mitigate some of the performance effects.
>> >>>
>> >>> Best, Fabian
>> >>>
>> >>>
>> >>>
>> >>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
>> >>> becket.qin@gmail.com
>> >>>> :
>> >>>
>> >>>> Thanks for the explanation, Piotrek.
>> >>>>
>> >>>> Is there any extra thing user can do on a MaterializedTable that they
>> >>>> cannot do on a Table? After users call *table.cache(), *users can
>> just
>> >>> use
>> >>>> that table and do anything that is supported on a Table, including
>> SQL.
>> >>>>
>> >>>> Naming wise, either cache() or materialize() sounds fine to me.
>> cache()
>> >>> is
>> >>>> a bit more general than materialize(). Given that we are enhancing
>> the
>> >>>> Table API to also support non-relational processing cases, cache()
>> >> might
>> >>> be
>> >>>> slightly better.
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Jiangjie (Becket) Qin
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
>> >> piotr@data-artisans.com
>> >>>>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Becket,
>> >>>>>
>> >>>>> Ops, sorry I didn’t notice that you intend to reuse existing
>> >>>>> `TableFactory`. I don’t know why, but I assumed that you want to
>> >>> provide
>> >>>> an
>> >>>>> alternate way of writing the data.
>> >>>>>
>> >>>>> Now that I hopefully understand the proposal, maybe we could rename
>> >>>>> `cache()` to
>> >>>>>
>> >>>>> void materialize()
>> >>>>>
>> >>>>> or going step further
>> >>>>>
>> >>>>> MaterializedTable materialize()
>> >>>>> MaterializedTable createMaterializedView()
>> >>>>>
>> >>>>> ?
>> >>>>>
>> >>>>> The second option with returning a handle I think is more flexible
>> >> and
>> >>>>> could provide features such as “refresh”/“delete” or generally
>> >> speaking
>> >>>>> manage the the view. In the future we could also think about adding
>> >>> hooks
>> >>>>> to automatically refresh view etc. It is also more explicit -
>> >>>>> materialization returning a new table handle will not have the same
>> >>>>> implicit side effects as adding a simple line of code like
>> >> `b.cache()`
>> >>>>> would have.
>> >>>>>
>> >>>>> It would also be more SQL like, making it more intuitive for users
>> >>>> already
>> >>>>> familiar with the SQL.
>> >>>>>
>> >>>>> Piotrek
>> >>>>>
>> >>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Hi Piotrek,
>> >>>>>>
>> >>>>>> For the cache() method itself, yes, it is equivalent to creating a
>> >>>>> BUILT-IN
>> >>>>>> materialized view with a lifecycle. That functionality is missing
>> >>>> today,
>> >>>>>> though. Not sure if I understand your question. Do you mean we
>> >>> already
>> >>>>> have
>> >>>>>> the functionality and just need a syntax sugar?
>> >>>>>>
>> >>>>>> What's more interesting in the proposal is do we want to stop at
>> >>>> creating
>> >>>>>> the materialized view? Or do we want to extend that in the future
>> >> to
>> >>> a
>> >>>>> more
>> >>>>>> useful unified data store distributed with Flink? And do we want to
>> >>>> have
>> >>>>> a
>> >>>>>> mechanism allow more flexible user job pattern with their own user
>> >>>>> defined
>> >>>>>> services. These considerations are much more architectural.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>>
>> >>>>>> Jiangjie (Becket) Qin
>> >>>>>>
>> >>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
>> >>>> piotr@data-artisans.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> Interesting idea. I’m trying to understand the problem. Isn’t the
>> >>>>>>> `cache()` call an equivalent of writing data to a sink and later
>> >>>> reading
>> >>>>>>> from it? Where this sink has a limited live scope/live time? And
>> >> the
>> >>>>> sink
>> >>>>>>> could be implemented as in memory or a file sink?
>> >>>>>>>
>> >>>>>>> If so, what’s the problem with creating a materialised view from a
>> >>>> table
>> >>>>>>> “b” (from your document’s example) and reusing this materialised
>> >>> view
>> >>>>>>> later? Maybe we are lacking mechanisms to clean up materialised
>> >>> views
>> >>>>> (for
>> >>>>>>> example when current session finishes)? Maybe we need some
>> >> syntactic
>> >>>>> sugar
>> >>>>>>> on top of it?
>> >>>>>>>
>> >>>>>>> Piotrek
>> >>>>>>>
>> >>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com>
>> >> wrote:
>> >>>>>>>>
>> >>>>>>>> Thanks for the suggestion, Jincheng.
>> >>>>>>>>
>> >>>>>>>> Yes, I think it makes sense to have a persist() with
>> >>>> lifecycle/defined
>> >>>>>>>> scope. I just added a section in the future work for this.
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>>
>> >>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>
>> >>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
>> >>>> sunjincheng121@gmail.com
>> >>>>>>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hi Jiangjie,
>> >>>>>>>>>
>> >>>>>>>>> Thank you for the explanation about the name of `cache()`, I
>> >>>>> understand
>> >>>>>>> why
>> >>>>>>>>> you designed this way!
>> >>>>>>>>>
>> >>>>>>>>> Another idea is whether we can specify a lifecycle for data
>> >>>>> persistence?
>> >>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user is
>> >> not
>> >>>>>>> worried
>> >>>>>>>>> about data loss, and will clearly specify the time range for
>> >>> keeping
>> >>>>>>> time.
>> >>>>>>>>> At the same time, if we want to expand, we can also share in a
>> >>>> certain
>> >>>>>>>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I
>> >> am
>> >>>> not
>> >>>>>>> sure,
>> >>>>>>>>> just an immature suggestion, for reference only!
>> >>>>>>>>>
>> >>>>>>>>> Bests,
>> >>>>>>>>> Jincheng
>> >>>>>>>>>
>> >>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
>> >>>>>>>>>
>> >>>>>>>>>> Re: Jincheng,
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
>> >>>>> personally I
>> >>>>>>>>>> find cache() to be more accurately describing the behavior,
>> >> i.e.
>> >>>> the
>> >>>>>>>>> Table
>> >>>>>>>>>> is cached for the session, but will be deleted after the
>> >> session
>> >>> is
>> >>>>>>>>> closed.
>> >>>>>>>>>> persist() seems a little misleading as people might think the
>> >>> table
>> >>>>>>> will
>> >>>>>>>>>> still be there even after the session is gone.
>> >>>>>>>>>>
>> >>>>>>>>>> Great point about mixing the batch and stream processing in the
>> >>>> same
>> >>>>>>> job.
>> >>>>>>>>>> We should absolutely move towards that goal. I imagine that
>> >> would
>> >>>> be
>> >>>>> a
>> >>>>>>>>> huge
>> >>>>>>>>>> change across the board, including sources, operators and
>> >>>>>>> optimizations,
>> >>>>>>>>> to
>> >>>>>>>>>> name some. Likely we will need several separate in-depth
>> >>>> discussions.
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>>
>> >>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>
>> >>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
>> >> xingcanc@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi all,
>> >>>>>>>>>>>
>> >>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are both
>> >>>>> orthogonal
>> >>>>>>>>> to
>> >>>>>>>>>>> the cache problem. Essentially, this may be the first time we
>> >>> plan
>> >>>>> to
>> >>>>>>>>>>> introduce another storage mechanism other than the state.
>> >> Maybe
>> >>>> it’s
>> >>>>>>>>>> better
>> >>>>>>>>>>> to first draw a big picture and then concentrate on a specific
>> >>>> part?
>> >>>>>>>>>>>
>> >>>>>>>>>>> @Becket, yes, actually I am more concerned with the underlying
>> >>>>>>> service.
>> >>>>>>>>>>> This seems to be quite a major change to the existing
>> >> codebase.
>> >>> As
>> >>>>> you
>> >>>>>>>>>>> claimed, the service should be extendible to support other
>> >>>>> components
>> >>>>>>>>> and
>> >>>>>>>>>>> we’d better discussed it in another thread.
>> >>>>>>>>>>>
>> >>>>>>>>>>> All in all, I also eager to enjoy the more interactive Table
>> >>> API,
>> >>>> in
>> >>>>>>>>> case
>> >>>>>>>>>>> of a general and flexible enough service mechanism.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>> Xingcan
>> >>>>>>>>>>>
>> >>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
>> >>> xiaoweij@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Relying on a callback for the temp table for clean up is not
>> >>> very
>> >>>>>>>>>>> reliable.
>> >>>>>>>>>>>> There is no guarantee that it will be executed successfully.
>> >> We
>> >>>> may
>> >>>>>>>>>> risk
>> >>>>>>>>>>>> leaks when that happens. I think that it's safer to have an
>> >>>>>>>>> association
>> >>>>>>>>>>>> between temp table and session id. So we can always clean up
>> >>> temp
>> >>>>>>>>>> tables
>> >>>>>>>>>>>> which are no longer associated with any active sessions.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Regards,
>> >>>>>>>>>>>> Xiaowei
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>> >>>>>>>>>> sunjincheng121@gmail.com>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks for initiating this great proposal!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Interactive Programming is very useful and user friendly in
>> >>> case
>> >>>>> of
>> >>>>>>>>>> your
>> >>>>>>>>>>>>> examples.
>> >>>>>>>>>>>>> Moreover, especially when a business has to be executed in
>> >>>> several
>> >>>>>>>>>>> stages
>> >>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in order
>> >>> to
>> >>>>>>>>>> utilize
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>> intermediate calculation results we have to submit a job by
>> >>>>>>>>>>> env.execute().
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> About the `cache()`  , I think is better to named
>> >> `persist()`,
>> >>>> And
>> >>>>>>>>> The
>> >>>>>>>>>>>>> Flink framework determines whether we internally cache in
>> >>> memory
>> >>>>> or
>> >>>>>>>>>>> persist
>> >>>>>>>>>>>>> to the storage system,Maybe save the data into state backend
>> >>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> BTW, from the points of my view in the future, support for
>> >>>>> streaming
>> >>>>>>>>>> and
>> >>>>>>>>>>>>> batch mode switching in the same job will also benefit in
>> >>>>>>>>> "Interactive
>> >>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>> Jincheng
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> As a few recent email threads have pointed out, it is a
>> >>>> promising
>> >>>>>>>>>>>>>> opportunity to enhance Flink Table API in various aspects,
>> >>>>>>>>> including
>> >>>>>>>>>>>>>> functionality and ease of use among others. One of the
>> >>>> scenarios
>> >>>>>>>>>> where
>> >>>>>>>>>>> we
>> >>>>>>>>>>>>>> feel Flink could improve is interactive programming. To
>> >>> explain
>> >>>>> the
>> >>>>>>>>>>>>> issues
>> >>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
>> >>> together
>> >>>>> the
>> >>>>>>>>>>>>>> following document with our proposal.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Feedback and comments are very welcome!
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotrek,

Thanks for the explanation.

Right now we are mostly thinking of the cached table as immutable. I can
see the Materialized view would be useful in the future. That said, I think
a simple cache mechanism is probably still needed. So to me, cache() and
materialize() should be two separate method as they address different
needs. Materialize() is a higher level concept usually implying periodical
update, while cache() has much simpler semantic. For example, one may
create a materialized view and use cache() method in the materialized view
creation logic. So that during the materialized view update, they do not
need to worry about the case that the cached table is also changed. Maybe
under the hood, materialized() and cache() could share some mechanism, but
I think a simple cache() method would be handy in a lot of cases.

Thanks,

Jiangjie (Becket) Qin

On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi Becket,
>
> > Is there any extra thing user can do on a MaterializedTable that they
> cannot do on a Table?
>
> Maybe not in the initial implementation, but various DBs offer different
> ways to “refresh” the materialised view. Hooks, triggers, timers, manually
> etc. Having `MaterializedTable` would help us to handle that in the future.
>
> > After users call *table.cache(), *users can just use that table and do
> anything that is supported on a Table, including SQL.
>
> This is some implicit behaviour with side effects. Imagine if user has a
> long and complicated program, that touches table `b` multiple times, maybe
> scattered around different methods. If he modifies his program by inserting
> in one place
>
> b.cache()
>
> This implicitly alters the semantic and behaviour of his code all over the
> place, maybe in a ways that might cause problems. For example what if
> underlying data is changing?
>
> Having invisible side effects is also not very clean, for example think
> about something like this (but more complicated):
>
> Table b = ...;
>
> If (some_condition) {
>   processTable1(b)
> }
> else {
>   processTable2(b)
> }
>
> // do more stuff with b
>
> And user adds `b.cache()` call to only one of the `processTable1` or
> `processTable2` methods.
>
> On the other hand
>
> Table materialisedB = b.materialize()
>
> Avoids (at least some of) the side effect issues and forces user to
> explicitly use `materialisedB` where it’s appropriate and forces user to
> think what does it actually mean. And if something doesn’t work in the end
> for the user, he will know what has he changed instead of blaming Flink for
> some “magic” underneath. In the above example, after materialising b in
> only one of the methods, he should/would realise about the issue when
> handling the return value `MaterializedTable` of that method.
>
> I guess it comes down to personal preferences if you like things to be
> implicit or not. The more power is the user, probably the more likely he is
> to like/understand implicit behaviour. And we as Table API designers are
> the most power users out there, so I would proceed with caution (so that we
> do not end up in the crazy perl realm with it’s lovely implicit method
> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
>
> > Table API to also support non-relational processing cases, cache() might
> be slightly better.
>
> I think even such extended Table API could benefit from sticking to/being
> consistent with SQL where both SQL and Table API are basically the same.
>
> One more thing. `MaterializedTable materialize()` could be more
> powerful/flexible allowing the user to operate both on materialised and not
> materialised view at the same time for whatever reasons (underlying data
> changing/better optimisation opportunities after pushing down more filters
> etc). For example:
>
> Table b = …;
>
> MaterlizedTable mb = b.materialize();
>
> Val min = mb.min();
> Val max = mb.max();
>
> Val user42 = b.filter(‘userId = 42);
>
> Could be more efficient compared to `b.cache()` if `filter(‘userId = 42);`
> allows for much more aggressive optimisations.
>
> Piotrek
>
> > On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com> wrote:
> >
> > I'm not suggesting to add support for Ignite. This was just an example.
> > Plasma and Arrow sound interesting, too.
> > For the sake of this proposal, it would be up to the user to implement a
> > TableFactory and corresponding TableSource / TableSink classes to persist
> > and read the data.
> >
> > Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> > pompermaier@okkam.it>:
> >
> >> What about to add also Apache Plasma + Arrow as an alternative to Apache
> >> Ignite?
> >> [1]
> >> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> >>
> >> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fh...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> Thanks for the proposal!
> >>>
> >>> To summarize, you propose a new method Table.cache(): Table that will
> >>> trigger a job and write the result into some temporary storage as
> defined
> >>> by a TableFactory.
> >>> The cache() call blocks while the job is running and eventually
> returns a
> >>> Table object that represents a scan of the temporary table.
> >>> When the "session" is closed (closing to be defined?), the temporary
> >> tables
> >>> are all dropped.
> >>>
> >>> I think this behavior makes sense and is a good first step towards more
> >>> interactive workloads.
> >>> However, its performance suffers from writing to and reading from
> >> external
> >>> systems.
> >>> I think this is OK for now. Changes that would significantly improve
> the
> >>> situation (i.e., pinning data in-memory across jobs) would have large
> >>> impacts on many components of Flink.
> >>> Users could use in-memory filesystems or storage grids (Apache Ignite)
> to
> >>> mitigate some of the performance effects.
> >>>
> >>> Best, Fabian
> >>>
> >>>
> >>>
> >>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> >>> becket.qin@gmail.com
> >>>> :
> >>>
> >>>> Thanks for the explanation, Piotrek.
> >>>>
> >>>> Is there any extra thing user can do on a MaterializedTable that they
> >>>> cannot do on a Table? After users call *table.cache(), *users can just
> >>> use
> >>>> that table and do anything that is supported on a Table, including
> SQL.
> >>>>
> >>>> Naming wise, either cache() or materialize() sounds fine to me.
> cache()
> >>> is
> >>>> a bit more general than materialize(). Given that we are enhancing the
> >>>> Table API to also support non-relational processing cases, cache()
> >> might
> >>> be
> >>>> slightly better.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> >> piotr@data-artisans.com
> >>>>
> >>>> wrote:
> >>>>
> >>>>> Hi Becket,
> >>>>>
> >>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> >>>>> `TableFactory`. I don’t know why, but I assumed that you want to
> >>> provide
> >>>> an
> >>>>> alternate way of writing the data.
> >>>>>
> >>>>> Now that I hopefully understand the proposal, maybe we could rename
> >>>>> `cache()` to
> >>>>>
> >>>>> void materialize()
> >>>>>
> >>>>> or going step further
> >>>>>
> >>>>> MaterializedTable materialize()
> >>>>> MaterializedTable createMaterializedView()
> >>>>>
> >>>>> ?
> >>>>>
> >>>>> The second option with returning a handle I think is more flexible
> >> and
> >>>>> could provide features such as “refresh”/“delete” or generally
> >> speaking
> >>>>> manage the the view. In the future we could also think about adding
> >>> hooks
> >>>>> to automatically refresh view etc. It is also more explicit -
> >>>>> materialization returning a new table handle will not have the same
> >>>>> implicit side effects as adding a simple line of code like
> >> `b.cache()`
> >>>>> would have.
> >>>>>
> >>>>> It would also be more SQL like, making it more intuitive for users
> >>>> already
> >>>>> familiar with the SQL.
> >>>>>
> >>>>> Piotrek
> >>>>>
> >>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Piotrek,
> >>>>>>
> >>>>>> For the cache() method itself, yes, it is equivalent to creating a
> >>>>> BUILT-IN
> >>>>>> materialized view with a lifecycle. That functionality is missing
> >>>> today,
> >>>>>> though. Not sure if I understand your question. Do you mean we
> >>> already
> >>>>> have
> >>>>>> the functionality and just need a syntax sugar?
> >>>>>>
> >>>>>> What's more interesting in the proposal is do we want to stop at
> >>>> creating
> >>>>>> the materialized view? Or do we want to extend that in the future
> >> to
> >>> a
> >>>>> more
> >>>>>> useful unified data store distributed with Flink? And do we want to
> >>>> have
> >>>>> a
> >>>>>> mechanism allow more flexible user job pattern with their own user
> >>>>> defined
> >>>>>> services. These considerations are much more architectural.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jiangjie (Becket) Qin
> >>>>>>
> >>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> >>>> piotr@data-artisans.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Interesting idea. I’m trying to understand the problem. Isn’t the
> >>>>>>> `cache()` call an equivalent of writing data to a sink and later
> >>>> reading
> >>>>>>> from it? Where this sink has a limited live scope/live time? And
> >> the
> >>>>> sink
> >>>>>>> could be implemented as in memory or a file sink?
> >>>>>>>
> >>>>>>> If so, what’s the problem with creating a materialised view from a
> >>>> table
> >>>>>>> “b” (from your document’s example) and reusing this materialised
> >>> view
> >>>>>>> later? Maybe we are lacking mechanisms to clean up materialised
> >>> views
> >>>>> (for
> >>>>>>> example when current session finishes)? Maybe we need some
> >> syntactic
> >>>>> sugar
> >>>>>>> on top of it?
> >>>>>>>
> >>>>>>> Piotrek
> >>>>>>>
> >>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>> Thanks for the suggestion, Jincheng.
> >>>>>>>>
> >>>>>>>> Yes, I think it makes sense to have a persist() with
> >>>> lifecycle/defined
> >>>>>>>> scope. I just added a section in the future work for this.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> >>>> sunjincheng121@gmail.com
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Jiangjie,
> >>>>>>>>>
> >>>>>>>>> Thank you for the explanation about the name of `cache()`, I
> >>>>> understand
> >>>>>>> why
> >>>>>>>>> you designed this way!
> >>>>>>>>>
> >>>>>>>>> Another idea is whether we can specify a lifecycle for data
> >>>>> persistence?
> >>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user is
> >> not
> >>>>>>> worried
> >>>>>>>>> about data loss, and will clearly specify the time range for
> >>> keeping
> >>>>>>> time.
> >>>>>>>>> At the same time, if we want to expand, we can also share in a
> >>>> certain
> >>>>>>>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I
> >> am
> >>>> not
> >>>>>>> sure,
> >>>>>>>>> just an immature suggestion, for reference only!
> >>>>>>>>>
> >>>>>>>>> Bests,
> >>>>>>>>> Jincheng
> >>>>>>>>>
> >>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
> >>>>>>>>>
> >>>>>>>>>> Re: Jincheng,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
> >>>>> personally I
> >>>>>>>>>> find cache() to be more accurately describing the behavior,
> >> i.e.
> >>>> the
> >>>>>>>>> Table
> >>>>>>>>>> is cached for the session, but will be deleted after the
> >> session
> >>> is
> >>>>>>>>> closed.
> >>>>>>>>>> persist() seems a little misleading as people might think the
> >>> table
> >>>>>>> will
> >>>>>>>>>> still be there even after the session is gone.
> >>>>>>>>>>
> >>>>>>>>>> Great point about mixing the batch and stream processing in the
> >>>> same
> >>>>>>> job.
> >>>>>>>>>> We should absolutely move towards that goal. I imagine that
> >> would
> >>>> be
> >>>>> a
> >>>>>>>>> huge
> >>>>>>>>>> change across the board, including sources, operators and
> >>>>>>> optimizations,
> >>>>>>>>> to
> >>>>>>>>>> name some. Likely we will need several separate in-depth
> >>>> discussions.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> >> xingcanc@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are both
> >>>>> orthogonal
> >>>>>>>>> to
> >>>>>>>>>>> the cache problem. Essentially, this may be the first time we
> >>> plan
> >>>>> to
> >>>>>>>>>>> introduce another storage mechanism other than the state.
> >> Maybe
> >>>> it’s
> >>>>>>>>>> better
> >>>>>>>>>>> to first draw a big picture and then concentrate on a specific
> >>>> part?
> >>>>>>>>>>>
> >>>>>>>>>>> @Becket, yes, actually I am more concerned with the underlying
> >>>>>>> service.
> >>>>>>>>>>> This seems to be quite a major change to the existing
> >> codebase.
> >>> As
> >>>>> you
> >>>>>>>>>>> claimed, the service should be extendible to support other
> >>>>> components
> >>>>>>>>> and
> >>>>>>>>>>> we’d better discussed it in another thread.
> >>>>>>>>>>>
> >>>>>>>>>>> All in all, I also eager to enjoy the more interactive Table
> >>> API,
> >>>> in
> >>>>>>>>> case
> >>>>>>>>>>> of a general and flexible enough service mechanism.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Xingcan
> >>>>>>>>>>>
> >>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> >>> xiaoweij@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Relying on a callback for the temp table for clean up is not
> >>> very
> >>>>>>>>>>> reliable.
> >>>>>>>>>>>> There is no guarantee that it will be executed successfully.
> >> We
> >>>> may
> >>>>>>>>>> risk
> >>>>>>>>>>>> leaks when that happens. I think that it's safer to have an
> >>>>>>>>> association
> >>>>>>>>>>>> between temp table and session id. So we can always clean up
> >>> temp
> >>>>>>>>>> tables
> >>>>>>>>>>>> which are no longer associated with any active sessions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Xiaowei
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> >>>>>>>>>> sunjincheng121@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for initiating this great proposal!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Interactive Programming is very useful and user friendly in
> >>> case
> >>>>> of
> >>>>>>>>>> your
> >>>>>>>>>>>>> examples.
> >>>>>>>>>>>>> Moreover, especially when a business has to be executed in
> >>>> several
> >>>>>>>>>>> stages
> >>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in order
> >>> to
> >>>>>>>>>> utilize
> >>>>>>>>>>> the
> >>>>>>>>>>>>> intermediate calculation results we have to submit a job by
> >>>>>>>>>>> env.execute().
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> About the `cache()`  , I think is better to named
> >> `persist()`,
> >>>> And
> >>>>>>>>> The
> >>>>>>>>>>>>> Flink framework determines whether we internally cache in
> >>> memory
> >>>>> or
> >>>>>>>>>>> persist
> >>>>>>>>>>>>> to the storage system,Maybe save the data into state backend
> >>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> BTW, from the points of my view in the future, support for
> >>>>> streaming
> >>>>>>>>>> and
> >>>>>>>>>>>>> batch mode switching in the same job will also benefit in
> >>>>>>>>> "Interactive
> >>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As a few recent email threads have pointed out, it is a
> >>>> promising
> >>>>>>>>>>>>>> opportunity to enhance Flink Table API in various aspects,
> >>>>>>>>> including
> >>>>>>>>>>>>>> functionality and ease of use among others. One of the
> >>>> scenarios
> >>>>>>>>>> where
> >>>>>>>>>>> we
> >>>>>>>>>>>>>> feel Flink could improve is interactive programming. To
> >>> explain
> >>>>> the
> >>>>>>>>>>>>> issues
> >>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
> >>> together
> >>>>> the
> >>>>>>>>>>>>>> following document with our proposal.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Feedback and comments are very welcome!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi Becket,

> Is there any extra thing user can do on a MaterializedTable that they cannot do on a Table? 

Maybe not in the initial implementation, but various DBs offer different ways to “refresh” the materialised view. Hooks, triggers, timers, manually etc. Having `MaterializedTable` would help us to handle that in the future.

> After users call *table.cache(), *users can just use that table and do anything that is supported on a Table, including SQL.

This is some implicit behaviour with side effects. Imagine if user has a long and complicated program, that touches table `b` multiple times, maybe scattered around different methods. If he modifies his program by inserting in one place

b.cache()

This implicitly alters the semantic and behaviour of his code all over the place, maybe in a ways that might cause problems. For example what if underlying data is changing? 

Having invisible side effects is also not very clean, for example think about something like this (but more complicated):

Table b = ...;

If (some_condition) {
  processTable1(b)
}
else {
  processTable2(b)
}

// do more stuff with b
 
And user adds `b.cache()` call to only one of the `processTable1` or `processTable2` methods.

On the other hand

Table materialisedB = b.materialize()

Avoids (at least some of) the side effect issues and forces user to explicitly use `materialisedB` where it’s appropriate and forces user to think what does it actually mean. And if something doesn’t work in the end for the user, he will know what has he changed instead of blaming Flink for some “magic” underneath. In the above example, after materialising b in only one of the methods, he should/would realise about the issue when handling the return value `MaterializedTable` of that method.  

I guess it comes down to personal preferences if you like things to be implicit or not. The more power is the user, probably the more likely he is to like/understand implicit behaviour. And we as Table API designers are the most power users out there, so I would proceed with caution (so that we do not end up in the crazy perl realm with it’s lovely implicit method arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)

> Table API to also support non-relational processing cases, cache() might be slightly better.

I think even such extended Table API could benefit from sticking to/being consistent with SQL where both SQL and Table API are basically the same. 

One more thing. `MaterializedTable materialize()` could be more powerful/flexible allowing the user to operate both on materialised and not materialised view at the same time for whatever reasons (underlying data changing/better optimisation opportunities after pushing down more filters etc). For example:

Table b = …;

MaterlizedTable mb = b.materialize();

Val min = mb.min();
Val max = mb.max();

Val user42 = b.filter(‘userId = 42);

Could be more efficient compared to `b.cache()` if `filter(‘userId = 42);` allows for much more aggressive optimisations.

Piotrek
 
> On 26 Nov 2018, at 12:14, Fabian Hueske <fh...@gmail.com> wrote:
> 
> I'm not suggesting to add support for Ignite. This was just an example.
> Plasma and Arrow sound interesting, too.
> For the sake of this proposal, it would be up to the user to implement a
> TableFactory and corresponding TableSource / TableSink classes to persist
> and read the data.
> 
> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> pompermaier@okkam.it>:
> 
>> What about to add also Apache Plasma + Arrow as an alternative to Apache
>> Ignite?
>> [1]
>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>> 
>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fh...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> Thanks for the proposal!
>>> 
>>> To summarize, you propose a new method Table.cache(): Table that will
>>> trigger a job and write the result into some temporary storage as defined
>>> by a TableFactory.
>>> The cache() call blocks while the job is running and eventually returns a
>>> Table object that represents a scan of the temporary table.
>>> When the "session" is closed (closing to be defined?), the temporary
>> tables
>>> are all dropped.
>>> 
>>> I think this behavior makes sense and is a good first step towards more
>>> interactive workloads.
>>> However, its performance suffers from writing to and reading from
>> external
>>> systems.
>>> I think this is OK for now. Changes that would significantly improve the
>>> situation (i.e., pinning data in-memory across jobs) would have large
>>> impacts on many components of Flink.
>>> Users could use in-memory filesystems or storage grids (Apache Ignite) to
>>> mitigate some of the performance effects.
>>> 
>>> Best, Fabian
>>> 
>>> 
>>> 
>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
>>> becket.qin@gmail.com
>>>> :
>>> 
>>>> Thanks for the explanation, Piotrek.
>>>> 
>>>> Is there any extra thing user can do on a MaterializedTable that they
>>>> cannot do on a Table? After users call *table.cache(), *users can just
>>> use
>>>> that table and do anything that is supported on a Table, including SQL.
>>>> 
>>>> Naming wise, either cache() or materialize() sounds fine to me. cache()
>>> is
>>>> a bit more general than materialize(). Given that we are enhancing the
>>>> Table API to also support non-relational processing cases, cache()
>> might
>>> be
>>>> slightly better.
>>>> 
>>>> Thanks,
>>>> 
>>>> Jiangjie (Becket) Qin
>>>> 
>>>> 
>>>> 
>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
>> piotr@data-artisans.com
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Becket,
>>>>> 
>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
>>>>> `TableFactory`. I don’t know why, but I assumed that you want to
>>> provide
>>>> an
>>>>> alternate way of writing the data.
>>>>> 
>>>>> Now that I hopefully understand the proposal, maybe we could rename
>>>>> `cache()` to
>>>>> 
>>>>> void materialize()
>>>>> 
>>>>> or going step further
>>>>> 
>>>>> MaterializedTable materialize()
>>>>> MaterializedTable createMaterializedView()
>>>>> 
>>>>> ?
>>>>> 
>>>>> The second option with returning a handle I think is more flexible
>> and
>>>>> could provide features such as “refresh”/“delete” or generally
>> speaking
>>>>> manage the the view. In the future we could also think about adding
>>> hooks
>>>>> to automatically refresh view etc. It is also more explicit -
>>>>> materialization returning a new table handle will not have the same
>>>>> implicit side effects as adding a simple line of code like
>> `b.cache()`
>>>>> would have.
>>>>> 
>>>>> It would also be more SQL like, making it more intuitive for users
>>>> already
>>>>> familiar with the SQL.
>>>>> 
>>>>> Piotrek
>>>>> 
>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi Piotrek,
>>>>>> 
>>>>>> For the cache() method itself, yes, it is equivalent to creating a
>>>>> BUILT-IN
>>>>>> materialized view with a lifecycle. That functionality is missing
>>>> today,
>>>>>> though. Not sure if I understand your question. Do you mean we
>>> already
>>>>> have
>>>>>> the functionality and just need a syntax sugar?
>>>>>> 
>>>>>> What's more interesting in the proposal is do we want to stop at
>>>> creating
>>>>>> the materialized view? Or do we want to extend that in the future
>> to
>>> a
>>>>> more
>>>>>> useful unified data store distributed with Flink? And do we want to
>>>> have
>>>>> a
>>>>>> mechanism allow more flexible user job pattern with their own user
>>>>> defined
>>>>>> services. These considerations are much more architectural.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jiangjie (Becket) Qin
>>>>>> 
>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
>>>> piotr@data-artisans.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Interesting idea. I’m trying to understand the problem. Isn’t the
>>>>>>> `cache()` call an equivalent of writing data to a sink and later
>>>> reading
>>>>>>> from it? Where this sink has a limited live scope/live time? And
>> the
>>>>> sink
>>>>>>> could be implemented as in memory or a file sink?
>>>>>>> 
>>>>>>> If so, what’s the problem with creating a materialised view from a
>>>> table
>>>>>>> “b” (from your document’s example) and reusing this materialised
>>> view
>>>>>>> later? Maybe we are lacking mechanisms to clean up materialised
>>> views
>>>>> (for
>>>>>>> example when current session finishes)? Maybe we need some
>> syntactic
>>>>> sugar
>>>>>>> on top of it?
>>>>>>> 
>>>>>>> Piotrek
>>>>>>> 
>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>> 
>>>>>>>> Yes, I think it makes sense to have a persist() with
>>>> lifecycle/defined
>>>>>>>> scope. I just added a section in the future work for this.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>> 
>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
>>>> sunjincheng121@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Jiangjie,
>>>>>>>>> 
>>>>>>>>> Thank you for the explanation about the name of `cache()`, I
>>>>> understand
>>>>>>> why
>>>>>>>>> you designed this way!
>>>>>>>>> 
>>>>>>>>> Another idea is whether we can specify a lifecycle for data
>>>>> persistence?
>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user is
>> not
>>>>>>> worried
>>>>>>>>> about data loss, and will clearly specify the time range for
>>> keeping
>>>>>>> time.
>>>>>>>>> At the same time, if we want to expand, we can also share in a
>>>> certain
>>>>>>>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I
>> am
>>>> not
>>>>>>> sure,
>>>>>>>>> just an immature suggestion, for reference only!
>>>>>>>>> 
>>>>>>>>> Bests,
>>>>>>>>> Jincheng
>>>>>>>>> 
>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
>>>>>>>>> 
>>>>>>>>>> Re: Jincheng,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
>>>>> personally I
>>>>>>>>>> find cache() to be more accurately describing the behavior,
>> i.e.
>>>> the
>>>>>>>>> Table
>>>>>>>>>> is cached for the session, but will be deleted after the
>> session
>>> is
>>>>>>>>> closed.
>>>>>>>>>> persist() seems a little misleading as people might think the
>>> table
>>>>>>> will
>>>>>>>>>> still be there even after the session is gone.
>>>>>>>>>> 
>>>>>>>>>> Great point about mixing the batch and stream processing in the
>>>> same
>>>>>>> job.
>>>>>>>>>> We should absolutely move towards that goal. I imagine that
>> would
>>>> be
>>>>> a
>>>>>>>>> huge
>>>>>>>>>> change across the board, including sources, operators and
>>>>>>> optimizations,
>>>>>>>>> to
>>>>>>>>>> name some. Likely we will need several separate in-depth
>>>> discussions.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>> 
>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
>> xingcanc@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are both
>>>>> orthogonal
>>>>>>>>> to
>>>>>>>>>>> the cache problem. Essentially, this may be the first time we
>>> plan
>>>>> to
>>>>>>>>>>> introduce another storage mechanism other than the state.
>> Maybe
>>>> it’s
>>>>>>>>>> better
>>>>>>>>>>> to first draw a big picture and then concentrate on a specific
>>>> part?
>>>>>>>>>>> 
>>>>>>>>>>> @Becket, yes, actually I am more concerned with the underlying
>>>>>>> service.
>>>>>>>>>>> This seems to be quite a major change to the existing
>> codebase.
>>> As
>>>>> you
>>>>>>>>>>> claimed, the service should be extendible to support other
>>>>> components
>>>>>>>>> and
>>>>>>>>>>> we’d better discussed it in another thread.
>>>>>>>>>>> 
>>>>>>>>>>> All in all, I also eager to enjoy the more interactive Table
>>> API,
>>>> in
>>>>>>>>> case
>>>>>>>>>>> of a general and flexible enough service mechanism.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Xingcan
>>>>>>>>>>> 
>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
>>> xiaoweij@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Relying on a callback for the temp table for clean up is not
>>> very
>>>>>>>>>>> reliable.
>>>>>>>>>>>> There is no guarantee that it will be executed successfully.
>> We
>>>> may
>>>>>>>>>> risk
>>>>>>>>>>>> leaks when that happens. I think that it's safer to have an
>>>>>>>>> association
>>>>>>>>>>>> between temp table and session id. So we can always clean up
>>> temp
>>>>>>>>>> tables
>>>>>>>>>>>> which are no longer associated with any active sessions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>>>>>>>>> sunjincheng121@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for initiating this great proposal!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Interactive Programming is very useful and user friendly in
>>> case
>>>>> of
>>>>>>>>>> your
>>>>>>>>>>>>> examples.
>>>>>>>>>>>>> Moreover, especially when a business has to be executed in
>>>> several
>>>>>>>>>>> stages
>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in order
>>> to
>>>>>>>>>> utilize
>>>>>>>>>>> the
>>>>>>>>>>>>> intermediate calculation results we have to submit a job by
>>>>>>>>>>> env.execute().
>>>>>>>>>>>>> 
>>>>>>>>>>>>> About the `cache()`  , I think is better to named
>> `persist()`,
>>>> And
>>>>>>>>> The
>>>>>>>>>>>>> Flink framework determines whether we internally cache in
>>> memory
>>>>> or
>>>>>>>>>>> persist
>>>>>>>>>>>>> to the storage system,Maybe save the data into state backend
>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> BTW, from the points of my view in the future, support for
>>>>> streaming
>>>>>>>>>> and
>>>>>>>>>>>>> batch mode switching in the same job will also benefit in
>>>>>>>>> "Interactive
>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As a few recent email threads have pointed out, it is a
>>>> promising
>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various aspects,
>>>>>>>>> including
>>>>>>>>>>>>>> functionality and ease of use among others. One of the
>>>> scenarios
>>>>>>>>>> where
>>>>>>>>>>> we
>>>>>>>>>>>>>> feel Flink could improve is interactive programming. To
>>> explain
>>>>> the
>>>>>>>>>>>>> issues
>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
>>> together
>>>>> the
>>>>>>>>>>>>>> following document with our proposal.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Feedback and comments are very welcome!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Fabian Hueske <fh...@gmail.com>.
I'm not suggesting to add support for Ignite. This was just an example.
Plasma and Arrow sound interesting, too.
For the sake of this proposal, it would be up to the user to implement a
TableFactory and corresponding TableSource / TableSink classes to persist
and read the data.

Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
pompermaier@okkam.it>:

> What about to add also Apache Plasma + Arrow as an alternative to Apache
> Ignite?
> [1]
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>
> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fh...@gmail.com> wrote:
>
> > Hi,
> >
> > Thanks for the proposal!
> >
> > To summarize, you propose a new method Table.cache(): Table that will
> > trigger a job and write the result into some temporary storage as defined
> > by a TableFactory.
> > The cache() call blocks while the job is running and eventually returns a
> > Table object that represents a scan of the temporary table.
> > When the "session" is closed (closing to be defined?), the temporary
> tables
> > are all dropped.
> >
> > I think this behavior makes sense and is a good first step towards more
> > interactive workloads.
> > However, its performance suffers from writing to and reading from
> external
> > systems.
> > I think this is OK for now. Changes that would significantly improve the
> > situation (i.e., pinning data in-memory across jobs) would have large
> > impacts on many components of Flink.
> > Users could use in-memory filesystems or storage grids (Apache Ignite) to
> > mitigate some of the performance effects.
> >
> > Best, Fabian
> >
> >
> >
> > Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> > becket.qin@gmail.com
> > >:
> >
> > > Thanks for the explanation, Piotrek.
> > >
> > > Is there any extra thing user can do on a MaterializedTable that they
> > > cannot do on a Table? After users call *table.cache(), *users can just
> > use
> > > that table and do anything that is supported on a Table, including SQL.
> > >
> > > Naming wise, either cache() or materialize() sounds fine to me. cache()
> > is
> > > a bit more general than materialize(). Given that we are enhancing the
> > > Table API to also support non-relational processing cases, cache()
> might
> > be
> > > slightly better.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > > On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> piotr@data-artisans.com
> > >
> > > wrote:
> > >
> > > > Hi Becket,
> > > >
> > > > Ops, sorry I didn’t notice that you intend to reuse existing
> > > > `TableFactory`. I don’t know why, but I assumed that you want to
> > provide
> > > an
> > > > alternate way of writing the data.
> > > >
> > > > Now that I hopefully understand the proposal, maybe we could rename
> > > > `cache()` to
> > > >
> > > > void materialize()
> > > >
> > > > or going step further
> > > >
> > > > MaterializedTable materialize()
> > > > MaterializedTable createMaterializedView()
> > > >
> > > > ?
> > > >
> > > > The second option with returning a handle I think is more flexible
> and
> > > > could provide features such as “refresh”/“delete” or generally
> speaking
> > > > manage the the view. In the future we could also think about adding
> > hooks
> > > > to automatically refresh view etc. It is also more explicit -
> > > > materialization returning a new table handle will not have the same
> > > > implicit side effects as adding a simple line of code like
> `b.cache()`
> > > > would have.
> > > >
> > > > It would also be more SQL like, making it more intuitive for users
> > > already
> > > > familiar with the SQL.
> > > >
> > > > Piotrek
> > > >
> > > > > On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
> > > > >
> > > > > Hi Piotrek,
> > > > >
> > > > > For the cache() method itself, yes, it is equivalent to creating a
> > > > BUILT-IN
> > > > > materialized view with a lifecycle. That functionality is missing
> > > today,
> > > > > though. Not sure if I understand your question. Do you mean we
> > already
> > > > have
> > > > > the functionality and just need a syntax sugar?
> > > > >
> > > > > What's more interesting in the proposal is do we want to stop at
> > > creating
> > > > > the materialized view? Or do we want to extend that in the future
> to
> > a
> > > > more
> > > > > useful unified data store distributed with Flink? And do we want to
> > > have
> > > > a
> > > > > mechanism allow more flexible user job pattern with their own user
> > > > defined
> > > > > services. These considerations are much more architectural.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > > piotr@data-artisans.com>
> > > > > wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> Interesting idea. I’m trying to understand the problem. Isn’t the
> > > > >> `cache()` call an equivalent of writing data to a sink and later
> > > reading
> > > > >> from it? Where this sink has a limited live scope/live time? And
> the
> > > > sink
> > > > >> could be implemented as in memory or a file sink?
> > > > >>
> > > > >> If so, what’s the problem with creating a materialised view from a
> > > table
> > > > >> “b” (from your document’s example) and reusing this materialised
> > view
> > > > >> later? Maybe we are lacking mechanisms to clean up materialised
> > views
> > > > (for
> > > > >> example when current session finishes)? Maybe we need some
> syntactic
> > > > sugar
> > > > >> on top of it?
> > > > >>
> > > > >> Piotrek
> > > > >>
> > > > >>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com>
> wrote:
> > > > >>>
> > > > >>> Thanks for the suggestion, Jincheng.
> > > > >>>
> > > > >>> Yes, I think it makes sense to have a persist() with
> > > lifecycle/defined
> > > > >>> scope. I just added a section in the future work for this.
> > > > >>>
> > > > >>> Thanks,
> > > > >>>
> > > > >>> Jiangjie (Becket) Qin
> > > > >>>
> > > > >>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > > sunjincheng121@gmail.com
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Hi Jiangjie,
> > > > >>>>
> > > > >>>> Thank you for the explanation about the name of `cache()`, I
> > > > understand
> > > > >> why
> > > > >>>> you designed this way!
> > > > >>>>
> > > > >>>> Another idea is whether we can specify a lifecycle for data
> > > > persistence?
> > > > >>>> For example, persist (LifeCycle.SESSION), so that the user is
> not
> > > > >> worried
> > > > >>>> about data loss, and will clearly specify the time range for
> > keeping
> > > > >> time.
> > > > >>>> At the same time, if we want to expand, we can also share in a
> > > certain
> > > > >>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I
> am
> > > not
> > > > >> sure,
> > > > >>>> just an immature suggestion, for reference only!
> > > > >>>>
> > > > >>>> Bests,
> > > > >>>> Jincheng
> > > > >>>>
> > > > >>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
> > > > >>>>
> > > > >>>>> Re: Jincheng,
> > > > >>>>>
> > > > >>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
> > > > personally I
> > > > >>>>> find cache() to be more accurately describing the behavior,
> i.e.
> > > the
> > > > >>>> Table
> > > > >>>>> is cached for the session, but will be deleted after the
> session
> > is
> > > > >>>> closed.
> > > > >>>>> persist() seems a little misleading as people might think the
> > table
> > > > >> will
> > > > >>>>> still be there even after the session is gone.
> > > > >>>>>
> > > > >>>>> Great point about mixing the batch and stream processing in the
> > > same
> > > > >> job.
> > > > >>>>> We should absolutely move towards that goal. I imagine that
> would
> > > be
> > > > a
> > > > >>>> huge
> > > > >>>>> change across the board, including sources, operators and
> > > > >> optimizations,
> > > > >>>> to
> > > > >>>>> name some. Likely we will need several separate in-depth
> > > discussions.
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>>
> > > > >>>>> Jiangjie (Becket) Qin
> > > > >>>>>
> > > > >>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> xingcanc@gmail.com>
> > > > >> wrote:
> > > > >>>>>
> > > > >>>>>> Hi all,
> > > > >>>>>>
> > > > >>>>>> @Shaoxuan, I think the lifecycle or access domain are both
> > > > orthogonal
> > > > >>>> to
> > > > >>>>>> the cache problem. Essentially, this may be the first time we
> > plan
> > > > to
> > > > >>>>>> introduce another storage mechanism other than the state.
> Maybe
> > > it’s
> > > > >>>>> better
> > > > >>>>>> to first draw a big picture and then concentrate on a specific
> > > part?
> > > > >>>>>>
> > > > >>>>>> @Becket, yes, actually I am more concerned with the underlying
> > > > >> service.
> > > > >>>>>> This seems to be quite a major change to the existing
> codebase.
> > As
> > > > you
> > > > >>>>>> claimed, the service should be extendible to support other
> > > > components
> > > > >>>> and
> > > > >>>>>> we’d better discussed it in another thread.
> > > > >>>>>>
> > > > >>>>>> All in all, I also eager to enjoy the more interactive Table
> > API,
> > > in
> > > > >>>> case
> > > > >>>>>> of a general and flexible enough service mechanism.
> > > > >>>>>>
> > > > >>>>>> Best,
> > > > >>>>>> Xingcan
> > > > >>>>>>
> > > > >>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > xiaoweij@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>> Relying on a callback for the temp table for clean up is not
> > very
> > > > >>>>>> reliable.
> > > > >>>>>>> There is no guarantee that it will be executed successfully.
> We
> > > may
> > > > >>>>> risk
> > > > >>>>>>> leaks when that happens. I think that it's safer to have an
> > > > >>>> association
> > > > >>>>>>> between temp table and session id. So we can always clean up
> > temp
> > > > >>>>> tables
> > > > >>>>>>> which are no longer associated with any active sessions.
> > > > >>>>>>>
> > > > >>>>>>> Regards,
> > > > >>>>>>> Xiaowei
> > > > >>>>>>>
> > > > >>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > > > >>>>> sunjincheng121@gmail.com>
> > > > >>>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi Jiangjie&Shaoxuan,
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks for initiating this great proposal!
> > > > >>>>>>>>
> > > > >>>>>>>> Interactive Programming is very useful and user friendly in
> > case
> > > > of
> > > > >>>>> your
> > > > >>>>>>>> examples.
> > > > >>>>>>>> Moreover, especially when a business has to be executed in
> > > several
> > > > >>>>>> stages
> > > > >>>>>>>> with dependencies,such as the pipeline of Flink ML, in order
> > to
> > > > >>>>> utilize
> > > > >>>>>> the
> > > > >>>>>>>> intermediate calculation results we have to submit a job by
> > > > >>>>>> env.execute().
> > > > >>>>>>>>
> > > > >>>>>>>> About the `cache()`  , I think is better to named
> `persist()`,
> > > And
> > > > >>>> The
> > > > >>>>>>>> Flink framework determines whether we internally cache in
> > memory
> > > > or
> > > > >>>>>> persist
> > > > >>>>>>>> to the storage system,Maybe save the data into state backend
> > > > >>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > > > >>>>>>>>
> > > > >>>>>>>> BTW, from the points of my view in the future, support for
> > > > streaming
> > > > >>>>> and
> > > > >>>>>>>> batch mode switching in the same job will also benefit in
> > > > >>>> "Interactive
> > > > >>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
> > > > >>>>>>>>
> > > > >>>>>>>> Best,
> > > > >>>>>>>> Jincheng
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> > > > >>>>>>>>
> > > > >>>>>>>>> Hi all,
> > > > >>>>>>>>>
> > > > >>>>>>>>> As a few recent email threads have pointed out, it is a
> > > promising
> > > > >>>>>>>>> opportunity to enhance Flink Table API in various aspects,
> > > > >>>> including
> > > > >>>>>>>>> functionality and ease of use among others. One of the
> > > scenarios
> > > > >>>>> where
> > > > >>>>>> we
> > > > >>>>>>>>> feel Flink could improve is interactive programming. To
> > explain
> > > > the
> > > > >>>>>>>> issues
> > > > >>>>>>>>> and facilitate the discussion on the solution, we put
> > together
> > > > the
> > > > >>>>>>>>> following document with our proposal.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > > >>>>>>>>>
> > > > >>>>>>>>> Feedback and comments are very welcome!
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Jiangjie (Becket) Qin
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Flavio Pompermaier <po...@okkam.it>.
What about to add also Apache Plasma + Arrow as an alternative to Apache
Ignite?
[1] https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/

On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fh...@gmail.com> wrote:

> Hi,
>
> Thanks for the proposal!
>
> To summarize, you propose a new method Table.cache(): Table that will
> trigger a job and write the result into some temporary storage as defined
> by a TableFactory.
> The cache() call blocks while the job is running and eventually returns a
> Table object that represents a scan of the temporary table.
> When the "session" is closed (closing to be defined?), the temporary tables
> are all dropped.
>
> I think this behavior makes sense and is a good first step towards more
> interactive workloads.
> However, its performance suffers from writing to and reading from external
> systems.
> I think this is OK for now. Changes that would significantly improve the
> situation (i.e., pinning data in-memory across jobs) would have large
> impacts on many components of Flink.
> Users could use in-memory filesystems or storage grids (Apache Ignite) to
> mitigate some of the performance effects.
>
> Best, Fabian
>
>
>
> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> becket.qin@gmail.com
> >:
>
> > Thanks for the explanation, Piotrek.
> >
> > Is there any extra thing user can do on a MaterializedTable that they
> > cannot do on a Table? After users call *table.cache(), *users can just
> use
> > that table and do anything that is supported on a Table, including SQL.
> >
> > Naming wise, either cache() or materialize() sounds fine to me. cache()
> is
> > a bit more general than materialize(). Given that we are enhancing the
> > Table API to also support non-relational processing cases, cache() might
> be
> > slightly better.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <piotr@data-artisans.com
> >
> > wrote:
> >
> > > Hi Becket,
> > >
> > > Ops, sorry I didn’t notice that you intend to reuse existing
> > > `TableFactory`. I don’t know why, but I assumed that you want to
> provide
> > an
> > > alternate way of writing the data.
> > >
> > > Now that I hopefully understand the proposal, maybe we could rename
> > > `cache()` to
> > >
> > > void materialize()
> > >
> > > or going step further
> > >
> > > MaterializedTable materialize()
> > > MaterializedTable createMaterializedView()
> > >
> > > ?
> > >
> > > The second option with returning a handle I think is more flexible and
> > > could provide features such as “refresh”/“delete” or generally speaking
> > > manage the the view. In the future we could also think about adding
> hooks
> > > to automatically refresh view etc. It is also more explicit -
> > > materialization returning a new table handle will not have the same
> > > implicit side effects as adding a simple line of code like `b.cache()`
> > > would have.
> > >
> > > It would also be more SQL like, making it more intuitive for users
> > already
> > > familiar with the SQL.
> > >
> > > Piotrek
> > >
> > > > On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
> > > >
> > > > Hi Piotrek,
> > > >
> > > > For the cache() method itself, yes, it is equivalent to creating a
> > > BUILT-IN
> > > > materialized view with a lifecycle. That functionality is missing
> > today,
> > > > though. Not sure if I understand your question. Do you mean we
> already
> > > have
> > > > the functionality and just need a syntax sugar?
> > > >
> > > > What's more interesting in the proposal is do we want to stop at
> > creating
> > > > the materialized view? Or do we want to extend that in the future to
> a
> > > more
> > > > useful unified data store distributed with Flink? And do we want to
> > have
> > > a
> > > > mechanism allow more flexible user job pattern with their own user
> > > defined
> > > > services. These considerations are much more architectural.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > piotr@data-artisans.com>
> > > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Interesting idea. I’m trying to understand the problem. Isn’t the
> > > >> `cache()` call an equivalent of writing data to a sink and later
> > reading
> > > >> from it? Where this sink has a limited live scope/live time? And the
> > > sink
> > > >> could be implemented as in memory or a file sink?
> > > >>
> > > >> If so, what’s the problem with creating a materialised view from a
> > table
> > > >> “b” (from your document’s example) and reusing this materialised
> view
> > > >> later? Maybe we are lacking mechanisms to clean up materialised
> views
> > > (for
> > > >> example when current session finishes)? Maybe we need some syntactic
> > > sugar
> > > >> on top of it?
> > > >>
> > > >> Piotrek
> > > >>
> > > >>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com> wrote:
> > > >>>
> > > >>> Thanks for the suggestion, Jincheng.
> > > >>>
> > > >>> Yes, I think it makes sense to have a persist() with
> > lifecycle/defined
> > > >>> scope. I just added a section in the future work for this.
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Jiangjie (Becket) Qin
> > > >>>
> > > >>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > sunjincheng121@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>>> Hi Jiangjie,
> > > >>>>
> > > >>>> Thank you for the explanation about the name of `cache()`, I
> > > understand
> > > >> why
> > > >>>> you designed this way!
> > > >>>>
> > > >>>> Another idea is whether we can specify a lifecycle for data
> > > persistence?
> > > >>>> For example, persist (LifeCycle.SESSION), so that the user is not
> > > >> worried
> > > >>>> about data loss, and will clearly specify the time range for
> keeping
> > > >> time.
> > > >>>> At the same time, if we want to expand, we can also share in a
> > certain
> > > >>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am
> > not
> > > >> sure,
> > > >>>> just an immature suggestion, for reference only!
> > > >>>>
> > > >>>> Bests,
> > > >>>> Jincheng
> > > >>>>
> > > >>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
> > > >>>>
> > > >>>>> Re: Jincheng,
> > > >>>>>
> > > >>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
> > > personally I
> > > >>>>> find cache() to be more accurately describing the behavior, i.e.
> > the
> > > >>>> Table
> > > >>>>> is cached for the session, but will be deleted after the session
> is
> > > >>>> closed.
> > > >>>>> persist() seems a little misleading as people might think the
> table
> > > >> will
> > > >>>>> still be there even after the session is gone.
> > > >>>>>
> > > >>>>> Great point about mixing the batch and stream processing in the
> > same
> > > >> job.
> > > >>>>> We should absolutely move towards that goal. I imagine that would
> > be
> > > a
> > > >>>> huge
> > > >>>>> change across the board, including sources, operators and
> > > >> optimizations,
> > > >>>> to
> > > >>>>> name some. Likely we will need several separate in-depth
> > discussions.
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>>
> > > >>>>> Jiangjie (Becket) Qin
> > > >>>>>
> > > >>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com>
> > > >> wrote:
> > > >>>>>
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> @Shaoxuan, I think the lifecycle or access domain are both
> > > orthogonal
> > > >>>> to
> > > >>>>>> the cache problem. Essentially, this may be the first time we
> plan
> > > to
> > > >>>>>> introduce another storage mechanism other than the state. Maybe
> > it’s
> > > >>>>> better
> > > >>>>>> to first draw a big picture and then concentrate on a specific
> > part?
> > > >>>>>>
> > > >>>>>> @Becket, yes, actually I am more concerned with the underlying
> > > >> service.
> > > >>>>>> This seems to be quite a major change to the existing codebase.
> As
> > > you
> > > >>>>>> claimed, the service should be extendible to support other
> > > components
> > > >>>> and
> > > >>>>>> we’d better discussed it in another thread.
> > > >>>>>>
> > > >>>>>> All in all, I also eager to enjoy the more interactive Table
> API,
> > in
> > > >>>> case
> > > >>>>>> of a general and flexible enough service mechanism.
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> Xingcan
> > > >>>>>>
> > > >>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> xiaoweij@gmail.com>
> > > >>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> Relying on a callback for the temp table for clean up is not
> very
> > > >>>>>> reliable.
> > > >>>>>>> There is no guarantee that it will be executed successfully. We
> > may
> > > >>>>> risk
> > > >>>>>>> leaks when that happens. I think that it's safer to have an
> > > >>>> association
> > > >>>>>>> between temp table and session id. So we can always clean up
> temp
> > > >>>>> tables
> > > >>>>>>> which are no longer associated with any active sessions.
> > > >>>>>>>
> > > >>>>>>> Regards,
> > > >>>>>>> Xiaowei
> > > >>>>>>>
> > > >>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > > >>>>> sunjincheng121@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Jiangjie&Shaoxuan,
> > > >>>>>>>>
> > > >>>>>>>> Thanks for initiating this great proposal!
> > > >>>>>>>>
> > > >>>>>>>> Interactive Programming is very useful and user friendly in
> case
> > > of
> > > >>>>> your
> > > >>>>>>>> examples.
> > > >>>>>>>> Moreover, especially when a business has to be executed in
> > several
> > > >>>>>> stages
> > > >>>>>>>> with dependencies,such as the pipeline of Flink ML, in order
> to
> > > >>>>> utilize
> > > >>>>>> the
> > > >>>>>>>> intermediate calculation results we have to submit a job by
> > > >>>>>> env.execute().
> > > >>>>>>>>
> > > >>>>>>>> About the `cache()`  , I think is better to named `persist()`,
> > And
> > > >>>> The
> > > >>>>>>>> Flink framework determines whether we internally cache in
> memory
> > > or
> > > >>>>>> persist
> > > >>>>>>>> to the storage system,Maybe save the data into state backend
> > > >>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > > >>>>>>>>
> > > >>>>>>>> BTW, from the points of my view in the future, support for
> > > streaming
> > > >>>>> and
> > > >>>>>>>> batch mode switching in the same job will also benefit in
> > > >>>> "Interactive
> > > >>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
> > > >>>>>>>>
> > > >>>>>>>> Best,
> > > >>>>>>>> Jincheng
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> > > >>>>>>>>
> > > >>>>>>>>> Hi all,
> > > >>>>>>>>>
> > > >>>>>>>>> As a few recent email threads have pointed out, it is a
> > promising
> > > >>>>>>>>> opportunity to enhance Flink Table API in various aspects,
> > > >>>> including
> > > >>>>>>>>> functionality and ease of use among others. One of the
> > scenarios
> > > >>>>> where
> > > >>>>>> we
> > > >>>>>>>>> feel Flink could improve is interactive programming. To
> explain
> > > the
> > > >>>>>>>> issues
> > > >>>>>>>>> and facilitate the discussion on the solution, we put
> together
> > > the
> > > >>>>>>>>> following document with our proposal.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>>>>>>>
> > > >>>>>>>>> Feedback and comments are very welcome!
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>>
> > > >>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Thanks for the feedback, Fabian.

As you mentioned, cache() method itself does not imply any implementation
detail. In fact, we plan to implement a default table service which is
locality aware, so the default table service hopefully will be satisfactory
in most cases. We could also explore more memory based storage as you
suggested.

Just two clarifications:

1. Table.cache() itself will not trigger a job execution. it just marks
that this table needs to be cached when a job containing that table
executes. The cache() method does not block and does not return anything.
Instead, it simply sets a flag on the table. When a job involving the
cached table runs for the first time, TableEnvironment will add an
additional sink to the cached table. Users could just use the table
variable that they called cache() on, and that table will be recognized by
the TableEnvironment. If the table is already successfully cached (the
first job involving that table has finished), it is replaced with a source
reading from the table service to avoid redundant computation.

2. Currently we are thinking of defining the session as a yarn application.
So we can embed the clean up logic in yarn Application Master. Ideally we
want to use an Application shutdown hook provided by Yarn, so that it is
guaranteed to run when the application exits. Unfortunately we did not find
such shutdown hook support.

Cheers,

Jiangjie (Becket) Qin

On Mon, Nov 26, 2018 at 6:56 PM Fabian Hueske <fh...@gmail.com> wrote:

> Hi,
>
> Thanks for the proposal!
>
> To summarize, you propose a new method Table.cache(): Table that will
> trigger a job and write the result into some temporary storage as defined
> by a TableFactory.
> The cache() call blocks while the job is running and eventually returns a
> Table object that represents a scan of the temporary table.
> When the "session" is closed (closing to be defined?), the temporary tables
> are all dropped.
>
> I think this behavior makes sense and is a good first step towards more
> interactive workloads.
> However, its performance suffers from writing to and reading from external
> systems.
> I think this is OK for now. Changes that would significantly improve the
> situation (i.e., pinning data in-memory across jobs) would have large
> impacts on many components of Flink.
> Users could use in-memory filesystems or storage grids (Apache Ignite) to
> mitigate some of the performance effects.
>
> Best, Fabian
>
>
>
> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> becket.qin@gmail.com
> >:
>
> > Thanks for the explanation, Piotrek.
> >
> > Is there any extra thing user can do on a MaterializedTable that they
> > cannot do on a Table? After users call *table.cache(), *users can just
> use
> > that table and do anything that is supported on a Table, including SQL.
> >
> > Naming wise, either cache() or materialize() sounds fine to me. cache()
> is
> > a bit more general than materialize(). Given that we are enhancing the
> > Table API to also support non-relational processing cases, cache() might
> be
> > slightly better.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <piotr@data-artisans.com
> >
> > wrote:
> >
> > > Hi Becket,
> > >
> > > Ops, sorry I didn’t notice that you intend to reuse existing
> > > `TableFactory`. I don’t know why, but I assumed that you want to
> provide
> > an
> > > alternate way of writing the data.
> > >
> > > Now that I hopefully understand the proposal, maybe we could rename
> > > `cache()` to
> > >
> > > void materialize()
> > >
> > > or going step further
> > >
> > > MaterializedTable materialize()
> > > MaterializedTable createMaterializedView()
> > >
> > > ?
> > >
> > > The second option with returning a handle I think is more flexible and
> > > could provide features such as “refresh”/“delete” or generally speaking
> > > manage the the view. In the future we could also think about adding
> hooks
> > > to automatically refresh view etc. It is also more explicit -
> > > materialization returning a new table handle will not have the same
> > > implicit side effects as adding a simple line of code like `b.cache()`
> > > would have.
> > >
> > > It would also be more SQL like, making it more intuitive for users
> > already
> > > familiar with the SQL.
> > >
> > > Piotrek
> > >
> > > > On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
> > > >
> > > > Hi Piotrek,
> > > >
> > > > For the cache() method itself, yes, it is equivalent to creating a
> > > BUILT-IN
> > > > materialized view with a lifecycle. That functionality is missing
> > today,
> > > > though. Not sure if I understand your question. Do you mean we
> already
> > > have
> > > > the functionality and just need a syntax sugar?
> > > >
> > > > What's more interesting in the proposal is do we want to stop at
> > creating
> > > > the materialized view? Or do we want to extend that in the future to
> a
> > > more
> > > > useful unified data store distributed with Flink? And do we want to
> > have
> > > a
> > > > mechanism allow more flexible user job pattern with their own user
> > > defined
> > > > services. These considerations are much more architectural.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > piotr@data-artisans.com>
> > > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Interesting idea. I’m trying to understand the problem. Isn’t the
> > > >> `cache()` call an equivalent of writing data to a sink and later
> > reading
> > > >> from it? Where this sink has a limited live scope/live time? And the
> > > sink
> > > >> could be implemented as in memory or a file sink?
> > > >>
> > > >> If so, what’s the problem with creating a materialised view from a
> > table
> > > >> “b” (from your document’s example) and reusing this materialised
> view
> > > >> later? Maybe we are lacking mechanisms to clean up materialised
> views
> > > (for
> > > >> example when current session finishes)? Maybe we need some syntactic
> > > sugar
> > > >> on top of it?
> > > >>
> > > >> Piotrek
> > > >>
> > > >>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com> wrote:
> > > >>>
> > > >>> Thanks for the suggestion, Jincheng.
> > > >>>
> > > >>> Yes, I think it makes sense to have a persist() with
> > lifecycle/defined
> > > >>> scope. I just added a section in the future work for this.
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Jiangjie (Becket) Qin
> > > >>>
> > > >>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > sunjincheng121@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>>> Hi Jiangjie,
> > > >>>>
> > > >>>> Thank you for the explanation about the name of `cache()`, I
> > > understand
> > > >> why
> > > >>>> you designed this way!
> > > >>>>
> > > >>>> Another idea is whether we can specify a lifecycle for data
> > > persistence?
> > > >>>> For example, persist (LifeCycle.SESSION), so that the user is not
> > > >> worried
> > > >>>> about data loss, and will clearly specify the time range for
> keeping
> > > >> time.
> > > >>>> At the same time, if we want to expand, we can also share in a
> > certain
> > > >>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am
> > not
> > > >> sure,
> > > >>>> just an immature suggestion, for reference only!
> > > >>>>
> > > >>>> Bests,
> > > >>>> Jincheng
> > > >>>>
> > > >>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
> > > >>>>
> > > >>>>> Re: Jincheng,
> > > >>>>>
> > > >>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
> > > personally I
> > > >>>>> find cache() to be more accurately describing the behavior, i.e.
> > the
> > > >>>> Table
> > > >>>>> is cached for the session, but will be deleted after the session
> is
> > > >>>> closed.
> > > >>>>> persist() seems a little misleading as people might think the
> table
> > > >> will
> > > >>>>> still be there even after the session is gone.
> > > >>>>>
> > > >>>>> Great point about mixing the batch and stream processing in the
> > same
> > > >> job.
> > > >>>>> We should absolutely move towards that goal. I imagine that would
> > be
> > > a
> > > >>>> huge
> > > >>>>> change across the board, including sources, operators and
> > > >> optimizations,
> > > >>>> to
> > > >>>>> name some. Likely we will need several separate in-depth
> > discussions.
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>>
> > > >>>>> Jiangjie (Becket) Qin
> > > >>>>>
> > > >>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com>
> > > >> wrote:
> > > >>>>>
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> @Shaoxuan, I think the lifecycle or access domain are both
> > > orthogonal
> > > >>>> to
> > > >>>>>> the cache problem. Essentially, this may be the first time we
> plan
> > > to
> > > >>>>>> introduce another storage mechanism other than the state. Maybe
> > it’s
> > > >>>>> better
> > > >>>>>> to first draw a big picture and then concentrate on a specific
> > part?
> > > >>>>>>
> > > >>>>>> @Becket, yes, actually I am more concerned with the underlying
> > > >> service.
> > > >>>>>> This seems to be quite a major change to the existing codebase.
> As
> > > you
> > > >>>>>> claimed, the service should be extendible to support other
> > > components
> > > >>>> and
> > > >>>>>> we’d better discussed it in another thread.
> > > >>>>>>
> > > >>>>>> All in all, I also eager to enjoy the more interactive Table
> API,
> > in
> > > >>>> case
> > > >>>>>> of a general and flexible enough service mechanism.
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> Xingcan
> > > >>>>>>
> > > >>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> xiaoweij@gmail.com>
> > > >>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> Relying on a callback for the temp table for clean up is not
> very
> > > >>>>>> reliable.
> > > >>>>>>> There is no guarantee that it will be executed successfully. We
> > may
> > > >>>>> risk
> > > >>>>>>> leaks when that happens. I think that it's safer to have an
> > > >>>> association
> > > >>>>>>> between temp table and session id. So we can always clean up
> temp
> > > >>>>> tables
> > > >>>>>>> which are no longer associated with any active sessions.
> > > >>>>>>>
> > > >>>>>>> Regards,
> > > >>>>>>> Xiaowei
> > > >>>>>>>
> > > >>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > > >>>>> sunjincheng121@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Jiangjie&Shaoxuan,
> > > >>>>>>>>
> > > >>>>>>>> Thanks for initiating this great proposal!
> > > >>>>>>>>
> > > >>>>>>>> Interactive Programming is very useful and user friendly in
> case
> > > of
> > > >>>>> your
> > > >>>>>>>> examples.
> > > >>>>>>>> Moreover, especially when a business has to be executed in
> > several
> > > >>>>>> stages
> > > >>>>>>>> with dependencies,such as the pipeline of Flink ML, in order
> to
> > > >>>>> utilize
> > > >>>>>> the
> > > >>>>>>>> intermediate calculation results we have to submit a job by
> > > >>>>>> env.execute().
> > > >>>>>>>>
> > > >>>>>>>> About the `cache()`  , I think is better to named `persist()`,
> > And
> > > >>>> The
> > > >>>>>>>> Flink framework determines whether we internally cache in
> memory
> > > or
> > > >>>>>> persist
> > > >>>>>>>> to the storage system,Maybe save the data into state backend
> > > >>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > > >>>>>>>>
> > > >>>>>>>> BTW, from the points of my view in the future, support for
> > > streaming
> > > >>>>> and
> > > >>>>>>>> batch mode switching in the same job will also benefit in
> > > >>>> "Interactive
> > > >>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
> > > >>>>>>>>
> > > >>>>>>>> Best,
> > > >>>>>>>> Jincheng
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> > > >>>>>>>>
> > > >>>>>>>>> Hi all,
> > > >>>>>>>>>
> > > >>>>>>>>> As a few recent email threads have pointed out, it is a
> > promising
> > > >>>>>>>>> opportunity to enhance Flink Table API in various aspects,
> > > >>>> including
> > > >>>>>>>>> functionality and ease of use among others. One of the
> > scenarios
> > > >>>>> where
> > > >>>>>> we
> > > >>>>>>>>> feel Flink could improve is interactive programming. To
> explain
> > > the
> > > >>>>>>>> issues
> > > >>>>>>>>> and facilitate the discussion on the solution, we put
> together
> > > the
> > > >>>>>>>>> following document with our proposal.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>>>>>>>
> > > >>>>>>>>> Feedback and comments are very welcome!
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>>
> > > >>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Fabian Hueske <fh...@gmail.com>.
Hi,

Thanks for the proposal!

To summarize, you propose a new method Table.cache(): Table that will
trigger a job and write the result into some temporary storage as defined
by a TableFactory.
The cache() call blocks while the job is running and eventually returns a
Table object that represents a scan of the temporary table.
When the "session" is closed (closing to be defined?), the temporary tables
are all dropped.

I think this behavior makes sense and is a good first step towards more
interactive workloads.
However, its performance suffers from writing to and reading from external
systems.
I think this is OK for now. Changes that would significantly improve the
situation (i.e., pinning data in-memory across jobs) would have large
impacts on many components of Flink.
Users could use in-memory filesystems or storage grids (Apache Ignite) to
mitigate some of the performance effects.

Best, Fabian



Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <becket.qin@gmail.com
>:

> Thanks for the explanation, Piotrek.
>
> Is there any extra thing user can do on a MaterializedTable that they
> cannot do on a Table? After users call *table.cache(), *users can just use
> that table and do anything that is supported on a Table, including SQL.
>
> Naming wise, either cache() or materialize() sounds fine to me. cache() is
> a bit more general than materialize(). Given that we are enhancing the
> Table API to also support non-relational processing cases, cache() might be
> slightly better.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
> > Hi Becket,
> >
> > Ops, sorry I didn’t notice that you intend to reuse existing
> > `TableFactory`. I don’t know why, but I assumed that you want to provide
> an
> > alternate way of writing the data.
> >
> > Now that I hopefully understand the proposal, maybe we could rename
> > `cache()` to
> >
> > void materialize()
> >
> > or going step further
> >
> > MaterializedTable materialize()
> > MaterializedTable createMaterializedView()
> >
> > ?
> >
> > The second option with returning a handle I think is more flexible and
> > could provide features such as “refresh”/“delete” or generally speaking
> > manage the the view. In the future we could also think about adding hooks
> > to automatically refresh view etc. It is also more explicit -
> > materialization returning a new table handle will not have the same
> > implicit side effects as adding a simple line of code like `b.cache()`
> > would have.
> >
> > It would also be more SQL like, making it more intuitive for users
> already
> > familiar with the SQL.
> >
> > Piotrek
> >
> > > On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
> > >
> > > Hi Piotrek,
> > >
> > > For the cache() method itself, yes, it is equivalent to creating a
> > BUILT-IN
> > > materialized view with a lifecycle. That functionality is missing
> today,
> > > though. Not sure if I understand your question. Do you mean we already
> > have
> > > the functionality and just need a syntax sugar?
> > >
> > > What's more interesting in the proposal is do we want to stop at
> creating
> > > the materialized view? Or do we want to extend that in the future to a
> > more
> > > useful unified data store distributed with Flink? And do we want to
> have
> > a
> > > mechanism allow more flexible user job pattern with their own user
> > defined
> > > services. These considerations are much more architectural.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> piotr@data-artisans.com>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> Interesting idea. I’m trying to understand the problem. Isn’t the
> > >> `cache()` call an equivalent of writing data to a sink and later
> reading
> > >> from it? Where this sink has a limited live scope/live time? And the
> > sink
> > >> could be implemented as in memory or a file sink?
> > >>
> > >> If so, what’s the problem with creating a materialised view from a
> table
> > >> “b” (from your document’s example) and reusing this materialised view
> > >> later? Maybe we are lacking mechanisms to clean up materialised views
> > (for
> > >> example when current session finishes)? Maybe we need some syntactic
> > sugar
> > >> on top of it?
> > >>
> > >> Piotrek
> > >>
> > >>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com> wrote:
> > >>>
> > >>> Thanks for the suggestion, Jincheng.
> > >>>
> > >>> Yes, I think it makes sense to have a persist() with
> lifecycle/defined
> > >>> scope. I just added a section in the future work for this.
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Jiangjie (Becket) Qin
> > >>>
> > >>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> sunjincheng121@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Hi Jiangjie,
> > >>>>
> > >>>> Thank you for the explanation about the name of `cache()`, I
> > understand
> > >> why
> > >>>> you designed this way!
> > >>>>
> > >>>> Another idea is whether we can specify a lifecycle for data
> > persistence?
> > >>>> For example, persist (LifeCycle.SESSION), so that the user is not
> > >> worried
> > >>>> about data loss, and will clearly specify the time range for keeping
> > >> time.
> > >>>> At the same time, if we want to expand, we can also share in a
> certain
> > >>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am
> not
> > >> sure,
> > >>>> just an immature suggestion, for reference only!
> > >>>>
> > >>>> Bests,
> > >>>> Jincheng
> > >>>>
> > >>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
> > >>>>
> > >>>>> Re: Jincheng,
> > >>>>>
> > >>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
> > personally I
> > >>>>> find cache() to be more accurately describing the behavior, i.e.
> the
> > >>>> Table
> > >>>>> is cached for the session, but will be deleted after the session is
> > >>>> closed.
> > >>>>> persist() seems a little misleading as people might think the table
> > >> will
> > >>>>> still be there even after the session is gone.
> > >>>>>
> > >>>>> Great point about mixing the batch and stream processing in the
> same
> > >> job.
> > >>>>> We should absolutely move towards that goal. I imagine that would
> be
> > a
> > >>>> huge
> > >>>>> change across the board, including sources, operators and
> > >> optimizations,
> > >>>> to
> > >>>>> name some. Likely we will need several separate in-depth
> discussions.
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Jiangjie (Becket) Qin
> > >>>>>
> > >>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com>
> > >> wrote:
> > >>>>>
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> @Shaoxuan, I think the lifecycle or access domain are both
> > orthogonal
> > >>>> to
> > >>>>>> the cache problem. Essentially, this may be the first time we plan
> > to
> > >>>>>> introduce another storage mechanism other than the state. Maybe
> it’s
> > >>>>> better
> > >>>>>> to first draw a big picture and then concentrate on a specific
> part?
> > >>>>>>
> > >>>>>> @Becket, yes, actually I am more concerned with the underlying
> > >> service.
> > >>>>>> This seems to be quite a major change to the existing codebase. As
> > you
> > >>>>>> claimed, the service should be extendible to support other
> > components
> > >>>> and
> > >>>>>> we’d better discussed it in another thread.
> > >>>>>>
> > >>>>>> All in all, I also eager to enjoy the more interactive Table API,
> in
> > >>>> case
> > >>>>>> of a general and flexible enough service mechanism.
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Xingcan
> > >>>>>>
> > >>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>> Relying on a callback for the temp table for clean up is not very
> > >>>>>> reliable.
> > >>>>>>> There is no guarantee that it will be executed successfully. We
> may
> > >>>>> risk
> > >>>>>>> leaks when that happens. I think that it's safer to have an
> > >>>> association
> > >>>>>>> between temp table and session id. So we can always clean up temp
> > >>>>> tables
> > >>>>>>> which are no longer associated with any active sessions.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Xiaowei
> > >>>>>>>
> > >>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > >>>>> sunjincheng121@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Jiangjie&Shaoxuan,
> > >>>>>>>>
> > >>>>>>>> Thanks for initiating this great proposal!
> > >>>>>>>>
> > >>>>>>>> Interactive Programming is very useful and user friendly in case
> > of
> > >>>>> your
> > >>>>>>>> examples.
> > >>>>>>>> Moreover, especially when a business has to be executed in
> several
> > >>>>>> stages
> > >>>>>>>> with dependencies,such as the pipeline of Flink ML, in order to
> > >>>>> utilize
> > >>>>>> the
> > >>>>>>>> intermediate calculation results we have to submit a job by
> > >>>>>> env.execute().
> > >>>>>>>>
> > >>>>>>>> About the `cache()`  , I think is better to named `persist()`,
> And
> > >>>> The
> > >>>>>>>> Flink framework determines whether we internally cache in memory
> > or
> > >>>>>> persist
> > >>>>>>>> to the storage system,Maybe save the data into state backend
> > >>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > >>>>>>>>
> > >>>>>>>> BTW, from the points of my view in the future, support for
> > streaming
> > >>>>> and
> > >>>>>>>> batch mode switching in the same job will also benefit in
> > >>>> "Interactive
> > >>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Jincheng
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> > >>>>>>>>
> > >>>>>>>>> Hi all,
> > >>>>>>>>>
> > >>>>>>>>> As a few recent email threads have pointed out, it is a
> promising
> > >>>>>>>>> opportunity to enhance Flink Table API in various aspects,
> > >>>> including
> > >>>>>>>>> functionality and ease of use among others. One of the
> scenarios
> > >>>>> where
> > >>>>>> we
> > >>>>>>>>> feel Flink could improve is interactive programming. To explain
> > the
> > >>>>>>>> issues
> > >>>>>>>>> and facilitate the discussion on the solution, we put together
> > the
> > >>>>>>>>> following document with our proposal.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > >>>>>>>>>
> > >>>>>>>>> Feedback and comments are very welcome!
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>>
> > >>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Thanks for the explanation, Piotrek.

Is there any extra thing user can do on a MaterializedTable that they
cannot do on a Table? After users call *table.cache(), *users can just use
that table and do anything that is supported on a Table, including SQL.

Naming wise, either cache() or materialize() sounds fine to me. cache() is
a bit more general than materialize(). Given that we are enhancing the
Table API to also support non-relational processing cases, cache() might be
slightly better.

Thanks,

Jiangjie (Becket) Qin



On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi Becket,
>
> Ops, sorry I didn’t notice that you intend to reuse existing
> `TableFactory`. I don’t know why, but I assumed that you want to provide an
> alternate way of writing the data.
>
> Now that I hopefully understand the proposal, maybe we could rename
> `cache()` to
>
> void materialize()
>
> or going step further
>
> MaterializedTable materialize()
> MaterializedTable createMaterializedView()
>
> ?
>
> The second option with returning a handle I think is more flexible and
> could provide features such as “refresh”/“delete” or generally speaking
> manage the the view. In the future we could also think about adding hooks
> to automatically refresh view etc. It is also more explicit -
> materialization returning a new table handle will not have the same
> implicit side effects as adding a simple line of code like `b.cache()`
> would have.
>
> It would also be more SQL like, making it more intuitive for users already
> familiar with the SQL.
>
> Piotrek
>
> > On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
> >
> > Hi Piotrek,
> >
> > For the cache() method itself, yes, it is equivalent to creating a
> BUILT-IN
> > materialized view with a lifecycle. That functionality is missing today,
> > though. Not sure if I understand your question. Do you mean we already
> have
> > the functionality and just need a syntax sugar?
> >
> > What's more interesting in the proposal is do we want to stop at creating
> > the materialized view? Or do we want to extend that in the future to a
> more
> > useful unified data store distributed with Flink? And do we want to have
> a
> > mechanism allow more flexible user job pattern with their own user
> defined
> > services. These considerations are much more architectural.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <pi...@data-artisans.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Interesting idea. I’m trying to understand the problem. Isn’t the
> >> `cache()` call an equivalent of writing data to a sink and later reading
> >> from it? Where this sink has a limited live scope/live time? And the
> sink
> >> could be implemented as in memory or a file sink?
> >>
> >> If so, what’s the problem with creating a materialised view from a table
> >> “b” (from your document’s example) and reusing this materialised view
> >> later? Maybe we are lacking mechanisms to clean up materialised views
> (for
> >> example when current session finishes)? Maybe we need some syntactic
> sugar
> >> on top of it?
> >>
> >> Piotrek
> >>
> >>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com> wrote:
> >>>
> >>> Thanks for the suggestion, Jincheng.
> >>>
> >>> Yes, I think it makes sense to have a persist() with lifecycle/defined
> >>> scope. I just added a section in the future work for this.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <sunjincheng121@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Hi Jiangjie,
> >>>>
> >>>> Thank you for the explanation about the name of `cache()`, I
> understand
> >> why
> >>>> you designed this way!
> >>>>
> >>>> Another idea is whether we can specify a lifecycle for data
> persistence?
> >>>> For example, persist (LifeCycle.SESSION), so that the user is not
> >> worried
> >>>> about data loss, and will clearly specify the time range for keeping
> >> time.
> >>>> At the same time, if we want to expand, we can also share in a certain
> >>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not
> >> sure,
> >>>> just an immature suggestion, for reference only!
> >>>>
> >>>> Bests,
> >>>> Jincheng
> >>>>
> >>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
> >>>>
> >>>>> Re: Jincheng,
> >>>>>
> >>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
> personally I
> >>>>> find cache() to be more accurately describing the behavior, i.e. the
> >>>> Table
> >>>>> is cached for the session, but will be deleted after the session is
> >>>> closed.
> >>>>> persist() seems a little misleading as people might think the table
> >> will
> >>>>> still be there even after the session is gone.
> >>>>>
> >>>>> Great point about mixing the batch and stream processing in the same
> >> job.
> >>>>> We should absolutely move towards that goal. I imagine that would be
> a
> >>>> huge
> >>>>> change across the board, including sources, operators and
> >> optimizations,
> >>>> to
> >>>>> name some. Likely we will need several separate in-depth discussions.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jiangjie (Becket) Qin
> >>>>>
> >>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> @Shaoxuan, I think the lifecycle or access domain are both
> orthogonal
> >>>> to
> >>>>>> the cache problem. Essentially, this may be the first time we plan
> to
> >>>>>> introduce another storage mechanism other than the state. Maybe it’s
> >>>>> better
> >>>>>> to first draw a big picture and then concentrate on a specific part?
> >>>>>>
> >>>>>> @Becket, yes, actually I am more concerned with the underlying
> >> service.
> >>>>>> This seems to be quite a major change to the existing codebase. As
> you
> >>>>>> claimed, the service should be extendible to support other
> components
> >>>> and
> >>>>>> we’d better discussed it in another thread.
> >>>>>>
> >>>>>> All in all, I also eager to enjoy the more interactive Table API, in
> >>>> case
> >>>>>> of a general and flexible enough service mechanism.
> >>>>>>
> >>>>>> Best,
> >>>>>> Xingcan
> >>>>>>
> >>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Relying on a callback for the temp table for clean up is not very
> >>>>>> reliable.
> >>>>>>> There is no guarantee that it will be executed successfully. We may
> >>>>> risk
> >>>>>>> leaks when that happens. I think that it's safer to have an
> >>>> association
> >>>>>>> between temp table and session id. So we can always clean up temp
> >>>>> tables
> >>>>>>> which are no longer associated with any active sessions.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Xiaowei
> >>>>>>>
> >>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> >>>>> sunjincheng121@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>>>
> >>>>>>>> Thanks for initiating this great proposal!
> >>>>>>>>
> >>>>>>>> Interactive Programming is very useful and user friendly in case
> of
> >>>>> your
> >>>>>>>> examples.
> >>>>>>>> Moreover, especially when a business has to be executed in several
> >>>>>> stages
> >>>>>>>> with dependencies,such as the pipeline of Flink ML, in order to
> >>>>> utilize
> >>>>>> the
> >>>>>>>> intermediate calculation results we have to submit a job by
> >>>>>> env.execute().
> >>>>>>>>
> >>>>>>>> About the `cache()`  , I think is better to named `persist()`, And
> >>>> The
> >>>>>>>> Flink framework determines whether we internally cache in memory
> or
> >>>>>> persist
> >>>>>>>> to the storage system,Maybe save the data into state backend
> >>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> >>>>>>>>
> >>>>>>>> BTW, from the points of my view in the future, support for
> streaming
> >>>>> and
> >>>>>>>> batch mode switching in the same job will also benefit in
> >>>> "Interactive
> >>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jincheng
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> >>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> As a few recent email threads have pointed out, it is a promising
> >>>>>>>>> opportunity to enhance Flink Table API in various aspects,
> >>>> including
> >>>>>>>>> functionality and ease of use among others. One of the scenarios
> >>>>> where
> >>>>>> we
> >>>>>>>>> feel Flink could improve is interactive programming. To explain
> the
> >>>>>>>> issues
> >>>>>>>>> and facilitate the discussion on the solution, we put together
> the
> >>>>>>>>> following document with our proposal.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>>>
> >>>>>>>>> Feedback and comments are very welcome!
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi Becket,

Ops, sorry I didn’t notice that you intend to reuse existing `TableFactory`. I don’t know why, but I assumed that you want to provide an alternate way of writing the data.

Now that I hopefully understand the proposal, maybe we could rename `cache()` to 

void materialize()

or going step further

MaterializedTable materialize()
MaterializedTable createMaterializedView()

? 

The second option with returning a handle I think is more flexible and could provide features such as “refresh”/“delete” or generally speaking manage the the view. In the future we could also think about adding hooks to automatically refresh view etc. It is also more explicit - materialization returning a new table handle will not have the same implicit side effects as adding a simple line of code like `b.cache()` would have.

It would also be more SQL like, making it more intuitive for users already familiar with the SQL.

Piotrek

> On 23 Nov 2018, at 14:53, Becket Qin <be...@gmail.com> wrote:
> 
> Hi Piotrek,
> 
> For the cache() method itself, yes, it is equivalent to creating a BUILT-IN
> materialized view with a lifecycle. That functionality is missing today,
> though. Not sure if I understand your question. Do you mean we already have
> the functionality and just need a syntax sugar?
> 
> What's more interesting in the proposal is do we want to stop at creating
> the materialized view? Or do we want to extend that in the future to a more
> useful unified data store distributed with Flink? And do we want to have a
> mechanism allow more flexible user job pattern with their own user defined
> services. These considerations are much more architectural.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
> 
>> Hi,
>> 
>> Interesting idea. I’m trying to understand the problem. Isn’t the
>> `cache()` call an equivalent of writing data to a sink and later reading
>> from it? Where this sink has a limited live scope/live time? And the sink
>> could be implemented as in memory or a file sink?
>> 
>> If so, what’s the problem with creating a materialised view from a table
>> “b” (from your document’s example) and reusing this materialised view
>> later? Maybe we are lacking mechanisms to clean up materialised views (for
>> example when current session finishes)? Maybe we need some syntactic sugar
>> on top of it?
>> 
>> Piotrek
>> 
>>> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com> wrote:
>>> 
>>> Thanks for the suggestion, Jincheng.
>>> 
>>> Yes, I think it makes sense to have a persist() with lifecycle/defined
>>> scope. I just added a section in the future work for this.
>>> 
>>> Thanks,
>>> 
>>> Jiangjie (Becket) Qin
>>> 
>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <su...@gmail.com>
>>> wrote:
>>> 
>>>> Hi Jiangjie,
>>>> 
>>>> Thank you for the explanation about the name of `cache()`, I understand
>> why
>>>> you designed this way!
>>>> 
>>>> Another idea is whether we can specify a lifecycle for data persistence?
>>>> For example, persist (LifeCycle.SESSION), so that the user is not
>> worried
>>>> about data loss, and will clearly specify the time range for keeping
>> time.
>>>> At the same time, if we want to expand, we can also share in a certain
>>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not
>> sure,
>>>> just an immature suggestion, for reference only!
>>>> 
>>>> Bests,
>>>> Jincheng
>>>> 
>>>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
>>>> 
>>>>> Re: Jincheng,
>>>>> 
>>>>> Thanks for the feedback. Regarding cache() v.s. persist(), personally I
>>>>> find cache() to be more accurately describing the behavior, i.e. the
>>>> Table
>>>>> is cached for the session, but will be deleted after the session is
>>>> closed.
>>>>> persist() seems a little misleading as people might think the table
>> will
>>>>> still be there even after the session is gone.
>>>>> 
>>>>> Great point about mixing the batch and stream processing in the same
>> job.
>>>>> We should absolutely move towards that goal. I imagine that would be a
>>>> huge
>>>>> change across the board, including sources, operators and
>> optimizations,
>>>> to
>>>>> name some. Likely we will need several separate in-depth discussions.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jiangjie (Becket) Qin
>>>>> 
>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> @Shaoxuan, I think the lifecycle or access domain are both orthogonal
>>>> to
>>>>>> the cache problem. Essentially, this may be the first time we plan to
>>>>>> introduce another storage mechanism other than the state. Maybe it’s
>>>>> better
>>>>>> to first draw a big picture and then concentrate on a specific part?
>>>>>> 
>>>>>> @Becket, yes, actually I am more concerned with the underlying
>> service.
>>>>>> This seems to be quite a major change to the existing codebase. As you
>>>>>> claimed, the service should be extendible to support other components
>>>> and
>>>>>> we’d better discussed it in another thread.
>>>>>> 
>>>>>> All in all, I also eager to enjoy the more interactive Table API, in
>>>> case
>>>>>> of a general and flexible enough service mechanism.
>>>>>> 
>>>>>> Best,
>>>>>> Xingcan
>>>>>> 
>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> Relying on a callback for the temp table for clean up is not very
>>>>>> reliable.
>>>>>>> There is no guarantee that it will be executed successfully. We may
>>>>> risk
>>>>>>> leaks when that happens. I think that it's safer to have an
>>>> association
>>>>>>> between temp table and session id. So we can always clean up temp
>>>>> tables
>>>>>>> which are no longer associated with any active sessions.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Xiaowei
>>>>>>> 
>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>>>> sunjincheng121@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>> 
>>>>>>>> Thanks for initiating this great proposal!
>>>>>>>> 
>>>>>>>> Interactive Programming is very useful and user friendly in case of
>>>>> your
>>>>>>>> examples.
>>>>>>>> Moreover, especially when a business has to be executed in several
>>>>>> stages
>>>>>>>> with dependencies,such as the pipeline of Flink ML, in order to
>>>>> utilize
>>>>>> the
>>>>>>>> intermediate calculation results we have to submit a job by
>>>>>> env.execute().
>>>>>>>> 
>>>>>>>> About the `cache()`  , I think is better to named `persist()`, And
>>>> The
>>>>>>>> Flink framework determines whether we internally cache in memory or
>>>>>> persist
>>>>>>>> to the storage system,Maybe save the data into state backend
>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>>>> 
>>>>>>>> BTW, from the points of my view in the future, support for streaming
>>>>> and
>>>>>>>> batch mode switching in the same job will also benefit in
>>>> "Interactive
>>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Jincheng
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> As a few recent email threads have pointed out, it is a promising
>>>>>>>>> opportunity to enhance Flink Table API in various aspects,
>>>> including
>>>>>>>>> functionality and ease of use among others. One of the scenarios
>>>>> where
>>>>>> we
>>>>>>>>> feel Flink could improve is interactive programming. To explain the
>>>>>>>> issues
>>>>>>>>> and facilitate the discussion on the solution, we put together the
>>>>>>>>> following document with our proposal.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>> 
>>>>>>>>> Feedback and comments are very welcome!
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Piotrek,

For the cache() method itself, yes, it is equivalent to creating a BUILT-IN
materialized view with a lifecycle. That functionality is missing today,
though. Not sure if I understand your question. Do you mean we already have
the functionality and just need a syntax sugar?

What's more interesting in the proposal is do we want to stop at creating
the materialized view? Or do we want to extend that in the future to a more
useful unified data store distributed with Flink? And do we want to have a
mechanism allow more flexible user job pattern with their own user defined
services. These considerations are much more architectural.

Thanks,

Jiangjie (Becket) Qin

On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi,
>
> Interesting idea. I’m trying to understand the problem. Isn’t the
> `cache()` call an equivalent of writing data to a sink and later reading
> from it? Where this sink has a limited live scope/live time? And the sink
> could be implemented as in memory or a file sink?
>
> If so, what’s the problem with creating a materialised view from a table
> “b” (from your document’s example) and reusing this materialised view
> later? Maybe we are lacking mechanisms to clean up materialised views (for
> example when current session finishes)? Maybe we need some syntactic sugar
> on top of it?
>
> Piotrek
>
> > On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com> wrote:
> >
> > Thanks for the suggestion, Jincheng.
> >
> > Yes, I think it makes sense to have a persist() with lifecycle/defined
> > scope. I just added a section in the future work for this.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <su...@gmail.com>
> > wrote:
> >
> >> Hi Jiangjie,
> >>
> >> Thank you for the explanation about the name of `cache()`, I understand
> why
> >> you designed this way!
> >>
> >> Another idea is whether we can specify a lifecycle for data persistence?
> >> For example, persist (LifeCycle.SESSION), so that the user is not
> worried
> >> about data loss, and will clearly specify the time range for keeping
> time.
> >> At the same time, if we want to expand, we can also share in a certain
> >> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not
> sure,
> >> just an immature suggestion, for reference only!
> >>
> >> Bests,
> >> Jincheng
> >>
> >> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
> >>
> >>> Re: Jincheng,
> >>>
> >>> Thanks for the feedback. Regarding cache() v.s. persist(), personally I
> >>> find cache() to be more accurately describing the behavior, i.e. the
> >> Table
> >>> is cached for the session, but will be deleted after the session is
> >> closed.
> >>> persist() seems a little misleading as people might think the table
> will
> >>> still be there even after the session is gone.
> >>>
> >>> Great point about mixing the batch and stream processing in the same
> job.
> >>> We should absolutely move towards that goal. I imagine that would be a
> >> huge
> >>> change across the board, including sources, operators and
> optimizations,
> >> to
> >>> name some. Likely we will need several separate in-depth discussions.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com>
> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> @Shaoxuan, I think the lifecycle or access domain are both orthogonal
> >> to
> >>>> the cache problem. Essentially, this may be the first time we plan to
> >>>> introduce another storage mechanism other than the state. Maybe it’s
> >>> better
> >>>> to first draw a big picture and then concentrate on a specific part?
> >>>>
> >>>> @Becket, yes, actually I am more concerned with the underlying
> service.
> >>>> This seems to be quite a major change to the existing codebase. As you
> >>>> claimed, the service should be extendible to support other components
> >> and
> >>>> we’d better discussed it in another thread.
> >>>>
> >>>> All in all, I also eager to enjoy the more interactive Table API, in
> >> case
> >>>> of a general and flexible enough service mechanism.
> >>>>
> >>>> Best,
> >>>> Xingcan
> >>>>
> >>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>> Relying on a callback for the temp table for clean up is not very
> >>>> reliable.
> >>>>> There is no guarantee that it will be executed successfully. We may
> >>> risk
> >>>>> leaks when that happens. I think that it's safer to have an
> >> association
> >>>>> between temp table and session id. So we can always clean up temp
> >>> tables
> >>>>> which are no longer associated with any active sessions.
> >>>>>
> >>>>> Regards,
> >>>>> Xiaowei
> >>>>>
> >>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> >>> sunjincheng121@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Jiangjie&Shaoxuan,
> >>>>>>
> >>>>>> Thanks for initiating this great proposal!
> >>>>>>
> >>>>>> Interactive Programming is very useful and user friendly in case of
> >>> your
> >>>>>> examples.
> >>>>>> Moreover, especially when a business has to be executed in several
> >>>> stages
> >>>>>> with dependencies,such as the pipeline of Flink ML, in order to
> >>> utilize
> >>>> the
> >>>>>> intermediate calculation results we have to submit a job by
> >>>> env.execute().
> >>>>>>
> >>>>>> About the `cache()`  , I think is better to named `persist()`, And
> >> The
> >>>>>> Flink framework determines whether we internally cache in memory or
> >>>> persist
> >>>>>> to the storage system,Maybe save the data into state backend
> >>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> >>>>>>
> >>>>>> BTW, from the points of my view in the future, support for streaming
> >>> and
> >>>>>> batch mode switching in the same job will also benefit in
> >> "Interactive
> >>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
> >>>>>>
> >>>>>> Best,
> >>>>>> Jincheng
> >>>>>>
> >>>>>>
> >>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> >>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> As a few recent email threads have pointed out, it is a promising
> >>>>>>> opportunity to enhance Flink Table API in various aspects,
> >> including
> >>>>>>> functionality and ease of use among others. One of the scenarios
> >>> where
> >>>> we
> >>>>>>> feel Flink could improve is interactive programming. To explain the
> >>>>>> issues
> >>>>>>> and facilitate the discussion on the solution, we put together the
> >>>>>>> following document with our proposal.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>>>>>
> >>>>>>> Feedback and comments are very welcome!
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jiangjie (Becket) Qin
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

Interesting idea. I’m trying to understand the problem. Isn’t the `cache()` call an equivalent of writing data to a sink and later reading from it? Where this sink has a limited live scope/live time? And the sink could be implemented as in memory or a file sink?

If so, what’s the problem with creating a materialised view from a table “b” (from your document’s example) and reusing this materialised view later? Maybe we are lacking mechanisms to clean up materialised views (for example when current session finishes)? Maybe we need some syntactic sugar on top of it?  

Piotrek

> On 23 Nov 2018, at 07:21, Becket Qin <be...@gmail.com> wrote:
> 
> Thanks for the suggestion, Jincheng.
> 
> Yes, I think it makes sense to have a persist() with lifecycle/defined
> scope. I just added a section in the future work for this.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <su...@gmail.com>
> wrote:
> 
>> Hi Jiangjie,
>> 
>> Thank you for the explanation about the name of `cache()`, I understand why
>> you designed this way!
>> 
>> Another idea is whether we can specify a lifecycle for data persistence?
>> For example, persist (LifeCycle.SESSION), so that the user is not worried
>> about data loss, and will clearly specify the time range for keeping time.
>> At the same time, if we want to expand, we can also share in a certain
>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not sure,
>> just an immature suggestion, for reference only!
>> 
>> Bests,
>> Jincheng
>> 
>> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
>> 
>>> Re: Jincheng,
>>> 
>>> Thanks for the feedback. Regarding cache() v.s. persist(), personally I
>>> find cache() to be more accurately describing the behavior, i.e. the
>> Table
>>> is cached for the session, but will be deleted after the session is
>> closed.
>>> persist() seems a little misleading as people might think the table will
>>> still be there even after the session is gone.
>>> 
>>> Great point about mixing the batch and stream processing in the same job.
>>> We should absolutely move towards that goal. I imagine that would be a
>> huge
>>> change across the board, including sources, operators and optimizations,
>> to
>>> name some. Likely we will need several separate in-depth discussions.
>>> 
>>> Thanks,
>>> 
>>> Jiangjie (Becket) Qin
>>> 
>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> @Shaoxuan, I think the lifecycle or access domain are both orthogonal
>> to
>>>> the cache problem. Essentially, this may be the first time we plan to
>>>> introduce another storage mechanism other than the state. Maybe it’s
>>> better
>>>> to first draw a big picture and then concentrate on a specific part?
>>>> 
>>>> @Becket, yes, actually I am more concerned with the underlying service.
>>>> This seems to be quite a major change to the existing codebase. As you
>>>> claimed, the service should be extendible to support other components
>> and
>>>> we’d better discussed it in another thread.
>>>> 
>>>> All in all, I also eager to enjoy the more interactive Table API, in
>> case
>>>> of a general and flexible enough service mechanism.
>>>> 
>>>> Best,
>>>> Xingcan
>>>> 
>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com>
>>> wrote:
>>>>> 
>>>>> Relying on a callback for the temp table for clean up is not very
>>>> reliable.
>>>>> There is no guarantee that it will be executed successfully. We may
>>> risk
>>>>> leaks when that happens. I think that it's safer to have an
>> association
>>>>> between temp table and session id. So we can always clean up temp
>>> tables
>>>>> which are no longer associated with any active sessions.
>>>>> 
>>>>> Regards,
>>>>> Xiaowei
>>>>> 
>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>> sunjincheng121@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>> 
>>>>>> Thanks for initiating this great proposal!
>>>>>> 
>>>>>> Interactive Programming is very useful and user friendly in case of
>>> your
>>>>>> examples.
>>>>>> Moreover, especially when a business has to be executed in several
>>>> stages
>>>>>> with dependencies,such as the pipeline of Flink ML, in order to
>>> utilize
>>>> the
>>>>>> intermediate calculation results we have to submit a job by
>>>> env.execute().
>>>>>> 
>>>>>> About the `cache()`  , I think is better to named `persist()`, And
>> The
>>>>>> Flink framework determines whether we internally cache in memory or
>>>> persist
>>>>>> to the storage system,Maybe save the data into state backend
>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>> 
>>>>>> BTW, from the points of my view in the future, support for streaming
>>> and
>>>>>> batch mode switching in the same job will also benefit in
>> "Interactive
>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
>>>>>> 
>>>>>> Best,
>>>>>> Jincheng
>>>>>> 
>>>>>> 
>>>>>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> As a few recent email threads have pointed out, it is a promising
>>>>>>> opportunity to enhance Flink Table API in various aspects,
>> including
>>>>>>> functionality and ease of use among others. One of the scenarios
>>> where
>>>> we
>>>>>>> feel Flink could improve is interactive programming. To explain the
>>>>>> issues
>>>>>>> and facilitate the discussion on the solution, we put together the
>>>>>>> following document with our proposal.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>> 
>>>>>>> Feedback and comments are very welcome!
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Jiangjie (Becket) Qin
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Thanks for the suggestion, Jincheng.

Yes, I think it makes sense to have a persist() with lifecycle/defined
scope. I just added a section in the future work for this.

 Thanks,

Jiangjie (Becket) Qin

On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <su...@gmail.com>
wrote:

> Hi Jiangjie,
>
> Thank you for the explanation about the name of `cache()`, I understand why
> you designed this way!
>
> Another idea is whether we can specify a lifecycle for data persistence?
> For example, persist (LifeCycle.SESSION), so that the user is not worried
> about data loss, and will clearly specify the time range for keeping time.
> At the same time, if we want to expand, we can also share in a certain
> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not sure,
> just an immature suggestion, for reference only!
>
> Bests,
> Jincheng
>
> Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:
>
> > Re: Jincheng,
> >
> > Thanks for the feedback. Regarding cache() v.s. persist(), personally I
> > find cache() to be more accurately describing the behavior, i.e. the
> Table
> > is cached for the session, but will be deleted after the session is
> closed.
> > persist() seems a little misleading as people might think the table will
> > still be there even after the session is gone.
> >
> > Great point about mixing the batch and stream processing in the same job.
> > We should absolutely move towards that goal. I imagine that would be a
> huge
> > change across the board, including sources, operators and optimizations,
> to
> > name some. Likely we will need several separate in-depth discussions.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > @Shaoxuan, I think the lifecycle or access domain are both orthogonal
> to
> > > the cache problem. Essentially, this may be the first time we plan to
> > > introduce another storage mechanism other than the state. Maybe it’s
> > better
> > > to first draw a big picture and then concentrate on a specific part?
> > >
> > > @Becket, yes, actually I am more concerned with the underlying service.
> > > This seems to be quite a major change to the existing codebase. As you
> > > claimed, the service should be extendible to support other components
> and
> > > we’d better discussed it in another thread.
> > >
> > > All in all, I also eager to enjoy the more interactive Table API, in
> case
> > > of a general and flexible enough service mechanism.
> > >
> > > Best,
> > > Xingcan
> > >
> > > > On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com>
> > wrote:
> > > >
> > > > Relying on a callback for the temp table for clean up is not very
> > > reliable.
> > > > There is no guarantee that it will be executed successfully. We may
> > risk
> > > > leaks when that happens. I think that it's safer to have an
> association
> > > > between temp table and session id. So we can always clean up temp
> > tables
> > > > which are no longer associated with any active sessions.
> > > >
> > > > Regards,
> > > > Xiaowei
> > > >
> > > > On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > sunjincheng121@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi Jiangjie&Shaoxuan,
> > > >>
> > > >> Thanks for initiating this great proposal!
> > > >>
> > > >> Interactive Programming is very useful and user friendly in case of
> > your
> > > >> examples.
> > > >> Moreover, especially when a business has to be executed in several
> > > stages
> > > >> with dependencies,such as the pipeline of Flink ML, in order to
> > utilize
> > > the
> > > >> intermediate calculation results we have to submit a job by
> > > env.execute().
> > > >>
> > > >> About the `cache()`  , I think is better to named `persist()`, And
> The
> > > >> Flink framework determines whether we internally cache in memory or
> > > persist
> > > >> to the storage system,Maybe save the data into state backend
> > > >> (MemoryStateBackend or RocksDBStateBackend etc.)
> > > >>
> > > >> BTW, from the points of my view in the future, support for streaming
> > and
> > > >> batch mode switching in the same job will also benefit in
> "Interactive
> > > >> Programming",  I am looking forward to your JIRAs and FLIP!
> > > >>
> > > >> Best,
> > > >> Jincheng
> > > >>
> > > >>
> > > >> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> As a few recent email threads have pointed out, it is a promising
> > > >>> opportunity to enhance Flink Table API in various aspects,
> including
> > > >>> functionality and ease of use among others. One of the scenarios
> > where
> > > we
> > > >>> feel Flink could improve is interactive programming. To explain the
> > > >> issues
> > > >>> and facilitate the discussion on the solution, we put together the
> > > >>> following document with our proposal.
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>
> > > >>> Feedback and comments are very welcome!
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Jiangjie (Becket) Qin
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by jincheng sun <su...@gmail.com>.
Hi Jiangjie,

Thank you for the explanation about the name of `cache()`, I understand why
you designed this way!

Another idea is whether we can specify a lifecycle for data persistence?
For example, persist (LifeCycle.SESSION), so that the user is not worried
about data loss, and will clearly specify the time range for keeping time.
At the same time, if we want to expand, we can also share in a certain
group of session, for example: LifeCycle.SESSION_GROUP(...), I am not sure,
just an immature suggestion, for reference only!

Bests,
Jincheng

Becket Qin <be...@gmail.com> 于2018年11月23日周五 下午1:33写道:

> Re: Jincheng,
>
> Thanks for the feedback. Regarding cache() v.s. persist(), personally I
> find cache() to be more accurately describing the behavior, i.e. the Table
> is cached for the session, but will be deleted after the session is closed.
> persist() seems a little misleading as people might think the table will
> still be there even after the session is gone.
>
> Great point about mixing the batch and stream processing in the same job.
> We should absolutely move towards that goal. I imagine that would be a huge
> change across the board, including sources, operators and optimizations, to
> name some. Likely we will need several separate in-depth discussions.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com> wrote:
>
> > Hi all,
> >
> > @Shaoxuan, I think the lifecycle or access domain are both orthogonal to
> > the cache problem. Essentially, this may be the first time we plan to
> > introduce another storage mechanism other than the state. Maybe it’s
> better
> > to first draw a big picture and then concentrate on a specific part?
> >
> > @Becket, yes, actually I am more concerned with the underlying service.
> > This seems to be quite a major change to the existing codebase. As you
> > claimed, the service should be extendible to support other components and
> > we’d better discussed it in another thread.
> >
> > All in all, I also eager to enjoy the more interactive Table API, in case
> > of a general and flexible enough service mechanism.
> >
> > Best,
> > Xingcan
> >
> > > On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com>
> wrote:
> > >
> > > Relying on a callback for the temp table for clean up is not very
> > reliable.
> > > There is no guarantee that it will be executed successfully. We may
> risk
> > > leaks when that happens. I think that it's safer to have an association
> > > between temp table and session id. So we can always clean up temp
> tables
> > > which are no longer associated with any active sessions.
> > >
> > > Regards,
> > > Xiaowei
> > >
> > > On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> sunjincheng121@gmail.com>
> > > wrote:
> > >
> > >> Hi Jiangjie&Shaoxuan,
> > >>
> > >> Thanks for initiating this great proposal!
> > >>
> > >> Interactive Programming is very useful and user friendly in case of
> your
> > >> examples.
> > >> Moreover, especially when a business has to be executed in several
> > stages
> > >> with dependencies,such as the pipeline of Flink ML, in order to
> utilize
> > the
> > >> intermediate calculation results we have to submit a job by
> > env.execute().
> > >>
> > >> About the `cache()`  , I think is better to named `persist()`, And The
> > >> Flink framework determines whether we internally cache in memory or
> > persist
> > >> to the storage system,Maybe save the data into state backend
> > >> (MemoryStateBackend or RocksDBStateBackend etc.)
> > >>
> > >> BTW, from the points of my view in the future, support for streaming
> and
> > >> batch mode switching in the same job will also benefit in "Interactive
> > >> Programming",  I am looking forward to your JIRAs and FLIP!
> > >>
> > >> Best,
> > >> Jincheng
> > >>
> > >>
> > >> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> > >>
> > >>> Hi all,
> > >>>
> > >>> As a few recent email threads have pointed out, it is a promising
> > >>> opportunity to enhance Flink Table API in various aspects, including
> > >>> functionality and ease of use among others. One of the scenarios
> where
> > we
> > >>> feel Flink could improve is interactive programming. To explain the
> > >> issues
> > >>> and facilitate the discussion on the solution, we put together the
> > >>> following document with our proposal.
> > >>>
> > >>>
> > >>>
> > >>
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > >>>
> > >>> Feedback and comments are very welcome!
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Jiangjie (Becket) Qin
> > >>>
> > >>
> >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Xiaowei,

Thanks for the comment. That is a valid point.

The callback is not only associated with a particular temp table. It is a
clean up logic provided by the user. The temp table to session ID mapping
is tracked internally. We also need to associate the callback with the
session lifecycle and make sure it will be invoked when the session exits,
whether normally or abnormally. We haven't decided how exactly that should
be done yet. Several options being explored are:
1. Invoke the callback the Yarn application session shutdown hook if there
is one. (probably the best option if available)
2. Put the logic into Yarn AM.
3. Launch a WatchDog service and let it heartbeat to the client. If the
client indicates the session is closed or the client goes away
accidentally, the cleanup service will just kick in.

In any case, the callback is unlikely to be invoked on the client side.

Thanks,

Jiangjie (Becket) Qin


On Fri, Nov 23, 2018 at 1:32 PM Becket Qin <be...@gmail.com> wrote:

> Re: Jincheng,
>
> Thanks for the feedback. Regarding cache() v.s. persist(), personally I
> find cache() to be more accurately describing the behavior, i.e. the Table
> is cached for the session, but will be deleted after the session is closed.
> persist() seems a little misleading as people might think the table will
> still be there even after the session is gone.
>
> Great point about mixing the batch and stream processing in the same job.
> We should absolutely move towards that goal. I imagine that would be a huge
> change across the board, including sources, operators and optimizations, to
> name some. Likely we will need several separate in-depth discussions.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com> wrote:
>
>> Hi all,
>>
>> @Shaoxuan, I think the lifecycle or access domain are both orthogonal to
>> the cache problem. Essentially, this may be the first time we plan to
>> introduce another storage mechanism other than the state. Maybe it’s better
>> to first draw a big picture and then concentrate on a specific part?
>>
>> @Becket, yes, actually I am more concerned with the underlying service.
>> This seems to be quite a major change to the existing codebase. As you
>> claimed, the service should be extendible to support other components and
>> we’d better discussed it in another thread.
>>
>> All in all, I also eager to enjoy the more interactive Table API, in case
>> of a general and flexible enough service mechanism.
>>
>> Best,
>> Xingcan
>>
>> > On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com> wrote:
>> >
>> > Relying on a callback for the temp table for clean up is not very
>> reliable.
>> > There is no guarantee that it will be executed successfully. We may risk
>> > leaks when that happens. I think that it's safer to have an association
>> > between temp table and session id. So we can always clean up temp tables
>> > which are no longer associated with any active sessions.
>> >
>> > Regards,
>> > Xiaowei
>> >
>> > On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <sunjincheng121@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi Jiangjie&Shaoxuan,
>> >>
>> >> Thanks for initiating this great proposal!
>> >>
>> >> Interactive Programming is very useful and user friendly in case of
>> your
>> >> examples.
>> >> Moreover, especially when a business has to be executed in several
>> stages
>> >> with dependencies,such as the pipeline of Flink ML, in order to
>> utilize the
>> >> intermediate calculation results we have to submit a job by
>> env.execute().
>> >>
>> >> About the `cache()`  , I think is better to named `persist()`, And The
>> >> Flink framework determines whether we internally cache in memory or
>> persist
>> >> to the storage system,Maybe save the data into state backend
>> >> (MemoryStateBackend or RocksDBStateBackend etc.)
>> >>
>> >> BTW, from the points of my view in the future, support for streaming
>> and
>> >> batch mode switching in the same job will also benefit in "Interactive
>> >> Programming",  I am looking forward to your JIRAs and FLIP!
>> >>
>> >> Best,
>> >> Jincheng
>> >>
>> >>
>> >> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
>> >>
>> >>> Hi all,
>> >>>
>> >>> As a few recent email threads have pointed out, it is a promising
>> >>> opportunity to enhance Flink Table API in various aspects, including
>> >>> functionality and ease of use among others. One of the scenarios
>> where we
>> >>> feel Flink could improve is interactive programming. To explain the
>> >> issues
>> >>> and facilitate the discussion on the solution, we put together the
>> >>> following document with our proposal.
>> >>>
>> >>>
>> >>>
>> >>
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>> >>>
>> >>> Feedback and comments are very welcome!
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Jiangjie (Becket) Qin
>> >>>
>> >>
>>
>>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Re: Jincheng,

Thanks for the feedback. Regarding cache() v.s. persist(), personally I
find cache() to be more accurately describing the behavior, i.e. the Table
is cached for the session, but will be deleted after the session is closed.
persist() seems a little misleading as people might think the table will
still be there even after the session is gone.

Great point about mixing the batch and stream processing in the same job.
We should absolutely move towards that goal. I imagine that would be a huge
change across the board, including sources, operators and optimizations, to
name some. Likely we will need several separate in-depth discussions.

Thanks,

Jiangjie (Becket) Qin

On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xi...@gmail.com> wrote:

> Hi all,
>
> @Shaoxuan, I think the lifecycle or access domain are both orthogonal to
> the cache problem. Essentially, this may be the first time we plan to
> introduce another storage mechanism other than the state. Maybe it’s better
> to first draw a big picture and then concentrate on a specific part?
>
> @Becket, yes, actually I am more concerned with the underlying service.
> This seems to be quite a major change to the existing codebase. As you
> claimed, the service should be extendible to support other components and
> we’d better discussed it in another thread.
>
> All in all, I also eager to enjoy the more interactive Table API, in case
> of a general and flexible enough service mechanism.
>
> Best,
> Xingcan
>
> > On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com> wrote:
> >
> > Relying on a callback for the temp table for clean up is not very
> reliable.
> > There is no guarantee that it will be executed successfully. We may risk
> > leaks when that happens. I think that it's safer to have an association
> > between temp table and session id. So we can always clean up temp tables
> > which are no longer associated with any active sessions.
> >
> > Regards,
> > Xiaowei
> >
> > On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <su...@gmail.com>
> > wrote:
> >
> >> Hi Jiangjie&Shaoxuan,
> >>
> >> Thanks for initiating this great proposal!
> >>
> >> Interactive Programming is very useful and user friendly in case of your
> >> examples.
> >> Moreover, especially when a business has to be executed in several
> stages
> >> with dependencies,such as the pipeline of Flink ML, in order to utilize
> the
> >> intermediate calculation results we have to submit a job by
> env.execute().
> >>
> >> About the `cache()`  , I think is better to named `persist()`, And The
> >> Flink framework determines whether we internally cache in memory or
> persist
> >> to the storage system,Maybe save the data into state backend
> >> (MemoryStateBackend or RocksDBStateBackend etc.)
> >>
> >> BTW, from the points of my view in the future, support for streaming and
> >> batch mode switching in the same job will also benefit in "Interactive
> >> Programming",  I am looking forward to your JIRAs and FLIP!
> >>
> >> Best,
> >> Jincheng
> >>
> >>
> >> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
> >>
> >>> Hi all,
> >>>
> >>> As a few recent email threads have pointed out, it is a promising
> >>> opportunity to enhance Flink Table API in various aspects, including
> >>> functionality and ease of use among others. One of the scenarios where
> we
> >>> feel Flink could improve is interactive programming. To explain the
> >> issues
> >>> and facilitate the discussion on the solution, we put together the
> >>> following document with our proposal.
> >>>
> >>>
> >>>
> >>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>
> >>> Feedback and comments are very welcome!
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Xingcan Cui <xi...@gmail.com>.
Hi all,

@Shaoxuan, I think the lifecycle or access domain are both orthogonal to the cache problem. Essentially, this may be the first time we plan to introduce another storage mechanism other than the state. Maybe it’s better to first draw a big picture and then concentrate on a specific part?

@Becket, yes, actually I am more concerned with the underlying service. This seems to be quite a major change to the existing codebase. As you claimed, the service should be extendible to support other components and we’d better discussed it in another thread.

All in all, I also eager to enjoy the more interactive Table API, in case of a general and flexible enough service mechanism.

Best,
Xingcan

> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xi...@gmail.com> wrote:
> 
> Relying on a callback for the temp table for clean up is not very reliable.
> There is no guarantee that it will be executed successfully. We may risk
> leaks when that happens. I think that it's safer to have an association
> between temp table and session id. So we can always clean up temp tables
> which are no longer associated with any active sessions.
> 
> Regards,
> Xiaowei
> 
> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <su...@gmail.com>
> wrote:
> 
>> Hi Jiangjie&Shaoxuan,
>> 
>> Thanks for initiating this great proposal!
>> 
>> Interactive Programming is very useful and user friendly in case of your
>> examples.
>> Moreover, especially when a business has to be executed in several stages
>> with dependencies,such as the pipeline of Flink ML, in order to utilize the
>> intermediate calculation results we have to submit a job by env.execute().
>> 
>> About the `cache()`  , I think is better to named `persist()`, And The
>> Flink framework determines whether we internally cache in memory or persist
>> to the storage system,Maybe save the data into state backend
>> (MemoryStateBackend or RocksDBStateBackend etc.)
>> 
>> BTW, from the points of my view in the future, support for streaming and
>> batch mode switching in the same job will also benefit in "Interactive
>> Programming",  I am looking forward to your JIRAs and FLIP!
>> 
>> Best,
>> Jincheng
>> 
>> 
>> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
>> 
>>> Hi all,
>>> 
>>> As a few recent email threads have pointed out, it is a promising
>>> opportunity to enhance Flink Table API in various aspects, including
>>> functionality and ease of use among others. One of the scenarios where we
>>> feel Flink could improve is interactive programming. To explain the
>> issues
>>> and facilitate the discussion on the solution, we put together the
>>> following document with our proposal.
>>> 
>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>> 
>>> Feedback and comments are very welcome!
>>> 
>>> Thanks,
>>> 
>>> Jiangjie (Becket) Qin
>>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Xiaowei Jiang <xi...@gmail.com>.
Relying on a callback for the temp table for clean up is not very reliable.
There is no guarantee that it will be executed successfully. We may risk
leaks when that happens. I think that it's safer to have an association
between temp table and session id. So we can always clean up temp tables
which are no longer associated with any active sessions.

Regards,
Xiaowei

On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <su...@gmail.com>
wrote:

> Hi Jiangjie&Shaoxuan,
>
> Thanks for initiating this great proposal!
>
> Interactive Programming is very useful and user friendly in case of your
> examples.
> Moreover, especially when a business has to be executed in several stages
> with dependencies,such as the pipeline of Flink ML, in order to utilize the
> intermediate calculation results we have to submit a job by env.execute().
>
> About the `cache()`  , I think is better to named `persist()`, And The
> Flink framework determines whether we internally cache in memory or persist
> to the storage system,Maybe save the data into state backend
> (MemoryStateBackend or RocksDBStateBackend etc.)
>
> BTW, from the points of my view in the future, support for streaming and
> batch mode switching in the same job will also benefit in "Interactive
> Programming",  I am looking forward to your JIRAs and FLIP!
>
> Best,
> Jincheng
>
>
> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
>
> > Hi all,
> >
> > As a few recent email threads have pointed out, it is a promising
> > opportunity to enhance Flink Table API in various aspects, including
> > functionality and ease of use among others. One of the scenarios where we
> > feel Flink could improve is interactive programming. To explain the
> issues
> > and facilitate the discussion on the solution, we put together the
> > following document with our proposal.
> >
> >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >
> > Feedback and comments are very welcome!
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by jincheng sun <su...@gmail.com>.
Hi Jiangjie&Shaoxuan,

Thanks for initiating this great proposal!

Interactive Programming is very useful and user friendly in case of your
examples.
Moreover, especially when a business has to be executed in several stages
with dependencies,such as the pipeline of Flink ML, in order to utilize the
intermediate calculation results we have to submit a job by env.execute().

About the `cache()`  , I think is better to named `persist()`, And The
Flink framework determines whether we internally cache in memory or persist
to the storage system,Maybe save the data into state backend
(MemoryStateBackend or RocksDBStateBackend etc.)

BTW, from the points of my view in the future, support for streaming and
batch mode switching in the same job will also benefit in "Interactive
Programming",  I am looking forward to your JIRAs and FLIP!

Best,
Jincheng


Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:

> Hi all,
>
> As a few recent email threads have pointed out, it is a promising
> opportunity to enhance Flink Table API in various aspects, including
> functionality and ease of use among others. One of the scenarios where we
> feel Flink could improve is interactive programming. To explain the issues
> and facilitate the discussion on the solution, we put together the
> following document with our proposal.
>
>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>
> Feedback and comments are very welcome!
>
> Thanks,
>
> Jiangjie (Becket) Qin
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Weihua,

Thanks for the comments. These are great questions!

To answer question 1, I think it depends on what do we want from the cache
service. At this point, it is not quite clear to me whether Flink needs
different caching levels. For example, in Spark, the memory level caching
are mostly used for iteration. I kind of think it is a little ugly to ask
users to explicitly do cache() and uncache() when writing the iterations.
In Flink, the iteration is done more efficiently without requiring user
explicitly managing the cache. BTW, Table API does not have iteration
support at this point, but we have being working on this and will send a
design doc shortly.
Another consideration here is that if we allow pluggable temp table
services, those implementations may not be able to provide all levels of
caching, which will make the cache level a bit confusing.

WRT the cleanup of the temp tables. That is a great point. As of now, the
cleanup is done in the callback when the session exits, i.e. when the
application program finishes. This assumes that the caching service could
host all the cached tables created in the entire session. I agree that an
explicit uncache() could be useful, we should probably add that.

We haven't thought through the FlinkService API yet. A rough idea is that
there will be a ServiceDescriptor/ServiceConfig as the contract between
Flink and user defined service. The service could be configured to either
run in a standalone process or within TMs. That said, FlinkService itself
is probably a big topic and justifies a discussion thread on its own. In
this proposal, it only affects how the default caching service is launched,
we can always adapt that to the FlinkService API once that is nailed.

Thanks,

Jiangjie (Becket) Qin

On Wed, Nov 21, 2018 at 10:42 AM Weihua Jiang <we...@gmail.com>
wrote:

> Hi Becket,
>
> The design is quite interesting and useful.
>
> I have several questions about your design:
> 1. Shall we add some persistence level hint to cache() function for
> different temperature data? E.g. IN_MEM, IN_DISK, etc, or HOTTEST, HOT,
> WARM, COLD?
> 2. When will the corresponding cached data be cleaned, by some kind of GC?
> Shall we add uncache() function to allow user manually delete the cached
> data?
> 3.  Must the FlinkService be a running service or Flink will run the
> service in TM?
>
> Thanks
> Weihua
>
> Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:
>
> > Hi all,
> >
> > As a few recent email threads have pointed out, it is a promising
> > opportunity to enhance Flink Table API in various aspects, including
> > functionality and ease of use among others. One of the scenarios where we
> > feel Flink could improve is interactive programming. To explain the
> issues
> > and facilitate the discussion on the solution, we put together the
> > following document with our proposal.
> >
> >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >
> > Feedback and comments are very welcome!
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Weihua Jiang <we...@gmail.com>.
Hi Becket,

The design is quite interesting and useful.

I have several questions about your design:
1. Shall we add some persistence level hint to cache() function for
different temperature data? E.g. IN_MEM, IN_DISK, etc, or HOTTEST, HOT,
WARM, COLD?
2. When will the corresponding cached data be cleaned, by some kind of GC?
Shall we add uncache() function to allow user manually delete the cached
data?
3.  Must the FlinkService be a running service or Flink will run the
service in TM?

Thanks
Weihua

Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午9:56写道:

> Hi all,
>
> As a few recent email threads have pointed out, it is a promising
> opportunity to enhance Flink Table API in various aspects, including
> functionality and ease of use among others. One of the scenarios where we
> feel Flink could improve is interactive programming. To explain the issues
> and facilitate the discussion on the solution, we put together the
> following document with our proposal.
>
>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>
> Feedback and comments are very welcome!
>
> Thanks,
>
> Jiangjie (Becket) Qin
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Shaoxuan Wang <ws...@gmail.com>.
Hi Xingcan,

I think you probably misunderstood our proposal. The proposed “cache()” API
basically infers the data is only available for its session, but not
forever available for other sessions to access. It will be cleaned when the
session exits. “cache” does not imply the underlying implementation only
utilizes the cache. Actually, the default implementation we proposed is
file system. In the future, we may want to extend a “persistent” interface
which allows the application really materializes the data in a QoS/lifetime
ensured storage.

I just left a comment and clarified this in the google doc. Feel free to
leave the comment in google doc if you have any further questions.

Regards,
Shaoxuan

On Thu, Nov 22, 2018 at 12:45 AM Xingcan Cui <xi...@gmail.com> wrote:

> Hi all,
>
> Thanks for the replies.
>
> @Becket I think whether putting the persist/cache methods in a separated
> util class or inside the DataSet/Table depends on what we want to
> introduce. The former one sounds more like a data storage component where
> users may even somehow get a stored DataSet/Table via an ID or something,
> whereas the latter one sounds only like a cache mechanism. I’m not quite
> sure what we really need, but either approach is acceptable to me.
>
> @Shaoxuan Yes, maybe “generally” is a more accurate word here. As the
> TableAPI only works with row type records, I just wondered whether a cache
> for that can be generalized on arbitrary data types. Anyway, if
> contributions can be made to enhance the TableAPI and rebuild other libs on
> it, that’s not a problem. Another point is, as I replied to @Becket,
> whether we introduce only a cache mechanism or a data storage component.
> IMO, compared to data storage, the cache could be volatile, which means it
> only works for (possibly?) accelerating and doesn’t need to absolutely
> guarantee the existence of DataSets/Tables.
>
> What do you think?
>
> Best,
> Xingcan
>
> > On Nov 21, 2018, at 5:44 AM, Ruidong Li <le...@gmail.com> wrote:
> >
> > Hi Becket,
> >
> > I think the Flink Service is a good abstraction, with which we can easily
> > build Interactive Programing or some other features.
> > We might bring the concept of 'Session', then we can think of Flink
> > Services as system processes and user jobs as user processes, so the
> > management of life cycle need to be discussed.
> >
> > Kind Regards
> > Xpray
> >
> >
> >
> > Xingcan Cui <xi...@gmail.com> 于2018年11月21日周三 上午1:10写道:
> >
> >> Hi Becket,
> >>
> >> Thanks for bringing this up! For a long time, the intermediate cache
> >> problem has always been a pain point of the Flink streaming model. As
> far
> >> as I know, it’s quite a block for iterate operations in batch-related
> libs
> >> such as Gelly and FlinkML.
> >>
> >> Actually, there’s an old JIRA[1], aiming to solve the cache problem more
> >> “thoroughly”. Compared with your proposal, it makes the persistence in
> >> DataSet level, which also allows the internal operations based on the
> >> DataSet API to benefit.
> >>
> >> I totally understand the importance of Table API, but just wonder
> whether
> >> we should consider this problem in a larger view, i.e., adding a
> >> `PersistentService` rather than a `TablePersistentService` (as
> described in
> >> the "Flink Services" section).
> >>
> >> Thanks,
> >> Xingcan
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-1730
> >>
> >>> On Nov 20, 2018, at 8:56 AM, Becket Qin <be...@gmail.com> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> As a few recent email threads have pointed out, it is a promising
> >>> opportunity to enhance Flink Table API in various aspects, including
> >>> functionality and ease of use among others. One of the scenarios where
> we
> >>> feel Flink could improve is interactive programming. To explain the
> >> issues
> >>> and facilitate the discussion on the solution, we put together the
> >>> following document with our proposal.
> >>>
> >>>
> >>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>
> >>> Feedback and comments are very welcome!
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Xingcan,

These a great points. We are on the same page regarding potential
capabilities of the proposed changes. There are actually two main parts in
the proposal, the API and the underlying service. Both parts can be
extended in the future.

We made a few design choices when draft the doc to restrict the scope of
this proposal, yet leave room for future extensions. For example, we did
not specify the interface of the underlying TableService. This is because
in the future, we may not only use it as a caching service, but also a
unified storage with functions such as stream/batch storage with indexing,
columnar/row-oriented formatting, schema awareness, etc.

Similarly, WRT API changes, right now we are just adding a cache() method,
and the cached table is only available within the session (it won't be lost
before the session exits). We found it already solves most of our concerns.
We can always add a persist(String tableId) method in the future, and let
the table be accessible globally. But this may introduce a lot of
interesting questions such as what if the table names conflict? Should
there be a session group? What should the life cycle look like for such
tables? Again, we are trying to restrict the scope and leave such questions
to future discussions.

Thanks,

Jiangjie (Becket) Qin







On Thu, Nov 22, 2018 at 12:45 AM Xingcan Cui <xi...@gmail.com> wrote:

> Hi all,
>
> Thanks for the replies.
>
> @Becket I think whether putting the persist/cache methods in a separated
> util class or inside the DataSet/Table depends on what we want to
> introduce. The former one sounds more like a data storage component where
> users may even somehow get a stored DataSet/Table via an ID or something,
> whereas the latter one sounds only like a cache mechanism. I’m not quite
> sure what we really need, but either approach is acceptable to me.
>
> @Shaoxuan Yes, maybe “generally” is a more accurate word here. As the
> TableAPI only works with row type records, I just wondered whether a cache
> for that can be generalized on arbitrary data types. Anyway, if
> contributions can be made to enhance the TableAPI and rebuild other libs on
> it, that’s not a problem. Another point is, as I replied to @Becket,
> whether we introduce only a cache mechanism or a data storage component.
> IMO, compared to data storage, the cache could be volatile, which means it
> only works for (possibly?) accelerating and doesn’t need to absolutely
> guarantee the existence of DataSets/Tables.
>
> What do you think?
>
> Best,
> Xingcan
>
> > On Nov 21, 2018, at 5:44 AM, Ruidong Li <le...@gmail.com> wrote:
> >
> > Hi Becket,
> >
> > I think the Flink Service is a good abstraction, with which we can easily
> > build Interactive Programing or some other features.
> > We might bring the concept of 'Session', then we can think of Flink
> > Services as system processes and user jobs as user processes, so the
> > management of life cycle need to be discussed.
> >
> > Kind Regards
> > Xpray
> >
> >
> >
> > Xingcan Cui <xi...@gmail.com> 于2018年11月21日周三 上午1:10写道:
> >
> >> Hi Becket,
> >>
> >> Thanks for bringing this up! For a long time, the intermediate cache
> >> problem has always been a pain point of the Flink streaming model. As
> far
> >> as I know, it’s quite a block for iterate operations in batch-related
> libs
> >> such as Gelly and FlinkML.
> >>
> >> Actually, there’s an old JIRA[1], aiming to solve the cache problem more
> >> “thoroughly”. Compared with your proposal, it makes the persistence in
> >> DataSet level, which also allows the internal operations based on the
> >> DataSet API to benefit.
> >>
> >> I totally understand the importance of Table API, but just wonder
> whether
> >> we should consider this problem in a larger view, i.e., adding a
> >> `PersistentService` rather than a `TablePersistentService` (as
> described in
> >> the "Flink Services" section).
> >>
> >> Thanks,
> >> Xingcan
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-1730
> >>
> >>> On Nov 20, 2018, at 8:56 AM, Becket Qin <be...@gmail.com> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> As a few recent email threads have pointed out, it is a promising
> >>> opportunity to enhance Flink Table API in various aspects, including
> >>> functionality and ease of use among others. One of the scenarios where
> we
> >>> feel Flink could improve is interactive programming. To explain the
> >> issues
> >>> and facilitate the discussion on the solution, we put together the
> >>> following document with our proposal.
> >>>
> >>>
> >>
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >>>
> >>> Feedback and comments are very welcome!
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>
> >>
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Xingcan Cui <xi...@gmail.com>.
Hi all,

Thanks for the replies.

@Becket I think whether putting the persist/cache methods in a separated util class or inside the DataSet/Table depends on what we want to introduce. The former one sounds more like a data storage component where users may even somehow get a stored DataSet/Table via an ID or something, whereas the latter one sounds only like a cache mechanism. I’m not quite sure what we really need, but either approach is acceptable to me.

@Shaoxuan Yes, maybe “generally” is a more accurate word here. As the TableAPI only works with row type records, I just wondered whether a cache for that can be generalized on arbitrary data types. Anyway, if contributions can be made to enhance the TableAPI and rebuild other libs on it, that’s not a problem. Another point is, as I replied to @Becket, whether we introduce only a cache mechanism or a data storage component. IMO, compared to data storage, the cache could be volatile, which means it only works for (possibly?) accelerating and doesn’t need to absolutely guarantee the existence of DataSets/Tables.

What do you think?

Best,
Xingcan

> On Nov 21, 2018, at 5:44 AM, Ruidong Li <le...@gmail.com> wrote:
> 
> Hi Becket,
> 
> I think the Flink Service is a good abstraction, with which we can easily
> build Interactive Programing or some other features.
> We might bring the concept of 'Session', then we can think of Flink
> Services as system processes and user jobs as user processes, so the
> management of life cycle need to be discussed.
> 
> Kind Regards
> Xpray
> 
> 
> 
> Xingcan Cui <xi...@gmail.com> 于2018年11月21日周三 上午1:10写道:
> 
>> Hi Becket,
>> 
>> Thanks for bringing this up! For a long time, the intermediate cache
>> problem has always been a pain point of the Flink streaming model. As far
>> as I know, it’s quite a block for iterate operations in batch-related libs
>> such as Gelly and FlinkML.
>> 
>> Actually, there’s an old JIRA[1], aiming to solve the cache problem more
>> “thoroughly”. Compared with your proposal, it makes the persistence in
>> DataSet level, which also allows the internal operations based on the
>> DataSet API to benefit.
>> 
>> I totally understand the importance of Table API, but just wonder whether
>> we should consider this problem in a larger view, i.e., adding a
>> `PersistentService` rather than a `TablePersistentService` (as described in
>> the "Flink Services" section).
>> 
>> Thanks,
>> Xingcan
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-1730
>> 
>>> On Nov 20, 2018, at 8:56 AM, Becket Qin <be...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> As a few recent email threads have pointed out, it is a promising
>>> opportunity to enhance Flink Table API in various aspects, including
>>> functionality and ease of use among others. One of the scenarios where we
>>> feel Flink could improve is interactive programming. To explain the
>> issues
>>> and facilitate the discussion on the solution, we put together the
>>> following document with our proposal.
>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>> 
>>> Feedback and comments are very welcome!
>>> 
>>> Thanks,
>>> 
>>> Jiangjie (Becket) Qin
>> 
>> 


Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Ruidong Li <le...@gmail.com>.
Hi Becket,

I think the Flink Service is a good abstraction, with which we can easily
build Interactive Programing or some other features.
We might bring the concept of 'Session', then we can think of Flink
Services as system processes and user jobs as user processes, so the
management of life cycle need to be discussed.

Kind Regards
Xpray



Xingcan Cui <xi...@gmail.com> 于2018年11月21日周三 上午1:10写道:

> Hi Becket,
>
> Thanks for bringing this up! For a long time, the intermediate cache
> problem has always been a pain point of the Flink streaming model. As far
> as I know, it’s quite a block for iterate operations in batch-related libs
> such as Gelly and FlinkML.
>
> Actually, there’s an old JIRA[1], aiming to solve the cache problem more
> “thoroughly”. Compared with your proposal, it makes the persistence in
> DataSet level, which also allows the internal operations based on the
> DataSet API to benefit.
>
> I totally understand the importance of Table API, but just wonder whether
> we should consider this problem in a larger view, i.e., adding a
> `PersistentService` rather than a `TablePersistentService` (as described in
> the "Flink Services" section).
>
> Thanks,
> Xingcan
>
> [1] https://issues.apache.org/jira/browse/FLINK-1730
>
> > On Nov 20, 2018, at 8:56 AM, Becket Qin <be...@gmail.com> wrote:
> >
> > Hi all,
> >
> > As a few recent email threads have pointed out, it is a promising
> > opportunity to enhance Flink Table API in various aspects, including
> > functionality and ease of use among others. One of the scenarios where we
> > feel Flink could improve is interactive programming. To explain the
> issues
> > and facilitate the discussion on the solution, we put together the
> > following document with our proposal.
> >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >
> > Feedback and comments are very welcome!
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Xingcan,

Thanks for the feedback.

Adding the cache to DataSet is useful. In fact, the current proposal does
not assume the "PersistService" can only be used by the Table. We can
always add DataSet.cache() and let it benefit from the underlying
persistency support. So it seems more of a wording issue. I'll clarify on
that. In this proposal we are trying to focus on Table API as it aligns
with other ongoing efforts at this point.

Regarding FLINK-1730, it looks that the actual difference is whether we put
the cache()/persist() method in a util class or within Table/DataSet
classes. Personally I prefer having the method with Table/DataSet classes.
It is more straightforward and concise so the users do not need to wonder
where the persist method is (or does it even exist). What do you think?

Thanks,

Jiangjie (Becket) Qin

On Wed, Nov 21, 2018 at 1:10 AM Xingcan Cui <xi...@gmail.com> wrote:

> Hi Becket,
>
> Thanks for bringing this up! For a long time, the intermediate cache
> problem has always been a pain point of the Flink streaming model. As far
> as I know, it’s quite a block for iterate operations in batch-related libs
> such as Gelly and FlinkML.
>
> Actually, there’s an old JIRA[1], aiming to solve the cache problem more
> “thoroughly”. Compared with your proposal, it makes the persistence in
> DataSet level, which also allows the internal operations based on the
> DataSet API to benefit.
>
> I totally understand the importance of Table API, but just wonder whether
> we should consider this problem in a larger view, i.e., adding a
> `PersistentService` rather than a `TablePersistentService` (as described in
> the "Flink Services" section).
>
> Thanks,
> Xingcan
>
> [1] https://issues.apache.org/jira/browse/FLINK-1730
>
> > On Nov 20, 2018, at 8:56 AM, Becket Qin <be...@gmail.com> wrote:
> >
> > Hi all,
> >
> > As a few recent email threads have pointed out, it is a promising
> > opportunity to enhance Flink Table API in various aspects, including
> > functionality and ease of use among others. One of the scenarios where we
> > feel Flink could improve is interactive programming. To explain the
> issues
> > and facilitate the discussion on the solution, we put together the
> > following document with our proposal.
> >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >
> > Feedback and comments are very welcome!
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Shaoxuan Wang <ws...@gmail.com>.
Hi Xingcan,

Thanks for the comments. Yes, "cache/persistent the intermediate data" is
useful. It can bring benefit to many scenarios. But different scenarios may
have different ways to solve it. For instance, as I replied to
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Embracing-Table-API-in-Flink-ML-td25368.html,
I
expect FlinkML to be implemented on top of tableAPI in the near future. We
already have some ideas/prototypes about how to do the iterations on
tableAPI. Will share it to the DEV soon.

I am not sure what you mean by “more thoroughly”. If you are referring to
"more general”, I think the underlying implementation of our proposal can
indeed extend to other APIs. But for now we want to focus on the tableAPI,
as we see lots of the user interests on tableAPI as oppose to dataset. As
you may already read, our proposal basically consists of two parts, one of
which is the changes on the tableAPI, including the table.cache() and how
to hook the table/store service in the table environment. The other one is
to provide a table/store service interface, with which the user can
plug/config different table / storeService according to their own
environment. It is not difficult to implement the same functionality for
dataset as what we proposed.

Regards,
Shaoxuan


On Wed, Nov 21, 2018 at 1:10 AM Xingcan Cui <xi...@gmail.com> wrote:

> Hi Becket,
>
> Thanks for bringing this up! For a long time, the intermediate cache
> problem has always been a pain point of the Flink streaming model. As far
> as I know, it’s quite a block for iterate operations in batch-related libs
> such as Gelly and FlinkML.
>
> Actually, there’s an old JIRA[1], aiming to solve the cache problem more
> “thoroughly”. Compared with your proposal, it makes the persistence in
> DataSet level, which also allows the internal operations based on the
> DataSet API to benefit.
> I totally understand the importance of Table API, but just wonder whether
> we should consider this problem in a larger view, i.e., adding a
> `PersistentService` rather than a `TablePersistentService` (as described in
> the "Flink Services" section).


> Thanks,
> Xingcan
>
> [1] https://issues.apache.org/jira/browse/FLINK-1730
>
> > On Nov 20, 2018, at 8:56 AM, Becket Qin <be...@gmail.com> wrote:
> >
> > Hi all,
> >
> > As a few recent email threads have pointed out, it is a promising
> > opportunity to enhance Flink Table API in various aspects, including
> > functionality and ease of use among others. One of the scenarios where we
> > feel Flink could improve is interactive programming. To explain the
> issues
> > and facilitate the discussion on the solution, we put together the
> > following document with our proposal.
> >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> >
> > Feedback and comments are very welcome!
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
>
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Posted by Xingcan Cui <xi...@gmail.com>.
Hi Becket,

Thanks for bringing this up! For a long time, the intermediate cache problem has always been a pain point of the Flink streaming model. As far as I know, it’s quite a block for iterate operations in batch-related libs such as Gelly and FlinkML. 

Actually, there’s an old JIRA[1], aiming to solve the cache problem more “thoroughly”. Compared with your proposal, it makes the persistence in DataSet level, which also allows the internal operations based on the DataSet API to benefit.

I totally understand the importance of Table API, but just wonder whether we should consider this problem in a larger view, i.e., adding a `PersistentService` rather than a `TablePersistentService` (as described in the "Flink Services" section).

Thanks,
Xingcan

[1] https://issues.apache.org/jira/browse/FLINK-1730

> On Nov 20, 2018, at 8:56 AM, Becket Qin <be...@gmail.com> wrote:
> 
> Hi all,
> 
> As a few recent email threads have pointed out, it is a promising
> opportunity to enhance Flink Table API in various aspects, including
> functionality and ease of use among others. One of the scenarios where we
> feel Flink could improve is interactive programming. To explain the issues
> and facilitate the discussion on the solution, we put together the
> following document with our proposal.
> 
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> 
> Feedback and comments are very welcome!
> 
> Thanks,
> 
> Jiangjie (Becket) Qin