You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Radu Tudoran <ra...@huawei.com> on 2017/02/07 18:27:19 UTC
RE: Stream SQL and Dynamic tables

Hi,

I made some comments over the Dynamic table document. Not sure how to ask for feedback for them...therefore my email.

Please let me know what do you think

https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_f4konQPW4tnl8THw6rzGUdaqU/edit#heading=h.3eo2vkvydld6

Dr. Radu Tudoran
Senior Research Engineer - Big Data Expert
IT R&D Division


HUAWEI TECHNOLOGIES Duesseldorf GmbH
European Research Center
Riesstrasse 25, 80992 München

E-mail: radu.tudoran@huawei.com
Mobile: +49 15209084330
Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

-----Original Message-----
From: Fabian Hueske [mailto:fhueske@gmail.com] 
Sent: Monday, January 30, 2017 9:07 PM
To: dev@flink.apache.org
Subject: Re: Stream SQL and Dynamic tables

Hi Radu,

yes, the clean-up timeout would need to be defined somewhere.
I would actually prefer to do that within the query, because the clean-up timeout affects the result and hence the semantics of the computed result.
This could look for instance as

SELECT a, sum(b)
FROM myTable
WHERE rowtime BETWEEN now() - INTERVAL '1' DAY AND now() GROUP BY a;

In this query now() would always refer to the current time, i.e., the current wall-clock time for processing time or the current watermark time for event time.
The result of the query would be the grouped aggregate of the data received in the last hour.
We can add syntactic sugar with built-in functions as for example:
last(rowtime, INTERVAL '1' DAY).

In addition we can also add a configuration parameter to the TableEnvironment to control the clean-up timeout.

Cheers,
Fabian

2017-01-30 18:14 GMT+01:00 Radu Tudoran <ra...@huawei.com>:

> Hi Fabian,
>
> Thanks for the clarifications. I have a follow up question: you say 
> that operations are expected to be bounded in space and time (e.g., 
> the optimizer will do a cleanup after a certain timeout period). - can 
> I assume that this will imply that we will have at the level of the 
> system a couple of parameters that hold these thresholds and potentially can be setup?
>
> For example having in the environment variable
>
> Env.setCleanupTimeout(100,TimeUnit.Minutes);
>
> ...or alternatively perhaps directly at the level of the table (either 
> table environment or the table itself)
>
> TableEnvironment tbEnv =...
> tbEnv.setCleanupTimeOut(100,TimeUnit.Minutes)
> Table tb=
> tb.setCleanupTimeOut(100,TimeUnit.Minutes)
>
>
>
> -----Original Message-----
> From: Fabian Hueske [mailto:fhueske@gmail.com]
> Sent: Friday, January 27, 2017 9:41 PM
> To: dev@flink.apache.org
> Subject: Re: Stream SQL and Dynamic tables
>
> Hi Radu,
>
> the idea is to only support operations that are bounded in space and 
> compute time:
>
> - space: the size of state may not infinitely grow over time or with 
> growing key domains. For these cases, the optimizer will enforce a 
> cleanup timeout and all data which is passed that timeout will be discarded.
> Operations which cannot be bounded in space will be rejected.
>
> - compute time: certain queries can not be efficiently execute because 
> newly arriving data (late data or just newly appended rows) might 
> trigger recomputation of large parts of the current state. Operations 
> that will result in such a computation pattern will be rejected. One 
> example would be event-time OVER ROWS windows as we discussed in the other thread.
>
> So the plan is that the optimizer takes care of limiting the space 
> requirements and computation effort.
> However, you are of course right. Retraction and long running windows 
> can result significant amounts of operator state.
> I don't think this is a special requirement for the Table API (there 
> are users of the DataStream API with jobs that manage TBs of state). 
> Persisting state to disk with RocksDB and scaling out to more nodes 
> should address the scaling problem initially. In the long run, the 
> Flink community will work to improve the handling of large state with 
> features such as incremental checkpoints and new state backends.
>
> Looking forward to your comments.
>
> Best,
> Fabian
>
> 2017-01-27 11:01 GMT+01:00 Radu Tudoran <ra...@huawei.com>:
>
> > Hi,
> >
> > Thanks for the clarification Fabian - it is really useful.
> > I agree that we should consolidate the module and avoid the need to 
> > further maintain 3 different "projects". It does make sense to see 
> > the current (I would call it)"Stream SQL" as a table with append semantics.
> > However, one thing that should be clarified is what is the best way 
> > from the implementation point of view to keep the state of the table 
> > (if we can actually keep it - though the need is clear for 
> > supporting retraction). As the input is a stream and the table is 
> > append of course we run in the classical infinite issue that streams 
> > have. What
> should be the approach?
> > Should we consider keeping the data in something like the 
> > statebackend now for windows, and then pushing them to the disk 
> > (e.g., like FSStateBackends). Perhaps with the disk we can at least 
> > enlarge the horizon of what we keep.
> > I will give some comments and some thoughts in the document about this.
> >
> >
> > Dr. Radu Tudoran
> > Senior Research Engineer - Big Data Expert IT R&D Division
> >
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center 
> > Riesstrasse 25, 80992 München
> >
> > E-mail: radu.tudoran@huawei.com
> > Mobile: +49 15209084330
> > Telephone: +49 891588344173
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549 
> > Düsseldorf, Germany, www.huawei.com Registered Office: Düsseldorf, 
> > Register Court Düsseldorf, HRB 56063, Managing Director: Bo PENG, 
> > Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft: Düsseldorf, 
> > Amtsgericht Düsseldorf, HRB 56063,
> > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN This e-mail and 
> > its attachments contain confidential information from HUAWEI, which 
> > is intended only for the person or entity whose address is listed 
> > above. Any use of the information contained herein in any way 
> > (including, but not limited to, total or partial disclosure,
> reproduction,
> > or dissemination) by persons other than the intended recipient(s) is 
> > prohibited. If you receive this e-mail in error, please notify the 
> > sender by phone or email immediately and delete it!
> >
> >
> > -----Original Message-----
> > From: Fabian Hueske [mailto:fhueske@gmail.com]
> > Sent: Thursday, January 26, 2017 3:37 PM
> > To: dev@flink.apache.org
> > Subject: Re: Stream SQL and Dynamic tables
> >
> > Hi Radu,
> >
> > the idea is to have dynamic tables as the common ground for Table 
> > API and SQL.
> > I don't think it is a good idea to implement and maintain 3 
> > different relational APIs with possibly varying semantics.
> >
> > Actually, you can see the current status of the Table API / SQL on 
> > stream as a subset of the proposed semantics.
> > Right now, all streams are implicitly converted into Tables with 
> > APPEND semantics. The currently supported operations (selection, 
> > filter, union, group windows) return streams.
> > The only thing that would change for these operations would be the 
> > output mode to be retraction mode by default in order to be able to 
> > emit updated records (e.g., updated aggregates due to late records).
> >
> > The document is not final and we can of course discuss the proposal.
> >
> > Best, Fabian
> >
> > 2017-01-26 11:33 GMT+01:00 Radu Tudoran <ra...@huawei.com>:
> >
> > > Hi all,
> > >
> > >
> > >
> > > I have a question with respect to the scope behind the initiative 
> > > behind relational queries on data streams:
> > >
> > > https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_
> > > f4konQPW4tnl8THw6rzGUdaqU/edit#
> > >
> > >
> > >
> > > Is the approach of using dynamic tables intended to replace the 
> > > implementation and mechanisms build now in stream sql ? Or will 
> > > the two co-exist, be built one on top of the other?
> > >
> > >
> > >
> > > Also – is the document in the final form or can we still provide 
> > > feedback / ask questions?
> > >
> > >
> > >
> > > Thanks for the clarification (and sorry if I missed at some point 
> > > the discussion that might have clarified this)
> > >
> > >
> > >
> > > Dr. Radu Tudoran
> > >
> > > Senior Research Engineer - Big Data Expert
> > >
> > > IT R&D Division
> > >
> > >
> > >
> > > [image: cid:image007.jpg@01CD52EB.AD060EE0]
> > >
> > > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > >
> > > European Research Center
> > >
> > > Riesstrasse 25, 80992 München
> > >
> > >
> > >
> > > E-mail: *radu.tudoran@huawei.com <ra...@huawei.com>*
> > >
> > > Mobile: +49 15209084330 <+49%201520%209084330>
> > >
> > > Telephone: +49 891588344173 <+49%2089%201588344173>
> > >
> > >
> > >
> > > HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549 
> > > Düsseldorf, Germany, www.huawei.com Registered
> > > Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing
> > > Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft:
> > > Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> > > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> > >
> > > This e-mail and its attachments contain confidential information 
> > > from HUAWEI, which is intended only for the person or entity whose 
> > > address is listed above. Any use of the information contained 
> > > herein in any way (including, but not limited to, total or partial 
> > > disclosure, reproduction, or dissemination) by persons other than 
> > > the intended
> > > recipient(s) is prohibited. If you receive this e-mail in error, 
> > > please notify the sender by phone or email immediately and delete it!
> > >
> > >
> > >
> >
>